Introduction to Hadoop

Similar documents
Managing large clusters resources

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache Flink Next-gen data analysis. Kostas

Big Data With Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CSE-E5430 Scalable Cloud Computing Lecture 2

Chapter 7. Using Hadoop Cluster and MapReduce

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Open source large scale distributed data management with Google s MapReduce and Bigtable

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Large scale processing using Hadoop. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Apache Hadoop. Alexandru Costan

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop IST 734 SS CHUNG

Challenges for Data Driven Systems

Big Data and Scripting Systems beyond Hadoop

Introduction to Hadoop

Hadoop and Map-Reduce. Swati Gore

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Fault Tolerance in Hadoop for Work Migration

NoSQL Data Base Basics

Spark: Making Big Data Interactive & Real-Time

NoSQL and Hadoop Technologies On Oracle Cloud

Design and Evolution of the Apache Hadoop File System(HDFS)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Open source Google-style large scale data analysis with Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

How To Scale Out Of A Nosql Database

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Apache Hadoop FileSystem and its Usage in Facebook


MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Data-Intensive Computing with Map-Reduce and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HOPS: Hadoop Open Platform-as-a-Service

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Machine- Learning Summer School

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Big Data Analytics Hadoop and Spark

Integrating Big Data into the Computing Curricula

The Stratosphere Big Data Analytics Platform

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Cloud Computing at Google. Architecture

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop Ecosystem B Y R A H I M A.

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

MapReduce with Apache Hadoop Analysing Big Data

Unified Big Data Analytics Pipeline. 连 城

Hadoop. Sunday, November 25, 12

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Big Data Technology Core Hadoop: HDFS-YARN Internals

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Hadoop & Spark Using Amazon EMR

Big Data and Scripting Systems build on top of Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop & its Usage at Facebook

Application Development. A Paradigm Shift

Parallel Processing of cluster by Map Reduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

COURSE CONTENT Big Data and Hadoop Training

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data and Apache Hadoop s MapReduce

Hadoop: Embracing future hardware

Storage Architectures for Big Data in the Cloud

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

NoSQL for SQL Professionals William McKnight

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Energy Efficient MapReduce

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Can the Elephants Handle the NoSQL Onslaught?

Map Reduce / Hadoop / HDFS

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Transcription:

Introduction to Hadoop ID2210 Jim Dowling

Large Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes stored - Facebook: 300+ Petabytes (April 2014)* - Hadoop In #Bytes processed/time - Google processed 24 petabytes of data per day in 2013 - Colossus, Spanner, BigQuery, BigTable, Borg, Omega,.. *http://www.adweek.com/socialtimes/orcfile/434041

Where does Big Data Come From? On-line services Scientific instruments PBs per day PBs per minute Whole genome sequencing 250 GB per person Internet-of-Things Will be lots!

What is Big Data? Small Data Big Data

Why is Big Data hot? Companies like Google and Facebook have shown how to extract value from Big Data Orbitz looks for higher prices from Safari users [WSJ 12]

Why is Big Data hot? Big Data helped Obama win the 2012 election through data-driven decision making* Data said: middle-aged females like contests, dinners and celebrity *http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/

Why is Big Data Important in Science? In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * * The Unreasonable Effectiveness of Data [Halevey et al 09]

4 Vs of Big Data Volume Velocity Variety Veracity/Variability/Value

A quick historical tour of data systems

Batch Sequential Processing Scan Sort IBM 082 Punch Card Sorter No Fault Tolerance

1960s

First Database Management Systems COBOL DBMS

Hierarchical and Network Database Management Systems

You had to know what data you want, and how to find it

Early DBMS were Disk-Aware

Codd's Relational Model Just tell me the data you want, the system will find it.

SystemR CREATE TABLE Students( id INT PRIMARY_KEY, firstname VARCHAR(96), lastname VARCHAR(96) ); SELECT * FROM Students WHERE id > 10; Views Relations Structured Query Language? Indexes Disk Disk Access Methods

Data Characteristics Change Finding the Data using a Query Optimizer Each color represents a program in this plan diagram Each program produces the same result for the Query. Each program has different performance characteristics depending on changes in the data characteristics Data Characteristics Change

What if I have lots of Concurrent Queries? Data Integrity using Transactions* ACID Atomicity Consistency Isolation Durability *Jim Gray, The Transaction Concept: Virtues and Limitation

In the 1990s Data Read Rates Increased Dramatically

Distribute within a Data Center Master-Slave Replication Data-location awareness is back: Clients read from slaves, write to master. Possibility of reading stale data.

In the 2000s Data Write Rates Increased Dramatically

Unstructured Data explodes Source: IDC whitepaper. As the Economy contracts, the Digital Universe Explodes. 2009

Key-Value stores don t do Big Data yet. Existing Big Data systems currently only work for a single Data Centre.* *The usual Google Exception applies

Storage and Processing of Big Data

What is Apache Hadoop? Huge data sets and large files Gigabytes files, petabyte data sets Scales to thousands of nodes on commodity hardware No Schema Required Data can be just copied in, extract required columns later Fault tolerant Network topology-aware, Data Location-Aware Optimized for analytics: high-throughput file access

Hadoop (version 1) Application MapReduce Hadoop Filesystem

HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Heartbeats Rebalance Re-replicate blocks 1 2 35 1 3 1 3 4 1 2 3 2 4 5 4 5 6 2 6 5 6 Data nodes Data nodes

HDFS v2 Architecture Active-Standby Replication of NN Log Agreement on the Active NameNode Faster Recovery - Cut the NN Log Journal Nodes Zookeeper Nodes NameNode Standby NameNode Snapshot Node HDFS Client DataNodes 30

HopsFS Architecture NDB Load Balancer HopsFS Client NameNodes Leader HDFS Client DataNodes 31

Processing Big Data

Big Data Processing with No Data Locality Job( /crawler/bot/jd.io/1 ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5

MapReduce Data Locality Job( /crawler/bot/jd.io/1 ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5 R R = resultfile(s) R R

MapReduce* 1. Programming Paradigm 2. Processing Pipeline (moving computation to data) *Dean et al, OSDI 04

MapReduce Programming Paradigm map(record) -> {(key i, value i ),.., (key l, value l )} reduce((key i, {value k,.., value y }) -> output

MapReduce Programming Paradigm Also found in: Functional programming languages MongoDB Cassandra

Example: Building a Web Search Index map(url, doc) -> {(term i, url),(term m, url)} reduce((term,{url k,..,url y }) -> (term, (posting list of url, count))

Example: Building a Web Search Index map( ( jd.io, A hipster website with news )) -> { emit( a, jd.io ), emit( hipster, jd.io ), emit( website, jd.io ), emit( with, jd.io ), emit( news, jd.io ) }

Example: Building a Web Search Index map( ( hn.io, Hacker hipster news )) -> { emit( hacker, hn.io ), emit( hipster, hn.io ), emit( news, hn.io ) }

Example: Building a Web Search Index reduce( hipster, { jd.io, hn.io }) -> ( hipster, ([ jd.io, hn.io ], 2))

Example: Building a Web Search Index reduce( website, { jd.io }) -> ( website, ([ jd.io ], 1))

Example: Building a Web Search Index reduce( news, { jd.io, hn.io }) -> ( news, ([ jd.io, hn.io ], 2))

Map Phase MapReduce map(url, doc) -> {(term i, url),(term l, url)} Mapper 1 Mapper 6 Mapper 4 Mapper 3 Mapper 2 Mapper 5 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5 1' 6 4 3 2 5 DN DN DN DN DN DN

Shuffle Phase MapReduce group by term Shuffle over the Network using a Partitioner A-D E-H I-L M-P Q-T U-Z 1' 6 4 3 2 5 DN DN DN DN DN DN

Reduce Phase MapReduce reduce((term,{url k,url y }) -> (term, (posting list of url, count)) A-D E-H I-L M-P Q-T U-Z Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer 6 A -D E -H I -L M -P Q -T U -Z DN DN DN DN DN DN

Hadoop 2.x Single Processing Framework Batch Apps Multiple Processing Frameworks Batch, Interactive, Streaming Hadoop 1.x Hadoop 2.x MapReduce (data processing) Others (spark, mpi, giraph, etc) MapReduce (resource mgmt, job scheduler, data processing) HDFS (distributed storage) YARN (resource mgmt, job scheduler) HDFS (distributed storage)

MapReduce and MPI as YARN Applications [Murthy et. al, Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 13]

Data Locality in Hadoop v2

Limitations of MapReduce [Zaharia 11] MapReduce is based on an acyclic data flow from stable storage to stable storage. - Slow writes data to HDFS at every stage in the pipeline Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: - Iterative algorithms (machine learning, graphs) - Interactive data mining tools (R, Excel, Python) Map Reduce Input Map Output Map Reduce

Iterative Data Processing Frameworks val input= TextFile(textInput) val words = input.flatmap { line => line.split( ) } val counts = words.groupby.count() { word => word } val output = counts.write (wordsoutput, RecordDataSinkFormat() ) val plan = new ScalaPlan(Seq(output))

Spark Resiliant Distributed Datasets Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce - Fault tolerance, data locality, scalability Resilient distributed datasets (RDDs) - Immutable, partitioned collections of objects - Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage - Can be cached for efficient reuse Actions on RDDs - Count, reduce, collect, save,

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() Base RDD Transformed RDD Driver results tasks Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin <1 (vs sec 170 (vssec 20 sec for on-disk for on-disk data) data) Action Cache 3 Worker Block 3 Worker Block 2 Cache 2

Apache Flink DataFlow Operators Flink Map Iterate Project Reduce Delta Iterate Aggregate Join Filter Distinct CoGroup FlatMap Vertex Update Union GroupReduce Accumulators Source Map Reduce Iterate Join Reduce Sink Source Map *Alexandrov et al.: The Stratosphere Platform for Big Data Analytics, VLDB Journal 5/2014

Built-in vs. driver-based looping Client Step Step Step Step Step Loop outside the system, in driver program Client Step Step Step Step Step Iterative program looks like many independent jobs red. Dataflows with feedback edges Flink map join join System is iterationaware, can optimize the job

Hadoop on the Cloud Cloud Computing traditionally separates storage and computation. Amazon OpenStack Web Services Nova (Compute) EC2 Swift Elastic (Object Block Storage) Glance (VM S3 Images)

Data Locality for Hadoop on the Cloud Cloud hardware configurations should support data locality Hadoop s original topology awareness breaks Placement of >1 VM containing block replicas for the same file on the same physical host increases correlated failures VMWare introduced a NodeGroup aware topology HADOOP-8468

Conclusions Hadoop is the open-source enabling technology for Big Data YARN is rapidly becoming the operating system for the Data Center Apache Spark and Flink are in-memory processing frameworks for Hadoop

References Dean et. Al, MapReduce: Simplified Data Processing on Large Clusters, OSDI 04. Schvachko, HDFS Scalability: The limits to growth, Usenix, :login, April 2010. Murthy et al, Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 13. Processing a Trillion Cells per Mouse Click, VLDB 12