Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Similar documents
Hadoop Streaming. Table of contents

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Extreme computing lab exercises Session one

map/reduce connected components

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Extreme computing lab exercises Session one

Open source Google-style large scale data analysis with Hadoop

Cloud Computing. Chapter Hadoop

Introduction to Hadoop

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

CS 378 Big Data Programming

Chapter 7. Using Hadoop Cluster and MapReduce

ITG Software Engineering

Internals of Hadoop Application Framework and Distributed File System

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data Processing with Google s MapReduce. Alexandru Costan

Developing a MapReduce Application

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

The MapReduce Framework

Apache HBase. Crazy dances on the elephant back

MapReduce Détails Optimisation de la phase Reduce avec le Combiner

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

How To Use Hadoop

ORACLE COHERENCE 12CR2

Performance and Scalability Overview

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Moving From Hadoop to Spark

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Cloudera Certified Developer for Apache Hadoop

Getting Started with Hadoop with Amazon s Elastic MapReduce

A Performance Analysis of Distributed Indexing using Terrier

MapReduce and Hadoop Distributed File System V I J A Y R A O

SCM Dashboard Monitoring Code Velocity at the Product / Project / Branch level

HADOOP PERFORMANCE TUNING

CS54100: Database Systems

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Introduc)on to Hadoop

Getting to know Apache Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

The Hadoop Eco System Shanghai Data Science Meetup

BIG DATA HADOOP TRAINING

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Apache Flink Next-gen data analysis. Kostas

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

MapReduce Job Processing

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Assignment 2: More MapReduce with Hadoop

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Job Oriented Training Agenda

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Tutorial for Assignment 2.0

CSE-E5430 Scalable Cloud Computing Lecture 2

PHP on IBM i: What s New with Zend Server 5 for IBM i

DMX-h ETL Use Case Accelerator. Web Log Aggregation

Introduction to Spark

Drupal Performance Tuning

COURSE CONTENT Big Data and Hadoop Training

Introduction to Cloud Computing

Peers Techno log ies Pv t. L td. HADOOP

Large-Scale Web Applications

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Developing MapReduce Programs

DMX-h ETL Use Case Accelerator. Word Count

Hadoop & Spark Using Amazon EMR

BIG DATA - HADOOP PROFESSIONAL amron

Architectures for massive data management

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Open source large scale distributed data management with Google s MapReduce and Bigtable

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Important Notice. (c) Cloudera, Inc. All rights reserved.

Extreme Computing. Hadoop MapReduce in more detail.

Map Reduce & Hadoop Recommended Text:

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Google Cloud Platform The basics

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

MapReduce and Hadoop Distributed File System

iway Roadmap: 2011 and Beyond Dave Watson SVP, iway Software

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Tutorial for Assignment 2.0

Hadoop Kelvin An Overview

Scalable Architecture on Amazon AWS Cloud

Lecture 10 - Functional programming: Hadoop and MapReduce

Transcription:

Lecture 6 Programming Hadoop II Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Outline Hadoop streaming Side data distribution Hadoop Zen System integration 2 / 22

HADOOP STREAMING

Motivation You want to use a scripting language Faster development time Easier to read, debug Use existing libraries You (still) have lots of data 4 / 22

HadoopStreaming Interfaces Hadoop MapReduce with arbitrary program code Uses stdin and stdout for data flow You define a separate program for each of mapper, reducer 5 / 22

WordCount in Shell Simplified Input Format: One word each line Input files:./input/* WordCount in Shell cat * sort uniq c Can we leverage MR to run it on thousands of files and nodes? 6 / 22

Hadoop Streaming + Shell hadoop jar <hadoop>/contrib/streaming/hadoop-0.20.2-streaming.jar \ -input input \ -output output \ -mapper /bin/cat \ -reducer /bin/uniq -c Note: Make sure those shell installed on every node. 7 / 22

Reusing Programs Identity mapper/reducer: cat Summing: wc wc -l a.txt Field selection: cut cat /etc/passwd cut -d: -f1 > user.txt Filtering: awk 8 / 22

Data Format Input (key, val) pairs sent in as lines of input key (tab) val (newline) Data naturally transmitted as text You emit lines of the same form on stdout for output (key, val) pairs. 9 / 22

Hadoop Streaming + Python Map: wcmap.py #!/usr/bin/python import re import sys for line in sys.stdin: for word in line.strip().split( ): print word + \t1 Reduce: wcred.py #!/usr/bin/python import re import sys w2c = {} for line in sys.stdin: if len(line.strip())!= 0: (k,v) = line.strip().split("\t") w2c[k] = w2c.get(k,0) + int(v) for w,c in w2c.items(): print "%s\t%d" % (w,c) 10 / 22

Hadoop Streaming + Python Test locally cat../data/test.txt python wcmap.py sort python wcred.py Run on hadoop hadoop jar <hadoop>/contrib/streaming/hadoop-0.20.2-streaming.jar \ -input input \ -output output \ -mapper wcmap.py \ -reducer wcred.py \ -file wcmap.py \ -file wcred.py 11 / 22

Hadoop Streaming Advanced features Support Java Classes -mapper org.apache.hadoop.mapred.lib.identitymapper Support Hadoop Aggregate Operators -reduce aggregate //sum/min/max Set job parameters by jobconf -jobconf mapred.reduce.tasks=12 12 / 22

Side Data Distribution Side Data: extra read-only data needed by a job to process the main dataset. Using the Job configuration (for small meta data only ) Using distributed cache 13 / 22

Side Data Caching via Job Configuration Used for meta data for no more than a few K bytes. Load with JobTracker/TaskTracker/JVM Sub-process Usage Set in job configuration Configuration conf = new Configuration(); conf.set( line-prefix, [SYSTEM]: ); conf.add Resource( test.xml ); Job job = new Job(conf, wordcount ); Get in Mapper/Reducer context.getconfiguration().get("line-prefix"); 14 / 22

Side Data Distribution Distributed Cache A service copying files and archives to the task nodes in runtime Files cached in local file system of tasktracker, possibly shared among different tasks. Usage Hadoop jar -files / -archives options hadoop jar -files /test/file/file.1. DistributedCache Class Access in Mapper/Reducer FileReader reader = new FileReader( god.txt ) 15 / 22

Hadoop Zen Don t get frustrated (take a deep breath) Remember this when you experience those W$*#T@F! moments This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none) But Hadoop is the path to data nirvana 16 / 22

System Integration Front-end Real-time Customer-facing Well-defined workflow Back-end Batch Internal Ad hoc analytics 17 / 22

Customers Browser Interactive Web applications Server-side software stack Interface AJAX HTTP request HTTP response Web Server Middleware DB Server

Typical Scale Out Strategies LAMP stack as standard building block Lots of each (load balanced, possibly virtualized): Web servers Application servers Cache servers RDBMS Reliability achieved through replication Most workloads are easily partitioned Partition by user Partition by geography 19 / 22

Caching servers: 15 million requests per second, 95% handled by memcache (15 TB of RAM) Database layer: 800 eight-core Linux servers running MySQL (40 TB user data) Source: Technology Review (July/August, 2008)

Customers Browser Interactive Web applications Server-side software stack Interface AJAX HTTP request HTTP response Web Server Middleware DB Server Browser Hadoop cluster Interface AJAX HTTP request HTTP response MapReduce HDFS Internal Analysts Back-end batch processing

OK, So now we have gone through MapReduce basics