Tes$ng Hadoop Applica$ons. Tom Wheeler



Similar documents
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

University of Maryland. Tuesday, February 2, 2010

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to MapReduce and Hadoop

ITG Software Engineering

Getting to know Apache Hadoop

Introduc)on to Map- Reduce. Vincent Leroy

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Qsoft Inc

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

map/reduce connected components

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop WordCount Explained! IT332 Distributed Systems

MapReduce Job Processing

COURSE CONTENT Big Data and Hadoop Training

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

The Hadoop Eco System Shanghai Data Science Meetup

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Apache Hadoop new way for the company to store and analyze big data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Mrs: MapReduce for Scientific Computing in Python

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop Setup. 1 Cluster

Hadoop: Code Injection Distributed Fault Injection. Konstantin Boudnik Hadoop committer, Pig contributor

Implement Hadoop jobs to extract business value from large and varied data sets

H2O on Hadoop. September 30,

How To Use Hadoop

MapReduce, Hadoop and Amazon AWS

IN PRACTICE. Alex Holmes MANNING SAMPLE CHAPTER

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Hadoop: Understanding the Big Data Processing Method

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Hadoop Tutorial. General Instructions

Hadoop Architecture. Part 1

Peers Techno log ies Pv t. L td. HADOOP

Word Count Code using MR2 Classes and API

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Management and NoSQL Databases

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Deploying Hadoop with Manager

IDS 561 Big data analytics Assignment 1

Introduc)on to Hadoop

A very short Intro to Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Xiaoming Gao Hui Li Thilina Gunarathne

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop EKG: Using Heartbeats to Propagate Resource Utilization Data

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

CS54100: Database Systems

Internals of Hadoop Application Framework and Distributed File System

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop Configuration and First Examples

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Hadoop: The Definitive Guide

Hadoop Streaming. Table of contents

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

MapReduce with Apache Hadoop Analysing Big Data

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Introduction to Cloud Computing

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Install Hadoop From Apa Hadoop To (Hadoop)

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

PassTest. Bessere Qualität, bessere Dienstleistungen!

Hadoop: The Definitive Guide

Performance and Energy Efficiency of. Hadoop deployment models

Click Stream Data Analysis Using Hadoop

Hadoop Internals for Oracle Developers and DBAs Exploring the HDFS and MapReduce Data Flow

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

MapReduce. Tushar B. Kute,

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

CS 378 Big Data Programming

Introduc)on to. Eric Nagler 11/15/11

Introduction to Hadoop

Big Data and Scripting map/reduce in Hadoop

Hadoop Design and k-means Clustering

Transcription:

Tes$ng Hadoop Applica$ons Tom Wheeler

About The Presenter Tom Wheeler Software Engineer, etc.! Greater St. Louis Area Information Technology and Services! Current:! Past:! Senior Curriculum Developer at Cloudera! Principal Software Engineer at Object Computing, Inc. (Boeing)! Software Engineer, Level IV at WebMD! Senior Software Engineer at Teralogix! Senior Programmer/Analyst at A.G. Edwards and Sons, Inc.!

About The Presenta$on What s ahead Unit Tes$ng Integra$on Tes$ng Performance Tes$ng Diagnos$cs

About The Audience What I expect from you Basic knowledge of Apache Hadoop Basic understanding of MapReduce (Java) But I ll assume no par$cular knowledge of tes$ng

Fundamental Concepts What is tes$ng? In my view, it s an applica$on of computer security

Defini$ons Unit tes$ng Verifies func$on of each unit in isola$on Integra$on tes$ng Verifies that the system works as a whole Performance tes$ng Verifies that code makes efficient use of resources Diagnos$cs More about forensics than tes$ng

Why is Tes$ng Important? Cost of Fixing Code Defects Production Expense Implementation Design Time

Overview of Unit Tes$ng Verifies correctness for a unit of code Key features of unit tests Simple Isolated Determinis$c Automated

Benefits of Unit Tests An investment in code quality Can prevent regressions Actually saves development $me Tests run much faster than Hadoop jobs Helps with refactoring

Introducing JUnit JUnit is a popular library for Java unit tes$ng Open source (CPL) Add JAR file to your project Inspired many xunit libraries for other languages

JUnit Basics Import statements removed for brevity 1 2 3 4 5 public class Calculator { public int add(int first, int second) { return first + second; } } Our class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 public class CalculatorTest { } private Calculator calc; @Before public void setup() { calc = new Calculator(); } @Test public void testadd() { assertequals(8, calc.add(5, 3)); } Our test

Log Event Coun$ng Example Let s look at a simple MapReduce example Our goal is to summarize log events by level Mapper: Parses log to extract level Reducer: Sums up occurrences by level Seeing the data flow first will help illustrate the job

Mapper Input Log data produced by an applica$on using Log4J 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"

Mapper Output Key is log level parsed from a single line in the file Value is a literal 1 (since there s one level per line) INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1

Reducer Input Hadoop sorts and groups the keys ERROR [1] INFO [1, 1, 1, 1] WARN [1, 1]

Reducer Output The final output is a per- level summary ERROR 1 INFO 4 WARN 2

Data Flow for En$re MapReduce Job Map input Map output 2012-09-06 22:16:49.391 CDT INFO "This can wait" 2012-09-06 22:16:49.392 CDT INFO "Blah blah" 2012-09-06 22:16:49.394 CDT WARN "Hmmm..." 2012-09-06 22:16:49.395 CDT INFO "More blather" 2012-09-06 22:16:49.397 CDT WARN "Hey there" 2012-09-06 22:16:49.398 CDT INFO "Spewing data" 2012-09-06 22:16:49.399 CDT ERROR "Oh boy!" 1 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 2 Reduce output Reduce input ERROR 1 INFO 4 WARN 2 4 ERROR [1] INFO [1, 1, 1, 1] WARN [1, 1] 3 Shuffle and Sort

Show Me the Code Let s examine the code for this job Then I ll run it And soon we ll see how to test and improve it

What s Wrong With This? Mapper is complex and hard to test How could we improve it? Refactor to decouple business logic from Hadoop API

Unit Tes$ng and External Dependencies We can validate core business logic with JUnit But our Mapper and Reducer s$ll aren t fully tested How can we verify them?

Introducing MRUnit It s a JUnit extension for tes$ng MapReduce code Open source (Apache licensed) Ac$ve development team Simulates much of Hadoop s core API, including InputSplit OutputCollector Reporter Counters

MRUnit Demo I ll now demonstrate how to use MRUnit Tes$ng the Mapper Tes$ng the Reducer

Limita$ons of MRUnit There are a few things MRUnit can t test (yet) Mul$ple lines of input for a Mapper Jobs that use DistributedCache Par$$oners Hadoop Streaming

Overview of Integra$on Tes$ng Verifies that the system works as a whole This can mean two things Your units of code work with one another Your code works with the underlying system

Tes$ng Mappers and Reducers Together Unit tests verify Mappers and Reducers separately Also need to ensure they work together MRUnit can help here too

MiniMRCluster and MiniDFSCluster MRUnit can test Mapper and Reducer integra$on But not the en*re job Two mini- cluster classes MiniMRCluster simulates MapReduce MiniDFSCluster simulates HDFS Let s see an example

Integra$on Tes$ng the Hadoop Stack Your code depends on many other components Your code doesn t func$on properly unless they do Your Code Hadoop MapReduce Hadoop HDFS JVM Opera$ng System Hardware

Integra$on Tes$ng with itest itest is a product of Apache Bigtop Can be used for custom integra$on tes$ng itest is wriden in Groovy The tests themselves can be in Java or Groovy

itest Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 public class RunHadoopJobsTest { } @Test public void testjobsuccess() { Shell sh = new Shell("/bin/bash -s"); sh.exec("hadoop jar /tmp/myjob.jar MyDriver input output"); assertequals(0, sh.getret()); } @Test public void testjobfail() { Shell sh = new Shell("/bin/bash -s"); sh.exec("hadoop jar /tmp/myjob.jar MyDriver bogus_args"); assertequals(1, sh.getret()); } Import statements removed for brevity

Resource U$liza$on Op$miza$on is a tradeoff between resources CPU Memory Disk I/O Network I/O Developer $me Hardware cost

Profiling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 public class ExampleDriver { public static void main(string[] args) throws Exception { JobConf conf = new JobConf(ExampleDriver.class); conf.setjobname("my Slow Job"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setprofileenabled(true); conf.setprofileparams("-agentlib:hprof=cpu=samples,heap=sites," + "depth=6,force=n,thread=y,verbose=n,file=%s"); conf.setmapperclass(myslowmapper.class); conf.setreducerclass(myslowreducer.class); // other driver code follows...

Profiling One.profile file per Map / Reduce task adempt $ ls *.profile attempt_201206281229_0070_m_000000_0.profile attempt_201206281229_0070_r_000000_0.profile attempt_201206281229_0070_m_000001_0.profile attempt_201206281229_0070_r_000001_0.profile attempt_201206281229_0070_m_000002_0.profile attempt_201206281229_0070_r_000002_0.profile

Overview of Diagnos$cs Tes$ng is proac$ve Diagnos$cs are reac$ve Many diagnos$c tools available, including Web UI Logs Counters Job history

Web UI Each Hadoop daemon has its own Web applica$on Daemon JobTracker TaskTracker NameNode Secondary NameNode DataNode URL http://hostname:50030/ http://hostname:50060/ http://hostname:50070/ http://hostname:50090/ http://hostname:50075/

JobTracker Web UI Example

JobTracker Job History Page

Task Adempt List Some columns removed to beder fit the screen

Logs Most important logs for MapReduce developers stdout stderr syslog Demo: How to debug error using Web UI and logs

Counters

Viewing Current Configura$on

Job Proper$es

Recap There are many types of tes$ng Unit Integra$on Performance Diagnos$cs help us find the problems we missed

Resources for More Info Johannes Link ISBN: 1-55860- 868-0

Resources for More Info (cont d) Tom White ISBN: 1-449- 31152-0

Resources for More Info (cont d) Charlie Hunt and Binu John ISBN: 0-137- 14252-8

Resources for More Info (cont d) Eric Sammer ISBN: 1-449- 32705-2

Conclusion Tes$ng helps develop trust in a system It should behave as you expect it to

Thank You! Tom Wheeler, Senior Curriculum Developer, Cloudera