Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development. Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords



Similar documents
ITG Software Engineering

Big Data Training - Hackveda

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Qsoft Inc

Constructing a Data Lake: Hadoop and Oracle Database United!

Peers Techno log ies Pv t. L td. HADOOP

Workshop on Hadoop with Big Data

Cloudera Certified Developer for Apache Hadoop

Big Data Course Highlights

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Connecting Hadoop with Oracle Database

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Setup. 1 Cluster

Big Business, Big Data, Industrialized Workload

A Performance Analysis of Distributed Indexing using Terrier

Open source Google-style large scale data analysis with Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

HDFS. Hadoop Distributed File System

Hadoop IST 734 SS CHUNG

Complete Java Classes Hadoop Syllabus Contact No:

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Important Notice. (c) Cloudera, Inc. All rights reserved.

Implement Hadoop jobs to extract business value from large and varied data sets

The Inside Scoop on Hadoop

Unit Testing webmethods Integrations using JUnit Practicing TDD for EAI projects

How To Create A Data Visualization With Apache Spark And Zeppelin

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data for Investment Research Management

Data Domain Profiling and Data Masking for Hadoop

Cost-Effective Business Intelligence with Red Hat and Open Source

Chase Wu New Jersey Ins0tute of Technology

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

BIG DATA - HADOOP PROFESSIONAL amron

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

CS 378 Big Data Programming

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

BIG DATA HADOOP TRAINING

COURSE CONTENT Big Data and Hadoop Training

Big Data for Investment Research Management

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing Big data is one of the biggest

Apache Bigtop: 100% Apache Bigdata management distribution. (and so much more!)

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Map Reduce & Hadoop Recommended Text:

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Building Scalable Big Data Pipelines

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Using distributed technologies to analyze Big Data

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Media Upload and Sharing Website using HBASE

Building a data analytics platform with Hadoop, Python and R

ITG Software Engineering

Parquet. Columnar storage for the people

June JMS and Hadoop Agent. Automic Workload Automation

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

How To Write A Nosql Database In Spring Data Project

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Introduction To Hive

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

APACHE HADOOP JERRIN JOSEPH CSU ID#

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Too Big To Ignore

Practical Hadoop by Example

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

So What s the Big Deal?

Business Intelligence for Big Data

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Development & BI- 0 to 100

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Database Performance with In-Memory Solutions

Internals of Hadoop Application Framework and Distributed File System

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

American International Journal of Research in Science, Technology, Engineering & Mathematics

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Native Connectivity to Big Data Sources in MSTR 10

Data processing goes big

Data Analyst Program- 0 to 100

<Insert Picture Here> Big Data

How To Scale Out Of A Nosql Database

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Transcription:

Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords

Who speaks the Elephant language? Class A extends Mapper ROI, $$, apt-get install? TDD!??????? Write/execute tests, specify acceptance criteria, 2

The road tobig Data QA our Big Data QA problem the FitNesse approach test data definition / selection result inspection job & workflow control 3

QA problem Web Intelligence@ 1&1 BI reporting, web analytics, ~ 1 billion log events / day, ~ 1 TB (thrift) logfiles DWH Hadoop Cluster chains of MR jobs, running on 20 nodes / 8 cores / 96 GB RAM (CDH) 4

QA problem An exemplary workflow? create?? inspect (sample) (binary) control input data formats workflows Log Files Log Files (thrift) Log Files (thrift) (thrift) MR job1 Intermediate result (avro) MR job2 DWH (RDBMS) 5

QA problem Existing Approaches method tests what? issues for our usecase JUnit isolated functions no integration, Java syntax MRUnit 1 mapper+ 1 reducer little integration, Java syntax itest hadoop jobs/workflows Java / Groovy syntax Scripts/CLI (manual) scripting/inspect. script chaos, syntax FitNesse as suitable addition/ solution! 6

The road tobig Data QA Big Data QA is different! the FitNesse approach test data definition / selection result inspection job & workflow control 7

FitNesse In a nutshell fully integrated standalone wiki and acceptance testing framework executable Wiki- Pages (returning test results) (almost) natural language test specification connectiontosut via (Java-) Fixtures 8

FitNesse Architecture Overview Browser script check num results 3 FitNesse Server Fixtures public int numresults {... } calling java methods from wiki, compare return values Integrates with REST, Jenkins System under Test 9

FitNesse An Exemplary Test 10

FitNesse Exemplary Test Source!path /home/inovex/lib/*.jar script Hadoop upload viewlog.csv to hdfs /testdata/ hadoop job from jar viewlog.jar [...] show job output check number of output files 3 11

FitNesse Hadoop Fixture Java Code public class Hadoop { public boolean uploadtohdfs(string localfile, String remotefile) {...} public boolean hadoopjobfromjar(string jar, String input, String output) {...} public String joboutput() {...} } public String numberofoutputfiles() {...} 12

The road tobig Data QA Big Data QA is different! Fitnesse Wiki test execution! test data definition / selection result inspection job & workflow control 13

Test Data CSV 14

Test Data Thrift Big Data: Efficient data transfer among heterogeneous sources Define Interface via IDL, Compiler for many languages 15

Test Data Real World Data Dev/Test HadoopCluster: Identical Hardware likeprod, but fewer nodes (random/biased) sampling e.g. on daily basis Feedback loop: identify special cases from real data include them in (manual) data definition Gradually increase test coverage/ artefact quality 16

The road tobig Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld test data! result inspection job & workflow control 17

Job Control Swiss Army Knife: Shell Execute arbitrary(shell) commands Mainly a wrapper around apache.commons.exec.commandline 18

Job Control Hadoop Fixture Hide complexity from test authors define appropriate test language via (Java) method names re-use other fixtures(shell, ) internally 19

Job Control Workflows & Suites FitNesse allows to group tests into suites MR job1 Can be used to simulate MR processing chains SetupSuite/ TearDownSuiteforcreating/ destroying test conditions MR job2 Tests can still be executed individually 20

The road tobig Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld data! result inspection Use suites & fixtures for jobs/workflows! 21

Results Data Warehouse / Hive Validate RDBMS contents(via JDBC) E.g. for checking the final result Oruse Hive+ Hive-Server toqueryrawdata 22

Results Pig Execute arbitrary pig commands from Wiki page Inspect e.g. binary intermediate results(avro, ) 23

Results Pig Fixture extends PigServer public class PigConsole extends PigServer { public void loadavrofileusingalias(string filename, String alias) { this.registerquery( alias + "= LOAD" + filename + "USING" + AVRO_STORAGE_LOADER + ";"); } } 24

Results Server Infrastructure TestEnvironments ProjA ProjB Fitnesse Master TestConfigurations ProjA ProjB dev qs live dev qs live Import / edit tests remotely Import / edit config remotely ProjA Dev ProjA Slave dev qs live QS ProjA Slave Live ProjA Slave 25

Thank you! dominik.benz@inovex.de Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld data! Inspect results via Pig/Hive Use suites & fixtures for jobs/workflows! 26

Want more? Inovex trains you! Android Developer Training (3 days, Karlsruhe/München) Certified Scrum Developer Training (5 days, Köln) Hadoop Developer Training (3 days, Karlsruhe/Köln) Liferay Portal-Developer Training (4 days, Karlsruhe) Liferay Portal-Admin Training (3 days, Karlsruhe) Pentaho Data Integration Training (4 days, München/Köln) information and registration at www.inovex.de/offene-trainings 27

Inovex @bbuzz Stefan Kathrin Bernhard Jörg Andrew Christian Christian 28