Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development. Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords

Size: px

Start display at page:

Download "Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development. Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords"

Gladys Copeland
10 years ago
Views:

1 Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords

2 Who speaks the Elephant language? Class A extends Mapper ROI, $$, apt-get install? TDD!??????? Write/execute tests, specify acceptance criteria, 2

3 The road tobig Data QA our Big Data QA problem the FitNesse approach test data definition / selection result inspection job & workflow control 3

4 QA problem Web 1&1 BI reporting, web analytics, ~ 1 billion log events / day, ~ 1 TB (thrift) logfiles DWH Hadoop Cluster chains of MR jobs, running on 20 nodes / 8 cores / 96 GB RAM (CDH) 4

5 QA problem An exemplary workflow? create?? inspect (sample) (binary) control input data formats workflows Log Files Log Files (thrift) Log Files (thrift) (thrift) MR job1 Intermediate result (avro) MR job2 DWH (RDBMS) 5

6 QA problem Existing Approaches method tests what? issues for our usecase JUnit isolated functions no integration, Java syntax MRUnit 1 mapper+ 1 reducer little integration, Java syntax itest hadoop jobs/workflows Java / Groovy syntax Scripts/CLI (manual) scripting/inspect. script chaos, syntax FitNesse as suitable addition/ solution! 6

mapper+ 1 reducer little integration, Java syntax itest hadoop jobs/workflows Java /

7 The road tobig Data QA Big Data QA is different! the FitNesse approach test data definition / selection result inspection job & workflow control 7

8 FitNesse In a nutshell fully integrated standalone wiki and acceptance testing framework executable Wiki- Pages (returning test results) (almost) natural language test specification connectiontosut via (Java-) Fixtures 8

Pages (returning test results) (almost) natural

9 FitNesse Architecture Overview Browser script check num results 3 FitNesse Server Fixtures public int numresults {... } calling java methods from wiki, compare return values Integrates with REST, Jenkins System under Test 9

10 FitNesse An Exemplary Test 10

11 FitNesse Exemplary Test Source!path /home/inovex/lib/*.jar script Hadoop upload viewlog.csv to hdfs /testdata/ hadoop job from jar viewlog.jar [...] show job output check number of output files 3 11

12 FitNesse Hadoop Fixture Java Code public class Hadoop { public boolean uploadtohdfs(string localfile, String remotefile) {...} public boolean hadoopjobfromjar(string jar, String input, String output) {...} public String joboutput() {...} } public String numberofoutputfiles() {...} 12

..} public boolean hadoopjobfromjar(string jar, String input, String

13 The road tobig Data QA Big Data QA is different! Fitnesse Wiki test execution! test data definition / selection result inspection job & workflow control 13

14 Test Data CSV 14

15 Test Data Thrift Big Data: Efficient data transfer among heterogeneous sources Define Interface via IDL, Compiler for many languages 15

16 Test Data Real World Data Dev/Test HadoopCluster: Identical Hardware likeprod, but fewer nodes (random/biased) sampling e.g. on daily basis Feedback loop: identify special cases from real data include them in (manual) data definition Gradually increase test coverage/ artefact quality 16

e.g. on daily basis Feedback loop: identify special cases from real

17 The road tobig Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld test data! result inspection job & workflow control 17

18 Job Control Swiss Army Knife: Shell Execute arbitrary(shell) commands Mainly a wrapper around apache.commons.exec.commandline 18

19 Job Control Hadoop Fixture Hide complexity from test authors define appropriate test language via (Java) method names re-use other fixtures(shell, ) internally 19

20 Job Control Workflows & Suites FitNesse allows to group tests into suites MR job1 Can be used to simulate MR processing chains SetupSuite/ TearDownSuiteforcreating/ destroying test conditions MR job2 Tests can still be executed individually 20

processing chains SetupSuite/ TearDownSuiteforcreating/

21 The road tobig Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld data! result inspection Use suites & fixtures for jobs/workflows! 21

22 Results Data Warehouse / Hive Validate RDBMS contents(via JDBC) E.g. for checking the final result Oruse Hive+ Hive-Server toqueryrawdata 22

23 Results Pig Execute arbitrary pig commands from Wiki page Inspect e.g. binary intermediate results(avro, ) 23

24 Results Pig Fixture extends PigServer public class PigConsole extends PigServer { public void loadavrofileusingalias(string filename, String alias) { this.registerquery( alias + "= LOAD" + filename + "USING" + AVRO_STORAGE_LOADER + ";"); } } 24

25 Results Server Infrastructure TestEnvironments ProjA ProjB Fitnesse Master TestConfigurations ProjA ProjB dev qs live dev qs live Import / edit tests remotely Import / edit config remotely ProjA Dev ProjA Slave dev qs live QS ProjA Slave Live ProjA Slave 25

26 Thank you! Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / realworld data! Inspect results via Pig/Hive Use suites & fixtures for jobs/workflows! 26

27 Want more? Inovex trains you! Android Developer Training (3 days, Karlsruhe/München) Certified Scrum Developer Training (5 days, Köln) Hadoop Developer Training (3 days, Karlsruhe/Köln) Liferay Portal-Developer Training (4 days, Karlsruhe) Liferay Portal-Admin Training (3 days, Karlsruhe) Pentaho Data Integration Training (4 days, München/Köln) information and registration at 27

28 Stefan Kathrin Bernhard Jörg Andrew Christian Christian 28

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.