Gerrit and Jenkins for Big Data Continuous Delivery. Santa Clara, CA, September 2-3

Size: px

Start display at page:

Download "Gerrit and Jenkins for Big Data Continuous Delivery. Santa Clara, CA, September 2-3"

Ronald Gallagher
10 years ago
Views:

1 Gerrit and Jenkins for Big Data Continuous Delivery Santa Clara, CA, September 2-3 1

2 About GerritForge Founded in 2009 in London Committed to OpenSource 2

3 The Team Luca Milanesio Co-founder and Director of GerritForge over 20 years in Agile Development and ALM OpenSource contributor to many projects (BigData, Continuous Integration, Git/Gerrit) Antonios Chalkiopulos Author of Programming MapReduce with Scalding Open source contributor to many BigData projects Working on the "land-of-hadoop' (landoop.com) 3

Integration, Git/Gerrit) Antonios Chalkiopulos Author of Programming MapReduce with

4 The Team (2) Tiago Palma Data Warehouse & Big Data Development Senior Data Modeler Big Data infrastructure specialist Stefano Galarraga 20 years of Agile Development Middleware, Big Data, Reactive Distributed Systems. Open Source contributor to many BigData projects. 4

Galarraga 20 years of Agile Development Middleware, Big Data,

5 Agenda Why continuous deployment on BigData? Our Development Lifecycle ingredients Gerrit, Jenkins, Mesos, Marathon, CDH / Spark Topics to address in BigData development Type of tests (Unit vs. Integration) Testing the "real thing" (aka the Cluster) Our BigData virtualised infrastructure Marathon, Mesos and Dockers all around Live (minimised) Demo 5

Topics to address in BigData development Type of tests (Unit vs.

6 WHY? Early BigData had no process at all = may fail at any time Mature BigData is mission critical decision maker Need for more stable sw-engineering methodologies: Test-Driven Development (Stefano's ScaldingUnit) Continuous Integration with Jenkins Integration & Performance testing Code review and validation 6

sw-engineering methodologies: Test-Driven Development (Stefano's

7 Code-Review BigData Lifecycle (1) GIT used by distributed teams (UK, Israel, India) Topics and Code Review Jenkins build on every patch-set Commits reviewed / approved via Gerrit Submit 7

8 Code-Review BigData Lifecycle (2) 8

9 Code-Review BigData Lifecycle (3) Submitting a Topic automatically does: all patch-sets merged (semi-atomically) trigger a longer chain of CI steps automatically promote a RC if everything passes Jenkins automation via Gerrit Trigger Plugin 9

trigger a longer chain of CI steps automatically promote a

10 Ingredients: Gerrit Git-based Code Review system Pre-commit review Allows multiple validation steps (pipeline) Validation + Integration flags 10

11 Ingredients: Jenkins Plugins: Gerrit trigger Docker build step Post-build script plugin 11

12 Fitting CDH Into this Picture Integration Test Running integration tests into an CDH-enabled docker container Hadoop/local and Spark/standalone is not enough Need to test classes serialisation Validate package fat-jars (libs conflicts with CDH) Performance on a real cluster 12

Spark/standalone is not enough Need to test classes serialisation

13 Fitting CDH Into this Picture Acceptance / performance test with short-lived CDHs Solution: Mesos, Marathon and Docker: Ephemeral clusters with defined capacity Automatic cluster-config All controlled via Docker/Mesos 13

and Docker: Ephemeral clusters with defined capacity

14 Mesos + Marathon Apache Mesos Abstracts CPU, memory, storage, other compute resources away from machines Marathon Framework Runs on top of Mesos Guarantees that long-running applications never stop REST API for managing and scaling services 14

Framework Runs on top of Mesos Guarantees that long-running

15 CDH Components CDH distribution Apache Spark Hadoop HDFS YARN 15

16 Integration Test Flow on CDH Cluster Slave Host Jenkins Master Marathon Mesos Master Mesos Slave Docker Private Docker Registry Waiting for Dockers POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents Marathon Framework receives resource offers from Mesos Master and submits the tasks The task is sent to the Mesos Slave Install Cloudera packages via Cloudera Manager API using Python Mesos slave starts the docker container Dockers UP Docker image is fetched from Docker registry if not present in Slave host Deploy the ETL, run the ETL and the Integration Tests 16

from Mesos Master and submits the tasks The task is sent to the Mesos Slave Install Cloudera packages via Cloudera Manager API using Python Mesos slave starts

17 Unit and Integration Tests sample Test project: Test Spark project ETL from Oracle to HDFS Unit-test directly on Spark logic Integration tests for every patch-set: VERY small dataset just for this demo CDH and Oracle Docker Images 17

Spark logic Integration tests for every patch-set: VERY

18 Unit and Integration Tests Jenkins init Build Job O Hadoop Pseudodistributed mode Spark Standalone Init/read HDFS Submit job 18

19 DEMO Small-scale of BigData Delivery Pipeline 19

20 References Replay, share or extend the continuous delivery pipeline demo: Blog: @GerritForge Learn Gerrit Code Review: Get in touch with GerritForge: mailto: 20

me Twitter: @GerritReview @GitEnterprise @GerritForge Learn Gerrit Code

21 Thanks to our Sponsors! 21