Gerrit and Jenkins for Big Data Continuous Delivery. Santa Clara, CA, September 2-3



Similar documents
Mobile Development with Git, Gerrit & Jenkins

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Developing Plugins for Cloud Scale

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

White paper: Delivering Business Value with Apache Mesos

Real Time Data Processing using Spark Streaming

How To Choose A Data Flow Pipeline From A Data Processing Platform

DevOps Course Content

Soma: Linked Data Infrastructure

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Jenkins Slave Cloud with Apache Mesos. Klaus Azesberger Reinhard Kiesswetter Infonova GmbH

Workshop on Hadoop with Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Dell In-Memory Appliance for Cloudera Enterprise

Cisco IT Hadoop Journey

Qsoft Inc

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analytics. Lucas Rego Drumond

Hadoop and Map-Reduce. Swati Gore

A Performance Analysis of Distributed Indexing using Terrier

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Business Intelligence in Microservice Architecture. Debarshi bol.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Chase Wu New Jersey Ins0tute of Technology

A Brief Introduction to Apache Tez

HiBench Introduction. Carson Wang Software & Services Group

RED HAT CONTAINER STRATEGY

Big Data Analytics OverOnline Transactional Data Set

BIG DATA HADOOP TRAINING

H2O on Hadoop. September 30,

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data and Scripting Systems beyond Hadoop

The Virtualization Practice

Hadoop IST 734 SS CHUNG

The Future of Data Management

Jenkins World Tour 2015 Santa Clara, CA, September 2-3

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

WHAT S NEW IN SAS 9.4

Data Analytics Infrastructure

NoSQL and Hadoop Technologies On Oracle Cloud

Native Connectivity to Big Data Sources in MSTR 10

Private Cloud Management

Certified Big Data and Apache Hadoop Developer VS-1221

Building a data analytics platform with Hadoop, Python and R

The KPMG-NL Big Data team 16 March 2015

Massively! Continuous Integration! A case study for Jenkins at cloud-scale

Unified Big Data Processing with Apache Spark. Matei

Talend Big Data. Delivering instant value from all your data. Talend

Gregory PaaS team

Ali Ghodsi Head of PM and Engineering Databricks

Practicing Continuous Delivery using Hudson. Winston Prakash Oracle Corporation

Hadoop & Spark Using Amazon EMR

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Extending Hadoop beyond MapReduce

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Distributed Scheduling with Apache Mesos in the Cloud. PhillyETE - April, 2015 Diptanu Gon

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

The Inside Scoop on Hadoop

Enabling Continuous Delivery for Java Projects with Oracle Cloud Services (Oracle PaaS) Siva Rama Krishna Oracle India

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Sujee Maniyam, ElephantScale

HDP Hadoop From concept to deployment.

Ubuntu and Hadoop: the perfect match

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Continuous Integration

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

DevOps. Jesse Pai Robert Monical 8/14/2015

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Lessons Learned: Building a Big Data Research and Education Infrastructure

Continuous integration with Jenkins CI

Cloudera Manager Introduction

CoDe:U Git Flow - a Continuous Delivery Approach

Big Data Training - Hackveda

DevOps Best Practices for Mobile Apps. Sanjeev Sharma IBM Software Group

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Cisco Application-Centric Infrastructure (ACI) and Linux Containers

Moving From Hadoop to Spark

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

A central continuous integration platform

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

November 12 th 13 th London: Mastering Continuous Integration with Jenkins

How Intel IT Successfully Migrated to Cloudera Apache Hadoop*

Big Data Course Highlights

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Managing large clusters resources

TUT NoSQL Seminar (Oracle) Big Data

Oracle Big Data Fundamentals Ed 1 NEW

Transcription:

Gerrit and Jenkins for Big Data Continuous Delivery Santa Clara, CA, September 2-3 1

About GerritForge Founded in 2009 in London Committed to OpenSource 2

The Team Luca Milanesio Co-founder and Director of GerritForge over 20 years in Agile Development and ALM OpenSource contributor to many projects (BigData, Continuous Integration, Git/Gerrit) Antonios Chalkiopulos Author of Programming MapReduce with Scalding Open source contributor to many BigData projects Working on the "land-of-hadoop' (landoop.com) 3

The Team (2) Tiago Palma Data Warehouse & Big Data Development Senior Data Modeler Big Data infrastructure specialist Stefano Galarraga 20 years of Agile Development Middleware, Big Data, Reactive Distributed Systems. Open Source contributor to many BigData projects. 4

Agenda Why continuous deployment on BigData? Our Development Lifecycle ingredients Gerrit, Jenkins, Mesos, Marathon, CDH / Spark Topics to address in BigData development Type of tests (Unit vs. Integration) Testing the "real thing" (aka the Cluster) Our BigData virtualised infrastructure Marathon, Mesos and Dockers all around Live (minimised) Demo 5

WHY? Early BigData had no process at all = may fail at any time Mature BigData is mission critical decision maker Need for more stable sw-engineering methodologies: Test-Driven Development (Stefano's ScaldingUnit) Continuous Integration with Jenkins Integration & Performance testing Code review and validation 6

Code-Review BigData Lifecycle (1) GIT used by distributed teams (UK, Israel, India) Topics and Code Review Jenkins build on every patch-set Commits reviewed / approved via Gerrit Submit 7

Code-Review BigData Lifecycle (2) 8

Code-Review BigData Lifecycle (3) Submitting a Topic automatically does: all patch-sets merged (semi-atomically) trigger a longer chain of CI steps automatically promote a RC if everything passes Jenkins automation via Gerrit Trigger Plugin 9

Ingredients: Gerrit Git-based Code Review system Pre-commit review Allows multiple validation steps (pipeline) Validation + Integration flags 10

Ingredients: Jenkins Plugins: Gerrit trigger Docker build step Post-build script plugin 11

Fitting CDH Into this Picture Integration Test Running integration tests into an CDH-enabled docker container Hadoop/local and Spark/standalone is not enough Need to test classes serialisation Validate package fat-jars (libs conflicts with CDH) Performance on a real cluster 12

Fitting CDH Into this Picture Acceptance / performance test with short-lived CDHs Solution: Mesos, Marathon and Docker: Ephemeral clusters with defined capacity Automatic cluster-config All controlled via Docker/Mesos 13

Mesos + Marathon Apache Mesos Abstracts CPU, memory, storage, other compute resources away from machines Marathon Framework Runs on top of Mesos Guarantees that long-running applications never stop REST API for managing and scaling services 14

CDH Components CDH 5.4.1 distribution Apache Spark Hadoop HDFS YARN 15

Integration Test Flow on CDH Cluster Slave Host Jenkins Master Marathon Mesos Master Mesos Slave Docker Private Docker Registry Waiting for Dockers POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents Marathon Framework receives resource offers from Mesos Master and submits the tasks The task is sent to the Mesos Slave Install Cloudera packages via Cloudera Manager API using Python Mesos slave starts the docker container Dockers UP Docker image is fetched from Docker registry if not present in Slave host Deploy the ETL, run the ETL and the Integration Tests 16

Unit and Integration Tests sample Test project: Test Spark project ETL from Oracle to HDFS Unit-test directly on Spark logic Integration tests for every patch-set: VERY small dataset just for this demo CDH and Oracle Docker Images 17

Unit and Integration Tests Jenkins init Build Job O Hadoop Pseudodistributed mode Spark Standalone Init/read HDFS Submit job 18

DEMO Small-scale of BigData Delivery Pipeline 19

References Replay, share or extend the continuous delivery pipeline demo: https://github.com/gerritforge Blog: https://gitenterprise.me Twitter: @GerritReview @GitEnterprise @GerritForge Learn Gerrit Code Review: http://gerrithub.io/book Get in touch with GerritForge: mailto: info@gerritforge.com 20

Thanks to our Sponsors! 21