Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera



Similar documents
Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Moving From Hadoop to Spark

Real Time Data Processing using Spark Streaming

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Architectural Considerations for Hadoop Applications

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Productionizing a 24/7 Spark Streaming Service on YARN

Spark Application Carousel. Spark Summit East 2015

Unified Big Data Processing with Apache Spark. Matei

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Putting Apache Kafka to Use!

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Unified Big Data Analytics Pipeline. 连 城

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop & Spark Using Amazon EMR

Apache Flink Next-gen data analysis. Kostas

Kafka & Redis for Big Data Solutions

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Architectures for massive data management

Hadoop Ecosystem B Y R A H I M A.

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Going Deep with Spark Streaming

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

ITG Software Engineering

Using Kafka to Optimize Data Movement and System Integration. Alex

Databricks. A Primer

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Getting Real Real Time Data Integration Patterns and Architectures

Unified Batch & Stream Processing Platform

Databricks. A Primer

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth


brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Beyond Hadoop with Apache Spark and BDAS

Unlocking the True Value of Hadoop with Open Data Science

Making big data simple with Databricks

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Reference Architecture, Requirements, Gaps, Roles

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Are You Big Data Ready?

Safe Harbor Statement

Big Data Analytics. Lucas Rego Drumond

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Ganzheitliches Datenmanagement

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

BIG DATA ARCHITECTURE AND ANALYTICS BIG DATA STRATEGIES FOR BUSINESS GROWTH

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Introduction to Spark

Big Data Analytics Nokia

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

the missing log collector Treasure Data, Inc. Muga Nishizawa

Apache Flink. Fast and Reliable Large-Scale Data Processing

Real-time Big Data Analytics with Storm

Streaming items through a cluster with Spark Streaming

Using RDBMS, NoSQL or Hadoop?

The Game of Big Data! Analytics Infrastructure at KIXEYE

The Internet of Things and Big Data: Intro

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Informatica and the Vibe Virtual Data Machine

Building Your Big Data Team

Building a data analytics platform with Hadoop, Python and R

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Using distributed technologies to analyze Big Data

Customer Case Study. Sharethrough

Big Data With Hadoop

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Transcription:

Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera

About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera, Inc. All rights reserved. 2

Big Data is stuck at The Lab.

We want to move to The Factory 4

Click to enter confidentiality information 5

What does it mean to Systemize? Ability to easily add new data sources Easily improve and expend analytics Ease data access by standardizing metadata and storage Ability to discover mistakes and to recover from them Ability to safely experiment with new approaches Click to enter confidentiality information 6

We will discuss: Architectures Patterns Ingest Storage Schemas Metadata Streaming Experimenting Recovery We will not discuss: Actual decision making Data Science Machine learning Algorithms Click to enter confidentiality information 7

So how do we build real data architectures? Click to enter confidentiality information 8

The Data Bus 9

Client Backend Data Pipelines Start like this. 0

Client Backend Client Client Client Then we reuse them

Client Backend Client Another Backend Client Client Then we add multiple backends 2

Client Backend Client Another Backend Client Another Backend Client Another Backend Then it starts to look like this 3

Client Backend Client Another Backend Client Another Backend Client Another Backend With maybe some of this 4

Adding applications should be easier We need: Shared infrastructure for sending records Infrastructure must scale Set of agreed-upon record schemas 5

Kafka Based Ingest Architecture Producers Source System Source System Source System Source System Brokers Kafka Consumers Hadoop Security Systems Real-time monitoring Data Warehouse Kafka decouples Data Pipelines 6

Retain All Data Click to enter confidentiality 7 information

Data Pipeline Traditional View Raw data Clean data Aggregated data Enriched data Raw data Clean data Input Waste of diskspace Output 8

It is all valuable data Raw data Clean data Aggregated data Enriched data Raw data Clean data Filtered data Dash board Report Data scientist Alerts OMG 204 Cloudera, Inc. All rights reserved. 9

Hadoop Based ETL The FileSystem is the DB /user/ /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=2030 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 20

Store intermediate data /etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id> /etl/pharmacy/fraud/orders/raw/date=2030 /etl/pharmacy/fraud/orders/deduped/date=2030 /etl/pharmacy/fraud/orders/validated/date=2030 /etl/pharmacy/fraud/orders_labs/merged/date=2030 /etl/pharmacy/fraud/orders_labs/aggregated/date=2030 /etl/pharmacy/fraud/orders_labs/ranked/date=2030 Click to enter confidentiality information 2

Batch ETL is old news Click to enter confidentiality information 22

Small Problem! HDFS is optimized for large chunks of data Don t write individual events of micro-batches Think 00M-2G batches What do we do with small events? Click to enter confidentiality information 23

Well, we have this data bus Partition 0 2 3 4 5 6 7 8 9 0 2 3 Partition 2 0 2 3 4 5 6 7 8 9 0 Writes Partition 3 0 2 3 4 5 6 7 8 9 Old 0 2 3 New Click to enter confidentiality information 24

Kafka has topics How about? <biz unit>.<app>.<dataset>.<stage> pharmacy.fraud.orders.raw pharmacy.fraud.orders.deduped pharmacy.fraud.orders.validated pharmacy.fraud.orders_labs.merged Click to enter confidentiality information 25

It s (almost) all topics Raw data Clean data Aggregated data Enriched Data Raw data Clean data Filtered data Dash board Report Data scientist Alerts OMG 204 Cloudera, Inc. All rights reserved. 26

Benefits Recover from accidents Debug suspicious results Fix algorithm errors Experiment with new algorithms Click to enter confidentiality information 27

Kinda Lambda 28

Lambda Architecture Immutable events Store intermediate stages Combine Batches and Streams Reprocessing Click to enter confidentiality information 29

What we don t like Maintaining two applications Often in two languages That do the same thing Click to enter confidentiality information 30

Pain Avoidance # Use Spark + SparkStreaming Spark is awesome for batch, so why not? The New Kid that isn t that New Anymore Easily 0x less code Extremely Easy and Powerful API Very good for machine learning Scala, Java, and Python RDDs DAG Engine Click to enter confidentiality information 3

Spark Streaming Calling Spark in a Loop Extends RDDs with DStream Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here 32

Spark Streaming Pre-first Batch Source Receiver RDD First Batch Source Receiver RDD Single Pass Filter Count Print RDD Source Receiver RDD Second Batch RDD Single Pass Filter Count Print RDD Confidentiality Information Goes Here 33

Small Example val sparkconf = new SparkConf().setMaster(args(0)).setAppName(this.getClass.getCanonicalName) val ssc = new StreamingContext(sparkConf, Seconds(0)) // Create the DStream from data sent over the network val dstream = ssc.sockettextstream(args(), args(2).toint, StorageLevel.MEMORY_AND_DISK_SER) // Counting the errors in each RDD in the stream val errcountstream = dstream.transform(rdd => ErrorCount.countErrors(rdd)) val statestream = errcountstream.updatestatebykey[int](updatefunc) errcountstream.foreachrdd(rdd => { }) System.out.println("Errors this minute:%d".format(rdd.first()._2)) Click to enter confidentiality information 34

Pain Avoidance #2 Split the Stream Why do we even need stream + batch? Batch efficiencies Re-process to fix errors Re-process after delayed arrival What if we could re-play data? Click to enter confidentiality information 35

Kafka + Stream Processing 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Click to enter confidentiality information 36

Lets Re-Process with new algorithm 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 37

Lets Re-Process with new algorithm 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 38

Oh no, we just got a bunch of data for yesterday! 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App Today Streaming App Yesterday Click to enter confidentiality information 39

Note: No need to choose between the approaches. There are good reasons to do both. Click to enter confidentiality information 40

Prediction: Batch vs. Streaming distinction is going away. Click to enter confidentiality information 4

Yes, you really need a Schema Click to enter confidentiality 42 information

Schema is a MUST HAVE for data integration Click to enter confidentiality information 43

Client Backend Client Another Backend Client Another Backend Client Another Backend 44

Remember that we want this? Producer s Source System Source System Source System Source System Brokers Kafka Consume rs Hadoop Security Systems Real-time monitoring Data Warehouse 45

This means we need this: Source System Source System Source System Source System Kafka Schema Repository Hadoop Security Systems Real-time monitoring Data Warehouse Click to enter confidentiality information 46

We can do it in few ways People go around asking each other: So, what does the 5 th field of the messages in topic Blah contain? There s utility code for reading/writing messages that everyone reuses Schema embedded in the message A centralized repository for schemas Each message has Schema ID Each topic has Schema ID Click to enter confidentiality information 47

I Avro Define Schema Generate code for objects Serialize / Deserialize into Bytes or JSON Embed schema in files / records or not Support for our favorite languages Except Go. Schema Evolution Add and remove fields without breaking anything Click to enter confidentiality information 48

Schemas are Agile Schemas allow adding readers and writers easily Schemas allow modifying readers and writers independently Schemas can evolve as the system grows Click to enter confidentiality information 49

Click to enter confidentiality information 50

Woah, that was lots of stuff! Click to enter confidentiality 5 information

Recap if you remember nothing else After the POC, its time for production Goal: Evolve fast without breaking things For this you need: Keep all data Design pipeline for error recovery batch or stream Integrate with a data bus And Schemas 52

Thank you