Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera
About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera, Inc. All rights reserved. 2
Big Data is stuck at The Lab.
We want to move to The Factory 4
Click to enter confidentiality information 5
What does it mean to Systemize? Ability to easily add new data sources Easily improve and expend analytics Ease data access by standardizing metadata and storage Ability to discover mistakes and to recover from them Ability to safely experiment with new approaches Click to enter confidentiality information 6
We will discuss: Architectures Patterns Ingest Storage Schemas Metadata Streaming Experimenting Recovery We will not discuss: Actual decision making Data Science Machine learning Algorithms Click to enter confidentiality information 7
So how do we build real data architectures? Click to enter confidentiality information 8
The Data Bus 9
Client Backend Data Pipelines Start like this. 0
Client Backend Client Client Client Then we reuse them
Client Backend Client Another Backend Client Client Then we add multiple backends 2
Client Backend Client Another Backend Client Another Backend Client Another Backend Then it starts to look like this 3
Client Backend Client Another Backend Client Another Backend Client Another Backend With maybe some of this 4
Adding applications should be easier We need: Shared infrastructure for sending records Infrastructure must scale Set of agreed-upon record schemas 5
Kafka Based Ingest Architecture Producers Source System Source System Source System Source System Brokers Kafka Consumers Hadoop Security Systems Real-time monitoring Data Warehouse Kafka decouples Data Pipelines 6
Retain All Data Click to enter confidentiality 7 information
Data Pipeline Traditional View Raw data Clean data Aggregated data Enriched data Raw data Clean data Input Waste of diskspace Output 8
It is all valuable data Raw data Clean data Aggregated data Enriched data Raw data Clean data Filtered data Dash board Report Data scientist Alerts OMG 204 Cloudera, Inc. All rights reserved. 9
Hadoop Based ETL The FileSystem is the DB /user/ /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=2030 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 20
Store intermediate data /etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id> /etl/pharmacy/fraud/orders/raw/date=2030 /etl/pharmacy/fraud/orders/deduped/date=2030 /etl/pharmacy/fraud/orders/validated/date=2030 /etl/pharmacy/fraud/orders_labs/merged/date=2030 /etl/pharmacy/fraud/orders_labs/aggregated/date=2030 /etl/pharmacy/fraud/orders_labs/ranked/date=2030 Click to enter confidentiality information 2
Batch ETL is old news Click to enter confidentiality information 22
Small Problem! HDFS is optimized for large chunks of data Don t write individual events of micro-batches Think 00M-2G batches What do we do with small events? Click to enter confidentiality information 23
Well, we have this data bus Partition 0 2 3 4 5 6 7 8 9 0 2 3 Partition 2 0 2 3 4 5 6 7 8 9 0 Writes Partition 3 0 2 3 4 5 6 7 8 9 Old 0 2 3 New Click to enter confidentiality information 24
Kafka has topics How about? <biz unit>.<app>.<dataset>.<stage> pharmacy.fraud.orders.raw pharmacy.fraud.orders.deduped pharmacy.fraud.orders.validated pharmacy.fraud.orders_labs.merged Click to enter confidentiality information 25
It s (almost) all topics Raw data Clean data Aggregated data Enriched Data Raw data Clean data Filtered data Dash board Report Data scientist Alerts OMG 204 Cloudera, Inc. All rights reserved. 26
Benefits Recover from accidents Debug suspicious results Fix algorithm errors Experiment with new algorithms Click to enter confidentiality information 27
Kinda Lambda 28
Lambda Architecture Immutable events Store intermediate stages Combine Batches and Streams Reprocessing Click to enter confidentiality information 29
What we don t like Maintaining two applications Often in two languages That do the same thing Click to enter confidentiality information 30
Pain Avoidance # Use Spark + SparkStreaming Spark is awesome for batch, so why not? The New Kid that isn t that New Anymore Easily 0x less code Extremely Easy and Powerful API Very good for machine learning Scala, Java, and Python RDDs DAG Engine Click to enter confidentiality information 3
Spark Streaming Calling Spark in a Loop Extends RDDs with DStream Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here 32
Spark Streaming Pre-first Batch Source Receiver RDD First Batch Source Receiver RDD Single Pass Filter Count Print RDD Source Receiver RDD Second Batch RDD Single Pass Filter Count Print RDD Confidentiality Information Goes Here 33
Small Example val sparkconf = new SparkConf().setMaster(args(0)).setAppName(this.getClass.getCanonicalName) val ssc = new StreamingContext(sparkConf, Seconds(0)) // Create the DStream from data sent over the network val dstream = ssc.sockettextstream(args(), args(2).toint, StorageLevel.MEMORY_AND_DISK_SER) // Counting the errors in each RDD in the stream val errcountstream = dstream.transform(rdd => ErrorCount.countErrors(rdd)) val statestream = errcountstream.updatestatebykey[int](updatefunc) errcountstream.foreachrdd(rdd => { }) System.out.println("Errors this minute:%d".format(rdd.first()._2)) Click to enter confidentiality information 34
Pain Avoidance #2 Split the Stream Why do we even need stream + batch? Batch efficiencies Re-process to fix errors Re-process after delayed arrival What if we could re-play data? Click to enter confidentiality information 35
Kafka + Stream Processing 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Click to enter confidentiality information 36
Lets Re-Process with new algorithm 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 37
Lets Re-Process with new algorithm 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 38
Oh no, we just got a bunch of data for yesterday! 0 2 3 4 5 6 7 8 9 0 2 3 Streaming App Today Streaming App Yesterday Click to enter confidentiality information 39
Note: No need to choose between the approaches. There are good reasons to do both. Click to enter confidentiality information 40
Prediction: Batch vs. Streaming distinction is going away. Click to enter confidentiality information 4
Yes, you really need a Schema Click to enter confidentiality 42 information
Schema is a MUST HAVE for data integration Click to enter confidentiality information 43
Client Backend Client Another Backend Client Another Backend Client Another Backend 44
Remember that we want this? Producer s Source System Source System Source System Source System Brokers Kafka Consume rs Hadoop Security Systems Real-time monitoring Data Warehouse 45
This means we need this: Source System Source System Source System Source System Kafka Schema Repository Hadoop Security Systems Real-time monitoring Data Warehouse Click to enter confidentiality information 46
We can do it in few ways People go around asking each other: So, what does the 5 th field of the messages in topic Blah contain? There s utility code for reading/writing messages that everyone reuses Schema embedded in the message A centralized repository for schemas Each message has Schema ID Each topic has Schema ID Click to enter confidentiality information 47
I Avro Define Schema Generate code for objects Serialize / Deserialize into Bytes or JSON Embed schema in files / records or not Support for our favorite languages Except Go. Schema Evolution Add and remove fields without breaking anything Click to enter confidentiality information 48
Schemas are Agile Schemas allow adding readers and writers easily Schemas allow modifying readers and writers independently Schemas can evolve as the system grows Click to enter confidentiality information 49
Click to enter confidentiality information 50
Woah, that was lots of stuff! Click to enter confidentiality 5 information
Recap if you remember nothing else After the POC, its time for production Goal: Evolve fast without breaking things For this you need: Keep all data Design pipeline for error recovery batch or stream Integrate with a data bus And Schemas 52
Thank you