Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Size: px

Start display at page:

Download "Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera"

Dale Glenn
10 years ago
Views:

1 Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera

3 Big Data is stuck at The Lab.

4 We want to move to The Factory 4

5 Click to enter confidentiality information 5

6 What does it mean to Systemize? Ability to easily add new data sources Easily improve and expend analytics Ease data access by standardizing metadata and storage Ability to discover mistakes and to recover from them Ability to safely experiment with new approaches Click to enter confidentiality information 6

Ease data access by standardizing metadata and storage Ability to discover

7 We will discuss: Architectures Patterns Ingest Storage Schemas Metadata Streaming Experimenting Recovery We will not discuss: Actual decision making Data Science Machine learning Algorithms Click to enter confidentiality information 7

not discuss: Actual decision making Data Science Machine

8 So how do we build real data architectures? Click to enter confidentiality information 8

9 The Data Bus 9

10 Client Backend Data Pipelines Start like this. 0

11 Client Backend Client Client Client Then we reuse them

12 Client Backend Client Another Backend Client Client Then we add multiple backends 2

13 Client Backend Client Another Backend Client Another Backend Client Another Backend Then it starts to look like this 3

14 Client Backend Client Another Backend Client Another Backend Client Another Backend With maybe some of this 4

15 Adding applications should be easier We need: Shared infrastructure for sending records Infrastructure must scale Set of agreed-upon record schemas 5

16 Kafka Based Ingest Architecture Producers Source System Source System Source System Source System Brokers Kafka Consumers Hadoop Security Systems Real-time monitoring Data Warehouse Kafka decouples Data Pipelines 6

Brokers Kafka Consumers Hadoop Security Systems

17 Retain All Data Click to enter confidentiality 7 information

18 Data Pipeline Traditional View Raw data Clean data Aggregated data Enriched data Raw data Clean data Input Waste of diskspace Output 8

data Filtered data Dash board Report Data

20 Hadoop Based ETL The FileSystem is the DB /user/ /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=2030 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 20

/data/<biz unit>/<app>/<dataset>/partition

21 Store intermediate data /etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id> /etl/pharmacy/fraud/orders/raw/date=2030 /etl/pharmacy/fraud/orders/deduped/date=2030 /etl/pharmacy/fraud/orders/validated/date=2030 /etl/pharmacy/fraud/orders_labs/merged/date=2030 /etl/pharmacy/fraud/orders_labs/aggregated/date=2030 /etl/pharmacy/fraud/orders_labs/ranked/date=2030 Click to enter confidentiality information 2

22 Batch ETL is old news Click to enter confidentiality information 22

23 Small Problem! HDFS is optimized for large chunks of data Don t write individual events of micro-batches Think 00M-2G batches What do we do with small events? Click to enter confidentiality information 23

24 Well, we have this data bus Partition Partition Writes Partition Old New Click to enter confidentiality information 24

25 Kafka has topics How about? <biz unit>.<app>.<dataset>.<stage> pharmacy.fraud.orders.raw pharmacy.fraud.orders.deduped pharmacy.fraud.orders.validated pharmacy.fraud.orders_labs.merged Click to enter confidentiality information 25

27 Benefits Recover from accidents Debug suspicious results Fix algorithm errors Experiment with new algorithms Click to enter confidentiality information 27

28 Kinda Lambda 28

29 Lambda Architecture Immutable events Store intermediate stages Combine Batches and Streams Reprocessing Click to enter confidentiality information 29

30 What we don t like Maintaining two applications Often in two languages That do the same thing Click to enter confidentiality information 30

31 Pain Avoidance # Use Spark + SparkStreaming Spark is awesome for batch, so why not? The New Kid that isn t that New Anymore Easily 0x less code Extremely Easy and Powerful API Very good for machine learning Scala, Java, and Python RDDs DAG Engine Click to enter confidentiality information 3

32 Spark Streaming Calling Spark in a Loop Extends RDDs with DStream Very Little Code Changes from ETL to Streaming Confidentiality Information Goes Here 32

33 Spark Streaming Pre-first Batch Source Receiver RDD First Batch Source Receiver RDD Single Pass Filter Count Print RDD Source Receiver RDD Second Batch RDD Single Pass Filter Count Print RDD Confidentiality Information Goes Here 33

34 Small Example val sparkconf = new SparkConf().setMaster(args(0)).setAppName(this.getClass.getCanonicalName) val ssc = new StreamingContext(sparkConf, Seconds(0)) // Create the DStream from data sent over the network val dstream = ssc.sockettextstream(args(), args(2).toint, StorageLevel.MEMORY_AND_DISK_SER) // Counting the errors in each RDD in the stream val errcountstream = dstream.transform(rdd => ErrorCount.countErrors(rdd)) val statestream = errcountstream.updatestatebykey[int](updatefunc) errcountstream.foreachrdd(rdd => { }) System.out.println("Errors this minute:%d".format(rdd.first()._2)) Click to enter confidentiality information 34

35 Pain Avoidance #2 Split the Stream Why do we even need stream + batch? Batch efficiencies Re-process to fix errors Re-process after delayed arrival What if we could re-play data? Click to enter confidentiality information 35

36 Kafka + Stream Processing Streaming App v Result set App Click to enter confidentiality information 36

37 Lets Re-Process with new algorithm Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 37

38 Lets Re-Process with new algorithm Streaming App v Result set App Streaming App v2 Result set 2 Click to enter confidentiality information 38

39 Oh no, we just got a bunch of data for yesterday! Streaming App Today Streaming App Yesterday Click to enter confidentiality information 39

40 Note: No need to choose between the approaches. There are good reasons to do both. Click to enter confidentiality information 40

41 Prediction: Batch vs. Streaming distinction is going away. Click to enter confidentiality information 4

42 Yes, you really need a Schema Click to enter confidentiality 42 information

43 Schema is a MUST HAVE for data integration Click to enter confidentiality information 43

44 Client Backend Client Another Backend Client Another Backend Client Another Backend 44

45 Remember that we want this? Producer s Source System Source System Source System Source System Brokers Kafka Consume rs Hadoop Security Systems Real-time monitoring Data Warehouse 45

46 This means we need this: Source System Source System Source System Source System Kafka Schema Repository Hadoop Security Systems Real-time monitoring Data Warehouse Click to enter confidentiality information 46

47 We can do it in few ways People go around asking each other: So, what does the 5 th field of the messages in topic Blah contain? There s utility code for reading/writing messages that everyone reuses Schema embedded in the message A centralized repository for schemas Each message has Schema ID Each topic has Schema ID Click to enter confidentiality information 47

48 I Avro Define Schema Generate code for objects Serialize / Deserialize into Bytes or JSON Embed schema in files / records or not Support for our favorite languages Except Go. Schema Evolution Add and remove fields without breaking anything Click to enter confidentiality information 48

49 Schemas are Agile Schemas allow adding readers and writers easily Schemas allow modifying readers and writers independently Schemas can evolve as the system grows Click to enter confidentiality information 49

50 Click to enter confidentiality information 50

51 Woah, that was lots of stuff! Click to enter confidentiality 5 information

52 Recap if you remember nothing else After the POC, its time for production Goal: Evolve fast without breaking things For this you need: Keep all data Design pipeline for error recovery batch or stream Integrate with a data bus And Schemas 52

53 Thank you

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Introduction to Apache Kafka And Real-Time ETL for Oracle DBAs and Data Analysts 1 About Myself Gwen Shapira System Architect @Confluent Committer @ Apache Kafka, Apache Sqoop Author of Hadoop Application