Simplifying Big Data with Apache Crunch. Micah

Size: px

Start display at page:

Download "Simplifying Big Data with Apache Crunch. Micah Whitacre @mkwhit"

Alexina McKinney
10 years ago
Views:

1 Simplifying Big Data with Apache Crunch Micah

11 Semantic Chart Search Medical Alerting System Cloud Based EMR Population Health Management

12 Problem moves from scaling architecture...

13 Problem moves from not only scaling architecture... To how to scale the knowledge

15 Battling the 3 V s

16 Daily, weekly, monthly uploads Battling the 3 V s

17 Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s

18 Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s Constant streams for near real time

19 Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s Constant streams for near real time 2+ TB of streaming data daily

20 Population Health Avro CSV Vertica HBase Normalize Data Apply Algorithms Load Data for Displays HBase Solr Vertica

21 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

22 M a p p e r R e d u c e r

23 Struggle to fit into single MapReduce job

24 Struggle to fit into single MapReduce job Integration done through persistence

25 Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns

26 Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns Evolving Requirements

27 Prep for Bulk Load CSV Process Reference Data Process Raw Data using Reference CSV HBase Filter Out Invalid Data Group Data By Person Process Raw Person Data Anonymize Data Avro Create Person Record Avro

28 Easy integration between teams Focus on processing steps Shallow learning curve Ability to tune for performance

29 Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Transformation through fns (not job) Utilizes POJOs (hides serialization)

30 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

31 CSV Processing Pipeline Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

32 Pipeline Programmatic description of DAG Supports lazy execution Implementations indicate runtime MapReduce, Spark, Memory

33 Pipeline pipeline = new MRPipeline(Driver.class, conf); Pipeline pipeline = MemPipeline.getIntance(); Pipeline pipeline = new SparkPipeline(sparkContext, app );

34 Source Reads various inputs At least one required per pipeline Creates initial collections for processing Custom implementations

35 Source Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

36 pipeline.read( From.textFile(path));

37 pipeline.read( new TextFileSource(path,ptype));

38 PType<String> ptype = ; pipeline.read( new TextFileSource(path,ptype));

39 PType Hides serialization Exposes data in native Java forms Supports composing complex types Avro, Thrift, and Protocol Buffers

40 Multiple Serialization Types Serialization Type = PTypeFamily Avro & Writable available Can t mix families in single type Can easily convert between families

41 PType<Integer> inttypes = Writables.ints(); PType<String> stringtype = Avros.strings(); PType<Person> persontype = Avros.records(Person.class);

42 PType<Pair<String, Person>> pairtype = Avros.pairs(stringType, persontype);

43 PTableType<String, Person> tabletype = Avros.tableOf(stringType,personType);

44 PType<String> ptype = ; PCollection<String> strings = pipeline.read( new TextFileSource(path, ptype));

45 PCollection Immutable Unsorted Not created only read or transformed Represents potential data

46 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

47 PCollection<String> Process Reference Data PCollection<RefData>

48 DoFn Simple API to implement Transforms PCollection between forms Location for custom logic Processes one element at a time

49 For each item emits 0:M items MapFn - emits 1:1 FilterFn - returns boolean

50 DoFn API class ExampleDoFn extends DoFn<String, RefData>{... } Type of Data In Type of Data Out

51 Type of Data In Type of Data Out public void process (String s, Emitter<RefData> emitter) { RefData data = ; emitter.emit(data); }

52 PCollection<String> refstrings PCollection<RefData> refs = refstrings.paralleldo(fn, Avros.records(RefData.class));

53 PCollection<String> datastrs... PCollection<RefData> refs = datastrs.paralleldo(difffn, Avros.records(Data.class));

54 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

55 Hmm now I need to join... But they don t have a common key? We need a PTable

56 PTable<K, V> Immutable & Unsorted Multimap of Keys and Values Variation PCollection<Pair<K, V>> Joins, Cogroups, Group By Key

57 class ExampleDoFn extends DoFn<String, RefData>{... }

58 class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{... }

59 PCollection<String> refstrings PTable<String, RefData> refs = refstrings.paralleldo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

60 PTable<String, RefData> refs ; PTable<String, Data> data ;

61 data.join(refs); (inner join)

62 PTable<String, Pair<Data, RefData>> joineddata = data.join(refs);

63 Joins right, left, inner, outer Eliminates custom impls Mapside, BloomFilter, Sharded

64 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

65 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

66 FilterFn API class MyFilterFn extends FilterFn<...>{... Type of Data In }

67 public boolean accept (... value){ return value > 3; }

68 PCollection<Model> values = ; PCollection<Model> filtered = values.filter(new MyFilterFn());

69 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

70 Keyed By PersonId PTable<String,Model> models = ;

71 PTable<String,Model> models = ; PGroupedTable<String, Model> groupedmodels = models.groupbykey();

72 PGroupedTable<K, V> Immutable & Sorted PCollection<Pair<K, Iterable<V>>>

73 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

74 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

75 PCollection<Person> persons = ;

76 PCollection<Person> persons = ; pipeline.write(persons, To.avroFile(path));

77 PCollection<Person> persons = ; pipeline.write(persons, new AvroFileTarget(path));

78 Target Persists PCollection At least one required per pipeline Custom implementations

79 Target Strings AvroRecords Results POJOs Protobufs Thrift Writables Sequence Files Avro Parquet HBase JDBC HFiles Text CSV

80 CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

81 Execution Pipeline pipeline = ;... pipeline.write(...); PipelineResult result = pipeline.done();

82 Map CSV Reduce Process Reference Data Process Raw Data using Reference CSV Reduce Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

83 Tuning Tweak pipeline for performance GroupingOptions/ParallelDoOptions Scale factors

84 Functionality first Focus on the transformations Smaller learning curve Less fragility

85 Iterate with confidence Integration through PCollections Extend pipeline for new features

86 Links

Hadoop: The Definitive Guide

FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!