Simplifying Big Data with Apache Crunch. Micah Whitacre @mkwhit



Similar documents
Hadoop: The Definitive Guide

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Flink Next-gen data analysis. Kostas

Professional Hadoop Solutions

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

COURSE CONTENT Big Data and Hadoop Training

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Unified Batch & Stream Processing Platform

Google Cloud Data Platform & Services. Gregor Hohpe

ITG Software Engineering

Scaling Out With Apache Spark. DTL Meeting Slides based on

Comparing SQL and NOSQL databases

Apache Flink. Fast and Reliable Large-Scale Data Processing

Peers Techno log ies Pv t. L td. HADOOP

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop & Spark Using Amazon EMR

Cloudera Certified Developer for Apache Hadoop

BIG DATA - HADOOP PROFESSIONAL amron

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Xiaoming Gao Hui Li Thilina Gunarathne

Google Cloud Dataflow

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data looks Tiny from the Stratosphere

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Hadoop Ecosystem B Y R A H I M A.

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HDFS. Hadoop Distributed File System

Data storing and data access

This is a brief tutorial that explains the basics of Spark SQL programming.

Safe Harbor Statement

Hadoop: The Definitive Guide

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Moving From Hadoop to Spark

CSE-E5430 Scalable Cloud Computing Lecture 2

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Parquet. Columnar storage for the people

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Project for IDEAL in CS5604

FlumeJava: Easy, Efficient Data-Parallel Pipelines

Big Data Frameworks: Scala and Spark Tutorial

Architectures for massive data management

Hadoop IST 734 SS CHUNG

Implement Hadoop jobs to extract business value from large and varied data sets

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Kafka & Redis for Big Data Solutions

Open source large scale distributed data management with Google s MapReduce and Bigtable

Workshop on Hadoop with Big Data

Upcoming Announcements

What is Analytic Infrastructure and Why Should You Care?

SQL on NoSQL (and all of the data) With Apache Drill

Big Data and Apache Hadoop s MapReduce

Large Scale Text Analysis Using the Map/Reduce

Brave New World: Hadoop vs. Spark

Hadoop Distributed File System (HDFS) Overview

Open source Google-style large scale data analysis with Hadoop

Building Scalable Big Data Pipelines

AVRO - SERIALIZATION

Real Time Data Processing using Spark Streaming

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Apache HBase. Crazy dances on the elephant back

Big Data for Investment Research Management

A framework for easy development of Big Data applications

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Big Data for Investment Research Management

On a Hadoop-based Analytics Service System

Complete Java Classes Hadoop Syllabus Contact No:

How To Write A Trusted Analytics Platform (Tap)

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Apache HBase: the Hadoop Database

NON-INTRUSIVE TRANSACTION MINING FRAMEWORK TO CAPTURE CHARACTERISTICS DURING ENTERPRISE MODERNIZATION ON CLOUD

Java SE 8 Programming

Hadoop and Map-Reduce. Swati Gore

How To Write A Nosql Database In Spring Data Project

CS 378 Big Data Programming. Lecture 9 Complex Writable Types

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Processing with Apache Spark. Matei

Introduction to Spark

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Transcription:

Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit

Semantic Chart Search Medical Alerting System Cloud Based EMR Population Health Management

Problem moves from scaling architecture...

Problem moves from not only scaling architecture... To how to scale the knowledge

Battling the 3 V s

Daily, weekly, monthly uploads Battling the 3 V s

Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s

Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s Constant streams for near real time

Daily, weekly, monthly uploads 60+ different data formats Battling the 3 V s Constant streams for near real time 2+ TB of streaming data daily

Population Health Avro CSV Vertica HBase Normalize Data Apply Algorithms Load Data for Displays HBase Solr Vertica

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

M a p p e r R e d u c e r

Struggle to fit into single MapReduce job

Struggle to fit into single MapReduce job Integration done through persistence

Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns

Struggle to fit into single MapReduce job Integration done through persistence Custom impls of common patterns Evolving Requirements

Prep for Bulk Load CSV Process Reference Data Process Raw Data using Reference CSV HBase Filter Out Invalid Data Group Data By Person Process Raw Person Data Anonymize Data Avro Create Person Record Avro

Easy integration between teams Focus on processing steps Shallow learning curve Ability to tune for performance

Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Transformation through fns (not job) Utilizes POJOs (hides serialization)

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

CSV Processing Pipeline Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

Pipeline Programmatic description of DAG Supports lazy execution Implementations indicate runtime MapReduce, Spark, Memory

Pipeline pipeline = new MRPipeline(Driver.class, conf); Pipeline pipeline = MemPipeline.getIntance(); Pipeline pipeline = new SparkPipeline(sparkContext, app );

Source Reads various inputs At least one required per pipeline Creates initial collections for processing Custom implementations

Source Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

pipeline.read( From.textFile(path));

pipeline.read( new TextFileSource(path,ptype));

PType<String> ptype = ; pipeline.read( new TextFileSource(path,ptype));

PType Hides serialization Exposes data in native Java forms Supports composing complex types Avro, Thrift, and Protocol Buffers

Multiple Serialization Types Serialization Type = PTypeFamily Avro & Writable available Can t mix families in single type Can easily convert between families

PType<Integer> inttypes = Writables.ints(); PType<String> stringtype = Avros.strings(); PType<Person> persontype = Avros.records(Person.class);

PType<Pair<String, Person>> pairtype = Avros.pairs(stringType, persontype);

PTableType<String, Person> tabletype = Avros.tableOf(stringType,personType);

PType<String> ptype = ; PCollection<String> strings = pipeline.read( new TextFileSource(path, ptype));

PCollection Immutable Unsorted Not created only read or transformed Represents potential data

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

PCollection<String> Process Reference Data PCollection<RefData>

DoFn Simple API to implement Transforms PCollection between forms Location for custom logic Processes one element at a time

For each item emits 0:M items MapFn - emits 1:1 FilterFn - returns boolean

DoFn API class ExampleDoFn extends DoFn<String, RefData>{... } Type of Data In Type of Data Out

Type of Data In Type of Data Out public void process (String s, Emitter<RefData> emitter) { RefData data = ; emitter.emit(data); }

PCollection<String> refstrings PCollection<RefData> refs = refstrings.paralleldo(fn, Avros.records(RefData.class));

PCollection<String> datastrs... PCollection<RefData> refs = datastrs.paralleldo(difffn, Avros.records(Data.class));

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

Hmm now I need to join... But they don t have a common key? We need a PTable

PTable<K, V> Immutable & Unsorted Multimap of Keys and Values Variation PCollection<Pair<K, V>> Joins, Cogroups, Group By Key

class ExampleDoFn extends DoFn<String, RefData>{... }

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{... }

PCollection<String> refstrings PTable<String, RefData> refs = refstrings.paralleldo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

PTable<String, RefData> refs ; PTable<String, Data> data ;

data.join(refs); (inner join)

PTable<String, Pair<Data, RefData>> joineddata = data.join(refs);

Joins right, left, inner, outer Eliminates custom impls Mapside, BloomFilter, Sharded

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

FilterFn API class MyFilterFn extends FilterFn<...>{... Type of Data In }

public boolean accept (... value){ return value > 3; }

PCollection<Model> values = ; PCollection<Model> filtered = values.filter(new MyFilterFn());

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

Keyed By PersonId PTable<String,Model> models = ;

PTable<String,Model> models = ; PGroupedTable<String, Model> groupedmodels = models.groupbykey();

PGroupedTable<K, V> Immutable & Sorted PCollection<Pair<K, Iterable<V>>>

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

PCollection<Person> persons = ;

PCollection<Person> persons = ; pipeline.write(persons, To.avroFile(path));

PCollection<Person> persons = ; pipeline.write(persons, new AvroFileTarget(path));

Target Persists PCollection At least one required per pipeline Custom implementations

Target Strings AvroRecords Results POJOs Protobufs Thrift Writables Sequence Files Avro Parquet HBase JDBC HFiles Text CSV

CSV Process Reference Data Process Raw Data using Reference CSV Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

Execution Pipeline pipeline = ;... pipeline.write(...); PipelineResult result = pipeline.done();

Map CSV Reduce Process Reference Data Process Raw Data using Reference CSV Reduce Process Raw Person Data Filter Out Invalid Data Group Data By Person Create Person Record Avro

Tuning Tweak pipeline for performance GroupingOptions/ParallelDoOptions Scale factors

Functionality first Focus on the transformations Smaller learning curve Less fragility

Iterate with confidence Integration through PCollections Extend pipeline for new features

Links http://crunch.apache.org/ http://dl.acm.org/citation.cfm?id=1806596.1806638 http://www.quora.com/apache-hadoop/what-are-thedifferences-between-crunch-and-cascading