Google Cloud Dataflow



Similar documents
Big Data on Google Cloud

FlumeJava: Easy, Efficient Data-Parallel Pipelines

Ali Ghodsi Head of PM and Engineering Databricks

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

The Cloud to the rescue!

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Unified Big Data Analytics Pipeline. 连 城

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Scaling Out With Apache Spark. DTL Meeting Slides based on

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Hadoop & Spark Using Amazon EMR

Creating Big Data Applications with Spring XD

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop Job Oriented Training Agenda

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Google Cloud Data Platform & Services. Gregor Hohpe

Cloud Big Data Architectures

Introduction to Hadoop

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Unified Big Data Processing with Apache Spark. Matei

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Real Time Data Processing using Spark Streaming

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Ecosystem B Y R A H I M A.

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Powering the Future with Google Cloud Platform. Brad

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Deploying complex applications to Google Cloud. Olia Kerzhner

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Cloudera Certified Developer for Apache Hadoop

Big Data Course Highlights

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Apache Flink. Fast and Reliable Large-Scale Data Processing

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Moving From Hadoop to Spark

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Beyond Hadoop with Apache Spark and BDAS

Introduction to Cloud Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

CS 378 Big Data Programming

Cloud3DView: Gamifying Data Center Management

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Unified Batch & Stream Processing Platform

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Architectures for massive data management

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Real-time Big Data Analytics with Storm

Simplifying Big Data with Apache Crunch. Micah

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

ITG Software Engineering

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data and Apache Hadoop s MapReduce

Hadoop: The Definitive Guide

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Apigee Edge API Services Manage, scale, secure, and build APIs and apps

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop IST 734 SS CHUNG

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Architectures for massive data management

Future Internet Technologies

Managing large clusters resources

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Apache Hadoop. Alexandru Costan

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Connecting Hadoop with Oracle Database

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Sentimental Analysis using Hadoop Phase 2: Week 2

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

How To Scale Out Of A Nosql Database

Big Data Analytics. Lucas Rego Drumond

Safe Harbor Statement

Machine- Learning Summer School

Big Data and Analytics in Government

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Streaming items through a cluster with Spark Streaming

Distributed Computing and Big Data: Hadoop and MapReduce

Kafka & Redis for Big Data Solutions

Getting Started with Google Cloud Platform

The Stratosphere Big Data Analytics Platform

Transcription:

Google Cloud Dataflow Cosmin Arad, Senior Software Engineer carad@google.com August 7, 2015

Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and Discussion

Cloud Dataflow History of Big Data at Google Dremel MapReduce GFS Flume Spanner Pregel Big Table MillWheel Colossus 2002 2004 2006 2008 2010 2012 2013

Big Data on Google Cloud Platform Store Capture Pub/Sub Logs App Engine BigQuery streaming Cloud Storage (objects) Cloud BigQuery Cloud SQL Storage (mysql) Datastore (NoSQL) BigTable (structured) Analyze Process Hadoop Dataflow (stream Spark (on GCE) and batch) BigQuery Hadoop Spark (on GCE) Larger Hadoop Ecosystem

What is Cloud Dataflow? Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines

Where might you use Cloud Dataflow? ETL Analysis Orchestration

Where might you use Cloud Dataflow? Movement Filtering Enrichment Shaping Reduction Batch computation Continuous computation Composition External orchestration Simulation

Dataflow SDK Concepts (Programming Model)

Dataflow SDK(s) Easily construct parallelized data processing pipelines using an intuitive set of programming abstractions Google supported and open sourced Do what the user expects. No knobs whenever possible. Build for extensibility. Unified batch & streaming semantics. Java 7 (public) @ github.com/googlecloudplatform/dataflowjavasdk Python 2 (in progress) Community sourced Scala @ github.com/darkjh/scalaflow Scala @ github.com/jhlch/scala-dataflow-dsl

Dataflow Java SDK Release Process weekly monthly

Pipeline A directed graph of data processing transformations Optimized and executed as a unit May include multiple inputs and multiple outputs May encompass many logical MapReduce or Millwheel operations PCollections conceptually flow through the pipeline

Runners Specify how a pipeline should run Direct Runner Cloud Dataflow Service For local, in-memory execution. Great for developing and unit tests batch mode: GCE instances poll for work items to execute. streaming mode: GCE instances are set up in a semi-permanent topology Community sourced Spark from Cloudera @ github.com/cloudera/spark-dataflow Flink from dataartisans @ github.com/dataartisans/flink-dataflow

Example: #HashTag Autocompletion

Tweets Read ExtractTags {Go Hawks #Seahawks!, #Seattle works museum pass. Free! Go #PatriotsNation! Having fun at #seaside, } {seahawks, seattle, patriotsnation, lovemypats,...} Count {seahawks->5m, seattle->2m, patriots->9m,...} ExpandPrefixes {d->(deflategate, 10M), d->(denver, 2M),, sea->(seahawks, 5M), sea->(seaside, 2M),...} Top(3) Write Predictions {d->[deflategate, desafiodatransa, djokovic],... de->[deflategate, desafiodatransa, dead50],...}

Tweets Pipeline p = Pipeline.create(); p.begin() Read.apply(TextIO.Read.from( gs:// )) ExtractTags.apply(ParDo.of(new ExtractTags())) Count ExpandPrefixes.apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()) Top(3).apply(Top.largestPerKey(3)) Write.apply(TextIO.Write.to( gs:// )); Predictions p.run();

Dataflow Basics Pipeline Directed graph of steps operating on data Pipeline p = Pipeline.create(); p.run();

Dataflow Basics Pipeline Directed graph of steps operating on data Data PCollection Pipeline p = Pipeline.create(); p.begin().apply(textio.read.from( gs:// )) Immutable collection of same-typed elements that can be encoded PCollectionTuple, PCollectionList.apply(TextIO.Write.to( gs:// )); p.run();

Dataflow Basics Pipeline Directed graph of steps operating on data Data PCollection Immutable collection of same-typed elements that can be encoded PCollectionTuple, PCollectionList Transformation Step that operates on data Core transforms ParDo, GroupByKey, Combine, Flatten Composite and custom transforms Pipeline p = Pipeline.create(); p.begin().apply(textio.read.from( gs:// )).apply(pardo.of(new ExtractTags())).apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()).apply(Top.largestPerKey(3)).apply(TextIO.Write.to( gs:// )); p.run();

Dataflow Basics Pipeline p = Pipeline.create(); class ExpandPrefixes {... public void processelement(processcontext c) { String word = c.element().getkey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(kv.of(prefix, c.element())); } } } p.begin().apply(textio.read.from( gs:// )).apply(pardo.of(new ExtractTags())).apply(Count.perElement()).apply(ParDo.of(new ExpandPrefixes()).apply(Top.largestPerKey(3)).apply(TextIO.Write.to( gs:// )); p.run();

PCollections A collection of data of type T in a pipeline Maybe be either bounded or unbounded in size Created by using a PTransform to: Build from a java.util.collection Read from a backing data store Transform an existing PCollection Often contain key-value pairs using KV<K, V> {Seahawks, NFC, Champions, Seattle,...} {..., NFC Champions #GreenBay, Green Bay #superbowl!,... #GoHawks,...}

Inputs & Outputs Your Source/Sink Here Read from standard Google Cloud Platform data sources GCS, Pub/Sub, BigQuery, Datastore,... Write your own custom source by teaching Dataflow how to read it in parallel Write to standard Google Cloud Platform data sinks GCS, BigQuery, Pub/Sub, Datastore, Can use a combination of text, JSON, XML, Avro formatted data

Coders A Coder<T> explains how an element of type T can be written to disk or communicated between machines Every PCollection<T> needs a valid coder in case the service decides to communicate those values between machines. Encoded values are used to compare keys -- need to be deterministic. Avro Coder inference can infer a coder for many basic Java objects.

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} LowerCase {seahawks, nfc, champions, seattle,...}

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn PCollection<String> tweets = ; tweets.apply(pardo.of( new DoFn<String, String>() { @Override public void processelement( ProcessContext c) { c.output(c.element().tolowercase()); })); {Seahawks, NFC, Champions, Seattle,...} LowerCase {seahawks, nfc, champions, seattle,...}

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} FilterOutSWords {NFC, Champions,...}

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} ExpandPrefixes {s, se, sea, seah, seaha, seahaw, seahawk, seahawks, n, nf, nfc, c, ch, cha, cham, champ, champi, champio, champion, champions, s, se, sea, seat, seatt, seattl, seattle,...}

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle,...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...}

ParDo ( Parallel Do ) Processes each element of a PCollection independently using a user-provided DoFn Elements are processed in arbitrary bundles e. g. shards startbundle(), processelement()*, finishbundle() supports arbitrary amounts of parallelization Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo {Seahawks, NFC, Champions, Seattle,...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...}

GroupByKey Takes a PCollection of key-value pairs and gathers up all values with the same key Corresponds to the shuffle phase in Hadoop How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>,...} GroupByKey {KV<S, {Seahawks, Seattle, }, KV<N, {NFC, } KV<C, {Champion, }}

Windows Nighttime Mid-Day Nighttime Logically divide up or groups the elements of a PCollection into finite windows Fixed Windows: hourly, daily, Sliding Windows Sessions Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections Window.into() can be called at any point in the pipeline and will be applied when needed Can be tied to arrival time or custom event time

Processing Time Event Time Skew Watermark Skew Event Time

Triggers Determine when to emit elements into an aggregated Window. Provide flexibility for dealing with time skew and data lag. Example use: Deal with late-arriving data. (Someone was in the woods playing Candy Crush offline.) Example use: Get early results, before all the data in a given window has arrived. (Want to know # users per hour, with updates every 5 minutes.)

Late & Speculative Results PCollection<KV<String, Long>> sums = Pipeline.begin().read( userrequests ).apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering( AfterEach.inOrder( Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane().alignedTo(Duration.standardMinutes(1))).orFinally(AfterWatermark.pastEndOfWindow()), Repeatedly.forever( AfterPane.elementCountAtLeast(1))).orFinally(AfterWatermark.pastEndOfWindow().plusDelayOf(Duration.standardDays(7)))).accumulatingFiredPanes()).apply(new Sum());

Composite PTransforms Define new PTransforms by building up subgraphs of existing transforms Some utilities are included in the SDK Count, RemoveDuplicates, Join, Min, Max, Sum,... You can define your own: modularity, code reuse better monitoring experience Pair With Ones Count GroupByKey Sum Values

Composite PTransforms

Cloud Dataflow Service

Life of a Pipeline Managed Service Job Manager Work Manager User Code & SDK Deploy & Schedule Progress & Logs Monitoring UI Google Cloud Platform

Cloud Dataflow Service Fundamentals Pipeline optimization: Modular code, efficient execution Smart Workers: Lifecycle management, Auto-Scaling, and Dynamic Work Rebalancing Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, Cloud Debugger, etc.

Graph Optimization ParDo fusion Producer Consumer Sibling Intelligent fusion boundaries Combiner lifting e.g. partial aggregations before reduction Flatten unzipping Reshard placement...

= ParallelDo Optimizer: ParallelDo Fusion + GBK consumer-producer C D C+D sibling C D C+D = CombineValues = GroupByKey

= ParallelDo Optimizer: Combiner Lifting + GBK A GBK A+ + GBK B + B = CombineValues = GroupByKey

Worker Lifecycle Management Deploy Schedule & Monitor Tear Down

Worker Pool Auto-Scaling 800 RPS 1,200 RPS 5,000 RPS 50 RPS

Dynamic Work Rebalancing 100 mins. vs. 65 mins.

Monitoring

Fully-managed Service Pipeline management Validation Pipeline execution graph optimizations Dynamic and adaptive sharding of computation stages Monitoring UI Cloud resource management Spin worker VMs Set up logging Manages exports Teardown

Benefits of Dataflow on Google Cloud Platform Ease of use No performance tuning required Highly scalable, performant out of the box Novel techniques to lower e2e execution time Intuitive programming model + Java SDK No dichotomy between batch and streaming processing Integrated with GCP (VMs, GCS, BigQuery, Datastore, ) Total Cost of Ownership Benefits from GCE s cost model

Optimizing your time: no-ops, no-knobs, zero-config Programming Monitoring Resource provisioning Performance tuning Handling Growing Scale Utilization improvements Deployment & configuration Programming More time to dig into your data Reliability Typical Data Processing Data Processing with Google Cloud Dataflow

Demo: Counting Words!

Highlights from the live demo... WordCount Code: See the SDK Concepts in action Running on the Dataflow Service Monitoring job progress in the Dataflow Monitoring UI Looking at worker logs in Cloud Logging Using the CLI

Questions and Discussion

Getting Started cloud.google.com/dataflow/getting-started github.com/googlecloudplatform/dataflowjavasdk stackoverflow.com/questions/tagged/google-cloud-dataflow