Big Data on Google Cloud

Size: px

Start display at page:

Download "Big Data on Google Cloud"

Eustacia Gregory
10 years ago
Views:

1 Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud / [email protected]

2 Agenda 1 Big Data at Google 2 Managing data through its lifecycle 3 Google Cloud Dataflow 4 Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark... 5 Optimizing your time 6 References and follow-up

3 Big Data at Google

4 Building on Google s infrastructure 1.5 million devices activated every day (over a billion devices) 6 billion hours watched every month (10h uploaded every minute) 20 billion pages crawled every day

5 Hardware and data center innovation

6 Software innovation BigQuery MapReduce Cloud Dataflow Flume Dremel GFS Big Table MillWheel Spanner Pregel Colossus

7 Managing data through its lifecycle

8 Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

9 Data usage organization maturity lifecycle Prescriptive + Predictive Exploratory Descriptive Predictive + Exploratory Descriptive Exploratory + Descriptive Descriptive

10 Supporting organizations with operational ease of use no administration most powerful tools in the easiest way constant experimentation with low risks & cost easy collaboration across teams and organizations low costs without requiring usage commitments best performance & virtually unlimited scale always on

11 Google Cloud Dataflow

12 Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

13 What is Cloud Dataflow? Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines Download from GitHub: Use on Google Cloud:

14 Cloud Dataflow SDK - Logical Model Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Unified programming model for both batch & stream processing.

15 Cloud Dataflow Pipeline A Direct Acyclic Graph of data processing transformations Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark May include multiple inputs and multiple outputs May encompass many logical MapReduce operations PCollections flow through the pipeline

16 Life of a Dataflow Pipeline Managed Service Job Manager Progress & Logs ph tion a Gr iza tim p o Deploy & Schedule User Code & SDK Work Manager Monitoring UI Google Cloud Platform

17 Worker Lifecycle Management throughout batch execution Deploy Schedule & Monitor Tear Down

18 Worker Optimization 100 mins. vs. 65 mins.

19 Continuous worker scaling for long-lived streaming pipelines 800 RPS 1,200 RPS 5,000 RPS 50 RPS time

20 Portability: Cloud Dataflow Runners Run the same code in multiple modes using different runners Direct Runner For local, in-memory execution. Great for developing and unit tests Cloud Dataflow Service Runner Runs on the fully-manage Dataflow Service Your code runs distributed across GCE instances Community sourced Spark github.com/cloudera/spark-dataflow Flink runner coming soon from dataartisans The most productive and portable Data pipeline SDK.

21 Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

22 Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

23 Cloud Pub/Sub Many-to-many asynchronous messaging Fast and reliable

24 BigQuery Ingest data via streaming (100K rows/second/table) or file loader Process interactive SQL queries on TB or PB of data Zero administration; just upload data and send queries Pay for storage and query separately, based on actual usage Non-technical analysts can drive queries on massive datasets using BI tools (e.g. Tableau) Highly Available: Data replication in multiple geographies. Secure and easy collaboration: access to data is controlled using customer-owned ACLs

25 Hadoop and Spark Master Node BigQuery Connector Connectors GCS Connector Name Node (optional) Local SSD PD SSD Work Nodes Work Nodes Work Nodes HDFS (optional) HDFS (optional) bdutil orchestration PD standard

26 Optimizing your time

27 Optimizing Your Time

28 Optimizing Your Time

29 Optimizing Your Time

30 Optimizing Your Time

31 Optimizing Your Time

32 Optimizing Your Time

33 Optimizing Your Time

34 Optimizing Your Time

35 Optimizing Your Time

36 References and follow-up

37 Getting Started Cloud Dataflow Service: Questions: SDK: BigQuery Cloud Pub/Sub Hadoop and Spark Contact me

38 Thank You! cloud.google.com

Google Cloud Dataflow

Google Cloud Dataflow Cosmin Arad, Senior Software Engineer [email protected] August 7, 2015 Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting