Big Data on Google Cloud

Similar documents
Google Cloud Dataflow

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Ali Ghodsi Head of PM and Engineering Databricks

Real Time Big Data Processing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Upcoming Announcements

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

From Spark to Ignition:

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Large-Scale Data Processing

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Unified Big Data Processing with Apache Spark. Matei

How Companies are! Using Spark

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Hadoop IST 734 SS CHUNG

Connecting Hadoop with Oracle Database

Managing large clusters resources

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop & Spark Using Amazon EMR

Workshop on Hadoop with Big Data

Google Cloud Data Platform & Services. Gregor Hohpe

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Cloud to the rescue!

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem B Y R A H I M A.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Building your Big Data Architecture on Amazon Web Services

Apache Kylin Introduction Dec 8,

Big Data Analytics Nokia

NoSQL and Hadoop Technologies On Oracle Cloud

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Cloud Big Data Architectures

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Hadoop: Past, Present, and Future

How to Leverage Cloud to Quickly Build Scalable Applications

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

BIG DATA IN BUSINESS ENVIRONMENT

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Customer Case Study. Automatic Labs

Hadoop & SAS Data Loader for Hadoop

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Large scale processing using Hadoop. Ján Vaňo

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

GigaSpaces Real-Time Analytics for Big Data

Introducing Oracle Exalytics In-Memory Machine

Hadoop implementation of MapReduce computational model. Ján Vaňo

Unified Big Data Analytics Pipeline. 连 城

Apache Hadoop Ecosystem

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop & its Usage at Facebook

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Querying Massive Data Sets in the Cloud with Google BigQuery and Java. Kon Soulianidis JavaOne 2014

Chase Wu New Jersey Ins0tute of Technology

Information Architecture

Cisco IT Hadoop Journey

HDP Hadoop From concept to deployment.

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

BIG DATA What it is and how to use?

Big Data and Market Surveillance. April 28, 2014

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Challenges for Data Driven Systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data with the Google Cloud Platform

Student Project 1 - Explorative Data Analysis with Hadoop and Spark

Big Data Big Data/Data Analytics & Software Development

Hadoop & its Usage at Facebook

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Moving From Hadoop to Spark

Transcription:

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com

Agenda 1 Big Data at Google 2 Managing data through its lifecycle 3 Google Cloud Dataflow 4 Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark... 5 Optimizing your time 6 References and follow-up

Big Data at Google

Building on Google s infrastructure 1.5 million devices activated every day (over a billion devices) 6 billion hours watched every month (10h uploaded every minute) 20 billion pages crawled every day

Hardware and data center innovation

Software innovation BigQuery MapReduce Cloud Dataflow Flume Dremel GFS Big Table MillWheel Spanner Pregel Colossus 2002 2004 2006 2008 2010 2012 2013

Managing data through its lifecycle

Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

Data usage organization maturity lifecycle Prescriptive + Predictive Exploratory Descriptive Predictive + Exploratory Descriptive Exploratory + Descriptive Descriptive

Supporting organizations with operational ease of use no administration most powerful tools in the easiest way constant experimentation with low risks & cost easy collaboration across teams and organizations low costs without requiring usage commitments best performance & virtually unlimited scale always on

Google Cloud Dataflow

Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

What is Cloud Dataflow? Cloud Dataflow is a collection of SDKs for building parallelized data processing pipelines Cloud Dataflow is a managed service for executing parallelized data processing pipelines Download from GitHub: Use on Google Cloud: https://github.com/googlecloudplatform/dataflowjavasdk https://cloud.google.com/dataflow/

Cloud Dataflow SDK - Logical Model Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Unified programming model for both batch & stream processing.

Cloud Dataflow Pipeline A Direct Acyclic Graph of data processing transformations Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark May include multiple inputs and multiple outputs May encompass many logical MapReduce operations PCollections flow through the pipeline

Life of a Dataflow Pipeline Managed Service Job Manager Progress & Logs ph tion a Gr iza tim p o Deploy & Schedule User Code & SDK Work Manager Monitoring UI Google Cloud Platform

Worker Lifecycle Management throughout batch execution Deploy Schedule & Monitor Tear Down

Worker Optimization 100 mins. vs. 65 mins.

Continuous worker scaling for long-lived streaming pipelines 800 RPS 1,200 RPS 5,000 RPS 50 RPS time

Portability: Cloud Dataflow Runners Run the same code in multiple modes using different runners Direct Runner For local, in-memory execution. Great for developing and unit tests Cloud Dataflow Service Runner Runs on the fully-manage Dataflow Service Your code runs distributed across GCE instances Community sourced Spark runner @ github.com/cloudera/spark-dataflow Flink runner coming soon from dataartisans The most productive and portable Data pipeline SDK.

Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

Data lifecycle Re a an l tim al e & a ytics ler ts Google App Engine Cloud Pub/Sub Cloud Logs Google Analytics Premium Google Cloud Storage Stream BigQuery Storage BigQuery Analytics (SQL) (tables) Batch Cloud Dataflow Cloud Storage (files) Cloud Dataflow

Cloud Pub/Sub Many-to-many asynchronous messaging Fast and reliable

BigQuery Ingest data via streaming (100K rows/second/table) or file loader Process interactive SQL queries on TB or PB of data Zero administration; just upload data and send queries Pay for storage and query separately, based on actual usage Non-technical analysts can drive queries on massive datasets using BI tools (e.g. Tableau) Highly Available: Data replication in multiple geographies. Secure and easy collaboration: access to data is controlled using customer-owned ACLs

Hadoop and Spark Master Node BigQuery Connector Connectors GCS Connector Name Node (optional) Local SSD PD SSD Work Nodes Work Nodes Work Nodes HDFS (optional) HDFS (optional) bdutil orchestration PD standard

Optimizing your time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

References and follow-up

Getting Started Cloud Dataflow Service: https://cloud.google.com/dataflow Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow SDK: https://github.com/googlecloudplatform/dataflowjavasdk BigQuery https://cloud.google.com/bigquery/ Cloud Pub/Sub https://cloud.google.com/pubsub/ Hadoop and Spark https://cloud.google.com/hadoop/ Contact me Twitter: @vambenepe email: vbp@google.com

Thank You! cloud.google.com