Google Cloud Data Platform & Services. Gregor Hohpe

Similar documents
A Quick Introduction to Google's Cloud Technologies

The Cloud to the rescue!

Big Data and Apache Hadoop s MapReduce

Sisense. Product Highlights.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Google Cloud Platform The basics

A programming model in Cloud: MapReduce

Querying Massive Data Sets in the Cloud with Google BigQuery and Java. Kon Soulianidis JavaOne 2014

Unified Big Data Processing with Apache Spark. Matei

Big Data on Google Cloud

TDAQ Analytics Dashboard

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data and Market Surveillance. April 28, 2014

Assignment # 1 (Cloud Computing Security)

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

The Internet of Things and Big Data: Intro

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Vendor: Crystal Decisions Product: Crystal Reports and Crystal Enterprise

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Data Lab System Architecture

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open Source Technologies on Microsoft Azure

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Big Data with the Google Cloud Platform

Big Data and Scripting Systems build on top of Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Ecosystem B Y R A H I M A.

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Comparing SQL and NOSQL databases

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Visualisation in the Google Cloud

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Data Management in the Cloud

Big Data With Hadoop

Big Data for Conventional Programmers Big Data - Not a Big Deal

Cloud computing - Architecting in the cloud

Fast Analytics on Big Data with H20

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

NoSQL for SQL Professionals William McKnight

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

QualysGuard WAS. Getting Started Guide Version 4.1. April 24, 2015

Data Modeling for Big Data

Getting Started Guide for Developing tibbr Apps

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

QualysGuard WAS. Getting Started Guide Version 3.3. March 21, 2014

Mining Large Datasets: Case of Mining Graph Data in the Cloud

From the Monolith to Microservices: Evolving Your Architecture to Scale. Randy linkedin.com/in/randyshoup

Large-Scale Data Processing

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Performance Testing Process A Whitepaper

Reporting Services. White Paper. Published: August 2007 Updated: July 2008

How To Scale Out Of A Nosql Database

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Lecture Data Warehouse Systems

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

An Approach to Implement Map Reduce with NoSQL Databases

Parquet. Columnar storage for the people

Apache Flink Next-gen data analysis. Kostas

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

How To Handle Big Data With A Data Scientist

Big Data and Scripting Systems build on top of Hadoop

Visualizing a Neo4j Graph Database with KeyLines

Big Data on Microsoft Platform

Storing and Processing Sensor Networks Data in Public Clouds

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

FREE computing using Amazon EC2

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Microsoft SQL Server Features that can be used with the IBM i

What s next for the Berkeley Data Analytics Stack?

MicroStrategy Desktop

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

How To Use Hadoop For Gis

Infrastructures for big data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Advanced In-Database Analytics

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Monitoring System Status

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Open source framework for interactive data exploration in server based architecture

SQL Server 2005 Features Comparison

Big Data Analytics Using SAP HANA Dynamic Tiering Balaji Krishna SAP Labs SESSION CODE: BI474

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Bringing Big Data Modelling into the Hands of Domain Experts

Transcription:

Google Cloud Data Platform & Services Gregor Hohpe

All About Data

We Have More of It Internet data more easily available Logs user & system behavior Cheap Storage keep more of it 3

Beyond just Relational Structure Relational (Hosted SQL) Record-oriented (Bigtable) Nested (Protocol Buffers) Graphs (Pregel) 4

Beyond just Relational Analysis needs Number crunching (MapReduce / FlumeJava) Ad hoc (Dremel) Precise vs. Estimate (Sawzall) Model Generation & Prediction (Prediction API) 5

Google Storage for Developers Store your data in Google's cloud

What Is Google Storage? Store your data in Google's cloud o any format, any amount, any time You control access to your data o private, shared, or public Access via Google APIs or 3rd party tools/libraries

Sample Use Cases Static content hosting e.g. static html, images, music, video Backup and recovery e.g. personal data, business records Sharing e.g. share data with your customers Data storage for applications e.g. used as storage backend for Android, App Engine, Cloud based apps Storage for Computation e.g. BigQuery, Prediction API

Core Features RESTful API o GET, PUT, POST, HEAD, DELETE o Resources identified by URI o Compatible with S3 Organized in Buckets o Flat containers, i.e. no bucket hierarchy Buckets Contain Objects o Any type o Size: <100 GB / object Access Control for Google Accounts o For individuals and groups Authenticate via o Signing request using access keys o Web browser login

Performance and Scalability Objects of any type and 100 GB / Object Unlimited numbers of objects, 1000s of buckets All data replicated to multiple US data centers Leveraging Google's worldwide network for data delivery Only you can use bucket names with your domain names Read-your-writes data consistency Range Get

gsutil

Pricing & Availability Storage - $0.17/GB/month Network o Upload data to Google $0.10/GB o Download data from Google $0.15/GB for Americas and EMEA $0.30/GB for APAC Requests o PUT, POST, LIST - $0.01 per 1,000 Requests o GET, HEAD - $0.01 per 10,000 Requests Free storage (up to 100GB) during limited preview period o No SLA http://code.google.com/apis/storage

GOOGLE PREDICTION API GOOGLE'S PREDICTION ENGINE IN THE CLOUD

Introducing the Google Prediction API Google's machine learning technology Now available externally RESTful Web service

How does it work? 1. TRAIN The Prediction API finds relevant features in the sample data during training. "english" "english" "spanish" "spanish" The quick brown fox jumped over the lazy dog. To err is human, but to really foul things up you need a computer. No hay mal que por bien no venga. La tercera es la vencida. 2. PREDICT The Prediction API later searches for those features during prediction.? To be or not to be, that is the question.? La fe mueve montañas.

Countless applications... Customer Sentiment Transaction Risk Species Identification Message Routing Diagnostics Churn Prediction Legal Docket Classification Suspicious Activity Work Roster Assignment Inappropriate Content Recommend Products Political Bias Uplift Marketing Email Filtering Career Counseling... and many more...

Using Your Data with Prediction API & BigQuery 1. Upload Upload your data to Google Storage Your Data 2. Process Import to tables Train a model BigQuery Prediction API 3. Act Run queries Make predictions Your Apps

Step 1: Upload Training data: outputs and input features Data format: comma separated value format (CSV), result in first column "english","to err is human, but to really..." "spanish","no hay mal que por bien no venga."... Upload to Google Storage gsutil cp ${data} gs://yourbucket/${data}

Step 2: Train To train a model: POST prediction/v1.1/training?data=mybucket%2fmydata Training runs asynchronously. To see if it has finished: GET prediction/v1.1/training/mybucket%2fmydata {"data":{ "data":"mybucket/mydata", "modelinfo":"estimated accuracy: 0.xx"}}}

Step 3: Predict POST prediction/v1.1/query/mybucket%2fmydata/predict { "data":{ "input": { "text" : [ "J'aime X! C'est le meilleur" ]}}}

Step 3: Predict POST prediction/v1.1/query/mybucket%2fmydata/predict { "data":{ "input": { "text" : [ "J'aime X! C'est le meilleur" ]}}} { data : { "kind" : "prediction#output", "outputlabel":"french", "outputmulti" :[ {"label":"french", "score": x.xx} {"label":"english", "score": x.xx} {"label":"spanish", "score": x.xx}]}}

Step 3: Predict Apply the trained model to make predictions on new data import httplib # put new data in JSON format params = {... } header = {"Content-Type" : "application/json"} conn = httplib.httpconnection("www.googleapis.com")conn. request("post", "/prediction/v1.1/query/mybucket%2fmydata/predict", params, header) print conn.getresponse()

Prediction API Capabilities Data Input Features: numeric or unstructured text Output: up to hundreds of discrete categories Training Many machine learning techniques Automatically selected Performed asynchronously Access from many platforms: Web app from Google App Engine Apps Script (e.g. from Google Spreadsheet) Desktop app

GOOGLE BIGQUERY INTERACTIVE ANALYSIS OF LARGE DATASETS

Key Capabilities Scalable: Billions of rows Fast: Response in seconds Simple: Queries in SQL-like language Accessible via: o REST o JSON-RPC o Google App Scripts

Writing Queries Compact subset of SQL o SELECT... FROM... WHERE... GROUP BY... ORDER BY... LIMIT...; Common functions o Math, String, Time,... Additional statistical approximations o TOP o COUNT DISTINCT

BigQuery via REST GET /bigquery/v1/tables/{table name} GET /bigquery/v1/query?q={query} Sample JSON Reply: { "results": { "fields": { [ {"id":"count(*)","type":"uint64"},... ] }, "rows": [ {"f":[{"v":"2949"},...]}, {"f":[{"v":"5387"},...]},... ] } } Also supports JSON-RPC

Example Using BigQuery Shell Python DB API 2.0 + B. Clapper's sqlcmd http://www.clapper.org/software/python/sqlcmd/

WORTH READING SOME OF WHAT IS UNDER THE COVERS

FlumeJava Processing Pipeline A Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. 31

Sawzall Parallel Data Analysis We present a system for automating analyses. A filtering phase, in which a query is expressed using a new programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines. 32

Dremel Ad Hoc Query A scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. 33

Pregel Comp Model for Graphs Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. 34

More information Google Storage for Developers http://code.google.com/apis/storage Prediction API http://code.google.com/apis/predict BigQuery http://code.google.com/apis/bigquery Dremel http://research.google.com/pubs/pub36632.html