Querying Massive Data Sets in the Cloud with Google BigQuery and Java. Kon Soulianidis JavaOne 2014

Similar documents

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Google Cloud Data Platform & Services. Gregor Hohpe

Big Data & Cloud Computing. Faysal Shaarani

Integrating VoltDB with Hadoop

Sisense. Product Highlights.

API Analytics with Redis and Google Bigquery. javier

Data processing goes big

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data on Google Cloud

Google Cloud Platform The basics

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Constructing a Data Lake: Hadoop and Oracle Database United!

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Replicating to everything

Big Data Analytics Nokia

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Apache Kylin Introduction Dec 8,

Cloud Powered Mobile Apps with Azure

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Hadoop & Spark Using Amazon EMR

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Analyzing Big Data with AWS

An Approach to Implement Map Reduce with NoSQL Databases

A Quick Introduction to Google's Cloud Technologies

NoSQL and Hadoop Technologies On Oracle Cloud

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

BIG DATA What it is and how to use?

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Native Connectivity to Big Data Sources in MSTR 10

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Comparing SQL and NOSQL databases

Splice Machine: SQL-on-Hadoop Evaluation Guide

SQL on NoSQL (and all of the data) With Apache Drill

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

How To Scale Out Of A Nosql Database

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Actian Vortex Express 3.0

What's New in SAS Data Management

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Tableau Online. Understanding Data Updates

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

So What s the Big Deal?

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

iservdb The database closest to you IDEAS Institute

Cloud Computing at Google. Architecture

Big Data and Market Surveillance. April 28, 2014

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Media Upload and Sharing Website using HBASE

Next-Generation Cloud Analytics with Amazon Redshift

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Cloud Big Data Architectures

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Performance and Scalability Overview

Hadoop IST 734 SS CHUNG

The Cloud to the rescue!

Tap into Hadoop and Other No SQL Sources

A Performance Analysis of Distributed Indexing using Terrier

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

CitusDB Architecture for Real-Time Big Data

Moving From Hadoop to Spark

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Hadoop Ecosystem B Y R A H I M A.

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Big Data with the Google Cloud Platform

Database migration. from Sybase ASE to PostgreSQL. Achim Eisele and Jens Wilke. 1&1 Internet AG

Workshop on Hadoop with Big Data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Tuning Tableau Server for High Performance

iway Roadmap: 2011 and Beyond Dave Watson SVP, iway Software

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Assignment # 1 (Cloud Computing Security)

Big Systems, Big Data

College of Engineering, Technology, and Computer Science

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cost-Effective Business Intelligence with Red Hat and Open Source

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Chapter 7. Using Hadoop Cluster and MapReduce

Transcription:

Querying Massive Data Sets in the Cloud with Google BigQuery and Java Kon Soulianidis JavaOne 2014

Agenda Massive Data Qualification A Big Data Problem What is BigQuery? BigQuery Java API Getting data in & out BigQuery streaming meets Java 8 Streams

Massive Data Credentials Have you ever written a report? Has it taken a long time to run? > 1 minutes > 5 minutes > 20 minutes > 1 hour > 1 day Hadoop & Other BigData tools?

Our Project And why BigQuery over everything else?

Client Problem Domain Client runs a number of popular Australian websites Knows who is logged in 3rd party analytics service tracks site visits Reports of what their users were looking at, and encoded demographic data A years worth of logs was 30 billion rows, 15 TB

Open Questions Which groups (demographics) were looking at which sites? Were there patterns? Could they adapt their content? Could they change advertising shown based on who used the site most? Can we have real time analysis and live dashboards of KPIs?

Solutions Batch processes to pull new logs into BQ Single Page WebApp for reports (JS) Dashboard WebApp Spring Boot backend app polling BQ via the API Clients connecting via web-sockets, receiving updates every 3 seconds.

What is BigQuery?

What is BigQuery? Name, Age Duke, 20 { name : Duke, age : 20} A DataStore Hosted in the Cloud Uses SQL Queries Hierarchical Data Structures SELECT name, age Sourced from JSON and CSVs FROM cloud_files Blazingly Fast

Just how fast? Analysing a months worth of web impressions 1.7 Billion Rows 700 GB 10 seconds Aggregating a single TB of data 25 seconds Analysing a years worth of web impressions 30 Billion Rows 15 Terabytes 330 seconds

Demo of the BigQuery Web UI

BigQuery History & Use Dremel - internal project used at Google Google use it across all their product suite. Crawled Web Documents, Android Install Data, Spam Analysis, Crash Reporting, Builds, Resource Monitoring BigQuery is the public implementation

Datatypes String Float Integer Boolean Timestamp Record (Nested / Repeatable)

Accessing BigQuery WebUI Command Line JDBC connectors (Starschema-SQL Driver) Reporting Tools (Tableau = Happy Clients) API

Pricing: Loading and Streaming Pay as you go pricing model Storage + Compute Transferring CSV/JSON data in is free Storage $0.026 per GB, per month Streaming data is $0.01 per 100,000 inserts (Free until Jan 1st 2015)

Pricing: Queries Queries $5 per TB Charged on the amount of data in the columns used Not charged for Cached Queries or those that fail LIMITing rows does nothing for reducing cost. Column based Table Decorators restrict based on age of data and do reduce cost

Query Quotas 20,000 queries per day Interactive and Batch Limit on the number of concurrent interactive queries https://cloud.google.com/bigquery/quotapolicy

Storage and Scalability How does it get to be so fast?

Columnar Datastore Traditional DB ID NAME GENDER EMAIL Loads in each row, all columns Regardless of if it needs it or not 1 Mary F mary@a. com SELECT id, email 2 Mark M mark@a. com FROM table; 3 Maria F null 4 Mike M mike@a. com

Columnar Datastore BigQuery Columnar Storage ID GENDER Columns stored separately NAME EMAIL 1 F Only loads data queried Mary mary@a.com Better compression achieved 2 M Mark mark@a.com SELECT id, email FROM table; 3 4 Maria Mike F M null mike@a.com

Tree Architecture Query runs on thousands nodes simultaneously Combines the results back in seconds Creates a massively parallel" tree & aggregating results from the leaves. Source: Google BigQuery Technical White-paper

Alternatives Oracle RDBMS Amazon Redshift Hadoop MapReduce Apache Drill

Why BQ? Easily imports data from Google Cloud Storage No setup fees or hardware required. Easy account setup. Already know SQL. API in Java, JS, Python,.NET, and more

MapReduce vs Dremel Source: Google BigQuery Technical White-paper

Java API Connecting and Authenticating Reading the Schema Loading a table Querying results Using Table Decorators

API Authentication OAuth2 - usually a client email and secret Google has a way of setting this up via the cloud console. Different OAuth workflows based on whether your app is web based, server to server, desktop or mobile only. Cloud console allows you to set these up independently and provide credentials for each scenario

Authenticating /** * Builders for Credentials and BigQuery objects */ public class BigQueryServiceAccountAuth { " private static final String SCOPE = "https://www.googleapis.com/auth/bigquery"; private static final String SCOPE_INSERTDATA = "https://www.googleapis.com/auth/cloud-platform"; private static final HttpTransport TRANSPORT = new NetHttpTransport(); private static final JsonFactory JSON_FACTORY = new JacksonFactory(); private static final ServiceAccountCredentials CREDENTIALS = new ServiceAccountCredentials(); public static GoogleCredential newcredential() throws GeneralSecurityException, IOException, URISyntaxException { return new GoogleCredential.Builder().setTransport(TRANSPORT).setJsonFactory(JSON_FACTORY).setServiceAccountId(CREDENTIALS.getServiceAccountId()).setServiceAccountScopes(Arrays.asList(SCOPE, SCOPE_INSERTDATA)).setServiceAccountPrivateKeyFromP12File(CREDENTIALS.getP12File()).build(); } public static Bigquery newbigquery(googlecredential credential) { return new Bigquery.Builder(TRANSPORT, JSON_FACTORY, credential).setapplicationname(credentials.getapplicationuseragent()).sethttprequestinitializer(credential).build(); } }

Java API Connecting and Authenticating Reading the Schema Loading a table Querying results Using Table Decorators

Query Metadata with Java 8 Streams Bigquery.Projects.List projectlist = bigquery.projects().list(); final ProjectList list = projectlist.execute(); final List<ProjectList.Projects> projects = list.getprojects(); projects.parallelstream().peek(p -> out.format("project Id:%s Name:%s%n", p.getid(), p.getfriendlyname())).map((projectlist.projects project) -> LambdaExceptionWrapper.propagate( () -> bigquery.datasets().list(project.getid()).execute())).flatmap((datasetlist) -> datasetlist.getdatasets().parallelstream()).peek(ds -> out.format("%ndataset id:%s %n", ds.getid())).map(ds -> LambdaExceptionWrapper.propagate( () -> bigquery.tables().list( ds.getdatasetreference().getprojectid(), ds.getdatasetreference().getdatasetid()).execute()) ).map(tablelist -> Optional.ofNullable(tableList.getTables())).flatMap(tableList -> tablelist.orelse(collections.emptylist()).parallelstream()).foreach(table -> out.format("table id:%s %n", table.getid()));

Query Metadata with Java 8 Streams PROJECT Id:javaone- talk Name:JavaOneTalk DATASET id:javaone- talk:inserttest TABLE id:javaone- talk:inserttest.stream_in TABLE id:javaone- talk:inserttest.transactions_20140928_22240 TABLE id:javaone- talk:inserttest.wikipedia_edit_stats TABLE id:javaone- talk:inserttest.wikipedia_title_edit_count

Java API Connecting and Authenticating Reading the Schema Loading a table Querying results Using Table Decorators

Creating a table - Java Table table = new Table(); table.setdescription("csvloadtest.create a table"); String tablename = YOUR_TABLE_NAME ; table.settablereference(new TableReference().setDatasetId(DATASET_ID).setTableId(tableName)); TableSchema schema = new TableSchema().setFields( Arrays.asList(new TableFieldSchema().setName("Date").setType("TIMESTAMP"), new TableFieldSchema().setName("Amount").setType("FLOAT"), new TableFieldSchema().setName("Subscriber_ID").setType("INTEGER"))); table.setschema(schema); table = bq.tables().insert(project_id, DATASET_ID, table).execute();

Creating a table - Groovy def table = new Table() table.setdescription("csvloadtest.create a table") String tablename = newtablename() table.settablereference(new TableReference(datasetId: DATASET_ID, tableid: tablename)) def schema = new TableSchema( fields: [new TableFieldSchema(name: 'Date', type: 'TIMESTAMP'), new TableFieldSchema(name: 'Amount', type: 'FLOAT'), new TableFieldSchema(name: 'Subscriber_ID', type: 'INTEGER')]) table.setschema(schema) table = bq.tables().insert(project_id, DATASET_ID, table).execute()

Loading a CSV File into a table def table = new Table(description: "CSVLoadTest.load a csv into BQ") def (TableSchema schema, TableReference tablereference) = configuretablerefandschema(table) // The tricky bit is setting the correct mime type and setting // the sourceuris as an emptylist (null will error) def insertjob = bq.jobs().insert(project_id, new Job(configuration: new JobConfiguration( load: new JobConfigurationLoad().setSourceUris(Collections.emptyList()).setSchema(schema).setDestinationTable(tableReference).setSkipLeadingRows(1))), new FileContent('application/octet-stream', file)).execute(); // queryjobuntildone while (job.getstatus().getstate() == "PENDING") { print "." job = bq.jobs().get(project_id, job.getjobreference()?.getjobid()).execute() }

Java API Connecting and Authenticating Reading the Schema Loading a table Querying results Using Table Decorators

Querying a Table QueryResponse response = bq.jobs().query(project_id, new QueryRequest( defaultdataset: new DatasetReference(projectId: PROJECT_ID, datasetid: DATASET_ID), // queries in SQL are great for existing Tooling to work with query: "SELECT Subscriber_ID, COUNT(*) as Transactions FROM transactions_20140929_162727 GROUP BY Subscriber_ID", // defaults worth mentioning " " " // Used to test if query has any errors. Will return an empty result if none. dryrun: false, " " " " // How long for the API call to wait before handing back. Job Complete may be false and job need to be polled. timeoutms: 10_000, " " " " // save money & use the results from cached queries in the last 24 hrs usequerycache: true, " " " " // rows to return each page. For large result sets its recommended to keep page size down (1000) to improve reliability. maxresults: null, // This is also an interesting way to batch pages of results back )).execute() queryresponseuntilcomplete(response)

Java API Connecting and Authenticating Reading the Schema Loading a table Querying results Using Table Decorators

Table Decorators We can provide a time range for how long ago to get results from. For example, results in the last hour (3,600,000 milliseconds ago) def query = "SELECT * FROM [$DATASET_ID:stream_in@-3600000--1000]" Job job = bq.jobs().insert(project_id, new Job(configuration: new JobConfiguration(query: new JobConfigurationQuery( query: query, " " " // we want to cover if the query returns a large dataset. allowlargeresults: true, destinationtable: new TableReference(projectId: PROJECT_ID, datasetid: DATASET_ID, tableid: newtablename()), " " " // Use Batch mode queries if you don't want to fill your pipeline priority: INTERACTIVE", " " " // Use cached results but doesnt work when a dest table is specified. usequerycache: true, " " " // There is also WRITE EMPTY, and WRITE TRUNCATE writedisposition: WRITE_APPEND", )))).execute() queryjobuntildone(job)

Gotchyas and other annoyances

Annoyances, Running SQL Queries Prefers denormalized tables Joining to many small tables impacts performance When to use the EACH keyword in GROUP BY, JOIN BY Uncertain error messages Resources Exceeded

Gotchyas The Shuffle error: Try and de-duplicate a large table using a GROUP BY EACH for all rows Hitting the Concurrent Rate Limit when our ETL processes were running alongside with large adhoc queries in our reporting tools.

Annoyances API API error messages sometimes suck 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Required parameter is missing", "reason" : "required" } ], "message" : "Required parameter is missing" }

Conclusion BigQuery - everyone can use big data Fast means of accessing the data API provides integration with other tools and frameworks SQL-like integration means older reporting tools can run queries against it Enables the use of Java 8 features

References An inside look at Google BigQuery parleys.com - How Google BigQuery works (for dummies) by Martin Gorner BigQuery Java API BigQuery Query Reference

Image Credits Clouds by Paul VanDerWerf Creative Commons 2.0 Google BigQuery Datacenter images by Connie Zhou" http://www.google.com/about/datacenters/gallery/#/tech Let us solve the big data problem The Marketoonist Sheriff Badge, Runner on starting blocks, 17.5 speed sign, Photodune Tree Architecture & Performance Comparisons with MapReduce, Google MelbJVM Logo by Noel Richards www.nsquaredstudio.com.au

Thank you Kon Soulianidis Principal Engineer Gradleware Organiser MelbJVM kon at outrospective dot org http://melbjvm.com http://gradleware.com http://github.com/neversleepz