Big Data with the Google Cloud Platform



Similar documents
The Cloud to the rescue!

Google Cloud Platform The basics

Powering the Future with Google Cloud Platform. Brad

API Analytics with Redis and Google Bigquery. javier

Getting Started with Google Cloud Platform

A Quick Introduction to Google's Cloud Technologies

Getting Started with vcloud Air Object Storage powered by Google Cloud Platform

Big Data on Google Cloud

Google Cloud Data Platform & Services. Gregor Hohpe

PUBLIC Performance Optimization Guide

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Cloud Elements! Marketing Hub Provisioning and Usage Guide!

PaaS - Platform as a Service Google App Engine

Every Silver Lining Has a Vault in the Cloud

A programming model in Cloud: MapReduce

Lab 1 Whatsup Watson Hands-On Lab

Querying Massive Data Sets in the Cloud with Google BigQuery and Java. Kon Soulianidis JavaOne 2014

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

PaaS Operation Manual

GCM for Android Setup Guide

Technical Overview Simple, Scalable, Object Storage Software

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Introduction to Cloud Computing

Research Paper Available online at: A COMPARATIVE STUDY OF CLOUD COMPUTING SERVICE PROVIDERS

AWS CodePipeline. User Guide API Version

Learning Web App Development

Cloud Models and Platforms

Cloud Computing Training

13.6 PRODUCT RELEASE AND ROADMAP UPDATE ERIC HARLESS SENIOR PRODUCT MANAGER

About This Document 3. Integration and Automation Capabilities 4. Command-Line Interface (CLI) 8. API RPC Protocol 9.

Data Integrator Performance Optimization Guide

IBM Digital Experience. Using Modern Web Development Tools and Technology with IBM Digital Experience

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Create apps with the efficiency of a cold blooded cyborg

Foundations for your. portable cloud

File S1: Supplementary Information of CloudDOE

Building a CloudStack UI for the Enterprise

Cloud Computing. Adam Barker

Collaborative Open Market to Place Objects at your Service

Connected Data. Connected Data requirements for SSO

ur skills.com

Getting Started with AWS. Hosting a Static Website

Cloudfinder for Office 365 User Guide. November 2013

Hadoop & its Usage at Facebook

Developer support in a federated Platform-as-a-Service environment

Visualizing Big Data. Activity 1: Volume, Variety, Velocity

CLOUD COMPUTING. When it's smarter to rent than to buy.. Presented by Anand Tirumani

Product Manual. MDM On Premise Installation Version 8.1. Last Updated: 06/07/15

itunes Store Publisher User Guide Version 1.1

Last time. Today. IaaS Providers. Amazon Web Services, overview

Cloud Service Models. Seminar Cloud Computing and Web Services. Eeva Savolainen

Using IBM dashdb With IBM Embeddable Reporting Service

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Microsoft Azure Data Technologies: An Overview

Developing Cloud Applications using IBM Bluemix. Brian DePradine (Development lead Liberty buildpack)

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

SUSE Cloud. OpenStack End User Guide. February 20, 2015

Oracle Communications WebRTC Session Controller: Basic Admin. Student Guide

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

iway Roadmap: 2011 and Beyond Dave Watson SVP, iway Software

Acronis Backup & Recovery 10 Advanced Server Virtual Edition. Quick Start Guide

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Windows Intune Walkthrough: Windows Phone 8 Management

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

Cloud. Hosted Exchange Administration Manual

Networks and Services

4/25/2016 C. M. Boyd, Practical Data Visualization with JavaScript Talk Handout

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Mobile App Analytics Features Overview

System Administration Training Guide. S100 Installation and Site Management

Use of Cloud for the Public Sector Information (PSI) Portal

DreamFactory on Microsoft SQL Azure

RDS Building Centralized Monitoring and Control

Cloud Elements ecommerce Hub Provisioning Guide API Version 2.0 BETA

Copyright Pivotal Software Inc, of 10

Performance Analysis and Design of a Mobile Web Services on Cloud Servers

Introduction to Mobile Access Gateway Installation

Android In The Cloud: A New PaaS Computing Platform

Aspera Direct-to-Cloud Storage WHITE PAPER

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

Getting Started with AWS. Static Website Hosting

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

Application. 1.1 About This Tutorial Tutorial Requirements Provided Files

Hadoop & its Usage at Facebook

Simple Storage Service (S3)

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Building your Big Data Architecture on Amazon Web Services

Cloudwork Dashboard User Manual

CLOUD STORAGE USING HADOOP AND PLAY

Logentries Insights: The State of Log Management & Analytics for AWS

Amazon S3 Cloud Backup Solution Contents

Transcription:

Big Data with the Google Cloud Platform Nacho Coloma CTO & Founder at Extrema Sistemas Google Developer Expert for the Google Cloud Platform @nachocoloma http://gplus.to/icoloma

For the past 15 years, Google has been building the most powerful cloud infrastructure on the planet. Images by Connie Zhou

security included

Google Cloud Platform Compute App Engine (PaaS) Compute Engine (IaaS) Storage Cloud Storage Cloud SQL Services Cloud Datastore BigQuery Let s talk about these Cloud Endpoints

Cloud Storage # create a file and copy it into Cloud Storage echo "Hello world" > foo.txt gsutil cp foo.txt gs://<my_bucket> gsutil ls gs://<my_bucket> # Open a browser at https://storage.cloud.google.com/<your bucket>/<your Object>

Invoking Cloud Storage Project CLI UI Code API CLI: command line GUI: web console JSON: REST API Cloud Storage

Why go cloud? Specially if I already have my own data center

Do-it-all yourself

Exploring the Cloud Do it yourself Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking You manage IaaS PaaS Infrastructure-as-a-Service Platform-as-a-Service Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Vendor managed Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking

1 minute at Google scale 100 hours 1000 new devices 3 million searches and also... 100 million gigabytes 1 billion users 1 billion users 1 billion activated devices

Disaster recovery

Internal bandwidth Data will move through the internal Google infrastructure as long as possible

Internal bandwidth Data will move through the internal Google infrastructure as long as possible e h c a c e g d Also: e

Cloud Storage: Measure bandwidth # From the EU zone $ time gsutil cp gs://cloud-platform-solutions-training-exercise-eu/10m-file.txt. Downloading: 10 MB/10 MB real 0m10.503s user 0m0.620s sys 0m0.456s # From the US zone $ time gsutil cp gs://cloud-platform-solutions-training-exercise/10m-file.txt. Downloading: 10 MB/10 MB real 0m11.141s user 0m0.604s sys 0m0.448s

Partial responses What will happen after clicking here?

Resumable file transfer Used by gsutil automatically for files > 2MB Just execute the same command again after a failed upload or download. Can also be used with the REST API

Parallel uploads and composition # Use the -m option for parallel copying gsutil -m cp <file1> <file2> <file3> gs://<bucket> # To upload in parallel, split your file into smaller pieces $ split -b 1000000 rand-splity.txt rand-s-part$ gsutil -m cp rand-s-part-* gs://bucket/dir/ $ rm rand-s-part-* $ gsutil compose gs://bucket/rand-s-part-* gs://bucket/big-file $ gsutil -m rm gs://bucket/dir/rand-s-part-*

ACLs Google Accounts (by ID or e-mail) Google Groups (by ID or e-mail) Users of a Google Apps domain AllAuthenticatedUsers AllUsers Project groups Project team members Project editors Project owners

Durable Reduced Availability (DRA) Enables you to store data at lower cost than standard storage (via fewer replicas) Lower costs Lower availability Same durability Same performance!!!

Object versioning Buckets can enable object versioning, to undelete files or recover previous versions of your objects.

Google Cloud Platform Compute App Engine (PaaS) Compute Engine (IaaS) Storage Cloud Storage Cloud SQL Cloud storage Services Cloud Datastore BigQuery Cloud Endpoints Big Data analysis

MapReduce and NoSQL when all you have is a hammer, everything looks like a nail Photo: 0Four

Who is already using AngularJS? The question that many JavaScript developers are asking

The HTTP Archive Introduced in 1996 Registers the Alexa Top 1,000,000 Sites About 400GB of raw CSV data That s answers to a lot of questions

url Websites using AngularJS in 2014 sites using jquery sites using AngularJS Jan 399,258 1297 Feb 423,018 1603 Mar 411,149 1691 Apr 406,239 2004 Not exactly up-to-date, right? rank http://www.pixnet.net/ 122 http://www.zoosk.com/ 1256 http://www.nasa.gov/ 1284 http://www.udemy.com/ 1783 http://www.itar-tass.com/ 3277 http://www.virgin-atlantic.com/ 3449 http://www.imgbox.com/ 3876 http://www.mensfitness.com/ 3995 http://www.shape.com/ 4453 http://www.weddingwire.com/ 4554 http://www.vanityfair.com/ 5228 http://www.openstat.ru/ 5513

How can we be sure? SELECT pages.pageid, url, pages.rank rank FROM [httparchive:runs.2014_03_01_pages] as pages JOIN ( SELECT pageid FROM (TABLE_QUERY([httparchive:runs], 'REGEXP_MATCH(table_id, r"^2014.*requests")')) WHERE REGEXP_MATCH(url, r'angular.*\.js') GROUP BY pageid ) as lib ON lib.pageid = pages.pageid WHERE rank IS NOT NULL ORDER BY rank asc; e t a d i l a v o t y r e u q a e We hav Source: http://bigqueri.es

Google innovations in the last twelve years MapReduce GFS 2002 Dremel Big Table 2004 2006 Spanner Compute Engine Colossus 2008 2010 2012 2013 Awesomeness starts here

Google BigQuery Analyze terabytes of data in seconds Data imported in bulk as CSV or JSON Supports streaming up to 100K updates/sec per table Use the browser tool, the command-line tool or REST API

s u r u o e r a o h W? s er

t u o b a g n i n i a l p m o c s r e s u t r O o p p u ZER s r e t t i w T Yahoo!) d d n e a v il o a m Mail, Hot G g in s the rtewitm u re ter, all we (behind

BigQuery is a prototyping tool Answers questions that you need to ask once in your life. Has a flexible interface to launch queries interactively, thinking on your feet. Processes terabytes of data in seconds. Processes streaming of data in real time. It s much easier than developing Map Reduce manually.

What are the top 100 most active Ruby repositories on GitHub? SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url FROM [githubarchive:github.timeline] WHERE type="pushevent" AND repository_language="ruby" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00') GROUP BY repository_name, repository_description, repository_url ORDER BY pushes DESC LIMIT 100 Source: http://bigqueri.es/t/what-are-the-top-100-most-active-ruby-repositories-on-github/9

Time to give these queries a spin Do we have a minute? Photo: Alex Lomix

Much more flexible than SQL Multi-valued attributes lived_in: [ { city: La Laguna, since: 19752903 }, { city: Madrid, since: 20010101 }, { city: Cologne, since: 20130401 } ] Correlation and nth percentile SELECT CORR(temperature, number_of_people) Data manipulation: dates, urls, regex, IP...

Cost of BigQuery Loading data Free Exporting data Free Storage $0.026 per GB/month Interactive queries $0.005 per GB processed Batch queries $0.005 per GB processed Not for dashboards: If you need to launch your query frequently, it s more cost effective to use MapReduce or SQL

Questions? Nacho Coloma CTO & Founder at Extrema Sistemas Google Developer Expert for the Google Cloud Platform @nachocoloma http://gplus.to/icoloma