HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Similar documents

appscale: open-source platform-level cloud computing

Hybrid Cloud Support for Large Scale Analytics and Web Processing

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

PaaS - Platform as a Service Google App Engine

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Modeling Public Pensions with Mathematica and Python II

AppScale: Open-Source Platform-As-A-Service

Private Clouds with Open Source

Cloud Computing Training

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Written examination in Cloud Computing

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Scalable Architecture on Amazon AWS Cloud

The Cloud to the rescue!

Unified API Governance in the New API Economy

Google Cloud Platform The basics

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

COMPARISON OF OPEN-SOURCE PAAS ARCHITECTURAL COMPONENTS

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

OpenStack Ecosystem and Xen Cloud Platform

Cover Story. Cloud Computing: A Paradigm Shift in IT Infrastructure

AppScale: Scalable and Open AppEngine Application Development and Deployment

A Quick Introduction to Google's Cloud Technologies

Cloud Computing Summary and Preparation for Examination

Cloud Computing: Making the right choices

Unified Big Data Analytics Pipeline. 连城

Google App Engine for Java. Lars Vogel Twitter:

Cloud Computing. Technologies and Types

Open Source for Cloud Infrastructure

Active Cloud DB: A RESTful Software-as-a-Service for Language Agnostic Access to Distributed Datastores

An Evaluation of Distributed Datastores Using the AppScale Cloud Platform

CloudFTP: A free Storage Cloud

Users VM A A A. Application. Compute/Storage/Network. VM Virtual Machine. On-Premises Data Center

Hadoop in the Hybrid Cloud

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

GigaSpaces Real-Time Analytics for Big Data

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Cloud Computing. Lecture 24 Cloud Platform Comparison

A programming model in Cloud: MapReduce

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Courses Description

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Development Model for the Cloud Paradigm Shift of the Same Old Same Old? Dr. Umit Yalcinalp, Salesforce.com Developer Evangelist

Platforms in the Cloud

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Enterprise PaaS Evaluation Guide

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Cloud computing - Architecting in the cloud

Cloud Computing. Chapter 3 Platform as a Service (PaaS)

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Efficient Network Marketing - Fabien Hermenier A.M.a.a.a.C.

Introduction to Cloud Computing

Introduction to Cloud Computing

Cloud Computing and Big Data What Technical Writers Need to Know

Cloud Computing at Google. Architecture

Certified Cloud Computing Professional VS-1067

A Cost-Evaluation of MapReduce Applications in the Cloud

Public Cloud Offerings and Private Cloud Options. Week 2 Lecture 4. M. Ali Babar

Cloud Computing Now and the Future Development of the IaaS

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

BIG DATA TRENDS AND TECHNOLOGIES

6.S897 Large-Scale Systems

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Data Integration Checklist

Moving From Hadoop to Spark

Assignment # 1 (Cloud Computing Security)

How To Understand Cloud Computing

Distributed Data Parallel Computing: The Sector Perspective on Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Fundamentals of Cloud Computing

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

JAVA IN THE CLOUD PAAS PLATFORM IN COMPARISON

The Inside Scoop on Hadoop

Structured Data Storage

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

How To Write A Trusted Analytics Platform (Tap)

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Courses Description

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Real Time Big Data Processing

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Time-to-live. Adrian Colyer CTO SpringSource

Lecture 6 Cloud Application Development, using Google App Engine as an example

Big data blue print for cloud architecture

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

A Web Base Information System Using Cloud Computing

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Big Data on Microsoft Platform

Open Source Technologies on Microsoft Azure

Development of nosql data storage for the ATLAS PanDA Monitoring System

Cloudy with a chance of 0-day

Practical Google App Engine Applications in Python

Big Data Pipeline and Analytics Platform

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Data Services Advisory

Transcription:

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Overview Google App Engine (GAE) GAE Analytics Libraries AppScale Hybrid Data synchronization Hive Analytics Evaluation Conclusion

Google App Engine

Google App Engine Platform-as-a-Service Developers focus on their applications Test locally then deploy on Google s infrastructure Language Support Python 2.5 and 2.7 Java Go 500,000 existing apps Auto-scaling, pay-as-you-go Web requests, background tasks, and storage

Google App Engine APIs Name Datastore Memcache Blobstore Channel Images Mail Users Task Queues URL Fetch XMPP Description Key/Value object storage Distributed caching service Storage of large files Long lived JavaScript connections Simple image manipulation Sending and receiving email Login service with Google Accounts Background tasks Resource fetching with HTTP request XMPP-compatible messaging service

Google App Engine Restrictions to enforce scalability and security Limited query support Runtime restrictions No socket access, no file system access White-list of libraries

Data Analytics in GAE

Task Queues Background tasks of 10 minutes taskqueue.add(url='/path/to/my/worker ) Task names to prevent fork bombs Task Queue Chaining Splitting up large background jobs is the burden of the developer GAE Analytics libraries abstract away the Task Queue (TQ)

Fantasm Based on the Task Queue Uses memcache and DB to manage state State machine driven Specified in YAML Iterate over a large dataset Fan-in to join data

Pipeline Task Queue based Chains tasks into a workflow class Add(pipeline.Pipeline): def run(self, a, b): return a + b class Multiply(pipeline.Pipeline): def run(self, a, b): return a * b class LinearFunc(pipeline.Pipeline): def run(self, x, slope=1, offset=0): # y = m*x + b mx = yield Multiply(x, slope) yield Add(mx, offset)

GAE MapReduce Built on top of GAE infrastructure Not Google s internal MapReduce or Hadoop Parallel processing and reductions on large datasets Map across a particular type of object Must scan the entire type (no subsets) Multiple MR jobs can be linked with Pipeline

AppScale

AppScale Private PaaS with GAE API compatibility Application portability Engenders a developer community Distributed and fault tolerant API implementations Leverages open source and new software systems Supports Python, Java, and Go languages Infrastructure agnostic One virtual machine with all components KVM, Xen, EC2, Eucalyptus, Openstack, etc Datastore agnostic Cassandra, HBase, Hypertable, etc

The AppScale Stack Python GAE Server Java GAE Server Go GAE Server ejabberd (xmpp & channel API) Blobstore server Routing (HAProxy and Nginx) memcached AppController Cassandra 14 Footer Text

GAE Analytics in AppScale Task Queue with RabbitMQ

Analytic Libraries in GAE Learning curve for libraries Analytics must be part of application code Can introduce bugs Can disrupt the user experience May require significant code and time to implement

AppScale and GAE Hybrid

Hybrid Solution Connect GAE to AppScale Scale of GAE for OTAP Flexibility of AppScale for OLAP Contributions Datastore mirroring Hive queries

Datastore Library Datastore library for asynchronous updates Best Effort (BE) with async URL Fetch Eventual Consistency (EC) with Task Queue No transaction support GAE AppScale Service User

Hive Support in AppScale Run SQL statements which translate to Hadoop MR jobs We provide the mapping interface from GAE data to Hive queries GAE AppScale Data Scientist

AppScale Stack with Hive Support Python GAE Server Java GAE Server Go GAE Server ejabberd (xmpp & channel API) Blobstore server Routing (HAProxy and Nginx) memcached AppController Hive Hadoop CassandraFS

Hive Queries Simple, short, and fast on-demand queries Offline processing of online data Does not impact user experience Does not introduce bugs or code bloat No ETL, data processing in-place SELECT COUNT(*) FROM appid_kind;

Measurements

Cross Cloud Measurements Async GAE TaskQueue EC2 Region A EC2 Region B

Cross Cloud Data Transfer Round-trip Time and Bandwidth Between a GAE Application and Different EC2 Regions

Cross Cloud Task Queue Delay Round-trip time from multiple regions to a deployed GAE application with task queue delay

Analytic Execution Time Fantasm (GAE) Pipeline (GAE) MapReduce (GAE) Hive (AppScale) Aggregate 11334.59 98.34 377.70 20.94 Grep 10360.40 98.89 227.57 10.69 Join 10147.75 159.96 256.40 23.41 Subset 78.28 3.81 237.75 20.66 Wordcount 10977.50 222.14 840.71 21.54 Processing 100K Entities (5 trials)

Cost Analysis Data synchronization Bandwidth: $0.12/GB Data backup Benefits Programmer productivity Ad-hoc analysis

Related Work Private PaaS offerings Red Hat s OpenShift VMWare s CloudFoundry GAE Compatible TyphoonAE MapReduce Amazon Elastic MapReduce Mesos framework

Thanks Check out AppScale at: appscale.cs.ucsb.edu