Viswanath Nandigam Sriram Krishnan Chaitan Baru

Similar documents
Sriram Krishnan, Ph.D.

Cloud Computing. Course: Designing and Implementing Service Oriented Business Processes

IS PRIVATE CLOUD A UNICORN?


The NIST Definition of Cloud Computing (Draft)

White Paper on CLOUD COMPUTING

Introduction to Cloud Computing

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

See Appendix A for the complete definition which includes the five essential characteristics, three service models, and four deployment models.

Cloud Models and Platforms

Introduction to Cloud Computing

The NIST Definition of Cloud Computing

Cloud Computing. Adam Barker

An Introduction to Cloud Computing Concepts

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

Business Intelligence (BI) Cloud. Prepared By: Pavan Inabathini

Cloud Computing and Amazon Web Services

Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India talk2tamanna@gmail.com

Capability Paper. Today, aerospace and defense (A&D) companies find

Cloud computing - Architecting in the cloud

Analysis and Strategy for the Performance Testing in Cloud Computing

SURVEY OF ADAPTING CLOUD COMPUTING IN HEALTHCARE

Chapter 2 Cloud Computing

Cloud definitions you've been pretending to understand. Jack Daniel, Reluctant CISSP, MVP Community Development Manager, Astaro

Cloud Computing Summary and Preparation for Examination

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing Now and the Future Development of the IaaS

What is Cloud Computing? First, a little history. Demystifying Cloud Computing. Mainframe Era ( ) Workstation Era ( ) Xerox Star 1981!

Cloud Computing. Chapter 1 Introducing Cloud Computing

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Kent State University s Cloud Strategy

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Enhancing Operational Capacities and Capabilities through Cloud Technologies

Managing Cloud Computing Risk

Getting Familiar with Cloud Terminology. Cloud Dictionary

Grid Computing Vs. Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

MapReduce (in the cloud)

A white paper from Fordway on CLOUD COMPUTING. Why private cloud should be your first step on the cloud computing journey - and how to get there

Contents. 1. Introduction

Soft Computing Models for Cloud Service Optimization

Zadara Storage Cloud A

A programming model in Cloud: MapReduce

OVERVIEW Cloud Deployment Services

Lecture 02a Cloud Computing I

CLOUD COMPUTING USING HADOOP TECHNOLOGY

A Service for Data-Intensive Computations on Virtual Clusters

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Computing an introduction

CHAPTER 8 CLOUD COMPUTING

BIG DATA TRENDS AND TECHNOLOGIES

CSO Cloud Computing Study. January 2012

Private Cloud 201 How to Build a Private Cloud

Introduction to Cloud Services

The Hybrid Cloud: Bringing Cloud-Based IT Services to State Government

Cloud Computing; What is it, How long has it been here, and Where is it going?

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Technology & Business Overview of Cloud Computing

Sistemi Operativi e Reti. Cloud Computing

Cloud Computing Services and its Application

Chapter 7. Using Hadoop Cluster and MapReduce

Trends in Business Intelligence

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

wu.cloud: Insights Gained from Operating a Private Cloud System

Essential Characteristics of Cloud Computing: On-Demand Self-Service Rapid Elasticity Location Independence Resource Pooling Measured Service

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Assignment # 1 (Cloud Computing Security)

Introduction to Hadoop

Cloud Computing: The Next Computing Paradigm

Mobile Hybrid Cloud Computing Issues and Solutions

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Amazon EC2 Product Details Page 1 of 5

Cloud Computing An Elephant In The Dark

CLOUD COMPUTING. When It's smarter to rent than to buy

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

High Performance Computing Cloud Computing. Dr. Rami YARED

Tutorial on Client-Server Architecture

Certified Cloud Computing Professional Sample Material

How To Understand Cloud Computing

Cloud Computing Service Models, Types of Clouds and their Architectures, Challenges.

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

SCADA Cloud Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

How To Get A Cloud Based System In Your Country

Transcription:

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance to spatial data and OpenTopography

Data Partitioning Indexing Regular & Spatial

Data Partitioning Strategies - Round Robin Equal distribution across nodes by data volume - Hash All data with the same key value go to same node - Range All data within a range of values go to the same node P P P P P M M M M M D D D D D Partitioning Strategy Dataset

DB2 Enterprise Server Edition (Version 8.2 fixpack 14) Data Partitioning Feature (DPF) DB2 Spatial Extender (Version 8.2) Advanced spatial data types Advanced spatial data types functions and features GEON LIDAR Cluster Hardware 8 x (dual core Intel Xeon 3.0 GHz processors with 8GB of memory and 750 GB local disk on each) Four 3TB disk arrays - 1.5 TB local to each node Total Disk ~ 18 TB

Northern San Andreas Fault Data 1205 148544.74364 6135780.64 2074881.21 1557.35 1 5 5.48 14 V 1205 148544.76241 6135770.33 2074880.95 1526.20 2 6 5.45 0 G 1205 148544.76246 6135775.01 2074884.30 1529.68 1 5 5.47 15 G 1205 148544.78122 6135765.79 2074881.62 1524.31 1 5 5.44 80 B Column 1 : Date - Day or week of acquisition Column 2 : Time - GPS time stamp uniquely identifying laser pulse time Column 3 and 4 : X and Y Column 5 : Elevation. Column 6 : Return number - Return number of this return. Column 7 : Number of Returns - Number of returns for this pulse. Column 8 : Off Nadir Angle - Angle between nadir and transmitted pulse Column 9 : Return Intensity - Intensity of return pulse Column 10: Classification Code - Classification of return where: B-Blunder; G-Ground or water; V-Vegetation; S-Building/Structure; N-Not ground or water - Could be V or S

Grid index that is optimized for two dimensional data Logical square grids

Pros Simple SQL-based querying of data Robust production-quality software/hardware stack High performance access to data Cons Increase in DB size due to indexing Extract, Transform, Load (ETL) overhead Fault Tolerance Scalability Costs (licensing, hardware)

Size of spatial objects in database ASCII File X, Y,..attributes Data + ST_POINT + Spatial/Other Indexes ~ 7 X Increase GARLOCK Data: Raw file: 113 GB Size in DB: 852 GB

Preprocess raw LIDAR data files (10%) DDL, loading process (25%) Tablespaces, bufferpools, tables, indexes, load and spatial generation scripts Define Spatial Reference System and spatial columns (5%) db2se create_srs MYSRS -srsname UTM_10N srsid 1011 xoffset 0 - yoffset 0 coordsysname WGS_1984_UTM_ZONE_10N; Populate the spatial column (35%) db2 update mytable set GEOMETRY = db2gse.gsepoint(x, Y, 1011,... ); Create spatial and other indexes (20%) db2 create index MYSINDX on mytable(geometry) extend using db2gse.spatial_index(1000,0,0) Optimize tables and update meta data table (5%)

+ Nodes

What is Cloud Computing? And what s in it for me?

Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction

Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS)

Infrastructure as a Service (IaaS) The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources Consumer is able to deploy and run arbitrary software, which can include operating systems and applications The consumer does not manage or control the underlying cloud infrastructure But has control over operating systems, storage, deployed applications, and possibly select networking components (e.g., firewalls, load balancers)

Amazon Elastic Compute Cloud (EC2) A web service that provides resizable compute capacity in the cloud. Configure an Amazon Machine Instance (AMI) and load it into the Amazon EC2 service Quickly scale capacity, both up and down, as your computing requirements change Amazon Simple Storage Service (S3) A simple web services interface that can be used to store and retrieve large amounts of data, at any time, from anywhere on the web It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites

Platform as a Service (PaaS) The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created applications using programming languages and tools supported by the provider (e.g., java, python,.net) The consumer does not manage or control the underlying cloud infrastructure, network, servers, operating systems, or storage But the consumer has control over the deployed applications and possibly application hosting environment configurations.

*Credit: http://rdn-consulting.com/blog/2009/02/07/exploring-cloud-computing-development/

Software as a Service (SaaS) The capability provided to the consumer is to use the provider s applications running on a cloud infrastructure Accessible from various client devices through a thin client interface such as a Web browser (e.g., web-based email) The consumer does not manage or control the underlying cloud infrastructure With the possible exception of limited user-specific application configuration settings.

Google Maps API http://code.google.com/apis/maps/ Users are provided with simple APIs for maps Uses cloud resources at the back-end Facebook http://www.facebook.com Social networking site using a suite of cloud-based tools at the back-end Animoto http://www.animoto.com Service that makes videos from user uploaded images Uses Amazon EC2 at the back-end

* Source: McKinsey & Co

Cloud computing necessitates a different approach for large scale data processing Use of non-specialized commodity hardware and low latency interconnect between nodes Not suitable for implementing traditional databases which require specialized hardware configurations Traditional databases also not known to scale very well to thousands of nodes which may be available on the cloud

Programming environment for very large scale data processing on commodity resources Managing task executions and data transfers in a shared nothing environment MapReduce: Infrastructure to support data scatter / gather Distributed data repository ( file system ) Google File System (GFS) Hadoop Distributed File System (HDFS) Round-robin partitioning of data MapReduce Google s proprietary implementation Hadoop Apache, open source implementation

Scalable and conducive to data-intensive and data-parallel applications Fault-tolerant by design workers can be restarted on failures Ability to run on non-specialized commodity hardware And cloud resources such as AWS

Success can be attributed to: Ease of use complexity is hidden by the library A large variety of problems are easily expressible in this model Implementation scales to 1000s of machines Restricting the programming model makes it easy to parallelize and deal with failures

Use of Infrastructure as a Service (IaaS) Access to the IBM-Google CluE facility and HP-Yahoo Cloud computing testbed at NCSA/UIUC provided via the NSF CluE Grant Access to AWS resources via an AWS in Education Research Grant Providing Software as a Service (SaaS) using Hadoop

map(string key, String value): // key: document name // value: document contents for each line l in value: if (match) { // if this point resides within the user specified bounding box } w = f(l) // analogous to the sql query, selecting the required fields from the line l EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a query result // values: a list of counts (should be 1, and can be ignored) Emit(key);

Evaluation of approaches Parallel DB2 and Hadoop Performance comparison for representative queries Overheads for the approaches (programmer and space overheads) Production use Implementation of a Hadoop-based solution for the OpenTopography Portal, if possible

Data Load Times 1TB/Cluster Execution time Distributed Grep * A Comparison of Approaches to Large-scale Data Analysis, Pavlo et al, SIGMOD 09

Cloud-based technologies may be appropriate for OpenTopography Use of Cloud Infrastructure as a Service (IaaS) for hosting LiDAR data and applications Use of Hadoop-based technologies to provide Cloud Software as a Service (SaaS) More information: http://cloudstor.sdsc.edu Questions?

Cloud computing means different things to different people A recent paper found that there are more than 20 definitions of clouds A Break in the Clouds: Towards a Cloud Definition, 2009; Vaquero, Rodero-Merino, Caceres, Linder We will use the the NIST (National Institute of Standards & Technology) definition

A consumer can unilaterally provision computing capabilities, such as server time and network storage As needed Automatically Without requiring human interaction from service provider

Capabilities are available over the network Accessed through standard mechanisms Heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs). Typically via light-weight Web service protocols

The provider s computing resources are pooled to serve all consumers using a multi-tenant model Different physical and virtual resources dynamically assigned and reassigned according to consumer demand. The customer generally has no control or knowledge over the exact location of the provided resources But may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter) Examples of resources include storage, processing, memory, network bandwidth, and virtual machines

Capabilities can be rapidly and elastically provisioned to quickly scale up And rapidly released to quickly scale down To the consumer, the capabilities available for provisioning often appear to be infinite Can be purchased in any quantity at any time

Automatic control and optimization of resource use by leveraging a metering capability At the level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts) Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service

Enterprises incur no infrastructure capital costs, just operational costs on a pay-per-use basis Elimination of up-front costs Architecture specifics are abstracted Capacity can be scaled up or down dynamically, and immediately Illusion of infinite resources, on-demand The underlying hardware can be anywhere geographically

Private Cloud Operated solely for an organization May be managed by the organization or a third party and may exist on premise or off premise. Community Cloud Shared by several organizations and supports a specific community that has shared concerns May be managed by the organizations or a third party and may exist on premise or off premise

Public Cloud Made available to the general public or a large industry group Owned by an organization selling cloud services Hybrid Cloud A composition of two or more clouds (private, community, or public) Bound together by standardized or proprietary technology that enables data and application portability

Pay as you go no upfront costs, contracts, or commitments EC2 From $0.1-$1.2 per hour depending on type of machine and OS used Users also pay for data transfer in and out of EC2 http://aws.amazon.com/ec2/#pricing S3 From $0.15-$0.2 per GB per month for storage Users pay for data uploads and downloads http://aws.amazon.com/s3/#pricing

Free persistent storage up to 500MB and enough bandwidth for about 5 million page views a month Additional storage and bandwidth can be purchased Storage around $0.15 per GB per month Bandwidth around $0.1 per GB per month CPU usage around $0.1 per hour

The computation takes a set of input key/ value pairs, and Produces a set of output key/value pairs. The MapReduce library expresses the computation as two functions Map and Reduce All data resides in a distributed file system