A METHOD FOR MINIMIZING COMPUTING CORE COSTS IN CLOUD INFRASTRUCTURES THAT HOST LOCATION-BASED ADVERTISING SERVICES. A Thesis.



Similar documents
Minimizing Computing Core Costs in Cloud Infrastructures that Host Location-Based Advertising Services

Sistemi Operativi e Reti. Cloud Computing

2) Xen Hypervisor 3) UEC

CLOUD COMPUTING. When It's smarter to rent than to buy

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Cloud Computing Service Models, Types of Clouds and their Architectures, Challenges.

Getting Familiar with Cloud Terminology. Cloud Dictionary

The Cisco Powered Network Cloud: An Exciting Managed Services Opportunity

THE EUCALYPTUS OPEN-SOURCE PRIVATE CLOUD

An Introduction to Cloud Computing Concepts

White Paper on CLOUD COMPUTING

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study


How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

RightScale mycloud with Eucalyptus

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

CHAPTER 8 CLOUD COMPUTING

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Cloud Models and Platforms

Software-Defined Networks Powered by VellOS

International Journal of Engineering Research & Management Technology

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

Cloud Computing Architecture: A Survey

IS PRIVATE CLOUD A UNICORN?

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

White Paper. Cloud Performance Testing

9/26/2011. What is Virtualization? What are the different types of virtualization.

Grid Computing Vs. Cloud Computing

IMPROVEMENT OF RESPONSE TIME OF LOAD BALANCING ALGORITHM IN CLOUD ENVIROMENT

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services

Capturing the New Frontier:

A Study of Infrastructure Clouds

IBM EXAM QUESTIONS & ANSWERS

How To Run A Cloud Computer System

Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India talk2tamanna@gmail.com

Optimizing Service Levels in Public Cloud Deployments

CLOUD COMPUTING IN HIGHER EDUCATION

Simplified Private Cloud Management

Permanent Link:

Deploying a Geospatial Cloud

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Participatory Cloud Computing and the Privacy and Security of Medical Information Applied to A Wireless Smart Board Network

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Is Hyperconverged Cost-Competitive with the Cloud?

CLOUD COMPUTING. Keywords: Cloud Computing, Data Centers, Utility Computing, Virtualization, IAAS, PAAS, SAAS.

Technology & Business Overview of Cloud Computing

Estimating the Cost of a GIS in the Amazon Cloud. An Esri White Paper August 2012

Kent State University s Cloud Strategy

Future of Cloud Computing. Irena Bojanova, Ph.D. UMUC, NIST

Cloud computing - Architecting in the cloud

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Certified Cloud Computing Professional Sample Material

Eucalyptus: An Open-source Infrastructure for Cloud Computing. Rich Wolski Eucalyptus Systems Inc.

IAAS CLOUD EXCHANGE WHITEPAPER

Cloud Computing and Amazon Web Services

TaaS: An Evolution of Testing Services using Cloud Computing

Cloud Computing Characteristics Are Key

Evaluation Methodology of Converged Cloud Environments

Cloud Computing with Red Hat Solutions. Sivaram Shunmugam Red Hat Asia Pacific Pte Ltd.

Architectural Implications of Cloud Computing

Outline. What is cloud computing? History Cloud service models Cloud deployment forms Advantages/disadvantages

Infrastructure as a Service (IaaS)

Eucalyptus: An Open-source Infrastructure for Cloud Computing. Rich Wolski Eucalyptus Systems Inc.

An Esri White Paper January 2011 Estimating the Cost of a GIS in the Amazon Cloud

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

Chapter 2 Cloud Computing

CS 695 Topics in Virtualization and Cloud Computing and Storage Systems. Introduction

Cloud Computing An Introduction

The OpenNebula Standard-based Open -source Toolkit to Build Cloud Infrastructures

Security Considerations for Public Mobile Cloud Computing

Auto-Scaling Model for Cloud Computing System

Mobile and Cloud computing and SE

The Hybrid Cloud: Bringing Cloud-Based IT Services to State Government

Multilevel Communication Aware Approach for Load Balancing

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Private Cloud in Educational Institutions: An Implementation using UEC

SURVEY OF ADAPTING CLOUD COMPUTING IN HEALTHCARE

LOGO Resource Management for Cloud Computing

Performance Management for Cloud-based Applications STC 2012

An Approach to Load Balancing In Cloud Computing

21/09/11. Introduction to Cloud Computing. First: do not be scared! Request for contributors. ToDO list. Revision history

VMware vcloud Powered Services

A Survey on Cloud Computing

Lecture 02a Cloud Computing I

CLOUD COMPUTING An Overview

Cloud Computing for SCADA

Testing as a Service on Cloud: A Review

Cloud Computing: Making the right choices

Amazon EC2 Product Details Page 1 of 5

Cloud Computing. Karan Saxena * & Kritika Agarwal**

Service allocation in Cloud Environment: A Migration Approach

Navigating Among the Clouds. Evaluating Public, Private and Hybrid Cloud Computing Approaches

Introduction to Cloud Computing

Transcription:

A METHOD FOR MINIMIZING COMPUTING CORE COSTS IN CLOUD INFRASTRUCTURES THAT HOST LOCATION-BASED ADVERTISING SERVICES A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Vikram Kumar Ramanna Spring 2013

iii Copyright 2013 by Vikram Kumar Ramanna All Rights Reserved

iv DEDICATION I dedicate this book to my family, who have always been there for me, and have never doubted my dreams.

v ABSTRACT OF THE THESIS A Method for Minimizing Computing Core Costs in Cloud Infrastructures that Host Location-Based Advertising Services by Vikram Kumar Ramanna Master of Science in Computer Science San Diego State University, 2013 Cloud computing provides services to a large number of remote users with diverse requirements, an increasingly popular paradigm for accessing computing resources over the Internet. A popular cloud-service model is Infrastructure as a Service (IaaS), exemplified by Amazon s Elastic Computing Cloud (EC2). In this model, users are given access to virtual machines on which they can install and run arbitrary applications, including relational database systems and geographic information systems (GIS). Location-based services (LBS) for offering targeted, real-time advertising is an emerging retail practice wherein a mobile user receives offers for goods and services through a smart phone application. These advertisements can be targeted to individual potential customers by correlating a smart phone user s interests to goods and services being offered within close proximity of the user. In this work, we examine the problem of establishing a Service Level Agreement (SLA) to determine the appropriate number of microprocessor cores required to constrain the query response time for a targeted advertisement to reach a mobile customer, within approachable distance to a Point of Sale (POS). We assume the optimum number of cores required to maintain a SLA is one which minimizes microprocessor core expenses, charged by infrastructure providers, while maximizing application service provider revenues derived from POS transaction fees. This problem is challenging because changes in the number of microprocessor cores assigned to database resources can result in changes in the time taken to transmit, receive, and interpret a targeted advertisement sent to a potential customer in motion. We develop a methodology to establish an equilibrium state between the utility gained from POS transaction revenues and costs incurred from purchasing microprocessor cores from infrastructure providers. We present different approaches based on an exponential, linear, and Huff method to model customer purchase decisions. From these models, the marginal cost and marginal revenue is calculated to determine the optimal number of microprocessor cores to purchase and assign to database resources.

vi TABLE OF CONTENTS PAGE ABSTRACT...v LIST OF TABLES... viii LIST OF FIGURES... ix ACKNOWLEDGEMENTS... xiii CHAPTER 1 INTRODUCTION...1 1.1 Motivation...1 1.2 Background...1 2 REVIEW OF THE LITERATURE...9 3 CLOUD COMPUTING...16 3.1 Virtualization vs. Cloud Computing...16 3.2 The Eucalyptus Cloud Architecture...17 3.3 Eucalyptus Cloud Overview...18 4 BUILD PRIVATE CLOUD USING EUCALYPTUS...21 4.1 Eucalyptus Networking Modes...22 4.2 Securing Eucalyptus...22 4.2.1 Configuring Secure Sockets Layer (SSL)... 24 4.2.2 Add a NC... 24 4.3 Eucalyptus and AWS...24 4.3.1 AWS Compatibility... 24 4.3.2 S3 Tools... 26 4.3.3 EC2 Tools... 26 4.3.4 Bundle and Image for Amazon EC2... 26 4.4 Eucalyptus Dashboard...26 5 POSTGRESQL...28 5.1 Brief History of PostgreSQL...29 5.2 Architectural Fundamentals of PostgreSQL...29

vii 5.3 Why Use PostgreSQL...30 6 POSTGIS...33 6.1 Postgres Spatial Capabilities...33 6.2 Tools that Support PostGIS...34 6.3 Spatial Data Types...35 6.4 Haversine Formula...35 6.5 Vincenty s Formula...38 6.6 Spatial Relationships and Measurements...40 6.6.1 ST_Distance... 41 6.6.2 ST_Distance_Sphere... 42 6.6.3 ST_Distance_Spheroid... 42 7 SYSTEM IMPLEMENTAITON AND OVERVIEW...44 7.1 Data Mining and Simulation Setup...44 7.2 Data Analytics and Model Assumptions...50 7.3 Linear Model...52 7.4 Exponential Model...53 7.5 Huff Model...55 7.6 Equilibrium Point...56 8 EXPERIMENT RESULT AND ANALYSIS...59 8.1 Probability Models...61 8.2 Marginal Revenue...66 8.3 Marginal Revenue vs. Marginal Cost...70 9 CONCLUSION AND FUTURE IMPLEMENTATIONS...87 9.1 Conclusion...87 9.2 Future Work...89 BIBLIOGRAPHY...90

viii LIST OF TABLES PAGE Table 6.1. Percentage Error between the Haversine and Vincenty Methods, for Two Points on Earth that are Exactly 10, 20, 30, 40, and 50m Apart...41 Table 7.1. Amazon EC2 Pricing per Instance-Hour Consumed for Each Instance...58 Table 8.1. Nomenclature...60

ix LIST OF FIGURES PAGE Figure 1.1. The cycle shows different stages for a cloud consumer to determine the optimal number of microprocessor cores to invest to drive a database, based on revenue generated through point of sale transaction fees....2 Figure 1.2. Marginal revenue vs. marginal cost (MR=MC curve)....8 Figure 3.1. Eucalyptus based cloud architecture....20 Figure 4.1. Components of Eucalyptus architecture....21 Figure 4.2. Network architecture decision flow chart to determine the network configuration for Eucalyptus setup...23 Figure 4.3. Eucalyptus compatibility with Amazon EC2 cloud...25 Figure 4.4. Dashboard for Eucalyptus services....27 Figure 6.1. Hierarchy of geometry and geography data types...35 Figure 6.2. Triangle in Euclidean plane transformed onto a spherical plane....36 Figure 6.3. Law of Haversines for a spherical triangle....38 Figure 7.1. Setup of the private Eucalyptus cloud used in this thesis....45 Figure 7.2. Cloud microprocessor core load as a function of the number of records inserted....47 Figure 7.3. UML diagram of database storage for a user s geospatial data....47 Figure 7.4. Flow chart for parallel insertion of geospatial records into our Eucalyptus hosted PostgreSQL database for measuring cloud infrastructure load....48 Figure 7.5. PostGIS API example for computing a great circle distance using the Haversine formulae and Vincenty s formulae, respectively....49 Figure 7.6. An offer for goods and/or services sold at a point of sale are sent to users once they enter a purchase frontier, based on an interest correlation...50 Figure 7.7. A mobile application transmits geospatial information to a cloud infrastructure when a user crosses into a purchase frontier....52 Figure 7.8. Dbacktrack is the distance a customer must backtrack to reach the point of sale upon receiving an offer....53 Figure 7.9. Upon receiving an offer, the probability of a customer back tracking to a POS to make a purchase based on the offer received is determined by a particular probability distribution....54 Figure 7.10. Probability as a function of Dbacktrack, linear model....54

Figure 7.11. Probability as a function of Dbacktrack, exponential model....55 Figure 7.12. Profit maximization curve with equilibrium point shown....57 Figure 7.13. Screenshot of Amazon EC2 online marginal cost calculator...57 Figure 8.1. Different stages of a smartphone user receiving an offer, once the user enters the purchase frontier....61 Figure 8.2.Total processing time ttotal for 1,000,000, 0.5Mb geo-location records on a 10 Mbps network as a function of cloud microprocessor core count....62 Figure 8.3. Per user processing time tuser based on geo-location data exchange on a 4G mobile network (8 Mbps download, 4 Mbps upload) for 1E6 0.5Mb geolocation records, as a function of cloud microprocessor core count....63 Figure 8.4. Two different customer purchase scenarios, considering a POS in Fashion Valley Mall located in San Diego, California...64 Figure 8.5. Marginal per user processing time saved, based on geo-location data exchange on a 4G mobile network (8 Mbps download, 4 Mbps upload) for 1E6, 0.5Mb geo-location records, as a function of incremental cloud microprocessor core addition....65 Figure 8.6. The distance traveled in meters by a user upon entering a purchase frontier until receiving a geo-location proximity list with one or more offers, assuming a walking speed of 5Km/h...66 Figure 8.7. Distance customer must backtrack to reach a point of sale upon receiving an offer on 4G mobile smart phone, as a function of microprocessor core count....67 Figure 8.8. A single user at different distances D1, D2, D3, and D4 away from different POS s having different attractiveness and size, probability of the user going to a particular POS as described by Huff model....67 Figure 8.9. A single user at different distances D1, D2, D3, and D4 away from a single POS, probability of the user going to the POS as described by proximity Huff model....68 Figure 8.10. Probability user backtracks to point of sale upon receiving offer on 4G mobile smart phone, as a function of microprocessor core count...68 Figure 8.11. Probability user backtracks to point of sale upon receiving offer on 4G mobile smart phone, as a function of microprocessor core count, assuming a scaled exponential probability distribution....69 Figure 8.12. Different Dbacktrack distances travelled by a smartphone user varying the number of cores on cloud infrastructure, walking at an average rate of 5km/h....70 Figure 8.13. Probability a user backtracks to a point of sale upon receiving an offer on a 4G mobile smart phone, as a function of microprocessor core count...71 x

Figure 8.14. Marginal revenue based on the probability of a customer backtracking to POS and making a purchase, assuming a linear model for purchase probability, and a $0.1 transaction fee on each purchase....72 Figure 8.15. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $0.1 transaction fee on each purchase....73 Figure 8.16. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability and a $0.25 transaction fee on each purchase....74 Figure 8.17. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $0.25 transaction fee on each purchase....75 Figure 8.18. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability, and a $0.50 transaction fee on each purchase....76 Figure 8.19. Marginal revenue based on the probability of a customer backtracking to POS and making a purchase, assuming an exponential model for purchase probability, and a $0.50 transaction fee on each purchase....77 Figure 8.20. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability, and a $1.00 transaction fee on each purchase....78 Figure 8.21. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $1.00 transaction fee on each purchase....79 Figure 8.22. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear Model, as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.10. Marginal cost based on the Amazon t1.micro instance....80 Figure 8.23. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 record of each 0.5 Mb size and T=$0.5 Marginal Cost based on the Amazon t1.micro instance....81 Figure 8.24. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using an exponential model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 record of each 0.5 Mb size and T=$0.5...82 Figure 8.25. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear model on an 8 Mb/s (4G mobile xi

network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.25...83 Figure 8.26. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using an exponential model on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.25...84 Figure 8.27. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of the user moving to a POS using a linear model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$1.00...85 Figure 8.28. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of the user moving to a POS using an exponential model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$1.00...86 Figure 9.1. Scenario of two different users travelling at different distances from a POS. The probability of user1, who is tangential to a POS, is less likely to backtrack as indicated with a red cross...88 xii

xiii ACKNOWLEDGEMENTS It is a pleasure to thank the many people who made this thesis possible. This work would not have been possible without the support from my thesis chair professor Dr. Mahasweta Sarkar who gave me the chance to work on this project. I would like to gratefully acknowledge the supervision of my advisor, Dr. Christopher Paolini, who has been abundantly helpful and has assisted me in numerous ways. The discussions I had with him were invaluable. Special thanks to my thesis committee member Prof. Carl Eckberg for his support and invaluable advice. My final words go to my family. I want to thank my parents, whose love and guidance is with me in whatever I pursue

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION The problem we are trying to solve in this thesis is to derive a way for cloud customers to choose, dynamically, an optimal number of microprocessor cores to drive a relational database, hosted in a cloud infrastructure, that supports location based advertising services for mobile smart-phone applications as shown in Figure 1.1. Upon choosing a particular commercial cloud provider, a mobile application firm hosts their back-end database application within a cloud instance, which is used to send offers for good or services to mobile smart-phone users. Based on the revenue generated by transaction fees charged to smart-phone users who consent to an offer received by purchasing goods or services at a point of sale, the mobile application firm then chooses to either add or remove microprocessor cores to maximize revenue. 1.2 BACKGROUND Cloud computing is a general terminology for anything that involves delivery of hosted services over the Internet or a Local Area Network (LAN). These services can be broadly classified into three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). The name cloud was inspired by the symbol used to represent the Internet in many diagrams. Infrastructure as a service is sometimes referred to as Hardware-as-a-Service (Haas) and it is a provision model in which an organization outsources resources such as hardware, storage, servers, and networking components to support operations on demand. The service provider is completely responsible for running, housing, and maintaining these resources. The client typically is billed on a per-use basis. IaaS can be obtained as a public or private infrastructure, based on the type of network under which this classification is done. If the infrastructure consists of shared

2 Figure 1.1. The cycle shows different stages for a cloud consumer to determine the optimal number of microprocessor cores to invest to drive a database, based on revenue generated through point of sale transaction fees. resources deployed on the Internet then its public cloud, and if the infrastructure that emulates the cloud computing features is on a private network, then it is a private cloud. Listed are certain features and characteristics of IaaS: Resources are distributed as a service The resources can be dynamically scaled Variation in cost varies the utility pricing model Multiple users on a single hardware is the general concept. The IaaS model is suitable where the demand is very volatile: and there might be spikes or troughs in terms of demand of infrastructure, where there is a limit to capital expenditure and for a specific line of business, trial or temporary infrastructural needs. IaaS is a form of hosting; an IaaS provider will generally provide the hardware and administrative services needed to store applications and a platform for running applications. The scaling of memory, storage, and bandwidth are included, vendors compete on the performance and pricing offered for their dynamic services. In our work, we have built a cloud network using

the open source Amazon EC2 based Eucalyptus [1] cloud network for hosting a database and Web services required for hosting a smartphone application. A Service Level Agreement (SLA) is an agreement between a cloud provider and cloud consumer based on the complete description of a list of services provided to the consumer by the provider, along with the metrics analysis to determine whether the consumer is being delivered the services as promised by the provider. A SLA also provides details about the responsibilities of the consumer and the provider and the remedies to both if the terms of the SLA are not met, and also briefly discusses how the SLA would change over time and requirements for consumer requests. Public clouds tend to offer negotiable SLAs. The specific definitions of pertinent SLA terms are important. Below are a few definitions that are discussed in this thesis: Downtime implies more than five percent user error for a setup. Downtime Period means, for a domain, a period of ten consecutive minutes of downtime. Intermittent downtime for a period of less than ten minutes will not be counted towards any downtime periods. Monthly Uptime Percentage means total number of minutes in a calendar month minus the number of minutes of downtime suffered from all downtime periods in a calendar month, divided by the total number of minutes in a calendar month. Scheduled Downtime means those times where a cloud provider notified a customer of periods of downtime at least five days prior to the commencement of such downtime. There will be no more than twelve hours of scheduled downtime per calendar year. Scheduled downtime is not considered downtime for purposes of this Google Apps SLA [2], and will not be counted towards any downtime periods. Thus, an SLA is a contract with the service provider and sets expectations for the business relationship between a consumer and a provider. SLA needs to be written to protect the cloud service(s) according to the level of risk. The goal is to have an SLA which both the cloud consumer and provider can understand and agree to, including an exit strategy. The SLA should be considered as the document that establishes the business partnership between the parties and is used to mitigate any problems. In this thesis, we design a model that a cloud consumer can use to estimate the number of needed microprocessor cloud resources, based on mobile application requirements, and to establish an SLA with a particular cloud provider for meeting those requirements. 3

Development in technologies such as wireless networks, the Internet, Geographical information systems (GIS), and Global Positioning Systems (GPS), have given birth to a new type of information technology called Location Based Services (LBS). Location Based Services refer to computational services that locate a mobile user geographically and deliver information services to the user based on their location. A Location Based Service is a geospatial mobile-based application that provides services based on a user s geographical location. An example of such a service is a food service application that informs mobile device holders of nearby restaurants. First generations of LBSs were reactive and clientserver focused where users would query for information and received a response back from the server. With the advancement of push notification techniques, improved mobile Internet access, and widespread adoption of the Web2.0 paradigm, next generation user-to-user interactive LBSs evolved where information is pushed asynchronously to users based on their location, rather than users having to query for services. The benefits of LBSs, from the users and providers perspective are: An LBS avoids users having to manually enter their physical location and interests and search for services; the LBS automatically handles the exchange of information and services based on a user s interest. By sharing location-tagged information, there is global awareness of localized information provided by all users. Location Based Services can be classified according to the kinds of information and services provided. Common classifications include: Information Services LBS provides information about dining places, tourist attractions, gas stations, etc. Entertainment Services LBS provide some form of entertainment and are typically integrated with social media services such as Facebook, LinkedIn, and Twitter. Accident Services LBS provide information based on one s location about nearby automobile accidents, traffic congestion, or road emergencies. Navigation and Route Assistance LBS assist drivers with navigation and road selection to minimize either travel time or distance. LBS based applications are another form of networking based on places one frequents. Typically, when using LBS applications, one gain points, discounts, and other rewards from participating businesses. 4

According to Internet and American Life Project research [3]: 7% of adults who go online with a mobile phone use location-based services 8% of online adults (age group 18-29) use location-based services (more than online adults in any other age group) 10% of online Hispanics use these services, more than online Caucasians (3%) or online African-Americans (5%) 6% of online men use location based services compared to 3% of women. For a typical LBS use case example, people use a mobile application to check-in at a meeting or perhaps a community-wide event (e.g. parade, football stadium, field house, and classroom). People checking in are then alerted to the presence of others that are also at the same location. There are a number of location-based applications available for mobile phones that allow user-to-user communication via texting. Geospatial functions are widely used in earth science modeling and applications. The fundamental purpose of geospatial processing is to allow the user to automate Geographical Information Science (GIS) tasks. Almost all uses of GIS involve the repetition of work, and this creates the need for methods to automate, document, and share multiple-step procedures. The geospatial processing functions in GIS have been developed over many decades and are well used in desktop-based computing. The rapid development of Web technology and smart phone applications has made it possible to share and process large volumes of distributed geospatial data through the Web and smart phone. However, a powerful, dependable and flexible information infrastructure is required to process geospatial data into information useful to mobile phone users. Cloud computing is rapidly emerging as a technology almost every business that provides or consumes software, hardware, and infrastructure can leverage. The technology and architecture that a cloud service and deployment model offers is a key area of research for deploying GIS technology to users of mobile applications. From a cloud provider s perspective, the key aspect of a cloud is the ability to dynamically scale and provide computational resources in an efficient way via the Internet. From a client mobile application provider s perspective, the ability to utilize cloud facilities, on-demand, without managing underlying infrastructure and dealing with related investments and maintenance costs to provide service to smart phone users, is of paramount importance. Some popular cloud computing platforms are Amazon Web Service (AWS) [4], Google App Engine [5], and 5

Microsoft Azure [6]. Users of these platforms can request microprocessors, storage, database, and other services and therefore gain access to a suite of elastic IT infrastructure services according to the dynamic demands of smart phone users. The purpose of using storage service in, say, Amazon AWS is to manage application data by processing and storing output data for further computation. Amazon AWS platform provides two storage services: Amazon Simple Storage Service (S3) and the Amazon Elastic Block Store (EBS). Amazon EBS provides block level storage (they are volumes with network-attached, and persist independently from the life of an instance) volumes for use with Amazon EC2 instances. Amazon EBS volumes are off instance (EC2 allows scalable deployment of applications by providing a Web service through which a user can boot an Amazon Machine Image to create a virtual machine) storage that persists independently from the life of an instance, which allows the user to create volumes that can be mounted as devices by EC2 instances. Compared with Amazon EBS, S3 is subject to delay in writes appearing in the system whereas EBS has no consistency delays. Also EBS can only be accessed by one machine at a time whereas snapshots on S3 can be shared. Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Some GIS enterprises, such as the Economic and Social Research Institute (ESRI), have made certain progress in moving their products and services into cloud computing. ESRI, in collaboration with Amazon, uses the public cloud environment in several different ways, and currently the following options are available from ESRI that can be deployed on AWS: ArcGIS Server ArcLogistics (a cloud application for optimizing navigation routing) Business Analyst Online (a cloud application for geographic analysis of demographic, consumer, business, and other data) Geospatial processing functions have been developed within professional GIS applications for over decades. However, most commercial GIS packages do not support an open processing and analysis environment, which means that geospatial functions can only be used under their own proprietary software s. Hence, geospatial processing functions that are required to implement location based services need to be adapted to the cloud computing 6

7 environment. A cloud platform should have the ability to scale computing resources up or down, automatically, according to current mobile application usage conditions. Single-core geospatial processing using only one microprocessor core within a cloud platform could not show the advantage of cloud computing, which is the flexibility to increase the number of computing resources during demand spikes. In this context, the desired SLA is between a mobile app provider and the cloud platform provider and the agreement is to always deliver the least number of cores required to maximize transaction revenue to the client app provider, such capability is a major feature of cloud computing and is necessary to maintain a desired SLA. When demand decreases, it is also necessary to scale down assigned computing resources to minimize costs to the mobile application provider. First, the geospatial processing functions deployed within a cloud computing platform should provide such flexibility in a transparent way to end-users. Second, because most current commercial cloud computing providers do not provide free usage, the consumer should take economic costs into account when deploying a mobile application in a commercial cloud environment. We have designed a new methodology to consider both of these factors. There is an extensive literature on resource management techniques for commercial data centers. Utility is often adopted as a metric for resource allocation; building frameworks using these utility functions can be used to optimize resources. Utility functions provide criteria for trading off between multiple competing systems. In this thesis, we propose and evaluate a new model for maximizing the net revenue of a cloud consumer who deploys location-based mobile applications, where net revenue is defined as the fees received from providing location based services to smart phone users minus fees paid for cloud infrastructure usage. RT C (1.1) Fees Fees where R is net revenue, T Fees is fees received from providing LBS to smart phone user, C Fees is the cloud infrastructure usage. Profit maximization is one of the important factors that a cloud consumer should always consider and it is assumed that the cloud consumer always tries to maximize profit. The intersection of marginal revenue and marginal cost is the equilibrium point. There are

8 several approaches to explain the equilibrium point regarding profit maximization. One is the marginal revenue (MR)-marginal cost (MC) approach [7]. Our model uses the MR-MC approach. If a cloud customer wants to maximize profit, it must choose that level of equilibrium point where the marginal cost equals marginal revenue. Marginal cost, in the context of our model, is defined as the increase in the cost paid by the mobile application provider of investing in exactly one more additional resource in the cloud infrastructure. Marginal revenue is defined as the change in total revenue as a result of an additional unit of resource added in the cloud infrastructure. Equivalently, this is the slope of the total revenue function. Where marginal cost is less than marginal revenue, for each extra unit invested in adding a resource, the revenue will be greater than the cost, so more investment should be done on adding the resource into the cloud infrastructure to maximize profit as shown in Figure 1.2. Where marginal cost is greater than marginal revenue, for each extra unit invested in adding a resource, the revenue will be less than the cost, hence further investment on adding the resource into the cloud infrastructure would decrease profit. Figure 1.2. Marginal revenue vs. marginal cost (MR=MC curve).

9 CHAPTER 2 REVIEW OF THE LITERATURE The literature identifies the importance of the increasingly perceived vision of a computing paradigm known as cloud computing, as highlighted by Buyya et al. [8] and a number of advantages cloud computing offers for the deployment of data-intensive applications. One important promise highlighted by Kossman et al. [9] is the (virtually) unlimited throughput achieved by adding servers as the workload increases. Continuing, the literature focuses on cloud-computing as an increasingly popular technology to access computing resources using network connections in accordance with the most accepted definition of cloud computing, from NIST (National Institute of Standards and Technology), which lays out five essential characteristics: on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service as highlighted by Kossman et al. The literature further dives into highlighting cloud computing usage models such as enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Furthermore the literature presents various physical resources, infrastructure, middleware platforms and applications that are being provided and consumed as services in cloud computing, highlighting the service model, Infrastructure as a service (IaaS), that displaces in-house servers, storage and networks by providing them on-demand, as highlighted by Mateljan et al. [10]. Buyya et al. [8] introduces cloud computing and provides the architecture for creating clouds with market-oriented resource allocation by leveraging technologies such as Virtual Machines (VMs). They also provide insights on market-based resource management strategies that encompass both customer-driven service management and computational risk management to sustain Service Level Agreement (SLA)-oriented resource allocation. In addition, they reveal thoughts on interconnecting clouds for dynamically creating global cloud exchanges and markets. Also presented in this study are some representative cloud

10 platforms, towards realizing market-oriented resource allocation of clouds. Furthermore, they highlight the difference between High Performance Computing (HPC) workload and Internet-based services workload. They also describe a meta-negotiation infrastructure to establish global cloud exchanges and markets, and illustrate a case study of harnessing storage clouds for high performance content delivery. In the paper An Evaluation of Alternative Architectures for Transaction Processing in the Cloud [9] Kossmann et al. talk about evaluating alternative architectures for transaction processing in a cloud. They define the benefits of cloud computing and talk about how cloud computing promises a number of advantages for the deployment of data-intensive applications highlighting reduced cost with a pay-as-you-go business model. Also they highlight the (virtually) unlimited throughput that can be achieved by adding additional servers when workload increases. This paper lists alternative architectures to effect cloud computing for database applications and reports on the results of a comprehensive evaluation of existing commercial cloud services that have adopted these architectures. The focus of this work is on transaction processing (i.e., read and update workloads), rather than analytics or Online Analytical Processing workloads (OLAP), which has recently gained a great deal of attention. The results are surprising in several ways. Most important, Kossmann et al. point out that all major vendors have adopted a different architecture for their cloud services. As a result, it is conclusive that the cost and performance of the services vary significantly depending on workload. Considering Return of Investments, the paper Cloud Database-as-a-Service (DaaS) - ROI [10] by V. Mateljan, D. Cisic and D. Ogrizovic present cloud computing concept with an overview of main services, cloud native and cloud capable relational / non-relational databases, and some advances and drawbacks of cloud Database-as-a-Service. The proposed Return on Investment (ROI) analysis outlines how to decide, and the things to consider, whether an application is suited for a cloud computing environment, on-premise (in house) infrastructure, or to outsource to managed services. Talking about the economic model of cloud computing, Kevin Buell and James Collofello [11] introduced the pay-as-you-go economic model of cloud computing in their work Cost Excessive Paths in Cloud Based Services showing how this model leads naturally to an earn-as-you-go profit model for many cloud based services and how these applications

11 can benefit from low level analyses for cost optimization and verification. Buell et al. present a static analysis approach for determining which control flow paths in cloud applications can exceed a cost threshold. Their emphasis build on tools used in Worst Case Execution Time analysis that provide a tight bound on processing time, and include provisions for adding bandwidth, storage, and service costs. Their approach determines the magnitude of cost excess for nodes in an application s call graph, so that cloud developers can better understand where to focus their efforts to lower costs (or deem some excesses acceptable based on business case analysis). Ugur et al. [12] have done an analysis of current trends in data management systems, such as cloud and multi-tenant databases, and how they are leading to data processing environments that concurrently execute query workloads that are heterogeneous. A query workload is any group of queries that run on a data server. Ugur et al. stresses the importance of how these systems need to satisfy diverse performance expectations in newly-emerging settings with the main aim of avoiding potential Quality-of-Service (QoS) violations. Ugur et al. shows how this model heavily relies on performance predictability, i.e., the ability to estimate the impact of concurrent query execution on the performance of individual queries in a continuously evolving workload. Ugur et al. present a modeling approach to estimate the impact of concurrency on query performance for analytical workloads. Their solution relies on the analysis of query behavior in isolation, pairwise query interactions and sampling techniques to predict resource contention under various query mixes and concurrency levels. Ugur et al. introduce a simple, yet powerful, single value metric that accurately captures the joint effects of disk and memory contention on query performance and also discuss predicting the execution behavior of a time-varying query workload through query interaction timelines, i.e., a fine-grained estimation of the time segments during which discrete workloads will be executed concurrently. Their experimental evaluation demonstrates models that can provide query latency predictions within approximately 20% of the actual values in the average case. One of the key service criteria is optimizing search performance, reduction of search time, and search costs of cloud databases. Liu Jia and Huang Ting-Lei [13] present a method for enhancing the efficiency of searching databases hosted in a cloud in their paper Dynamic Route Scheduling for Optimization of Cloud Database [13]. Huang et al. present how cloud

12 computing is developing as a key computing platform for sharing resources and highlighted the difficulty of achieving an effective route storage resource (storage resource is the process of optimizing the efficiency and speed with which the available drive space is utilized) in cloud. Huang et al. introduce the ant colony algorithm (a probabilistic technique for solving computational problems that could be used to find good paths through graphs) [14] and the cloud database search of cloud computing combined, and showed improvements for the traditional ant colony algorithm. Pengcheng Xiong, Yun Chi, Shenghuo Zhu, Hyun Jin Moon, Calton Pu, Hakan Hacigümüs [15] investigated how cloud hosted resources are shared among different clients and the importance of intelligently managing and allocating resources among various clients is important for system providers in their paper Intelligent Management of Virtualized Resources for Database Systems in Cloud Environment [15]. The business model of most system providers rely on managing infrastructure resources in a cost-effective manner, while satisfying client service level agreements (SLAs). In this paper, Xiong et al. address the issue of how to intelligently manage resources in a shared cloud database system and present SmartSLA, a cost-aware resource management system. The SmartSLA solution consists of two main components: a system modeling module and a resource allocation decision module. The system modeling module uses machine learning techniques to learn a model that describes the potential profit margins for each client under different resource allocations. Based on the learned model, the resource allocation decision module dynamically adjusts the resource allocations in order to achieve optimum profits. Xiong et al. also present an evaluation model for SmartSLA using the Transaction Processing Performance (TPC-W) benchmark [16] with workload characteristics derived from real-life systems. Performance results indicate that SmartSLA can successfully compute predictive models under different hardware resource allocations, such as CPU and memory, as well as database specific resources, such as the number of replicas in a database system. Replicas are created to make a database available in different locations, networks, or time zones. Experimental results show how their solution provides intelligent service differentiation according to factors such as variable workloads, SLA levels, resource costs, and can deliver improved profit margins. In the paper Calibrating the Huff Model Using ArcGIS Business Analyst [17], Dr. David Huff and Bradley M. McCallum of ESRI Inc., illustrate how the parameters of the

13 generalized version of the Huff Model (presented in Chapter 7) can be estimated statistically using ArcGIS Business Analyst [18]. Emphasis is directed toward the application and interpretation of the model in addressing spatial interaction problems. Maps, diagrams, charts, and graphs generated from ArcGIS Business Analyst are used to facilitate decision making. The paper discusses the Huff Model, as it is formally called, citing it is a tool for formulating and evaluating business geographic decisions. The paper also discusses how the development of geographic information system (GIS) technology has given the model even more attention and various places where the model is used, from estimating market potential, predicting consumer shopping selections, and profiling/targeting consumers. Methodologies used for setting network charges need to recover the costs of capital, operation, and maintenance of a network and provide forward-looking, economically efficient messages for both consumers and generators. A novel long-run marginal cost (LRMC) pricing methodology based on analytical method is proposed by Gu et al. [19] to reflect the impacts on the long-run costs imposed by a nodal injection through sensitivity analysis. In the proposed LRMC approach, the change of present value of future reinforcement with respect to a nodal power increment is represented by three partial differentiations: (1) sensitivity of circuit loading level with regard to nodal injection, (2) sensitivity of time to reinforce with respect to circuit loading level, and (3) sensitivity of the present value of future reinforcement with respect to time to reinforce. Using the sensitivity approach, long-run marginal cost calculates charges that can reflect very small changes in nodal generation/demand accurately. Data centers offering computational and storage capacities for users are associated with high electricity consumption and costs of running the data center itself. Cloud providers are facing the problem of choosing the right number of servers to run in order to avoid over - provisioning, as it is a major contributor to excessive power consumption, while meeting availability and performance requirements. Mazzucco et al. [20] address the problem of maximizing the revenues of cloud providers by trimming down their electricity costs. They propose and evaluate energy-aware allocation policies that aim to maximize the average revenue received by the provider per unit time. This is achieved by improving the utilization of the server farm, i.e., by powering excess servers off. The policies are based on (i) dynamic estimates of user demand, and (ii) models of system behavior.

14 Pricing of the cloud resources is both a fundamental component of the cloud economy and a crucial system parameter for the cloud operator, because it directly impacts customer usage pattern and the utilization level of the infrastructure. The problem of finding the optimal pricing policy for cloud providers such as Amazon is tackled by Xu et al. [21] by adopting a revenue management framework from economics that deals with the problem of selling perishable resources, such as airline seats and hotel reservations, in order to maximize the expected revenue from a population of price sensitive customers. They present an infinite horizon stochastic dynamic program for the revenue maximization problem in the cloud, with stochastic demand arrivals and departures. The optimal price is only a function of the system utilization and not a function of time. Another integrated approach to address the problem of revenue-cost optimization in cloud-based application service providers, with stringent QoS requirements, is proposed by Duong et al. [22], which combines resource provisioning algorithms and request scheduling disciplines. The main goal is to maximize the service provider s revenue via satisfying predefined QoS requirements, and at the same time, to minimize cloud resource cost. They consider the QoS-aware, revenue-cost optimization problem for latency-sensitive application service providers (ASPs). They propose an approach that combines efficient cloud resource provisioning with request scheduling to meet the QoS requirement in terms of users request waiting time, while reducing server resource cost for the problem. The revenue maximization problem of managing a server farm has been examined by Marlon et al. [23] to maximize the net revenue earned by a cloud provider by renting servers to customers according to a typical Platform-as-a-Service model. They propose and evaluate a model for maximizing the net revenue of a cloud provider, where net revenue is defined as the fees for server usage, minus energy costs and penalties paid by the provider for service unavailability. The main novelty of their study is that it takes into account penalties due to unavailability, and it distinguishes between customers who are entitled to penalty payments due to unavailability, and those who are served on a best-effort basis. The problem of resource allocation for cost effective data processing services has been studied by Prasad et al. [24] by taking in to account not just the processing power and memory requirements, but the network speed, reliability and data throughput. They provide a method for resource provisioning over the cloud for bulk data processing. They present an

15 algorithm to determine SLA parameters for a given workload and available resources. They also present various data partitioning strategies to improve the data throughput. They consider the hybrid case where resources from both the cloud and customer site are utilized and the two sites are connected by a network of limited bandwidth. They also study the effect of session lifetime in data partitioning for multi-session processing. A different methodology for maximizing the cloud provider s revenue through Service Level Agreement (SLA) based dynamic resource allocation has been studied by Feng et al. [25] They formalize the resource allocation problem using Queuing Theory. Queuing Theory based mathematical formulation models the application s requirement using various parameters such as resource quantity, request arrival, service time and pricing model. They propose Optimal SLA-based resource allocation algorithms among different Cloud service instances considering various Quality of Service (QoS) parameters such as pricing mechanisms, arrival rates, service rates and available resources, by which Cloud providers can maximize their revenues. Bouchenak [26] introduces SLAaaS model (SLA aware as a Service) for ad-hoc management of cloud computing resources for quality-of-service. SLAaaS enables a systematic and transparent integration of service levels and SLA to the cloud. In order to guarantee the SLA of SLAaaS clouds, automated control for dynamic cloud elasticity should be provided. This aims to meet quality-of-service requirements such as performance and availability while minimizing cloud cost. SLA-aware dynamic elastic clouds can be provided through online observation, monitoring, modeling and automated control of the cloud. The development of spatial data infrastructures (SDIs) bring about the Web-based sharing of large volumes of distributed geospatial data and computational resources. A powerful, dependable and flexible information infrastructure is required to process heterogeneous and distributed data into information and knowledge. The emergence of Cloud Computing technology brings a new computing Information Technology (IT) infrastructure to general users. Shao et al. [27] implement a geoprocessing service that integrates Amazon cloud computing and geoprocessing functions to provide geoprocessing competence in a distributed web environment.

16 CHAPTER 3 CLOUD COMPUTING Cloud computing is a model that gives an access to computers and their functionality via the Internet or a local area network. Users of a cloud request this access from a set of Web services that manage a pool of computing resources such as machines, networks, storage devices, operating systems, application development environments, and application programs. When granted, a fraction of the resources in the pool are dedicated to the requesting user until the user releases them. The model is called cloud computing because the user cannot actually see or specify the physical location and organization of the equipment hosting the resources the user is ultimately allowed to use. That is, requested resources are drawn from a cloud of resources when they are granted to a user and returned to the cloud when they are released. In short, a cloud, in the context of distributed computing, is a set of machines and Web services that implement computing resource allocation. 3.1 VIRTUALIZATION VS. CLOUD COMPUTING Virtualization is the ability to run virtual machines on top of a hypervisor. A virtual machine (VM) is a software implementation of a machine that executes programs like a physical machine. Each VM includes its own kernel, operating system, supporting libraries, and applications. A hypervisor provides a uniform abstraction of the underlying physical machine. Multiple VMs can execute simultaneously on a single hypervisor. The decoupling of the VM from the underlying physical hardware allows the same VM to execute on different physical machines. Thus, virtualization is seen as an enabler for cloud computing, allowing the cloud computing provider the necessary flexibility to move and allocate computing resources requested by the user wherever physical resources are available. The different possible cloud service styles are: Infrastructure as a Service (IaaS) IaaS clouds provide access to collections of virtualized computer hardware resources, including machines, network, and storage.

17 With IaaS, users assemble their own virtual cluster on which they are responsible for installing, maintaining, and executing their own software stack. Platform as a Service (PaaS) PaaS style clouds provide access to a programming or runtime environment with scalable compute and data structures embedded within. With PaaS, users develop and execute their own applications within an environment offered by the service provider. Software as a Service (SaaS) SaaS style clouds deliver access to collections of software application programs. SaaS providers offer users access to specific application programs controlled and executed on the provider s infrastructure. SaaS is often referred to as Software on Demand. 3.2 THE EUCALYPTUS CLOUD ARCHITECTURE The flexibility of cloud computing has its origin in the combination of virtualization technologies with Web services. The definition of what exactly is cloud in the computing literatures is Building on compute and storage virtualization, and leveraging the modern Web, cloud computing provides scalable, network-centric, abstracted IT infrastructure, platforms, and applications as on-demand services that are billed by consumption [28] cloud computing offers various advantages with respect to the grid paradigm. Grid computing is the federation of computer resources from multiple locations to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. The grid deals with complexity to operate federated (A federated database system is a type of DBMS, which does the mapping of multiple database system into a single database transparently) and autonomous data centers. All resources are traded as a service over the Internet, based on a utility model, and only consumed resources are accounted for, following the pay-as-you-go principle. Cloud infrastructures implement central control. However, commercial infrastructure cloud providers often use proprietary middleware systems. With the introduction of Eucalyptus [1], an open source solution for the construction of private clouds is available. Eucalyptus is open source software originally developed at the University of California, Santa Barbara, to implement cloud computing on compute clusters. Eucalyptus implements Infrastructure as a Service (IaaS) while giving users the ability to run and control virtual machine instances via Xen [29] hypervisor or Kernel based Virtual Machine (KVM) [30] deployed across a variety of physical resources. By implementing the same interfaces and protocols, Eucalyptus is able to leverage the same tools known to work with the original

18 Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3) and Elastic Block Storage(EBS) [4]. Eucalyptus has the potential to help establish an open cloud computing de facto standard. The main components of Eucalyptus are the Cloud Controller (CLC), Cluster Controller (CC) and Node Controller (NC). The implementation of the Amazon S3 interface is called Walrus [1]. Eucalyptus also includes a feature of block storage functionality that is similar to EBS which is explained in Chapter 4. The NC runs on every node in the cloud as well as a Xen Hypervisor or KVM. The NC provides information about free resources to the CC. The CC schedules the distribution of virtual machines to the NC and collects resource capacity information. The CLC collects resource information from the CC and operates like a meta-scheduler in the cloud. 3.3 EUCALYPTUS CLOUD OVERVIEW Eucalyptus is built on six components: Cloud Controller, Walrus, Cluster Controller, Storage Controller, Node Controller, and an optional VMWare Broker [1]. Each component is a stand-alone web service. The architecture allows Eucalyptus to expose each web service as a well-defined, language-agnostic API, and to support existing web service standards for secure communication between its components. The Cloud Controller is the entry-point into the cloud for administrators, developers, project managers, and end-users. The CLC queries other components for information about resources, makes high-level scheduling decisions, and makes requests to the Cluster Controllers. As the interface to the management platform, the CLC is responsible for exposing and managing underlying virtualized resources (servers, network, and storage). You can access the CLC through command line tools that are compatible with Amazon s Elastic Compute Cloud (EC2) and through a web-based dashboard. Walrus allows users to store persistent data, organized as buckets and objects. One can use Walrus to create, delete, and list buckets, or to put, get, and delete objects, or to set access control policies. Walrus is an interface compatible with Amazon s Simple Storage Service (S3). Walrus provides a mechanism for storing and accessing virtual machine images and user data. Walrus can be accessed by end-users, whether the user is running a client from outside the cloud, or from a virtual machine instance running inside the cloud. The Cluster Controller generally executes on a machine that has network connectivity to both the machines running the NC and to the machine running the CLC. CCs gather information

19 about a set of NCs and schedule virtual machine (VM) execution on specific NCs. The CC also manages virtual machine networks. All NCs associated with a single CC must reside on the same subnet. The Storage Controller (SC) provides functionality similar to the Amazon Elastic Block Store (Amazon EBS). The SC is capable of interfacing with various storage systems (NFS, iscsi, SAN devices, etc.). Elastic block storage exports storage volumes that can be attached by a VM and mounted or accessed as a raw block device. EBS volumes persist past VM termination and are commonly used to store persistent data. Eucalyptus implements availability zones as separate cluster with in the cloud. An EBS volume cannot be shared between VMs and can only be accessed within the same availability zone in which the VM is running. Users can create snapshots from EBS volumes. Snapshots are stored in Walrus and made available across availability zones. Eucalyptus with SAN support lets you use your enterprise-grade SAN devices to host EBS storage within a Eucalyptus cloud. The Node Controller (NC) executes on any machine that hosts VM instances. The NC controls VM activities, including the execution, inspection, and termination of VM instances. The NC also fetches and maintains a local cache of instance images, and queries and controls system software (host OS and the hypervisor) in response to queries and control requests from the CC. The NC is also responsible for the management of the virtual network endpoint which adds an endpoint to a machine for other resources on the Internet or other virtual networks to communicate with it. The VMware Broker (Broker or VB) is an optional Eucalyptus component activated only in versions of Eucalyptus with VMware support. The Broker enables Eucalyptus to deploy virtual machines (VMs) on VMware infrastructure elements. The Broker mediates all interactions between the CC and VMware hypervisors Elastic Sky X (ESX/ESXi), either directly or through VMware vcenter [31] as shown in Figure 3.1.

Figure 3.1. Eucalyptus based cloud architecture. 20

21 CHAPTER 4 BUILD PRIVATE CLOUD USING EUCALYPTUS We have built a private cloud using Eucalyptus [1] upon which we run location-based service simulations. We can provision and scale collections of resources drawn from the private Eucalyptus cloud, via web service interface. We have deployed the private cloud within the existing campus data center and behind the college firewall; the cloud is subject to campus security measures, which provides a high degree of security over code and data. Figure 4.1 depicts the relation between Eucalyptus components in a generalized deployment. Figure 4.1. Components of Eucalyptus architecture. The Cloud Controller (CLC) and Walrus are the cloud components that communicate with the cluster components, the Cluster Controllers (CCs) and Storage Controllers (SCs). The CCs and SCs communicate with the Node Controllers (NCs). The networks between machines hosting these components are configured using Transmission Control Protocol (TCP) socket connections. The virtual machines (VMs) run on node controllers. The cluster controllers are used as software application routers between the clients outside the Eucalyptus cloud and the virtual machines. The VMs use the routing framework already in place without CC software routers. However, depending on the layer-2 isolation characteristics of your existing network, one might not be able to implement all of the security features supported by Eucalyptus.

22 4.1 EUCALYPTUS NETWORKING MODES Eucalyptus provides the functionality to overlay a virtual network on top of an existing physical network which supports four different networking modes: managed, managed no Virtual Local Area Network (VLAN), system, and static. Each mode is designed for different levels of security and flexibility as shown in Figure 4.2 [1]. The classification of these modes direct Eucalyptus to use different network features in order to manage the virtual networks that connect VMs to each other and to clients external to Eucalyptus. A Eucalyptus installation must be compatible with local site policies and configurations (e.g., firewall rules). Eucalyptus configuration and deployment interfaces allow a wide range of options for specifying how it should be deployed. In the Eucalyptus setup we have employed for our work, we have chosen the managed no VLAN configuration. As shown in Figure 4.2, we have configured the Managed no VLAN setup for our infrastructure. In the Managed no VLAN mode, Eucalyptus fully manages the local VM instance network and provides all of the networking features. Without the isolation of VLAN at the bridge level, it is possible in Managed no VLAN mode for a root user to interfere with the Ethernet traffic of other VMs running on the same layer 2 network. The configuration involves Elastic IP addresses (static IP addresses designed for dynamic cloud computing). The elastic IP address is associated with an account and not an instance. We have also defined a Security Group which is a set of rules that the Eucalyptus infrastructure applies to the incoming packets for instances in our Managed VLAN mode. Each of the defined rules specifies the source IP/network, protocol type and destination ports. 4.2 SECURING EUCALYPTUS As defined by the Amazon Web Service specification [4] the messages that are received over the Simple Object Access Protocol (SOAP) or Query interfaces are required to have time stamps on them in order to prevent replay attacks on messages. Eucalyptus provides a set of strict policy enforcements to check the timestamps of the received messages for proper functionality of the cloud infrastructure; hence it s crucial to have all the clocks constantly synchronized. In our setup we have synchronized the clocks with Network Time Protocol Daemon (ntpd) on all machines hosting the Eucalyptus components. Message interfaces containing the expired time stamp of 15 minutes are rejected. The SOAP interface

23 Figure 4.2. Network architecture decision flow chart to determine the network configuration for Eucalyptus setup. Source: Eucalyptus. Open Source Private and Hybrid Clouds from Eucalyptus, 2008. http://www.eucalyptus.com, accessed Feb. 2013. requests using the WS-Security expire [32]. Eucalyptus allows up to 20 seconds of clock drift between the machines when checking the timestamps for expiration. To protect against replay attacks, the CLC caches messages for 15 minutes. Any tools used to interact with the CLC have expired time element to value less than 15 minutes from the current time. Generally it s not an issue with the standard tools, such as euca2ools and

24 Amazon EC2 API tools. In our configuration we have set the values using euca2ools command: euca-modify-property -p bootstrap.webservices.replay_skew_window_sec=15 4.2.1 Configuring Secure Sockets Layer (SSL) To connect to Eucalyptus cloud using SSL, a valid certificate is required for the CLC. Eucalyptus uses Personal Information Exchange System Standard (PKCS) #12 [33] keystore. The CLC should be enabled to use the keystore. If the Walrus and other tools that expect to speak SSL on port 443, the requests can be forwarded by modifying the iptables. 4.2.2 Add a NC We increased the system s capacity by adding more NC. We added marconi.sdsu.edu for increasing the capacity, to add an extra NC the following procedure was followed. This command is issued on the CLC: /usr/sbin/euca_conf --register-nodes \ "marconi.sdsu.edu" 4.3 EUCALYPTUS AND AWS Eucalyptus is based on the AWS API, this means we can reuse many existing AWScompatible tools, images, scripts to maintain the cloud environment. 4.3.1 AWS Compatibility Eucalyptus implementation of the AWS API is such that the tools in the cloud ecosystem can communicate with Eucalyptus and AWS using the same API. Eucalyptus has the following features compatible with the AWS features: Amazon Elastic Compute Cloud (EC2) Amazon Elastic Block Storage (EBS) Amazon Machine Image (AMI) Amazon Simple Storage Service (S3) Amazon Identity and Access Management (IAM) Eucalyptus has similar components and is open source and they communicate with each other using well-defined web service definitions, also having additional communication layer that exposes the Amazon compatible interface as shown in Figure 4.3 [1]. Eucalyptus framework provides access to the tools ecosystem for AWS, which includes monitoring,

25 Figure 4.3. Eucalyptus compatibility with Amazon EC2 cloud. Source: Eucalyptus. Open Source Private and Hybrid Clouds from Eucalyptus, 2008. http://www.eucalyptus.com, accessed Feb. 2013. cloud service and image management. We are using Eucalyptus as the open source reference implementation for AWS-compatibility with the advantage of common web services platform between AWS and Eucalyptus Web Services supporting EC2, EBS, S3 and IAM as Eucalyptus users.

By using AWS-Compatibility with Eucalyptus, we meet the compliance requirements and satisfy the regulation of keeping private data in our own data center. 4.3.2 S3 Tools S3 tools are compatible with Amazon S3. These are designed to interact with the S3 bucket storage and works well with the Eucalyptus Walrus. 4.3.3 EC2 Tools EC2 tools are compatible with Amazon EC2 and are designed for control and management of VM instances, EBS volumes, elastic IP s, and security groups which works well with the EC2 and eucalyptus 4.3.4 Bundle and Image for Amazon EC2 The images can uploaded without changes to Amazon EC2 and run as Amazon Machine Instances (AMI) in the public cloud. To upload an EMI image file to Amazon, the following command is used euca-bundle-image -i <image_name> -r <architecture>\ -c <cert_filename> -k <private_key_filename> \ --ec2cert <path_to_cert_file > Eucalyptus can be used to build a flexible IT architecture, and is the easiest and most efficient way to get started building a private cloud. Eucalyptus is open source, compatible with multiple hypervisors including Xen, KVM, and VMware, and can be installed on any major distribution of Linux. 26 4.4 EUCALYPTUS DASHBOARD This section describes the method and implementation for the Eucalyptus dashboard access. The Eucalyptus Dashboard is web-based interface that allows users to manage their private clouds using specific private identities. Dashboard is accessed using a browser window and taking a secure http connection to https://<clc_ip_address>:8443.the Dashboard provides quick links for standard administrative actions and queries. The Dashboard also provides a robust search engine mechanism to search for information or tasks quickly by building a custom search. The Dashboard gives two ways to get information: by search or by following links. To show the member users of an account, we can click

27 Accounts in quick links, select the account, and then click on the Member users link in the Properties section. Or, you use the Search box and type: user:account=<account_name>. The dashboard provides information about the number of instances, storage, network and security group and information about images, the status about the instances as shown in Figure 4.4. Figure 4.4. Dashboard for Eucalyptus services. After logging in, we can see all the features in the user console against the AWS account. The access key appears on the upper right, as seen the Figure 4.4.

28 CHAPTER 5 POSTGRESQL PostgreSQL [34] is an object-relational database management system (ORDBMS) which is based on POSTGRES Version 4.2 [35], developed at the University of California at Berkeley Computer Science Department. PostgreSQL is an open-source descendant of this original Berkeley code. This supports a large part of the SQL standard and has many modern features such as: new: Support of complex queries: SQL does not follow a sequential programming approach. SQL Joins and Group By queries are few complex queries that are supported in PostgreSQL. Foreign keys: A foreign key is a constraint that specifies that the values in a group of columns must match the values appearing in some other row of a different table. Triggers: This is a specification that the database should automatically execute a function whenever a particular operation needs to be performed. Views: Views are named, stored queries in the database. These are called and executed each time when a view is included in a query to the database. This is implemented using the rule system. Transactional integrity: This means that, where a combination of multiple inserts or updates are required to complete a transaction, their full completion is enforced by the database. Multi-version concurrency controls: This feature frees data tables for simultaneous use by readers and writers. PostgreSQL can be extended by the user in many ways, for example by adding Data types Functions Operators Aggregate functions Index methods Procedural languages

29 PostgreSQL confirms to liberal licensing terms which means it can be used, modified, and distributed by anyone free of charge for any purpose, be it private, commercial, or academic. 5.1 BRIEF HISTORY OF POSTGRESQL The object-relational database management system which is now known as PostgreSQL is derived from the POSTGRES package that was written at the University of California at Berkeley. There has been a development of over two decades in POSTGRES. The implementation of POSTGRES began in 1986, the initial concepts for the system were presented in the design of POSTGRES, and the definition of the initial data model appeared in a paper titled The POSTGRES data model [36]. POSTGRES underwent several major releases since 1986. Starting with the demoware system that became operational in 1987 was presented at the ACM-SIGMOD Conference [37]. Later Version 2 was released in June 1990 with concept of database defining active rules called the new rule system which are usually the stored procedures and triggers. Version 3 was released in 1991 and had added support for multiple storage managers, an improved query executor, and a rewritten rule system. There were subsequent releases until Postgres95 focused on portability and reliability. POSTGRES has been used to implement various different research and production applications, few of them are: an asteroid tracking database, a medical information database, and several geographic information systems [38]. POSTGRES is also an educational tool at several universities. There are external user communities for POSTGRES and it has been growing rapidly with many users contributing in. POSTGRES became increasingly obvious that maintenance of the prototype code and support was taking up large amounts of time that should have been devoted to database research. In an effort to reduce this support burden, the Berkeley POSTGRES project officially ended with Version 4.2. 5.2 ARCHITECTURAL FUNDAMENTALS OF POSTGRESQL In this section, a description of the PostgreSQL system architecture is presented. Similar to other databases PostgreSQL uses a client/server model. A PostgreSQL session consists of two processes: a server process and a client application invoked by a user. The server process is the process that accepts connections to a database from client applications, performs database actions, and manages the database files. The database server part is called

30 postgres. Client applications can be very diverse in nature: a client could be a text- tool, a graphical tool, a web server that accesses the database, a mobile application, or a specialized database maintenance tool. Some client applications are supplied with the PostgreSQL distribution or either are developed by users. As is typical of client/server applications, the client and the server can be on different hosts. Here both applications communicate over a TCP/IP network connection. In this case, files that can be accessed on a client machine might not be accessible on the database server machine. The PostgreSQL server can handle multiple concurrent connections from clients. In our work we have used this feature, and have performed simulations for different number of connections. To do this we, start the server s forks a new process for each connection. From that point on, the client and the new server process communicate without intervention by the original postgres process. Thus, the master server process is always running, waiting for client connections, whereas the client and associated server processes come and go on ad hoc basis. 5.3 WHY USE POSTGRESQL Compared to other open source databases, PostgreSQL has many features that are not present in other databases. One of them is the option of writing database functions using many different languages that can return simple scalar values, as well as data sets for building aggregate functions. Commonly used languages are standard: SQL, PL/PGSQL, and C. In addition to the three standard languages, PL/Perl, PL/Python, PL/TCL, PL/SH, PL/R, and PL/Java are also available. To use these languages, their corresponding environments should be installed such as Perl, Python, TCL, Java, and R. IBM DB2 and Microsoft SQL Server uses the Microsoft.NET functions where the coding is not on the database. There exists a tool called PL/Parrot, a procedural language handler for the Parrot system, which has a feature of combining multiple dialects of languages in one procedural language. PostgreSQL has support for data structure arrays. PostgreSQL, Oracle, and IBM DB2 are different among databases in that arrays are considered to be one of the useful features. In PostgreSQL, one can define any table column comprising an array of strings, numbers, dates, geometries, or even your own data type creations. This comes in handy for matrix-like analysis or aggregation. In addition, one can convert any single column of row list to an array, which is particularly useful when manipulating geometries.

Moreover, PostgreSQL has a feature called table inheritance, which is similar to object multi-inheritance. Table inheritance is often used for table-partitioning strategies. An example of using table inheritance in the field of statistics could be considered in the example: statistics - statistics_2010_04 (inherits statistics) - statistics_2010_05 (inherits statistics) In the above example, consider a sample with 1,000,000 rows in each table. Each table has a constraint to make sure only data for the matching month gets stored in it. The feature of inheritance is useful when we need to select data between two particular months. One would scan only the tables between those two months and would not query in other tables, which decreases the query response time. A second use of inheritance is: each small table has small index and hence read operations are faster compared to that of a master table with a large index. Another advantage of using inheritance is each table could be vacuumed, or indexed independently of data in other tables. PostgreSQL supports a multicolumn feature. This feature gives the ability to combine and use multiple single-column indexes in a query. The multicolumn feature has an ability to define aggregate functions that take input of more than one column. But when we think of aggregates, we think of them as taking a single column as input. This feature is less exploited and thus is hard to visualize. PostgreSQL is based on a simple single process per user client/server model implementation. In this model, there is exactly a one to one mapping between the client and server processes. PostgreSQL uses a master process that spawns a new server process every time a connection is initiated, since PostgreSQL does not know how many connections will be made ahead of time. The master process listens at a specified TCP/IP port for incoming connections and is called postgres. The postgres process spawns a new server process whenever a request for a connection is detected. Semaphores and shared memory are used to ensure data integrity throughout concurrent data access whenever server tasks communicate with each other. Any PostgreSQL client process must understand the PostgreSQL protocol. There are many clients built using the C-language library; libpq is one of the library interfaces for PostgreSQL, but there are numerous independent implementations of the protocol like the 31

32 Java JDBC driver. The client process triggers the query to the backend server once a connection is established. A query is not parsed on the frontend client and is transmitted as plain text. The server then parses the query, initiates an execution plan, returns retrieved rows, and transmits those rows over an established connection back to the client.

33 CHAPTER 6 POSTGIS PostGIS is an open-source extension available for PostgreSQL, which supports GIS (Geographic Information Systems) objects to be stored in the database, and has functions for analysis and processing of GIS objects. PostgreSQL also provides support for Generalized Search Tree (GiST)-based R-Tree spatial indexes. PostGIS enables spatial functionality within the PostgreSQL object-relational database management system (ORDBMS) and is a free and open source (FOSS) library. We define an object-relational database contrasting RDBMS tables that stores scalars, as one that can store more complex types of objects in relational columns, compared to simple types like number, text, and date. PostgreSQL allows users to define custom data types, new functions, and also operators that handle these new custom types. 6.1 POSTGRES SPATIAL CAPABILITIES People in many walks of life have the empowerment to answer the question Where is something? by identifying something on a neatly detailed, interactive map. This has been possible due to many poplar map sites such as Google Maps, Virtual Earth, MapQuest, and Yahoo. We are no longer restricted to give descriptive textual directions for situating a location or finding the right way to a place. It has been possible for us to locate ourselves on a map and eliminated the perennial problem of identifying where we are on a paper map. A benefit of mapping does not end at getting directions. Mapping can be a great resource for analyzing patterns in data and has been beneficial for both small and large organizations. For example, a national pizza chain can estimate where to locate the next grand opening based on visual data plots of the addresses of pizza lovers. Political organizations can easily concentrate their route walks accordingly by seeing on a map where the undecided or unregistered voters are located. While interactive mapping has given unprecedented power to users, using mapping services still requires users to gather point data and place that data on a map. More critically, the reasoning that germinates from an interactive map is entirely visual.

34 Considering the pizza example again, the chain may be able to visually inspect their map showing pizza lovers by means of pushpins to see the concentration of pizza lovers in a city or arbitrary sales region. However, suppose the chain has more of a gourmet offering, then the pizza chain would want to locate sites in the midst of mid- to high-income pizza lovers and would want to sort pizza lovers by income level. The pizza chain could use pushpins of different colors on the interactive map to indicate various income tiers, but the heuristic visual reasoning is now much more complicated. Now the planner must keep the varying colors or icons of the pin in mind along with a lookout for the concentration of pushpins. The problem can become further complex by the addition of new variables like households with more than two children. This problem of information overload can be solved with the help of spatial databases. 6.2 TOOLS THAT SUPPORT POSTGIS The following vendors currently support PostGIS in their desktop/web offerings: Cadcorp SIS Cadcorp supports more than 160 PostGIS formats, including direct support for all other high-end spatial database offerings. This is partially funding the raster support in PostGIS and is preferred among modelers for both desktop and webbased apps. Safe FME Safe FME is preferred for high-end extract transform load (ETL) transactions. Safe FME makes ETL tools for GIS data. These tools allow GIS data transport to different formats and databases as a simple drag, drop, and schedule exercise. Safe FME contributes both monetary and developer support for GEOS. Manifold Manifold is liked by many spatial database analysts and people who like SQL in Tools that support PostGIS extensively. Manifold supports Oracle Locator/Spatial, PostGIS, SQL Server 2008, IBM DB2, MySQL. Manifold released support for PostGIS in its version 8.0. ziggis ziggis allows one to access PostGIS data without an Arc Spatial Database Engine (ArcSDE) license; ziggis doesn t work with ArcGIS Server as of this writing. ziggis is a desktop plug-in for the ESRI ArcGIS desktop that works with version 9.2 and above. ArcGIS ArcGIS is best known for its cartography capabilities. In ArcGIS 9.3, ESRI introduced support for PostGIS. ArcGIS is expensive and is independent of different versions of PostGIS or PostgreSQL. This requires an ArcSDE Server license for PostGIS and works only with PostGIS 1.4 and below (as of ArcGIS 10). Pitney Bowes MapInfo 10 MapInfo is a popular tool for GIS visual basic programmers using its MapBasic interface. Pitney Bowes introduced support for PostGIS in its recent MapInfo 10 offering. MapInfo is preferred by lightweight GIS

users and database analysts because of its rich query options and easy data import menus. MapInfo enjoys a rich history of integration with MS Office products. PostGIS has far exceeded the spatial offerings of MySQL and has garnered more support in the free open source GIS arena than any other spatial database. There are too many PostGIS open source tools to list. PostGIS has gained a strong commercial vendor support similar to the support available for other players in this market such as Oracle, SQL Server, or IBM DB2. 35 6.3 SPATIAL DATA TYPES In a spatial database there are two types of spatial data: planar and Euclidean. Figure 6.1 [39] depicts the geometry and geography data types. The data types colored in blue can be called through their instance, whereas the types colored in yellow can be directly invoked. Figure 6.1. Hierarchy of geometry and geography data types. Source: Microsoft. Spatial Objects, 2005. http://technet.microsoft.com/en-us/library/bb964711.aspx, accessed Feb. 2013. The main differences between the two types of spatial data are how the data is stored and manipulated. 6.4 HAVERSINE FORMULA The Haversine formula [40] computes the great circle distances between two points based on their longitude and latitude values. The great circle distance is the shortest distance

36 between any two points on the surface of a sphere, considering a spherical geometry. The great circle distance is a general formula in spherical trigonometry, relating to the sides and angles of spherical triangles, as shown in Figure 6.2. The equation has been based on the law of Haversines, and has been widely used in navigation. Figure 6.2. Triangle in Euclidean plane transformed onto a spherical plane. For two points on a sphere (having radius r) with latitudes ф 1 and ф 2 and longitudes λ 1 and λ 2 with a separation of Δ ф and Δ λ, where angles are in radians, the distance d between the two points along a great circle of a sphere can be computed using the Haversine formula, which is a function of r, ф 1, ф 2, λ 1, and λ 2. For any two points on a sphere, the Haversine of the central angle between them is calculated using the formula d haversin haversin( (6.1) 2 1) cos( 1)cos( 2)haversin( 2 1) r Haversin is the Haversine function: 2 1 cos( ) haversin( ) sin (6.2) 2 2 where, d is the distance between two points (along the great circle of a sphere), r is the radius of the sphere, ф 1, ф 2 are the latitudes of points 1 and 2, λ 1, λ 2 are the longitudes of points 1 and 2.

37 A central angle is an angle whose vertex is the center of a sphere, and whose side passes through two points on the sphere, forming an arc between those two points whose angle is termed as central angle. A central angle is also referred to as the arc segment s angular distance. From Eq. 6.1, solving for d, assuming the angles are measured in radians, we have, where, h is haversin(d/r) d r h r h 1 haversin ( ) 2 arcsin( ) (6.3) d 2rarcsin( haversin( ) cos( )cos( )haversin( )) 2 1 1 2 2 1 (6.4) 22 1 22 1 d 2rarcsin sin cos( 1)cos( 2)sin (6.5) 2 2 In Eq. 6.5, the value of h does not exceed 1; h approaches 1 for the antipodal points, antipodal point for a point is the diametrically opposite point on a sphere, at this region there are lot of numerical errors that tend to arise in the formula when finite precision is used. The Eq. 6.5 could be written using cosines (also called as law of spherical cosines) instead of the Haversine function. However, if the two points are very close together, an erroneous value may result. Either formula is only appropriate for obtaining approximate distances when applied to the Earth, which is not a perfect sphere. The Earth radius r varies from 6356.87 km at the poles, to 6378.14 km at the equator. The radius of curvature of a north-south line at the equator is greater than the radius of curvature at the poles by 1%. Most accurate methods that consider the Earth s ellipsoidal shape use Vincenty s [41] formulae. The law of Haversines is defined for a given unit sphere, having three points forming a triangle on the surface of the sphere, u, v and w. With three sides of the triangle having lengths a, b and c, respectively, and the angle of the corner opposite to c is C, the law of Haversines is given by haversin( c) haversin( ab) sin( a)sin( b)haversin( C) (6.6) In Figure 6.3, the unit sphere has a, b, and c of equal length, and has equal angles subtended by those sides. In order to obtain the Haversine formula, a special case is considered where u is at the North Pole and v, w are the two points whose separation d is being determined. For this case, a and b are considered (90º - latitude), and C is the longitude difference Δ λ.

38 Figure 6.3. Law of Haversines for a spherical triangle. 6.5 VINCENTY S FORMULA Thaddeus Vincenty [42] devised formulae that calculates the geodesic distance between two latitudinal and longitudinal points on Earth, considering Earth using an ellipsoidal model. Vincenty s formulae [41] are accurate to 0.5 mm on the ellipsoid being used. The Vincenty s formula to calculate the distance between two points is referred to as an inverse formula. Vincenty s formula is computed using the below formulae: Loop until change in λ value is negligible (e.g. 10-12 iterations 0.006 mm)

39 { sin( ) [(cosu sin ) (cosu sinu sinu cosu cos ) ] 2 2 2 1 2 1 2 cos( ) sinu sinu cosu cosu cos 1 2 1 2 atan2 sin, cos sin cosu cosu sin sin 1 2 / cos² 1 sin² ( trig identity) (6.7) (6.8) (6.9) (6.10) (6.11) } cos2 cos 2 sinu sinu / cos² where a, b, are the major and minor semi axes of the ellipsoid, f is the ellipsoid flattening factor (a b)/a, φ 1, φ 2 are geodetic latitude positive north of the equator, L is longitudinal difference, U 1 is the reduced latitude, atan 1 f tan, 1 2 U 2 is λ is the difference in longitude on an auxiliary sphere, σ is the angular distance between the two points on the sphere, m C f /16 cos² 4 f 43 cos² 1 2 m 1 2 ² 2 m L C fsin Csin cos Ccos cos u² cos² a² b² / b² A 1 u² /16384 4096 u² 768 u² 320 175 u² Bu² /1024 256 u² 128 u² 74 47 u² Bsin cos2 mb /4cos 12 cos²2 m B /6cos2m 34 sin² 34 cos²2m s ba atan2 cosu sin, cosu sinu sinu cosu cos 1 2 1 2 1 2 atan2 cosu sin, sinu cosu cosu sinu cos 2 1 1 2 1 2 1 (6.12) (6.13) (6.14) (6.15) (6.16) (6.17) (6.19) (6.20) (6.21) (6.18)

s is the distance between points a and b (in the same length units as a and b), α 1 is the initial bearing, or forward azimuth, and α 2 is the final bearing (in direction a b) Vincenty made an observation that Eq. 6.14 becomes indeterminate over equatorial lines (since cos²α 0). In this case, setting cos(2σm) to 0 would compute the result correctly. He also stated that the formula would have no solution between two nearly antipodal points because the iteration limit causes this case. A few implementations of Vincenty s formula inefficiently uses a large number of trigonometric functions; Vincenty devised this solution looking for an efficiency in the implementation, and his formula uses just one each of sin, cos, sqrt, and atan2 for each iteration; generally three to four iterations yields a satisfactory result. Trigonometric functions here consider angles to be in radians, so latitude, longitude, and bearings in degrees should be converted to radians using the conversion formula deg rad= (6.22) 180 The west direction on a compass is considered to be negative if using signed decimal degrees. The atan2 function used in Eq. 6.22) takes two arguments, atan2(y, x), and computes the arc tangent of the ratio y/x. This is more flexible than atan(y/x), since this takes care of the condition when x=0, and this returns values in all the 4 quadrants varying from -π to +π while the atan function returns values in the range -π/2 to +π/2. From the point of geodesic enthusiasts, Vincenty is considered over Haversine since Vincenty model yield results that are accurate to 0.5mm compared to 0.3% accuracy in Haversine. We calculated the error percentage for Haversine compared to Vincenty, as shown in Table 6.1. For our model it does not make much significance between the two formulae since no user actually travels on either theoretical spheroid or sphere and both the method yields an approximation and have a negligible difference in calculation time. 40 6.6 SPATIAL RELATIONSHIPS AND MEASUREMENTS The functions that PostGIS conform to the SQL/MM 3 standards are discussed in this section. SQL/MM define the default Spatial Reference System Identifier (SRID) of all the geometry constructors as 0. PostGIS uses a default of SRID of -1.

41 Table 6.1. Percentage Error between the Haversine and Vincenty Methods, for Two Points on Earth that are Exactly 10, 20, 30, 40, and 50m Apart Vincenty Formula(m) Haversine Formula (m) Haversine Error Percentage Error 10 10.11 0.11 1.1 20 20.14 0.14 0.7 30 30.24 0.24 0.8 40 40.25 0.25 0.625 50 50.37 0.37 0.74 6.6.1 ST_Distance This function returns the 2-Dimensional minimum distance between two geometric points on a Cartesian plane. This method implements the OpenGIS Simple Features Implementation Specification for SQL, and also implements the SQL/MM specification: SQL/MM 3. The syntax for using this method on PostGIS is: float ST_Distance(geometry g1, geometry g2); An Example of ST_Distance_Sphere SELECT ST_Distance(ga,gb) from (SELECT ST_GeomFromText( POINT(5 4 2 8) ) as ga, ST_GeomFromText( POINT(8 6 10 2) ) as gb) as foo; --Result 3.605551275463989 SELECT ST_Distance(ga,gb) from (SELECT ST_GeomFromText( POLYGON ((10 30, 0 10, 30 20, 30 30, 20 40, 10 30)) ) as ga, ST_GeomFromText( GEOMETRYCOLLECTION (POINT (40 30), LINESTRING (50 40, 70 20, 70 10, 70 10, 70 10)) ) as gb) as foo; --Result 10.0 Distance Barcelona-Paris degrees SELECT ST_Distance(Barcelona,Paris) as Dist_degrees from (SELECT ST_GeomFromText( POINT(2.183333 41.383333),4326) as Barcelona,

42 ST_GeomFromText( POINT(2.350833 48.856667),4326) as Paris) as foo; dist_degrees ------------------ 7.47521085492282 6.6.2 ST_Distance_Sphere This method returns the distance between two latitude/longitude geo spatial points. The API ST_Distance_Sphere assumes Earth to be spherical of radius 6370986 meters. This function does not look at the SRID of a point geometry and always assumes the latitude and longitude to be in World Geodetic System (WGS) 80.This method is based on the great circle distance which is also known as Orthodromic, which is the shortest distance between any two points on the sphere measured along a path on the surface of the sphere. In Eq. 6.18 Δσ is the spherical distance, from this the linear distance is obtained by multiplying the spherical distance with the radius of the Earth. ST_Distance_Sphere API is based on the Haversine formula. The syntax for the method is: float ST_Distance_Sphere(geometry geomlonlata, geometry geomlonlatb); An example to use the above method SELECT ST_Dist ance_sphere(barcelona,paris) from (SELECT ST_GeomFromText( POINT(2.183333 41.383333),4326) as Barcelona, ST_GeomFromText( POINT(2.350833 48.856667),4326) as Paris) as foo; st_distance_sphere -------------------- 831098.439081409 6.6.3 ST_Distance_Spheroid This method returns the minimum distance between two longitude/latitude geometries for a given particular spheroid. This method assumes the coordinates of spheroid and considers earth to be spherical in shape. This function looks at the SRID of point geometry and is much slower and more accurate compared to ST_Distance_Sphere, and this method is based on the Vincenty s formula. The syntax for this method is:

43 float ST_Distance_Spheroid(geometry geomlonlata, geometry geomlonlatb, spheroid measurement_spheroid); An example using the method: ST_Distance_Spheroid(Barcelona,Paris), --PostgreSQL SELECT ST_Distance_Spheroid(ga,gb) as distance_hayford1924, ST_Distance_Spheroid(ga,gb,25830) as distance_grs80, ST_Distance_Spheroid(ga,gb, SPHEROID[ WGS 84,6378137,298.257223563] ) as distance_wgs84 from (SELECT ST_GeomFromText( POINT(-3.52 41.0),23030) as ga, ST_GeomFromText( POINT(2.33 48.86),23030) as gb) as foo; --Result distance_hayford1924 distance_grs80 distance_wgs84 ----------------------+------------------+------------------ 987364.755996404 987329.984551754 987329.984556481

44 CHAPTER 7 SYSTEM IMPLEMENTAITON AND OVERVIEW 7.1 DATA MINING AND SIMULATION SETUP The review of literature presented in Chapter 2 highlights the concepts of cloud computing, SLAs between a cloud consumer and a cloud provider, location based advertising, and the concept of profit maximization. In this chapter, we will discuss the data mining and data analysis techniques used in our implementation. We have based our model on that of a cloud infrastructure customer hosting their smart phone application in a cloud to provide location-based advertising services for mobile users. We propose a methodology whereby a cloud customer can determine their optimal investment in cloud resources based on the revenue they obtain through transaction fees gained from point of sale purchases. This chapter focuses on setting up a Eucalyptus cloud, database design and implementation for maintaining real-time geo-spatial data, and controlling cloud microprocessor resources used to process streams of real-time geo-spatial data being transmitted by users of the mobile application. Because location-based advertising services are sensitive to user physical position, it is imperative that a mobile application user receive offers for services when they are as close as possible to the respective point of sale. Our model hypothesizes that a user s desire to respond to an offer received on his or her smart device (e.g. smart phone), walk to a point of sale and purchase a product, is indirectly proportional to the distance the user is from the point of sale at the time the offer is received. Therefore, to minimize this distance, we wish for the response time between when a user sends his or her current location estimates (e.g. GPS position) to a cloud through a web service and the time an offer is received and displayed on the user s phone to be minimized. For hosting such a database to manage real-time geo-spatial data, we developed a private cloud using the Eucalyptus software discussed in Chapter 2, which is based on Amazon EC2 web services. The Eucalyptus cloud model requires a minimum of two servers, one server for the front end controller and the other for the node controller. In our setup we setup two servers in the initial iteration and later added another server with higher

45 configuration for increasing the microprocessor cores: one to serve as the front controller and the other two acting as node controllers. The node controller hosts all the virtual machine instances that are created and the front controller controls the cloud resource and nodes. The node controller and the front controller are connected via a private switch, and the front controller is accessible from a public network via the college webserver as shown in Figure 7.1. The front controller has two network interface cards (NIC). In our setup, dspserv.sdsu.edu is configured to be the front controller on which Walrus (cloud controller) and storage controller processes of Eucalyptus are running. There are two server nodes for instance storage: maroconi.sdsu.edu and dsp.sdsu.edu. Figure 7.1. Setup of the private Eucalyptus cloud used in this thesis. The Eucalyptus network model type is based on the managed no-vlan model described in Chapter 2. A network bridge is configured to connect the controllers to the cluster network. Once the virtual instances boot, the node-controllers attach them to the configured bridge thus connecting them to public network. In our model, the node controllers are configured with a bridge network interface and each instance that is created gets an elastic IP address setup in the Eucalyptus configuration file.

46 The virtual instances run standard Linux distributions. The Linux images are configured and deployed using euca2-tools [43] support. Different instances are created based on the hardware configuration available on the node controller. In our cloud setup, we have three instances created of type c1.xlarge: running Ubuntu [44] distribution. The instances are for three different services - database, web-service container and simulation of geo-spatial records insertion into the database. We use PostgreSQL for database support and PostGIS [45] for performing the geo-spatial calculations. As first step we ensured consistent system functionality by inserting records and performing a few geo-spatial functions on a single core. As we started formulating our model, we collected results by varying the number of microprocessor cores available to cloud instances. We measure the cloud statistics using the Nagios [46] monitoring tool. One measure is to evaluate the load average i.e. to gauge how many processes on average concurrently demand processor attention. This is done by inserting a large number of records of varying sizes and calculating the cloud load. We observe that with varying cloud resources and number of records inserted, the load curve increased exponentially and reached a saturation stage as show in Figure 7.2. Based on this result we devised an optimal core resource allocation model for our cloud hosting smart phone applications. We control the load on the cloud infrastructure using a multithreaded automation script which inserts records of varying number and size into a PostgreSQL database executing on an instance. While inserting records in parallel using the multithreaded script, we varied the number of microprocessor cores allocated to PostgreSQL. Figure 4.3 describes flow of the automation script for data collection. The automation framework is built using Python [47] scripting language. Different methodologies (e.g. great circle algorithms) were used for iterations to find out records with in the 50m radius, calculations by varying the cloud resources and made measurements. For insertion of geospatial records into the database as per the schema show in Figure 7.3, we used a random number generator for generating random latitude and longitude coordinates along with example customer profile data, forming record data sizes of 10 kb, 0.5 MB, and 1 MB per each record inserted. These record sizes are based on the typical sizes of user profile data that is transmitted at a given frequency from a user s smart phone to the Web Service container cloud instance as the user is in motion, walking through a shopping area.

47 Figure 7.2. Cloud microprocessor core load as a function of the number of records inserted. Figure 7.3. UML diagram of database storage for a user s geospatial data.

48 We designed an automation script to simulate a live scenario of a million smart phone users running a mobile application to update their geospatial location as show in flowchart Figure 7.4. Using a random data generator to insert a million records of different sizes into a cloud database, we computed the different t insertion times. As records were inserted into the database, we computed the list of users within a 50m radius of a simulated POS. We used the ST_Distance_Sphere [48] and the ST_Distance_Spheroid [49] API s provided by PostGIS. These functions take two latitude and longitude values as input and compute the distance between the two points using the Haversine formulae and Vincenty s formulae, respectively. An example of using these functions to compute a list of records with a 50m radius using PostgreSQL is shown in Figure 7.5 (Using ST_Distance_Sphere amd ST_Distance_Spheroid). Figure 7.4. Flow chart for parallel insertion of geospatial records into our Eucalyptus hosted PostgreSQL database for measuring cloud infrastructure load.

49 Figure 7.5. PostGIS API example for computing a great circle distance using the Haversine formulae and Vincenty s formulae, respectively. The time taken to compute these quantities is designated t Geospatial,sphere and t Geospatial,spheroid, respectively. When computing these two timings, a list of users who currently reside within the purchase frontier (we define purchase frontier as the boundary of 50m radius around the POS) of a POS are sent an offer for goods and/or services based on interest attributes specified in their profile. A time factor of t Correlation is computed to accommodate the time taken to perform a statistical correlation of user interests with goods and/or services offered at a POS. After determining which users inside a purchase frontier to send an offer, based on an interest correlation, an offer message is sent to each highly correlated user through a Web Service response as show in Figure 7.6. This Web Service response time is designated as t Reception. These different time quantities have been measured as a function of the number of microprocessor cores assigned to the PostgreSQL cloud instance. We varied the number of microprocessor cores assigned to an instance by enabling and disabling each core using the following Linux mechanism: echo 0 > /sys/devices/system/cpu/cpu7/online Individual CPU attributes are contained in the subdirectories named by the kernel s logical CPU number. Each core of the kernel could be adjusted and scheduled e.g.: echo 0 0- No power 1- Power on offline: cpu s that are not online because they have been off or exceed the limit of cpus allowed by the kernel configuration. online: cpu s that are online and being scheduled. cpu value could be varied based on the hardware configuration of available cores.

50 Figure 7.6. An offer for goods and/or services sold at a point of sale are sent to users once they enter a purchase frontier, based on an interest correlation. Offers are received through a web service response message. 7.2 DATA ANALYTICS AND MODEL ASSUMPTIONS On completing over ten simulations, the total time ttotal is computed using Eq. 7.1. ttotal,10baset tinsertion,10baset tgeospatial tcorelation treception,10baset (7.1) where, t Total, 10BaseT t Insertion, 10BaseT t Geospatial,sphere t Geospatial,spheroid Total processing time Total geo-location record transmission and insertion time Total geospatial computation time using haversine model (50m) Total geospatial computation time using Vincenty's model (50m)

t Reception, 10BaseT t Correlation 51 Total geo-location proximity list reception time Total time taken by cloud infrastructure to perform statistical correlation Our simulations were executed on a 10BaseT network (10 Mbps) with the cloud infrastructure on the same LAN. To scale these timings to a smart phone network using a 4G Long Term evolution (LTE) bit rate, we scaled t Reception and t Insert time results to an average 4G network uplink rate of 4 Mbps, a downlink rate of 8 Mbps, and computed a per user processing time we designate t User. In our model, we assume that a user s geospatial data is transmitted to the cloud the moment the user enters the purchase frontier of a POS and gets a response back with an offer for goods and/or services. The per user processing time t User is the time between a user entering the purchase frontier to when the user receives an offer. We assume people walk at an average rate of 5 Km/h [49], thus the distance travelled by a customer is computed using the t User times the average walking speed as show in Eq. 7.2. distance travelled = t user average walking speed(5km/h) (7.2) where, t User Per user total processing time For the distance computation, we have considered a purchase frontier to have a circular boundary around a POS and have considered cases where the user travels at distances of 50m (tangential to the circular purchase frontier), 40m, 30m, 20m, and 10m from the POS, as shown in Figure 7.7. We assume the walking path to be tangential to a POS. As we estimate the distance travelled by a user upon receiving an offer, we estimate the distance to back track to a POS by calculating the hypotenuse of the triangle formed. Figure 7.8 shows the distance to back track, as a function of the number of PostgreSQL assigned microprocessor cores and the distance the customer is from the POS. From the computed back track distances, we estimate the probability of a person travelling back to the POS as shown in Figure 7.9 to make a purchase at the POS using three different models: Huff, exponential, and linear. These three models are presented to study the variation in customer purchase probability as a function of back track distance and the number of microprocessor cores assigned to a database instance.

52 Figure 7.7. A mobile application transmits geospatial information to a cloud infrastructure when a user crosses into a purchase frontier. 7.3 LINEAR MODEL The linear model is a simple model we designed to map a D backtrack distance to a probability value (ranging from 0 to 1) of a customer travelling back to a POS to make a purchase, after receiving an offer. This model is a function of distance: for every unit of distance away from the POS we assume the same amount of decrease in probability of the customer travelling back to a POS. This model is computed using the d max and d min parameters. P linear Dbacktrack where, d max is the maximum backtracking distance found for each purchase frontier radius, d min is the minimum backtracking distance D backtrack is the actual backtracking distance d min d d max max (7.3)

53 Figure 7.8. Dbacktrack is the distance a customer must backtrack to reach the point of sale upon receiving an offer. With the linear model, probability increases linearly with back track distance as shown in Figure 7.10. 7.4 EXPONENTIAL MODEL This model is based on the exponential distribution where we assume a customer s reluctance to backtrack to a POS increases exponentially with distance to the POS, as shown in Figure 7.11. We computed the probability of a customer travelling back to a POS as a function of the number of microprocessor cores assigned to a database instance. Compared to the linear model, the exponential model returns a backtracking probability range between 20% to 80%. In this model, as the distance increases further away from the POS by a fixed measure, the probability of person not making a purchase increases exponentially. For example, if a customer is at a distance of 30 meters away from the POS, the customer would be 60% likely to make a purchase by travelling back to the POS, but at a distance of 40 meters, the likelihood reduces to 30%. The probability of not making a purchase increases exponentially with every unit of distance increase as show in Figure 7.11.

54 Figure 7.9. Upon receiving an offer, the probability of a customer back tracking to a POS to make a purchase based on the offer received is determined by a particular probability distribution. Figure 7.10. Probability as a function of Dbacktrack, linear model.

55 Figure 7.11. Probability as a function of Dbacktrack, exponential model. 7.5 HUFF MODEL This model is based on the gravity model of spatial interaction [50] and is used for analysis of market areas for retail outlets. Huff (1963) formulated this model based on the gravity model that was probabilistic in nature. His model states that the probability of demand at location i is being satisfied by a retail outlet at location j equals the relative attractiveness of this j th retail outlet, having an attractiveness as a function of scaling factor α defined as (size of the retail outlet j) attractiveness (7.4) (distance between locations i and j) where λ is a distance decay scaling parameter. The Huff model introduces customer choice and discounts demand. Huff model is a tradeoff between the attenuating nature of distance and size of a retail outlet. This model assumes that demand is inelastic, when a price change has no effect on the supply and demand of a good or service, it is considered as inelastic. In our implementation of the Huff model, we consider one store. The total set of customers would patronize each store, regardless of the store s attraction, size, or location. Hence, the

56 probability of a person travelling the back to a POS and making a purchase is not efficient with this model. 7.6 EQUILIBRIUM POINT When a customer makes a purchase at a POS after travelling the back track distance upon receiving an offer, we assume the cloud consumer (i.e. mobile application provider) charges a transaction fee for every purchase made. We then computed the marginal revenue a cloud consumer would make based on the probability of a customer making a purchase. We varied the transaction fees that the cloud consumer would charge the customer from $0.10 to $1, and computed the marginal revenue for the linear and exponential models. In any business model, a firm always aims for profit maximization; to obtain the profit maximization the marginal revenue should be equal to marginal cost. In the Figure 7.12 describes the profit maximization concept graphically. Marginal cost (MC) is the change in total cost that arises on investing (i.e. purchasing from a cloud provider) in one additional microprocessor core. The marginal revenue (MR) is the additional revenue that will be generated by investing one additional microprocessor core. From Figure 7.12 we notice the equilibrium point, which is the point at which MR=MC. This is the point of maximum profit. The region to the left of equilibrium point is the positive profit region, since it is the region where more investment could be performed to add resources into the cloud infrastructure to make more profit. The region to the right of equilibrium point is the negative profit region and hence more investment made in purchasing a microprocessor core resource from a cloud provider would not produce appreciable profit for the investment made. The MC in our model is calculated based on the pricing of the AmazonEC2 cloud model as shown in Figure 7.13 [51] and Table 7.1 for a micro instance; micro instances are optimized for applications that require lower throughput, which still may consume significant compute cycles periodically. A micro instance (t1.micro) [52] provides a small amount of consistent CPU resources and allows an increase in CPU capacity in short bursts when additional cycles are available. Micro instances are well suited for lower throughput applications and smart phone applications, or Web sites that require additional compute cycles periodically. MR is computed by accounting for transaction fees gained at the POS. In our model, we have varied the transaction fees a firm (i.e. cloud consumer) charges the

57 Figure 7.12. Profit maximization curve with equilibrium point shown. Figure 7.13. Screenshot of Amazon EC2 online marginal cost calculator. Source: Amazon. EC2 Calc, 2009. http://calculator.s3.amazonaws.com/calc5.html, accessed Feb. 2013.

58 Table 7.1. Amazon EC2 Pricing per Instance-Hour Consumed for Each Instance Standard On-Demand Instances Linux/UNIX usage Windows usage Small (Default) $0.065 per Hour $0.125 per Hour Medium $0.130 per Hour $0.250 per Hour Large $0.260 per Hour $0.500 per Hour Extra Large $0.520 per Hour $1.000 per Hour Second Generation Standard On-Demand Instances Extra Large $0.550 per Hour $1.060 per Hour Double Extra Large $1.100 per Hour $2.120 per Hour Micro On-Demand Instances Micro $0.025 per Hour $0.035 per Hour customer for making a purchase at a POS, and have performed an analysis based on different results obtained. We present these finding in the next chapter.

59 CHAPTER 8 EXPERIMENT RESULT AND ANALYSIS See Table 8.1 for nomenclature. The different stages of our approach to design our model are shown in Figure 8.1. In our approach, the user s smart phone application updates his or her current geospatial location upon entering a purchase frontier of a POS, and the cloud infrastructure computes and sends an offer to the user sometime after the user has travelled a distance. We observe that the total processing time required to insert a million records decreases as a function of the microprocessor core count. The simulations were run for two different geospatial functions. The ST_Distance_Spheroid took more time compared to the ST_Distance_Sphere as seen in Figure 8.2 and Figure 8.3, we scaled the simulation results of total processing time from a 10BaseT user. Note: ttotal,10baset tinsertion,10baset tgeospatial tcorelation treception,10baset (8.1) Eq. 8.2 describes the calculation for computing the user processing time. t t Total,4G User (8.2) In our hypothetical model design we have considered a network to a 4G mobile network having an uplink of 4 Mb/s and downlink of 8 Mb/s per scenario of a user walking at a distance of 10m, 20m, 30m, 40m and 50m away from the store, as shown in Figure 8.4 that depicts shopping complex Fashion valley mall [53] with different purchase frontier circles of different radii. The green dashed line in Figure 8.4 [53] shows the path of one user walking at a distance of 10m away from the store. The yellow dashed line shows the path of another user walking tangential to the 50m purchase frontier. The user on the green path receives an offer at a shorter distance away from the POS, compared to the user on the yellow path, and will thus have a shorter backtrack distance to the POS. The back track distances are marked using solid lines. N

60 Table 8.1. Nomenclature Variable t Total t Insertion t Geospatial,sphere t Geospatial,spheroid t Reception t Correlation t User N x D Travelled D Backtrack r POS Variable Description Total processing time Total geo-location record transmission and insertion time Total geospatial computation time using haversine model (50m) Total geospatial computation time using Vincenty's model (50m) Total geo-location proximity list reception time Total time taken by cloud infrastructure to perform a statistical correlation Per user total processing time Number of geo-location records to insert for load analysis Distance to POS Distance travelled in meters by a user upon entering a purchase frontier until receiving an offer Distance customer must backtrack to reach a point of sale upon receiving an offer Radius of purchase frontier with center as POS From the user processing time, we computed the marginal user time. Figure 8.5 shows that the time saved reaches a peak at four cores, after which the time saved is not much. After six cores, we see the time saved is close to zero. Proximity list is the list of all the POS with in the proximity of the user based on user s interest of good/services. Based on the user processing time, we computed the total distance a person would travel by the time he or she receives an offer, assuming the user is continuously moving. This calculation was done considering a user walks at an average speed of 5 km/h [49]. Using the distance formula we computed the distance that user would have travelled and plotted the graph as seen in Figure 8.6 [49]. From From D Travelled we compute the D Backtrack distance using the Pythagoras formula, with D Travelled and r POS being the sides of the triangle and D Backtrack as hypotenuse in Figure 7.8. We plot the D Backtrack as a function of microprocessor core count and noticed that the D Backtrack decreases as a function of microprocessor core count, as shown in Figure 8.7. D Backtrack = (8.3) where, r 50, 40, 30, 20, 10.

61 Figure 8.1. Different stages of a smartphone user receiving an offer, once the user enters the purchase frontier. 8.1 PROBABILITY MODELS Huff s model as discussed on page 75 is for a scenario as shown in Figure 8.8. The formula for Huff model is P i, j n j1 AD j j i, j A D where, is a measure of attractiveness of store j, such as square footage is the distance from user's location i to store j α is an attractiveness parameter, estimated from empirical observations i, j (8.4)

62 Figure 8.2.Total processing time ttotal for 1,000,000, 0.5Mb geo-location records on a 10 Mbps network as a function of cloud microprocessor core count. β is the distance decay parameter estimated from empirical observations n is the total number of stores including store j. Since in our hypothetical model we have a single store with the user being mobile we have formulated the proximity Huff model by changing the indices as show in Figure 8.9. We generalized the Huff model to our design and obtained Eq. 8.5. where, s is the attraction factor scaled to 200 a is a scale of attraction factor d is the distance to POS P huff, mobile 8 i1 sd a j s d i a i (8.5)

63 Figure 8.3. Per user processing time tuser based on geo-location data exchange on a 4G mobile network (8 Mbps download, 4 Mbps upload) for 1E6 0.5Mb geo-location records, as a function of cloud microprocessor core count. The probability range using the Huff model was not acceptable for our use, as we observed from Figure 8.10, we have a maximum probability of 45% a user would travel the back track distance and make a purchase at POS. Since our model is designed per store, and the attractiveness is equal to other stores, we do not have a probability above 50%. However, if multiple offers are received by a user for multiple stores, the user s interests can be correlated to each store and then the attractiveness parameter can be taken as a function of the user s correlation coefficient for each store, i.e. i i (8.6) This model is used in spatial analysis and assumes the probability of a customer visiting a store to make a purchase is a function of the distance to the POS, a store s attractiveness, and the attractiveness of competing stores.

64 Backtrack distance for tangential user Tangential to POS POS 10m away from POS Backtrack distance for 10m user Figure 8.4. Two different customer purchase scenarios, considering a POS in Fashion Valley Mall located in San Diego, California. Source: Fashion Valley. Fashion Valley Mall, 2010. http://fashion-valley.mallsite.us/, accessed Feb. 2013. The exponential distribution is a family of continuous probability distributions that describes the time between events in a Poisson process that occur at a constant rate and are independent of each other. The probability density function of an exponential distribution is defined as follows: x e, x 0 f(; x ) (8.7) 0, x < 0 where, λ is rate parameter of events occurring. In Eq. 8.7 for λ=1 the value of probability density ranges between 0 to 1. Hence we designed our exponential model as: 1 P e x, 1, (8.8) 50 This model was chosen since, for λ=1, the exponential pdf varies from 0 to 1 to designate a 0% to 100% chance, respectively, a customer will backtrack to a POS upon

65 Figure 8.5. Marginal per user processing time saved, based on geo-location data exchange on a 4G mobile network (8 Mbps download, 4 Mbps upload) for 1E6, 0.5Mb geo-location records, as a function of incremental cloud microprocessor core addition. receiving an offer. From the results we observer there is a 20-80% probability the user would travel the backtrack distance and make a purchase at the POS as shown in Figure 8.11. Using the linear model, the probability scales between to 0 to 1 and we notice that after cores greater than or equal to 4, the probability that the user would travel the back track distance is 100%. The linear model is given by Eq. 8.9. P linear dbacktrack d min d d where d max is the maximum backtracking distance found for each purchase frontier radius, d min is the minimum backtracking distance d backtrack is the actual backtracking distance max max (8.9)

66 Figure 8.6. The distance traveled in meters by a user upon entering a purchase frontier until receiving a geo-location proximity list with one or more offers, assuming a walking speed of 5Km/h. Source: N. Carey and R. Knoblauch. Establishing pedestrian walking speeds. FHWA, Transportation and Records, Washington, DC, 1975. dmax and dmin are the maximum and minimum distance the user would have travelled as function of microprocessor core number. As shown in Figures 7.9 and 8.12 the greater the number of microprocessor cores, the lesser the distance travelled by the user. The distance travelled by the user is inversely proportional to the count of microprocessor cores as seen in Figure 8.13. The lesser the distance travelled by the user, the probability of user travelling the back track distance to make a purchase upon receiving the offer is higher. 8.2 MARGINAL REVENUE Marginal revenue is the additional revenue that will be generated by increasing the microprocessor core count by one unit, that is, the additional revenue generated by adding another microprocessor core into a cloud instance supporting a database service. We plot the

67 Figure 8.7. Distance customer must backtrack to reach a point of sale upon receiving an offer on 4G mobile smart phone, as a function of microprocessor core count. Figure 8.8. A single user at different distances D1, D2, D3, and D4 away from different POS s having different attractiveness and size, probability of the user going to a particular POS as described by Huff model.

68 Figure 8.9. A single user at different distances D1, D2, D3, and D4 away from a single POS, probability of the user going to the POS as described by proximity Huff model. Figure 8.10. Probability user backtracks to point of sale upon receiving offer on 4G mobile smart phone, as a function of microprocessor core count. Probability is modeled using the Huff method.

69 Figure 8.11. Probability user backtracks to point of sale upon receiving offer on 4G mobile smart phone, as a function of microprocessor core count, assuming a scaled exponential probability distribution. graph of marginal revenue for a transaction fee of $0.10 for a linear model, from the plot we observed that the marginal revenue decreases steeply after four cores. We also have plotted this for different r POS and we observe that marginal revenue is maximal for a distance of 10 m from the POS. Figure 8.14 shows the plot of marginal revenue for a transaction fee of $0.1 for an exponential model. From Figure 8.15 we observe that the marginal revenue decreases steeply after four cores. We also have plotted this for different rpos and we observe that marginal revenue is maximal for a distance of 10m from the POS. Based on the marginal plots of the linear model and exponential model, we calculated the marginal cost as a function of number of cores using Amazon EC2 [51] pricing as shown in Table 7.1 and plotted it over the marginal revenue. We considered a micro instance [52] as the type of instance in Amazon EC2 for our model design. Micro instances (t1.micro) provide a small amount of consistent CPU resources and allow one to increase CPU capacity in short burst when additional cycles are available. These are well suited for lower

70 Figure 8.12. Different Dbacktrack distances travelled by a smartphone user varying the number of cores on cloud infrastructure, walking at an average rate of 5km/h. throughput applications and mobile applications that require additional compute cycles periodically. We plotted the marginal revenue curve for different transaction fee of $0.25, $0.50 and $1.00 for the exponential model and linear model as seen in Figures 8.16 to 8.21. In Figure 8.22, we make a very important observation about the equilibrium point, the intersection of marginal revenue and marginal cost, at a core count of four. 8.3 MARGINAL REVENUE VS. MARGINAL COST Using the MR-MC analysis we varied the transaction fees for the following values $0.25, $0.50, and $1.00. In Figure 8.22 we observe that the core count is varying between 4 and 5. Since the core count can only be an integer, the cloud consumer can decide to have four cores or five cores for profit maximization. In Figure 8.23 and Figure 8.24 we observe that the core count closer to five for a linear model and is between 4 and 5 for an exponential

71 Figure 8.13. Probability a user backtracks to a point of sale upon receiving an offer on a 4G mobile smart phone, as a function of microprocessor core count. Probability is modeled using a linear model. model. Since the core count can only be an integer, the cloud consumer can invest on five cores for profit maximization. Using the cloud elasticity feature which allows us to adjust core count on demand, or in real-time we can see that cloud consumer can choose. For a transaction fees of $0,25 we plotted the MR-MC curve as shown in Figure 8.25 and Figure 8.26 and observed that the optimal number of cores for a linear model is between 4 to 5 and for an exponential model is between 3 to 4. Similarly on increasing the transaction fees to a dollar we observed that the optimal number of cores required for profit maximization in a linear model is close to 5 and for an exponential model is close to 4 as seen in Figure 8.27 and Figure 8.28.

Figure 8.14. Marginal revenue based on the probability of a customer backtracking to POS and making a purchase, assuming a linear model for purchase probability, and a $0.1 transaction fee on each purchase. 72

Figure 8.15. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $0.1 transaction fee on each purchase. 73

Figure 8.16. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability and a $0.25 transaction fee on each purchase. 74

Figure 8.17. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $0.25 transaction fee on each purchase. 75

Figure 8.18. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability, and a $0.50 transaction fee on each purchase. 76

Figure 8.19. Marginal revenue based on the probability of a customer backtracking to POS and making a purchase, assuming an exponential model for purchase probability, and a $0.50 transaction fee on each purchase. 77

Figure 8.20. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming a linear model for purchase probability, and a $1.00 transaction fee on each purchase. 78

Figure 8.21. Marginal revenue based on the probability of a customer backtracking to a POS and making a purchase, assuming an exponential model for purchase probability, and a $1.00 transaction fee on each purchase. 79

Figure 8.22. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear Model, as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.10. Marginal cost based on the Amazon t1.micro instance. 80

Figure 8.23. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 record of each 0.5 Mb size and T=$0.5 Marginal Cost based on the Amazon t1.micro instance. 81

Figure 8.24. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using an exponential model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 record of each 0.5 Mb size and T=$0.5. Marginal cost based on the Amazon t1.micro instance. 82

Figure 8.25. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using a linear model on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.25. Marginal cost based on the Amazon t1.micro instance. 83

Figure 8.26. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of moving to a POS using an exponential model on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$0.25. Marginal cost based on the Amazon t1.micro instance. 84

Figure 8.27. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of the user moving to a POS using a linear model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$1.00. Marginal cost based on the Amazon t1.micro instance. 85

Figure 8.28. Marginal cost vs. marginal revenue: Marginal revenue based on the probability of the user moving to a POS using an exponential model as on an 8 Mb/s (4G mobile network), for two sets of PostGIS APIs, for a total of 1,000,000 records of each 0.5 Mb size and T=$1.00. Marginal cost based on the Amazon t1.micro instance. 86

87 CHAPTER 9 CONCLUSION AND FUTURE IMPLEMENTATIONS Using the resources of cloud IaaS services and the economics of profit maximization, we have proposed a new methodology for cloud consumers to decide when, and when not, to invest in additional microprocessor cores in a cloud infrastructure for profit maximization. 9.1 CONCLUSION We provide a new model that can be used to estimate the number of microprocessor cores in a cloud infrastructure for a given business model to maximize profit. This model can be generalized to any for-profit mobile application hosted in a cloud, and based on the revenue earned in return, determine the proper number of resources to invest in the cloud. Our idea can be generalized for any scenario. In this thesis, we have considered an example of a smart phone application hosted in a Eucalyptus/Amazon EC2 cloud where the transaction fees are appropriate for small consumer purchases. The equilibrium point, determined by the intersection of the marginal revenue and marginal cost curves, provides insight to a cloud consumer to determine the approximate number of resources to be invested. We have formulated this model by considering how fast an offer for goods and services is sent to a mobile user, as a function of the number of invested microprocessor cores in a cloud, the probability of the user travelling a backtrack distance, and making a purchase at point of sale. Based on out chosen behavioral models, if a user is near a point of sale and receives an offer within a 10m radius, the user has a 100% probability of making a purchase at the POS using the linear model, and an 80% probability using the exponential model. If the user is at the boundary of the purchase frontier, the probability is much less that the user with choose to travel the back track distance to make a purchase, as shows in Figure 9.1. We have hypothesized that our backtracking probability model simulates actual consumer behavior. In the linear model, the reluctance of a user travelling back to a POS is

88 Figure 9.1. Scenario of two different users travelling at different distances from a POS. The probability of user1, who is tangential to a POS, is less likely to backtrack as indicated with a red cross. However, the probability of user2, who is 20m away from the, POS is more likely to backtrack to the POS, as indicated by green check, upon receiving an offer for good or services. proportional to the distance the user has travelled, but in the exponential model, we observe the reluctance increases exponentially for each fixed distance travelled by the user. Other, more complex, models could be incorporated based on this idea. Our scheme can also assist a consumer with comparing different cloud services by plotting the marginal revenue curves for different marginal costs of different cloud providers, then comparing and estimating which cloud provider would yield greater profit based on specific POS transaction fees. Our model can be generalized to any cloud provider, any backtrack distance calculations, any POS purchase probability model, and any POS transaction fees, to determine the optimal core count to achieve profit maximization. Our results showed that the optimal number of cores required for profit maximization is to 4 to 5, based on an Amazon EC2 cloud micro instance and our selected transaction fees. From our results for an exponential model, we found that for a transaction fee of $0.25, the

89 optimal number of cores required for profit maximization is to 3 to 4, and for a transaction fee of $1.00, the optimal number of cores for profit maximization is 4 to 5. We also observed that, by increasing transaction fees, more revenue is generated per sale and this revenue can be invested in adding additional microprocessor resources to a database instance of a cloud infrastructure to reduce offer latency. 9.2 FUTURE WORK We plan to measure actual mobile consumer shopping behavior by logging when users visit a POS upon receiving an offer on a smart phone. We will record how far each user is from a POS and whether or not the user backtracked to a POS upon receiving an offer. This data can be used to validate our linear or exponential model, or provide insight into new models that govern shopping behavior.

90 BIBLIOGRAPHY [1] Eucalyptus. Open Source Private and Hybrid Clouds from Eucalyptus, 2008. http://www.eucalyptus.com, accessed Feb. 2013. [2] Google. Apps SLA, 2001. http://www.google.com/apps/intl/en/terms/reseller_sla.html, accessed Feb. 2013. [3] K. Zickuhr. 4% of Online Americans Use Location-Based Services, 2010. http://www.pewinternet.org/reports/2010/location-basedservices/overview.aspx?view=all, accessed Feb. 2013. [4] Amazon. AWS, 2005. http://aws.amazon.com/, accessed Feb. 2013. [5] Google. Google Cloud, 2005. https://cloud.google.com/products/, accessed Feb. 2013. [6] Microsoft. Windows Azure, 2007. http://www.windowsazure.com/en-us/, accessed Feb. 2013. [7] S. Landsburg. Price Theory and Applications. South-Western, Cincinnati, OH, 2002. [8] B. Rajkumar, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. Cloud Computing and Emerging IT Platforms. ACM, The Netherlands, 2009. [9] T. Kraska, L. Simon, and D. Kossmann. An evaluation of alternative architectures for transaction. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, 2010. ACM. [10] V. Mateljan, D. Cisic., and D. Ogrizovic. Cloud database-as-a-service (DaaS) - ROI. Proceedings of the 33rd International MIPRO Convention, Opatija, Croatia, 2010. IEEE. [11] K. Buell and J. Collofello. Cost excessive paths in cloud based services. Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), Las Vegas, NV, 2012. IEEE. [12] J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance prediction for concurrent database workloads. Proceedings of the 2011 ACM SIGMOD International Conference on Management, Athens, Greece, 2011. ACM. [13] J. Liu, and H. Ting-Lei. Dynamic route scheduling for optimization of cloud database. Proceedings of the Intelligent Computing and Integrated Systems (ICISS), Guilin, China, 2010. IEEE. [14] A. Colorni, M. Dorigo, and V. Maniezzo. Distributed optimization by ant colonies. De la première conférence européenne sur la vie artificielle, Paris, France, 1992. [15] X., Pengcheng, Y. Chi, S. Zhu, H. J. Moon, C. Pu, and H. Hacigümüş. Intelligent management of virtualized resources. Proceedings of the 2011 IEEE 27th

International Conference on Data Engineering (ICDE), Hannover, Germany, 2011. IEEE. [16] TPC. TPC-W Benchmark, 1998. http://www.tpc.org/tpcw/, accessed Feb. 2013. [17] D. Huff. Huff, 2003. http://www.esri.com/news/arcuser/1003/files/huff.pdf, accessed Feb. 2013. [18] ESRI. ArcGIS Business Analyst Server, 2007. http://www.esri.com/news/arcnews/fall07articles/ag-ba-server.html, accessed Feb. 2013. [19] C. Gu, and F. R. Li. Long-run marginal cost pricing based on analytical method for revenue reconciliation. IEEE Transactions on in Power Systems, 26:103-110, 2011. [20] M. Mazzucco, D. Dyachuk, and R. Deters. Maximizing cloud providers revenues via energy aware allocation policies. Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), Miami, CA, 2010. IEEE. [21] H. Xu and B. Li. Maximizing revenue with dynamic cloud pricing: The infinite horizon case. Proceedings of the IEEE International Conference on Communications (ICC), Ottawa, Ontario, Canada, 2012. IEEE. [22] D. Ta, L. Xiaorong, and M. G. Rick. QoS-aware revenue-cost optimization. Proceedings of the IEEE/ACM 16th International Symposium on Distributed Simulation and Real Time Applications, Salford, England, 2012. IEEE. [23] M. Mazzucco, M. Vasar, and M. Dumas. Squeezing out the cloud via profitmaximizing resource allocation policies. Proceedings of the 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). Washington, DC, 2012. IEEE. [24] K. H. Prasad, A.T. Faruquie, V.V. Subramaniam, M. Mohania, and V. Girish. Resource allocation and SLA determination for large data processing services over cloud. Proceedings of the 2010 IEEE International Conference on Services Computing, Miami, FL, 2010. IEEE. [25] G. Feng, S. K. Garg, R. Buyya, and W. Li. Revenue maximization using adaptive resource provisioning. Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing, Beijing, China, 2012. IEEE. [26] S. Bouchenak. Automated Control for SLA-Aware Elastic Clouds. ACM, New York, New York, 2010. [27] Y. Shao, L. Di, B. Guo, J. Gong, and Y. Bai. Geoprocessing on the Amazon cloud computing platform - AWS. Proceedings of the Agro-Geoinformatics 2012 First International Conference, Shanghi, China, 2012. ACM. [28] C. Baun, M. Kunze, J. Nimis, and S. Tai. Cloud Computing: Web-Based Dynamic IT Services. Springer, Karlsruhe, Germany, 2011. [29] Xen. Xen 4.2 Releases, 2012. http://xen.org/news/2012/09_xen_4_2_releases.html, accessed Feb. 2013. 91

[30] Linux KVM. KVM, 2001. http://www.linux-kvm.org/page/main_page, accessed Feb. 2013. [31] VmWare. VmWare vcenter, 2003. http://www.vmware.com/products/vcenterserver/overview.html, accessed Feb. 2013. [32] P. Hallam-Baker. WS-Security SOAP, 2003. http://https://www.oasis- open.org/committees/download.php/2314/wss-soapmessagesecurity-13-050103- merged.pdf, accessed Feb. 2013. [33] RSA Laboratories. PKCS#12, 2001. http://www.rsa.com/rsalabs/node.asp?id=2138, accessed Feb. 2013. [34] PostgreSQL. About, 2001. http://www.postgresql.org/, accessed Feb. 2013. [35] U.C. Berkley. PostgreSQL 4.2, 1985. http://db.cs.berkeley.edu/postgres.html, accessed Feb. 2013. [36] L. A. Rowe. VLDB 87 Proceedings of the 13th International Conference on Very Large Data Bases, California, 1987. Morgan Kaufmann Publishers, Inc. [37] G. M. Lohman. Grammar-like functional rules for representing query optimization alternatives. Proceedings of the SIGMOD Conference, New York, NY, 1988. ACM. [38] PostgreSQL. Applications, 1991. http://www.postgresql.org/download/products/7/, accessed Feb. 2013. [39] Microsoft. Spatial Objects, 2005. http://technet.microsoft.com/enus/library/bb964711.aspx, accessed Feb. 2013. [40] R. W. Sinnott. Virtues of the Haversine. Sky and Telescope, 159:68-70, 1984. [41] T. Vincenty. Direct and inverse solutions of geodesics on the direct and inverse solutions of geodesics. Surv. Rev., XXII:176, 1975. [42] C. Veness. Vincenty Formula for Distance between Two Latitude/Longitude Points, 2010. http://www.movable-type.co.uk/scripts/latlong-vincenty.html, accessed Feb. 2013. [43] Eucalyptus. Euca2ools, 2010. http://www.eucalyptus.com/download/euca2ools, accessed Feb. 2013. [44] Ubuntu. Meet Ubuntu, 2001. http://www.ubuntu.com/, accessed Feb. 2013. [45] PostGIS. About PostGIS, 2000. http://postgis.net/, accessed Feb. 2013. [46] Nagio. Nagios Is the Industry Standard in IT, 1998. http://www.nagios.org/, accessed Feb. 2013. [47] Python. Python Programming Language Official Website, 1998. http://www.python.org/, accessed Feb. 2013. [48] PostGIS. API Documentation, 2001. http://postgis.refractions.net/documentation/manual-1.4/, accessed Feb. 2013. [49] N. Carey and R. Knoblauch. Establishing pedestrian walking speeds. FHWA, Transportation and Records, Washington, DC, 1975. 92

[50] K. E. Haynes and A. S. Fotheringham. Gravity and Spatial Interaction Models. Sage Publications, Beverly Hills, California, 1984. [51] Amazon. EC2 Calc, 2009. http://calculator.s3.amazonaws.com/calc5.html, accessed Feb. 2013. [52] Amazon. Micro Instance, 2006. http://docs.aws.amazon.com/awsec2/latest/userguide/concepts_micro_instances.ht ml, accessed Feb. 2013. [53] Fashion Valley. Fashion Valley Mall, 2010. http://fashion-valley.mallsite.us/, accessed Feb. 2013. 93