Politecnico di Torino. Porto Institutional Repository



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Project Networks With Mixed-Time Constraints

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Alternative Way to Measure Private Equity Performance

Efficient Striping Techniques for Variable Bit Rate Continuous Media File Servers æ

DEFINING %COMPLETE IN MICROSOFT PROJECT

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Cloud Auto-Scaling with Deadline and Budget Constraints

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

Fault tolerance in cloud technologies presented as a service

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

The OC Curve of Attribute Acceptance Plans

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Survey on Virtual Machine Placement Techniques in Cloud Computing Environment

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Single and multiple stage classifiers implementing logistic discrimination

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Cloud-based Social Application Deployment using Local Processing and Global Distribution

J. Parallel Distrib. Comput. Environment-conscious scheduling of HPC applications on distributed Cloud-oriented data centers

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Traffic State Estimation in the Traffic Management Center of Berlin

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Calculating the high frequency transmission line parameters of power cables

Performance Analysis of Energy Consumption of Smartphone Running Mobile Hotspot Application

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

Multiple-Period Attribution: Residuals and Compounding

Section 5.4 Annuities, Present Value, and Amortization

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Enabling P2P One-view Multi-party Video Conferencing

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

IMPACT ANALYSIS OF A CELLULAR PHONE

SUPPLIER FINANCING AND STOCK MANAGEMENT. A JOINT VIEW.

Forecasting the Direction and Strength of Stock Market Movement

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Calculation of Sampling Weights

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

An MILP model for planning of batch plants operating in a campaign-mode

A Secure Password-Authenticated Key Agreement Using Smart Cards

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Dynamic Pricing for Smart Grid with Reinforcement Learning

What is Candidate Sampling

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

FORMAL ANALYSIS FOR REAL-TIME SCHEDULING

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Activity Scheduling for Cost-Time Investment Optimization in Project Management

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

J. Parallel Distrib. Comput.

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

HP Mission-Critical Services

Heuristic Static Load-Balancing Algorithm Applied to CESM

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Conferencing protocols and Petri net analysis

For example, you might want to capture security group membership changes. A quick web search may lead you to the 632 event.

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

The Greedy Method. Introduction. 0/1 Knapsack Problem

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

An ILP Formulation for Task Mapping and Scheduling on Multi-core Architectures

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

VoIP Playout Buffer Adjustment using Adaptive Estimation of Network Delays

Vembu StoreGrid Windows Client Installation Guide

Overview of monitoring and evaluation

An Interest-Oriented Network Evolution Mechanism for Online Communities

Optimization Model of Reliable Data Storage in Cloud Environment Using Genetic Algorithm

QoS-based Scheduling of Workflow Applications on Service Grids

Period and Deadline Selection for Schedulability in Real-Time Systems

Durham Research Online

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

An Optimal Model for Priority based Service Scheduling Policy for Cloud Computing Environment

Software project management with GAs

Dynamic Fleet Management for Cybercars

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Damage detection in composite laminates using coin-tap method

An Introduction to 3G Monte-Carlo simulations within ProMan

M3S MULTIMEDIA MOBILITY MANAGEMENT AND LOAD BALANCING IN WIRELESS BROADCAST NETWORKS

Automated information technology for ionosphere monitoring of low-orbit navigation satellite signals

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Updating the E5810B firmware

Statistical Methods to Develop Rating Models

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

AN APPROACH TO WIRELESS SCHEDULING CONSIDERING REVENUE AND USERS SATISFACTION

Transcription:

Poltecnco d Torno Porto Insttutonal Repostory [Artcle] A cost-effectve cloud computng framework for acceleratng multmeda communcaton smulatons Orgnal Ctaton: D. Angel, E. Masala (2012). A cost-effectve cloud computng framework for acceleratng multmeda communcaton smulatons. In: JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, vol. 72 n. 10, pp. 1373-1385. - ISSN 0743-7315 Avalablty: Ths verson s avalable at : http://porto.polto.t/2501543/ snce: July 2012 Publsher: Elsever Publshed verson: DOI:10.1016/j.jpdc.2012.06.005 Terms of use: Ths artcle s made avalable under terms and condtons applcable to Open Access Polcy Artcle ("Publc - All rghts reserved"), as descrbed at http://porto.polto.t/terms_and_condtons. html Porto, the nsttutonal repostory of the Poltecnco d Torno, s provded by the Unversty Lbrary and the IT-Servces. The am s to enable open access to all the world. Please share wth us how ths access benefts you. Your story matters. (Artcle begns on next page)

A Cost-Effectve Cloud Computng Framework for Acceleratng Multmeda Communcaton Smulatons Danele Angel, Enrco Masala Control and Computer Engneerng Dept., Poltecnco d Torno, corso Duca degl Abruzz, 24 10129 Torno, Italy. Phone: +39-011-0907036, Fax: + 39-011-0907099 Abstract Multmeda communcaton research and development often requres computatonally ntensve smulatons n order to develop and nvestgate the performance of new optmzaton algorthms. Dependng on the smulatons, they may requre even a few days to test an adequate set of condtons due to the complexty of the algorthms. The tradtonal approach to speed up ths type of relatvely small smulatons, whch requre several develop-smulate-reconfgure cycles, s ndeed to run them n parallel on a few computers and leavng them dle when developng the technque for the next smulaton cycle. Ths work proposes a new cost-effectve framework based on cloud computng for acceleratng the development process, n whch resources are obtaned on demand and pad only for ther actual usage. Issues are addressed both analytcally and practcally runnng actual test cases,.e., smulatons of vdeo communcatons on a packet lossy network, usng a commercal cloud computng servce. A software framework has also been developed to smplfy the management of the vrtual machnes n the cloud. Results show that t s economcally convenent to use the consdered cloud computng servce, especally n terms of reduced development tme and costs, wth respect to a soluton usng dedcated computers, when the development tme s hgher than one hour. If more development tme s needed between smulatons, the economc advantage progressvely reduces as the computatonal complexty of the smulaton ncreases. Keywords: Multmeda Smulatons, Cloud Computng, Vdeo Communcaton, Amazon EC2, Cloud Cost Comparson 1. Introducton Nearly all works that propose new algorthms and technques n the multmeda communcaton feld nclude smulaton results n order to test the performance of the proposed systems. Correspondng author. Emal: masala@polto.t. NOTICE: ths s the author s verson of a work that was accepted for publcaton n Journal of Parallel and Dstrbuted Computng. Changes resultng from the publshng process, such as peer revew, edtng, correctons, structural formattng, and other qualty control mechansms may not be reflected n ths document. Changes may have been made to ths work snce t was submtted for publcaton. A defntve verson s beng publshed n Journal of Parallel and Dstrbuted Computng. DOI: 10.1016/j.jpdc.2012.06.005 Preprnt submtted to Journal of Parallel and Dstrbuted Computng July 9, 2012

Valdaton of new algorthms and deas through smulaton s ndeed a fundamental part of ths type of research due to the complexty of multmeda telecommuncaton systems. However, the performance of the proposed systems has to be evaluated n many dfferent network scenaros and for several values of all the key parameters (e.g., avalable bandwdth, channel nose). Moreover, to acheve statstcally sgnfcant results, smulatons are often repeated many tmes and then results are averaged. In addton, for research purposes, software and smulators are usually developed only as a prototype,.e., not optmzed for speed. For nstance, the vdeo test model software avalable to researchers s usually one or two order of magntude slower than commercal software whch, however, mght not be sutable for research, snce t does not come wth the source code needed to experment wth new technques. As a consequence, the tme spent n runnng such type of smulatons and gettng results mght be sgnfcant, sometmes even a few days. Interpretng such results leads to performance mprovement and bug fxes that need to be tested agan wth other smulatons. Focusng on the development stage of these smulatons and relatvely small sze smulatons mples that there s the potental for several development-smulaton-reconfguraton cycles n a sngle day, and computatonal resources are dle between smulaton runs. Therefore, the tme needed to get smulaton results play a key role n tryng to speed up the research actvtes. A commonly used approach to speed up smulatons s to run them n parallel on several computers. However, ths approach strongly depends on several varables, e.g., computer power and avalablty. Buyng new, dedcated computers mght not be affordable n case of small research groups snce the number of computers should be hgh and ther usage rato would be low due to the dead tmes between the varous smulaton runs needed to mprove the algorthms and fx bugs. Accessng large computer resources could be dffcult as well snce currently t s not easy to acqure resources to spend n computaton costs when the hardware s not owned. Indeed n typcal research projects whch nclude fundng for multmeda communcaton research, hghperformance computng s not seen as one of the prmary goal of the project, and costs are usually domnated by tems such as staff and development of testbeds. Therefore, the cost and rsk of acqurng a sgnfcant number of computers entrely rest on the research group. An effectve technque to speed up smulatons could be to rent computng resources n the cloud and run the computatons n parallel, however there s a sgnfcant lack of works n lterature that quantfy the advantages or dsadvantages of such a soluton especally n terms of economc costs n a practcal case. Ths paper addresses ths ssue by provdng quanttatve results, ncludng cost comparsons, that can help n takng the most effectve decsons. Note that the type of scentfc tasks consdered n ths work does not ft well nto the class of hgh performance computng (HPC) problems, snce n that case requrements are dfferent: the problem s well known, and algorthms to solve t are well tested. The requrement s generally lmted to run those type of algorthms n the most cost-effcent way, whch typcally mples they are batch-scheduled. In the consdered scenaro, nstead, researchers wants to run the smulaton code as soon as possble to speed up further mprovements. The type of smulatons addressed n ths work are better descrbed by the many task computng (MTC) defnton, whch denotes hgh-performance computatons comprsng multple dstnct actvtes, coupled va fle system operatons [1]. Multmeda communcaton smulatons consdered here are fully parallelzable by nature, makng them perfectly sutable for a cloud computng envronment. The possblty to parallelze smulatons stems from the fact that results are usually averaged on a number of dfferent smulaton runs that do not have dependency among them. Moreover, usng several values for the parameters as the nput of the algorthms adds another dmenson to the problem whch agan ncreases the possblty to further parallelze 2

the smulaton. The contrbuton of ths paper s twofold. Frst, t provdes a smple software framework that can be used to automate all the operatons nvolved n settng up and manage the cloud envronment for the specfc smulaton to be run as well as to effcently and quckly collect the smulaton results. Second, t analyzes the cost-performance tradeoff usng several actual smulaton examples taken from the vdeo communcaton research area,.e., H.264/AVC vdeo communcatons on a packet lossy channel. An analytcal approach s employed n order to nvestgate both the economc costs and the performance of the proposed approach. Moreover, actual prces of a major commercal cloud computng provder are used to quantfy, n a practcal way, the sutablty, economc proftablty and development speed up of employng the cloud computng approach for the relatvely short scentfc smulatons, such as the ones faced by many researchers, n a realstc scenaro where the typcal work pattern of researchers s also consdered,.e., they do not work 24 hours whle computers do. The paper s organzed as follows. Secton 2 analyzes the related work n the feld. Then, Secton 3 nvestgates the requrements of typcal smulatons n the multmeda communcaton feld and ther sutablty for cloud computng. Secton 4 descrbes the developed software framework to automate runnng smulatons n the cloud. In Secton 5, a bref performance analyss of the varous nstances n the Amazon cloud computng nfrastructure s presented. Secton 6 descrbes the case studes used n ths work, followed by Secton 7 whch analytcally nvestgates the cost performance tradeoffs wth practcal examples n the case of both a sngle smulaton and a whole research actvty comprsng several smulatons. Conclusons are drawn n Secton 8. 2. Related work Some works have been presented n recent years on the proftablty of usng cloud computng servces n order to mprove the performance of runnng large scentfc applcatons. Cloud-based servces ndeed clam that they can acheve sgnfcant cost savngs over owned computatonal resources, due to the pay-per-use approach and reduced costs n mantenance and admnstraton whch are spread on a large user bass [2]. Untl recently, most of the scentfc tasks were run on clusters and grds, and many works explored how to optmze the performance of scentfc applcatons n such specfc contexts. A taxonomy of scentfc workflow systems for grd computng s presented n, e.g., [3]. However, cloud s not a completely new concept wth respect to grds, t ndeed has ntrcate connecton to the grd computng paradgm and other technologes such as utlty and cluster computng, as well as wth dstrbuted systems n general [4]. Several works nvestgated several dfferent aspects nvolved n runnng scentfc workflows n the cloud, for nstance focusng on optmal data placement nsde the cloud [5], the overall experence and man ssues faced when the cloud s used [6] and the sutablty of cloud storage systems such as Amazon S3 for the scentfc communty [7]. Other works addressed the costs of usng cloud computng to perform tasks tradtonally addressed by means of an HPC approach wth dedcated computatonal resources. Fndngs ndcate that n ths scenaro proftablty s somehow lmted, at least wth current commercally avalable cloud computng platforms [2]. Indeed the performance of general purpose cloud computng systems, such as the vrtual machnes provded by Amazon [8], are generally up to an order of magntude lower than those of conventonal HPC clusters [9] and are comparable to low-performance clusters [10]. Nevertheless, due to the savngs acheved by means of the large scale of these cloud 3

computng systems, they seem to be a good soluton for scentfc computng workloads that requre resources n an nstant and temporary way [11], although capacty plannng can be qute dffcult snce tradtonal capacty plannng models do not work well [12]. Many works employ benchmarks amed at predctng the performance of complex scentfc applcatons. Often, these benchmarks focus on testng the effcency of the communcaton between the varous computng nodes, whch are an mportant factor n some types of applcatons. For nstance, [13] focuses on establshng theoretcal performance bounds for the case of a large number of hghly parallel tasks competng for CPU and network resources. The type of smulatons addressed n ths work fts nto the many task computng (MTC) defnton [1]. MTC has been nvestgated, for nstance, n [14], where a data dffuson approach s presented to enable data ntensve MTC, n partcular dealng wth ssues such as acqurng computng and storage resources dynamcally, replcatng data n response to demand, and schedulng computatons close to data both under statc and dynamc resource provsonng scenaros. Frameworks for task dspatch n such scenaros have also been proposed recently, such as Falcon [15], whch smplfy the rapd executon of many tasks on archtectures such as computer clusters by means of a dspatcher and a mult-level schedulng system to separate resource acquston from task dspatch. These frameworks are complemented by means of languages sutable for scalable parallel scrptng of scentfc computng tasks, such as Swft [16]. Some works have been presented to compare the performance acheved by means of the cloud wth other approaches based on desktop workstatons, local clusters, and HPC shared resources wth reference to sample scentfc workloads. For nstance, n [17] a comparson s performed among all these approaches, manly focusng on gettng relable estmate of predcton of performance of the varous archtectures dependng on the workflows. However, no economc cost comparsons between the dfferent platforms are shown. Another work [18] consder a practcal scentfc task tradtonally run on a local cluster. The authors study a cloud alternatve based on the Amazon nfrastructure, frst developng a method to create a vrtual cluster usng EC2 nstances to make portablty easer, then nvestgatng how the dfferent data storage methods provded by Amazon mpact on the performance. Whle costs of Amazon cloud are consdered n detals for the proposed archtectures, no cost comparsons wth the prevous cluster-based archtecture are presented. Ths work helps n quantfyng the economc advantage and potental drawbacks n replacng computers dedcated to smulaton n a small research lab wth a cloud computng soluton. To the best of our knowledge, no works have addressed ths ssue so far wth reference to the relatvely small sze smulatons presented n ths paper, apart from our short prelmnary study presented n [19]. Even though ths mght seem a qute pecular smulaton scenaro, many researchers, at least n the multmeda communcaton feld, share the need to perform smulatons of the sze dscussed here. Note also that the computatonal requrements of these smulatons are constantly ncreasng due to the tendency to move towards hgh qualty, hgh resoluton mages and vdeo, urgng researchers to fnd cost effectve ways to deal wth these type of smulatons. 3. Analyss of Smulaton Requrements Typcal smulatons n the multmeda communcaton feld nvolves runnng the same set of algorthms many tmes wth dfferent random seeds at each teraton. The objectve s to evaluate the performance of the system under test by averagng the results acheved n varous condtons, e.g., dfferent realzatons of a packet lossy channel, so that confdence ntervals are mnmzed. Clearly, such a setup allows many smulatons to run n parallel, snce no nteracton among them 4

s requred except when mergng the results at the end of the smulaton. Despte the conceptual smplcty, the actual computatonal load can be hgh snce multmeda codecs mght be prototypes only, not optmzed for speed, as well as channel models or other robustness technques mght be complex to smulate. Moreover, consder that to acheve statstcally sgnfcant results many dfferent test sgnals, e.g., vdeo sequences, should be used n the experments so that technques are valdated across a range of dfferent nput condtons. Other smlarly heavy tasks mght nclude extensve precomputatons n order to optmze the performance of algorthms that are supposed to run n real tme once deployed n an actual system. As an example, consder a system optmzng the transmsson polcy of packets n a streamng scenaro. Some precomputed values regardng the characterstcs of the content, such as the dstorton that would be caused by the loss of some parts of the data, can be useful to the optmzaton algorthms, but values need to be computed n advance (see, e.g., [20, 21]). Therefore, due to the typcal peculartes of multmeda communcaton smulatons, very lttle effort s needed to parallelze them. Often, no nteracton s needed untl collecton of results (or not at all when consderng dfferent nput sgnals). Even n more complex cases, such as precomputaton, usually multmeda sgnals can be easly splt nto dfferent ndependent segments, for nstance group of pctures (GOP) n vdeo sequences, and processed almost ndependently. Therefore, the smulaton types descrbed n ths secton can take full advantage of the avalablty of multple computng unts, as n a cloud envronment. Parallelsm can be exploted both at the CPU level, usng more CPUs, and wthn the CPU takng advantage of multple cores. As wth modern computers, cloud envronments offer multcore CPUs n the hghest performance ters, whch ndeed requre parallelsm for a cost effectve explotaton of the resources. 4. The Cloud Smulaton Software Framework In ths work we focus on the Amazon AWS offer as of October 2011, whch provdes the Elastc Compute Cloud servce, n bref Amazon EC2, that ncludes a number of nstance types wth dfferent characterstcs n terms of CPU power, RAM sze and I/O performance. The Amazon AWS platform allows to control the deployment of resources n dfferent ways, for nstance by usng a web nterface or by means of an API, avalable for dfferent languages. In all cases (web or API), the deployment of vrtual systems n a remote envronment and ther montorng requres several operatons, although conceptually smple. Whle for smple operatons and management of vrtual servers ths task can be easly accomplshed by a human operator through, e.g., the web nterface, a more tme effcent system s needed to manage at the same tme the actvaton, confguraton and deactvaton of a number of nstances n order to automatcally create the requested vrtual envronment needed by the smulatons. A smulaton could, n fact, requre to create, for nstance, ten vrtual machnes, each one fed wth dfferent nput parameters so that t operates on the correct set of data, then check that every one of them s correctly runnng and fnally resultng data has to be collected n a sngle central pont. Snce the smulatons consdered n ths work are qute short when carred out n the cloud computng envronment, the tme spent n settng up the approprate smulaton envronment (e.g., actvatng nstances, feedng them wth the correct startup fles, etc.) must be mnmzed otherwse the advantage of cloud computng n terms of speed s reduced. To mnmze the set up tme, an automatc system s needed. A number of frameworks have been proposed n lterature to address the ssue of dspatchng tasks to a computer system (e.g., a cluster or a grd) where they are usually receved by batch 5

Fgure 1: General archtecture of the proposed cloud smulaton software framework. schedulers. However, ther dspatchng tme can be hgh [15] because they usually support rch functonaltes such as multple queues, flexble dspatch polces and accountng. Lghtweght approaches have also been proposed, for nstance the Falkon framework [15], whch reduces the task dspatch tme by means of elmnatng the support for some of the features. However, n a relatvely small sze smulaton wth almost no dependences between the tasks the use of such frameworks, amed at large scale scentfc smulatons, provdes much more features that the ones effectvely needed. For these reasons, we desgned and mplemented our software framework, named Cloud Smulaton System (CSS), wth the am to create a very lghtweght support for the executon of our smulatons. The man desgn crtera were to be able to automate the executon n the cloud of the smulatons of the type consdered n ths work, and to automatcally take care of all the aspects of confguraton of the cloud, e.g., startng and termnatng nstances, uploadng the ntal data for each nstance and downloadng the results. Note that these aspects must be adapted to the specfc cloud technology used regardless of whch framework s employed, therefore also f more complex frameworks were used, the tme reducton n settng up the framework would have been lmted, also consderng the tme needed to learn and adapt the features of a new framework for our ams. Fgure 1 shows the general archtecture of the CSS. The software has been developed n the Java language n order to be portable on dfferent platforms. Frst, an offlne step s needed, that s, the preparaton of a vrtual machne mage, named AMI n the Amazon termnology, contanng the tools needed for the smulaton and a few parameterzed commands, usually scrpts, that can both run a set of smulatons (controlled by an nput fle) and save the smulaton results to a storage system, for nstance the S3 provded by Amazon. One of the key component of the archtecture s the controller computer, whch s ntally fed wth a smulaton descrpton, n XML format, of the actvtes to carry on, ncludng the specfc set of nput parameters for each sngle EC2 nstance. The controller automatcally performs a 6

<?xml verson="1.0" encodng="utf-8"?> <smul tag="sm1"> <optons> <cloudregon>eu_ireland</cloudregon> <amid>am-12345678</amid> <vmsknd>c1.medum</vmsknd> <cloudkeypar>ec2-key</cloudkeypar> <s3cmdloc>/home/ubuntu/s3/s3cmd/s3cmd</s3cmdloc> <s3confgloc>/home/ubuntu/.s3cfg</s3confgloc> </optons> <cloudvm> <commandlst datatosave="res*.txt" execloc="/home/ubuntu/smul/h264/"> <command>./sm.sh lst_set_3.txt</command> </commandlst> </cloudvm>... </smul> Fgure 2: Sample smulaton descrpton fle for the developed cloud smulaton software framework. number of actvtes needed to ensure the successful executon of the set of smulatons specfed n the XML fle. The man actvtes nclude: 1. actvatng new nstances; 2. runnng the smulaton software; 3. perodcally montorng the status of the nstances to get early warnngs n case some software ncluded n the smulaton fals or crashes; 4. checkng for the end of the smulaton; 5. downloadng the results from the remote storage systems. In more detals, the system creates nstances usng the API provded by the Amazon Java SDK. The software s packaged n a runnable JAR archve and the man optons and operatons that wll be performed are specfed n the XML confguraton fle. Several parameters can be specfed, such as the number and type of Amazon nstances to use, n whch regon they wll be launched and whch commands they wll execute at startup. It s also possble to specfy a user defned tag to run more than one smulaton set at the same tme n the cloud. A sample fle s shown n Fg. 2. A separate fle contans the access credentals to the Amazon AWS platform. The montorng of the smulaton s performed through a set of scrpts that wll perodcally connect to all the nstances nvolved n the smulaton and check ther memory and CPU utlzaton. If any of these metrcs show anomalous values an automatc emal alert wll be sent to a predefned address, ncludng the detals of the nstances that are havng ssues. When all smulatons end, fles are downloaded from the Amazon S3 storage system, used by all the nstances to save ther results. The applcaton developed n ths framework wll automatcally detect the end of the smulaton and then download the resultng data n the local system. The descrbed framework can effcently run smulatons n the Amazon AWS platform. In order to optmze the cost performance tradeoff, sutable optons must be chosen n the confguraton fle, for nstance the most effcent type of EC2 nstance for the gven smulaton. The 7

Table 1: Characterstcs of the avalable EC2 nstances and costs n the EU regon (Oct. 2011). Name ECU/core # cores RAM I/O cost/h cost/h/ecu [Symbol] [E ] [ν ] (GB) perf. ($) ($) [ϕ ] std.small 1 1 1.7 Moderate 0.095 0.095 std.large 2 2 7.5 Hgh 0.38 0.095 std.xlarge 2 4 15 Hgh 0.76 0.095 h-cpu.medum 2.5 2 1.7 Moderate 0.19 0.038 h-cpu.xlarge 2.5 8 7 Hgh 0.76 0.038 h-mem.xlarge 3.25 2 17.1 Moderate 0.57 0.088 h-mem.dxlarge 3.25 4 34.2 Hgh 1.14 0.088 h-mem.qxlarge 3.25 8 68.4 Hgh 2.28 0.088 mcro up to 2 1 0.613 Low 0.025 - next sectons wll nvestgate how to confgure the developed framework n order to maxmze the performance and mnmze the cost of runnng the smulaton n the cloud. 5. Performance Analyss of EC2 Instances The computatonal power unt used by Amazon s the EC2 compute unt (ECU), defned as the equvalent to the CPU capacty of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Table 1 summarzes the characterstcs of the Amazon EC2 offer n the EU regon as of Oct. 2011 [22]. Note that there s a partcular type of nstance, named mcro, whose characterstcs cannot be easly defned as the other ones. More detals about the mcro nstance are presented later n ths work. The key quanttes pecular of each nstance are represented usng the followng symbols: ϕ s the cost/h/ecu for nstance type,ν s the number of cores and E the number of nomnal ECU per core. The meanng of all the symbols used throughout the paper s reported n Table 2. 5.1. Raw Computng Performance Frst, the CPU performance of the dfferent nstances has been assessed by usng a smple CPU-ntensve program,.e., computng the MD5 hash of a randomly-generated 100 MB fle. The experment s repeated 100 tmes to cache the fle nto the RAM so that the performance of the storage system does not affect the measurements. Results are reported n Table 3. The effectve computng power P (1) of nstance s computed as: P (1) = t(1) std.small P(1) std.small t (1) (1) where t (1) s the tme, as seen by the user, needed by the nstance to compute the MD5 value 100 tmes usng a sngle process. As a reference, the P (1) value for the std.small has been set equal to 1.00, so that values can be drectly compared wth the nomnal speed n ECU as declared by Amazon AWS. Superscrpt (1) ndcates that the performance s acheved usng only one core. Note also that the mcro nstance dffers from the others snce t s not sutable for a contnuous computng load. It provdes a good alternatve for nstances that are dle most of the tme but they sometmes must deal wth some short bursts of loads, such as low-traffc web servers. Moreover, there are no guarantees that a mnmum amount of processng power wll be avalable at any tme even when the nstance s runnng, makng t a sort of best effort offer. For the mcro nstance, 8

Table 2: Symbols used throughout the paper. Instance of type E ECU/core for nstance type ν Number of cores of nstance type ϕ Cost ($)/h/ecu of nstance type P (N) Effectve computng power of nstance type usng N cores p (N) Effectve computng power parameter,.e., P (N) normalzed by E C (N) Cost ($) of one hour of nstance type consderng ts p (N) n Number of processes runnng n parallel S Computng energy (ECU h) needed to run a smulaton K Number of nstances used to run the smulaton n the cloud T S Tme (h) requred to run a smulaton, requrng energy S, n the cloud C S Cost ($) of runnng a smulaton, requrng energy S, n the cloud η Instance usage effcency n the cloud T D Tme (h) spent to study and modfy the algorthm n each cycle T C Total tme (h) of one cycle n PC Number of PCs needed to acheve the same tme performance of the cloud L PC Lfetme (h) of a PC C PC Cost ($) of a PC ncludng runnng costs for L PC N cy Number of cycles used for the development of a gven technque C cloud Total cost ($) of the cloud soluton C npc Total cost ($) of the soluton based on n PCs f Tme ncrease factor due to operators workng durng daytme only C rato Rato of the cost of the cloud soluton to the cost of the npc-based soluton Table 3 shows the ECU/core declared by Amazon whle tme and effectve computatonal power are averaged over several cycles of burst and slow-down perods. For completeness, note that when the mcro nstance performed at maxmum computatonal speed, t reached P (1) = 3.38 n our experments, whle t provded only P (1) = 0.09 whle n the slow phase. Due to ths behavor, ths type of nstance wll not be consdered further n ths work. When multple cores are avalable on a gven nstance processes can be run n parallel. Fgure 3 shows the effectve computng power of the varous nstances whle performng the same CPU-ntensve task (MD5 hash) usng a dfferent number of processes n parallel. The effectve computng power parameter p (N) s gven by Eq. (2), where the tme nterval t (N) refers to the tme elapsed between the start of the frst process and the end of the last runnng process, as p (N) = t(1) std.small P(1) std.small t (N) E. (2) Note that, dfferently from Eq. (1), the p value s normalzed by the nomnal ECU/core value E, so that values can be easly compared among them. As expected, performance tends to slghtly decrease when the number of processes ncreases. Ths result confrms that each nstance can be loaded wth CPU-ntensve parallel processes up to the number of cores wthout ncurrng n an unreasonable performance reducton. 9

Relatve performance 1.6 1.4 1.2 1.0 std.small std.large std.xlarge h-cpu.medum h-cpu.xlarge 0.8 1 2 3 4 5 6 7 8 Number of processes Fgure 3: Effectve computng power of parallel CPU-ntensve tasks on dfferent nstances (normalzed by the nomnal ECU/cores). Fgure 4 shows the cost, per process, for one hour of each nstance type for each effectve computng power unt p as prevously defned. The cost s defned as C (N) p (N) = ν E ϕ n (3) whereν E ϕ s the cost of one hour of nstance and n s the number of processes runnng n parallel. Snce the cost of the nstance s constant regardless of the number of processes runnng on t, clearly the cost per process decreases when the number of processes run n parallel n the nstance ncreases, up to the number of avalable cores. Note that each sngle pont n Fg. 4 consders the effectve computng power unts p that can be acheved wth that specfc number of processes. Consderng the MD5 hash task, the best performance cost rato s provded by the h-cpu.xlarge nstance type, followed by the std.xlarge, std.large and h-cpu.medum. Table 3: Expermental measurements of CPU computng performance of EC2 nstances, usng only one core. Tme refers to the MD5 task. Nomnal Effectve comp. power P Name ECU/core Tme (s) (std.small=1.00) std.small 1 100 1.00 std.large 2 36 2.78 std.xlarge 2 33 3.03 h-cpu.medum 2.5 42 2.38 h-cpu.xlarge 2.5 32 3.13 mcro up to 2 152 0.66 10

Cost per p unts, per process ($/h) 0.30 0.25 0.20 0.15 0.10 0.05 0.00 std.small std.large std.xlarge h-cpu.medum h-cpu.xlarge 1 2 3 4 5 6 7 8 Number of processes Fgure 4: Cost per p unts, for each process, dependng on the nstance type. 5.2. I/O Performance In addton to CPU-ntensve tasks, the I/O performance of the varous nstances has also been measured. Ths s mportant snce smulatons often requre I/O actvty, especally when dealng wth uncompressed vdeo sequences as t s generally the case wth vdeo qualty smulatons. Table 4 reports the performance of the I/O subsystem, as measured by the ozone tool [23], for all the nstances consdered n ths work. The tool has been run wth record sze equal to 32 KBytes and fle sze equal to a value larger than the maxmum amount of avalable memory to reduce as much as possble the nfluence of the operatng system dsk cache. For convenence, the column named I/O classfcaton reports the Amazon classfcaton of the I/O performance, where M means Medum and H means Hgh. The performance s mostly algned wth the classfcaton, wth hgher values for the std nstance types especally for wrte operatons compared to the hcpu nstances. However, note that, as stated by Amazon [22], due to the shared nature of the I/O subsystem across multple nstances, performance s hghly varable dependng on the tme the nstance s run. Table 4: Expermental measurements of I/O performance of the varous nstances, normalzed values where hcpu.medum=1.00 (last row shows absolute values n KBytes/s, usng record sze = 32 KBytes.) I/O Random Random Instance type classf. Read Wrte read wrte std.small M 1.41 2.31 1.11 1.19 std.large H 1.52 2.32 1.64 1.58 std.xlarge H 1.46 2.10 1.91 1.52 h-cpu.medum M 1.00 1.00 1.00 1.00 h-cpu.xlarge H 1.47 1.29 1.50 0.92 h-cpu.medum M 71317 14454 4545 8511 11

Table 5: Parameters of some representatve communcaton smulatons. Parameter Typcal values Sm1 Sm2 Sm3 Resoluton 352 288 to 352 288 704 576 1280 720 1920 1080 Pxels per frame 100K-2000K 101,376 405,504 921,600 Sequence length (frames) 180-300 300 300 300 Uncompressed vdeo sequence 26-890 43.5 174 395.5 (MB) Input fles, scrpts and executables (MB) 0.5-1 0.61 0.60 0.64 Channel realzatons 30-50 50 40 30 # of values for channel parameter 3-5 5 4 4 (e.g., SNR) # of values for algorthm parameter 3-5 5 4 4 (e.g., maxmum packet sze) # of sequences 4-5 4 4 4 Algorthm for qualty measure PSNR, SSIM, PSNR SSIM PVQM PVQM Total sze of results (compressed) (MB) 200-600 595 305 228 6. Case Study: Sample Smulatons 6.1. Smulaton Characterstcs In order to assess the performance n realstc cases, we descrbe the typcal requrements of some multmeda communcaton smulatons typcally used n the research actvtes. The most mportant part of the dataset n multmeda experments are the vdeo sequences, typcally stored n uncompressed format. Generally, sequence length ranges from 6 to 10 s at 30 frames per second,.e., 180 to 300 frames. The correspondng sze n bytes range from about 26 MB up to 890 MB dependng on vdeo resoluton, from CIF (352 288) to FullHD (1920 1080). Typcally, four or fve vdeo sequences are generally enough to represent a range of vdeo contents sutable to draw relable conclusons about the presented technques. Snce results are computed as the average performance over dfferent channel realzatons, to acheve statstcally sgnfcant results from 30 to 50 dfferent random channel realzatons are needed. Moreover, often a couple of parameters can be vared, e.g., one n the channel model and one n the algorthms to be tested, thus three (mnmum to plot a curve) to fve values have to be tested for each parameter. Once each transmsson smulaton has been performed, the vdeo decoder s run, e.g., the H.264 standard test model software [24] as done n ths work, thus obtanng a dstorted vdeo sequence whose sze n bytes s the same as the orgnal uncompressed vdeo sequence sze. Fnally, performance can be measured usng varous vdeo qualty metrcs, rangng from smple mean squared error (MSE) that can be mmedately mapped nto PSNR values [25], the most commonly used measure n lterature, to much more complex algorthms that tres to account for the characterstcs of the human vsual system, e.g., SSIM and PVQM [26, 27]. Note that once the prevous performance metrcs have been computed (typcally, a sngle floatng pont number for each frame of the decoded sequence), the decoded vdeo sequence can be dscarded, therefore the maxmum temporary storage occupancy s lmted to the sze of one vdeo sequence. Table 5 provdes actual values for three representatve smulatons, respectvely a low, moderate and hgh complexty smulaton, that wll be used as examples n the remanng part of the paper. Note that 12

60 50 PVQM SSIM PSNR Normalzed tme 40 30 20 10 0 0 500 1000 1500 2000 Frame sze (Kpxel) Fgure 5: Relatve tme performance for varous frame szes and qualty evaluaton algorthms. Tme s assumed to be equal to one for CIF frame sze (about 100 Kpxel), evaluated wth PSNR. once the number of combnatons of parameters has been decded, t s also possble to compute an estmate of the sze of the results produced by the smulaton, as shown by the last row of the table. 6.2. Smulaton Complexty The tme needed to run Sm1 sequentally on, e.g., an Intel 5 M560 processor at 2.67 GHz wth 4 GB RAM s about 29,500 s,.e., more than 8 hours. It s clear that such a long tme mght sgnfcantly slow down the development of transmsson optmzaton technques, snce every tme modfcatons of the algorthm are made, for any reason, smulatons should be run agan. Therefore, speedng up smulatons s defntely nterestng. Usng all the computng power avalable on the 5 computer requres 17,250 s to run Sm1. By means of a CPU-ntensve task such as the MD5 hash descrbed n Secton 5.1, t can be seen that the computer performance s equal to about 2.75 ECU per core,.e., 5.5 ECU total (assumng the performance acheved by the std.small nstance as the reference, equal to 1 ECU). Usng the same symbols ntroduced at the begnnng of Secton 5,ν PC = 2 and E PC = 2.75. Thus, we conclude that Sm1 would requre about 26.38 hours on a 1 ECU processor,.e., the computng energy S needed to run t s 26.38 ECU h. Consderng the nomnal computatonal power stated by Amazon for the varous nstance types, the total cost would be 2.51 $ when usng the std famly of nstances or 1.00 $ usng the cheaper h-cpu famly of nstances. The value s obtaned by multplyng the computng energy S byϕ,.e., the cost/h/ecu shown n Table 1, whch s the same for all nstances belongng to the same famly. Note that the cost obtaned n ths way s ndependent of the number of nstances used to perform the smulaton, snce the code can be sgnfcantly parallelzed as dscussed n Secton 3. Varyng the sze of the vdeo frame or the algorthm used to measure the vdeo qualty performance of the communcaton changes the computng energy needed for the smulaton. Fgure 5 shows that the amount of computatons ncreases lnearly wth the frame sze n pxels. Moreover, 13

Table 6: Computng energy (n brackets the addtonal energy for compresson of fnal results), executon tme and cloud costs on the cheapest famly of nstances dependng on the smulaton example. Sm. ID Computng energy Tme on 5 Cloud nstance cost Download of (ECU h) PC (h) ($) results ($) Sm1 26.38 (0.028) 4.80 1.00 0.06 Sm2 101.27 (0.014) 18.41 3.85 0.03 Sm3 316.94 (0.011) 57.63 12.04 0.02 Table 7: I/O performance dependng on the smulaton example. Sm. ID Upload of sequences (only Upload scrpt, exe and Store n Download frst tme) (s) setup data (s) S3 (s) results (s) Sm1 4.4 < 1 30.0 94.4 Sm2 17.4 < 1 4.4 48.3 Sm3 39.6 < 1 1.1 36.3 of the computatonal cost of runnng more complex qualty evaluaton algorthms s approxmately constant (n percentage) when compared wth the PSNR algorthm. The SSIM algorthm ncurs n about 50% ncrease wth respect to PSNR, whle the PVQM requres about 180% addtonal computaton tme. Table 6 reports the complexty n terms of ECU h of each sample smulaton descrbed n Table 5, as well as the tme needed to run them on the 5 PC usng all CPU cores and the cost of runnng them n the cloud usng the cheapest famly of nstance types. For a hgh-complexty smulaton such as Sm3, the tme requred by the PC s more than two days. 6.3. Smulaton Setup, I/O and Memory Requrements As hghlghted n Secton 4, durng the preparaton phase of the AMI, sequences could be preloaded n the AMI tself snce t s lkely that smulatons have to be repeated many tmes durng the development of multmeda optmzaton technques, as descrbed n Secton 7.2. Storng data n the Amazon cloud (e.g. n the AMI) has a very lmted cost, 0.15 $ to store 1 GB for one month (fractons of sze and tme are charged on an hourly pro rata bass), whch would be enough to hold nearly 6 sequences of the type employed n Sm2. Data transfer costs to the cloud are zero, whle transferrng from the cloud costs 0.15 $ per GB. Transfer speed s not an ssue snce researchers can typcally use hgh speed unversty network connectons. For nstance, from the authors unversty n Italy the typcal transfer speed to and from an actve nstance n the AWS regon n EU (Ireland) s about 11 MB/s and 6 MB/s respectvely. In our tests, the most crtcal part has been transferrng from an actve nstance to the S3 storage wthn the Amazon nfrastructure, at the end of smulatons when fles have to be stored n S3. Ths could be done at an average of 3.3 MB/s from each nstance. Downloadng large fles from S3 to a local PC n the unversty can be done at about 6.3 MB/s. Table 7 shows the transfer tmes needed to load uncompressed sequences, transfer scrpts, executable and control fles needed to run the smulaton, and to collect the results, ncludng tme to store data n S3 (for each nstance). Note that n the provded examples, for smplcty, results are stored n text fles that are then compressed and downloaded. The data sze s determned by the number of combnatons of parameters, whch s the hghest n Sm1, thus ths mples more data to download. However, the tme s qute lmted (about one mnute and a half n the 14

Table 8: Improvement of executon tme usng a RAM dsk nstead of the default Amazon storage (EBS) for the nstance. H-cpu.medum nstance type. Sm. ID Tme reducton (%) Sm1 4.0 Sm2 0.9 Sm3 0.1 worst case) f compared wth the duraton of the smulaton, and could be further reduced by storng data n a more optmzed way, e.g., usng a bnary format to represent numbers nstead of text. The computng energy needed to compress result fles has already been accounted for n Table 6. In our experments nstance actvaton tmes have always been less than about 45 seconds, wth typcal values around 30, therefore they are comparable or sometmes less than the tme needed to download the results and much less than the tme needed to run smulatons at the most cost-effectve tradeoff pont,.e., about one hour, as t wll be determned n Secton 7. The types of computatons nvolved n multmeda communcatons are typcally CPU ntensve. However, t s often necessary to move large amounts of data wthn the computer system (e.g., readng and wrtng large fles, that s, the vdeo sequences). The operatng system dsk cache almost always helps n ths regard by usng a large amount of memory for ths purpose, thus effectvely keepng most of the data n memory. An alternatve approach to ensure that RAM s used to access these fles could be to create, for nstance, a RAM dsks and keep the most crtcal fles there, such as the vdeo sequence currently tested. We expermented wth ths soluton, movng all fles to a RAM dsk whose sze s about 80% of the avalable memory, and measurng the values shown n Table 8. Some modest mprovements can be acheved n the case of Sm1 (4%) whle they are neglgble for Sm2 and Sm3. An addtonal experment, not shown n the table, whch uses the PSNR qualty measure and the vdeo frame resoluton employed n Sm3 shows 9.2% tme mprovement. We attrbute ths behavor to the fact that when the PSNR qualty measure s used, t s much more mportant to have fast access to the vdeo sequence fles (both the orgnal and the decoded one for the current smulaton experment) snce the PSNR only performs very smple computatons. For more CPU-ntensve algorthms such as SSIM and PVQM there s almost no mprovement n faster access to the vdeo fles. However, even 9.2% tme mprovement s not suffcent to justfy the use of nstances wth much more memory, whch cost more than twce for unt of computatonal power. 7. Analytcal Analyss 7.1. Sngle Smulaton In order to mathematcally characterze the tme and cost needed to perform a gven smulaton, the followng notaton s ntroduced. The amount of workload assocated wth the gven smulaton s denoted by S, as already mentoned. Let K be the number of nstances used to perform the smulaton, and be the type of the nstances, assumng that only one type s used to run the whole smulaton. The other varables, correspondng to the characterstcs and costs of each nstance type, have already been defned n Secton 3. The tme T S needed to perform a gven smulaton that requres S computng energy usng K nstances of type s gven by T S (, K)= S ν E K. (4) 15

Effcency 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Effcency 0.0 Lower bound 0 1 2 3 4 5 6 7 8 9 10 Tme (h) Fgure 6: Effcency as a functon of the actual tme used to perform the smulaton. The value s nversely proportonal to the number of nstances K,.e., the computaton speed can be ncreased as desred, provded that the smulaton task can be splt n a suffcently hgh number of parallel processes, by just ncreasng the number of nstances K. However, note that Amazon charges every partal hour as one hour, therefore the exact cost of runnng the gven smulaton s obtaned by roundng up the tme used on each nstance to the nearest nteger hour. Cost s gven by S C S (, K)=ν E ϕ K T S (, K) =ν E ϕ K (5) ν E K where the functon represents the smallest nteger greater than or equal to the argument. We ntroduce an effcency valueηwhch represents the rato of the tme nterval n whch each nstance s performng computatons to the tme nterval the nstance s pad for, that s, η= T S (, K) T S (, K). (6) Eq. (6) presents the behavor shown n Fgure 6. The functon has perodc local maxma (value equal to 1) when T S s an nteger number of hours, and t can be lower bounded as η> T S 1+T S (7) mplyng that effcency tends to 1 for T S values much greater than 1 hour. 16

Total cost ($) 6.00 5.00 4.00 3.00 2.00 1.00 8 (both xlarge) std.small std.large 7 (both xlarge) std.xlarge h-cpu.medum h-cpu.xlarge 6 5 4 3 3 (large) 2 7 (small) 1 4 (small) 3 (small) 0.00 0 1 2 3 4 5 6 7 8 9 10 11 Total tme (h) Fgure 7: Total cost of Sm1 as a functon of the tme that would be needed to complete all the tasks accordng to the nomnal ECU values. Labels wthn the graph show the number of nstances (K) correspondng to the pont (wth the name of the nstance type n case of ambguty.) Eq. (5) can be rewrtten as C S (, K)= ν E K S S Sϕ = ν E K S ν Sϕ E K S ν E K = Sϕ T S (, K) T S (, K) = Sϕ η. (8) It s clear that there s a lower bound to the cost of smulaton S, that s equal to Sϕ, acheved when the effcency s one. Ths happens when the smulaton tme s exactly a multple of one hour. In all other cases,η<1 and the cost s hgher than the mnmum value Sϕ. Effcency tends to one when smulaton tme ncreases. Now consder a smulaton tme shorter than one hour: T S =1. Substtutng T S n Eq. (5) and usng Eq. (4), the cost for the specfc case T S < 1 can be wrtten as C S = Sϕ T S. (9) The prevous equaton shows that n such a condton,.e., maxmum smulaton speed up, due to the Amazon prcng polcy on partal hours, the cost s nversely proportonal to smulaton tme. For the Sm1 test case the cost that would be needed to complete all the tasks accordng to the nomnal ECU values s shown n Fgure 7. Each lne represents a dfferent nstance type. Each pont on the lne corresponds to a dfferent number of nstances K. For hgh values of smulaton tme, curves are approxmately flat snce the effcency η s hgh, therefore cost s almost constant. When the smulaton tme s decreased by ncreasng the number of nstances K, curves tend to follow an hyperbolc trend, whch s due to the trend of the lower bound on the effcency. However, effcency oscllates between the lower bound and one, therefore the 17

Table 9: Expermental measurements of computng performance of EC2 nstances usng as many processes as the number of avalable cores (Sm1). Nomnal Number Effectve comp. power P Name ECU/core of cores (std.small=1.00) std.small 1 1 1.00 std.large 2 2 1.54 std.xlarge 2 4 1.32 h-cpu.medum 2.5 2 2.37 h-cpu.xlarge 2.5 8 1.36 hyperbolc trend s often nterrupted. When tme s lower than one hour, the trend becomes hyperbolc for all the nstance types, as stated by Eq. (9). Moreover, only two hyperbolae are present, whch ndeed correspond to the two possbleϕ values, set by Amazon for the two consdered famles of nstances, that s, std and h-cpu. Note that the prevous analyss do not consder the cost of dsk I/O operatons on the Amazon cloud computng platform. However, n all our experments, ths has always been neglgble compared to the nstance costs as detaled n Secton 6.3. The prevous analyss assumes that the nomnal computng power, n terms of ECU stated by Amazon, allows to compute the runnng tme of a gven task on any type of nstance. Actually, ths s not true, as already shown n Table 3 for the case of a CPU-ntensve task. Actual tests on EC2 have been performed by runnng, usng the developed framework, a small set of each of the smulatons ncluded n Sm1, Sm2 and Sm3. The proposed framework has been confgured, each tme, to use dfferent types of nstances, so that we computed the processng power that can be acheved by usng every type of nstance. For every run, only one type of nstance was tested,.e., we dd not mx nstances of dfferent types to make performance comparson easer. Results are reported n Table 9 for Sm1, wth reference to the performance provded by the std.small nstance type. Note that these results only apply to the consdered smulaton, snce they are nfluenced by both CPU and I/O actvty. For each nstance, a number of processes equal to the number of avalable cores has been used to maxmze the explotaton of the resources of each nstance. It s clear that the best performance s provded by the h-cpu.medum nstance type. We attrbute ths behavor to the low number of vrtual cores, whch seems to perform better n the EC2 nfrastructure, and to the fact that the h-cpu famly of nstances s probably better suted for manly CPU-bounded tasks such as the ones of our smulatons. To get a clear overvew of the actual performance that can be acheved through the Amazon nfrastructure, Fgure 8 shows the actual tmes and costs that can be acheved by usng our proposed framework n order to run all the tasks ncluded n Sm1, wth varous tradeoffs between tme and cost, dependng on the number of nstances and the nstance types. A comparson wth Fgure 7 shows the same general trend, but there exsts some dscrepances wth respect to the theoretcal behavor expected by the nomnal ECU values. From Fgure 8 t s clear that the most convenent nstance type, n terms of both tme and cost, for ths partcular type of smulaton s the h-cpu.medum one. The same apples to Sm2 and Sm3. As a fnal remark, note that Fgure 8 do not show the pont correspondng to runnng the smulatons on the dedcated computer snce the prce would be more than two orders of magntude hgher than the one shown for the same set of smulatons n the cloud, whle the tme s fxed at 17,250 s, that s, nearly 5 hours. The next secton ncludes a comparson of the tme and cost, n a realstc usage case, for the development of a new transmsson algorthm based on smulaton 18

Total cost ($) 6.00 5.00 4.00 3.00 2.00 1.00 8 8 8 4 std.small std.large std.xlarge 3 2 h-cpu.medum h-cpu.xlarge 5 5 4 2 6 3 5 4 5 3 1 2 3 8 7 2 6 1 5 1 4 1 3 0.00 0 1 2 3 4 5 6 7 8 9 10 11 Total tme (h) Fgure 8: Total cost of Sm1 as a functon of the tme needed to complete all the tasks n actual smulatons. Dfferent ponts on the same lne correspond to dfferent numbers of deployed nstances K. Labels wthn the graph show the K value correspondng to the pont. results obtaned by means of the dedcated computer or the cloud system. 7.2. Multple Smulatons Ths secton ams at quantfyng the costs and speed up of the cloud computng soluton wth respect to the dedcated computers soluton when smulatons are part of an actual research actvty. In such a scenaro, smulaton experments such as Sm1, Sm2 and Sm3 are run many dfferent tmes to nvestgate, mprove and refne the performance of the algorthms. In order to nvestgate the cost and tme requred by usng a cloud smulaton system rather than a PC we assume that the research actvty that leads to the development of a new algorthm s composed by a number of consecutve smulaton cycles. A smulaton cycle s the basc unt of the development process. One smulaton cycle ncludes the tme needed for nvestgaton and a few modfcatons of the algorthm as well as the tme needed to run the smulatons once,.e., T C = T D + T S (10) where T C s the duraton of the smulaton cycle, T D s the tme spent to study and modfy the algorthm, and T S s the tme spent to perform the smulaton, ether usng the cloud computng or the dedcated computer soluton. Fgure 9 llustrates the stuaton. Moreover, n order to better model realty, we consder that operators (e.g., researchers) wll work durng the daytme only. Therefore, f the T D mples that the operators work cannot be termnated wthn the day, the remanng part of the work wll be carred out at the begnnng of the next day. When the work s termnated, a new smulaton cycle can start. Clearly, f smulatons extend past the end of the workng day, they can contnue snce, of course, computers can always work at nght. Fgure 10 llustrates the stuaton. In the followng, we assume a workng day equal to 11 hours and T D values rangng from one hour up to the duraton of the workng day. 19