Ecotopia: An Ecological Framework for Change Management in Distributed Systems

Similar documents

Multiprocessor Systems-on-Chips

Performance Center Overview. Performance Center Overview 1

Task is a schedulable entity, i.e., a thread

The Application of Multi Shifts and Break Windows in Employees Scheduling

Distributing Human Resources among Software Development Projects 1

Morningstar Investor Return

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

Chapter 1.6 Financial Management

Chapter 8: Regression with Lagged Explanatory Variables

Activity-Based Scheduling of IT Changes

Impact of scripless trading on business practices of Sub-brokers.

Constant Data Length Retrieval for Video Servers with Variable Bit Rate Streams

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999

Automatic measurement and detection of GSM interferences

PolicyCore. Putting Innovation and Customer Service at the Core of Your Policy Administration and Underwriting

Information Systems for Business Integration: ERP Systems

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID

Chapter 6: Business Valuation (Income Approach)

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal

Model-Based Monitoring in Large-Scale Distributed Systems

A Joint Optimization of Operational Cost and Performance Interference in Cloud Data Centers

INTRODUCTION TO FORECASTING

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

Journal Of Business & Economics Research September 2005 Volume 3, Number 9

Individual Health Insurance April 30, 2008 Pages

Strategic Optimization of a Transportation Distribution Network

THE FIRM'S INVESTMENT DECISION UNDER CERTAINTY: CAPITAL BUDGETING AND RANKING OF NEW INVESTMENT PROJECTS

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

LEASING VERSUSBUYING

Can Individual Investors Use Technical Trading Rules to Beat the Asian Markets?

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

Trends in TCP/IP Retransmissions and Resets

Improvement of a TCP Incast Avoidance Method for Data Center Networks

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011ABV1 EKHBRD014ABV1 EKHBRD016ABV1

CALCULATION OF OMX TALLINN

Hedging with Forwards and Futures

Capacity Planning and Performance Benchmark Reference Guide v. 1.8

Towards Intrusion Detection in Wireless Sensor Networks

Working Paper No Net Intergenerational Transfers from an Increase in Social Security Benefits

Automated Allocation of ESA Ground Station Network Services

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

DETERMINISTIC INVENTORY MODEL FOR ITEMS WITH TIME VARYING DEMAND, WEIBULL DISTRIBUTION DETERIORATION AND SHORTAGES KUN-SHAN WU

Market Analysis and Models of Investment. Product Development and Whole Life Cycle Costing

t Thick,intelligent,or thin access points? t WLAN switch or no WLAN switch? t WLAN appliance with 3rd party APs?

Making Use of Gate Charge Information in MOSFET and IGBT Data Sheets

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

Nikkei Stock Average Volatility Index Real-time Version Index Guidebook

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

AP Calculus AB 2010 Scoring Guidelines

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

Vector Autoregressions (VARs): Operational Perspectives

Secure Election Infrastructures Based on IPv6 Clouds

Relationships between Stock Prices and Accounting Information: A Review of the Residual Income and Ohlson Models. Scott Pirie* and Malcolm Smith**

Towards a Generic Trust Model Comparison of Various Trust Update Algorithms

PRACTICES AND ISSUES IN OPERATIONAL RISK MODELING UNDER BASEL II

How To Predict A Person'S Behavior

WATER MIST FIRE PROTECTION RELIABILITY ANALYSIS

Term Structure of Prices of Asian Options

Situated vs. Global Aggregation Schemes for Autonomous Management Systems

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

Making a Faster Cryptanalytic Time-Memory Trade-Off

CPU Provisioning Algorithms for Service Differentiation in Cloud-based Environments

Real-time Particle Filters

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

Statistical Analysis with Little s Law. Supplementary Material: More on the Call Center Data. by Song-Hee Kim and Ward Whitt

Information Theoretic Evaluation of Change Prediction Models for Large-Scale Software

UPDATE OF QUARTERLY NATIONAL ACCOUNTS MANUAL: CONCEPTS, DATA SOURCES AND COMPILATION 1 CHAPTER 7. SEASONAL ADJUSTMENT 2

ClaimCore. Putting Customers at the Core of Your Claims Processes. Integrated Customer Database. R es y me. Ad j u d ic ati o n

Direc Manipulaion Inerface and EGN algorithms

AP Calculus BC 2010 Scoring Guidelines

The Grantor Retained Annuity Trust (GRAT)

Chapter 8 Student Lecture Notes 8-1

Q-SAC: Toward QoS Optimized Service Automatic Composition *

Task-Execution Scheduling Schemes for Network Measurement and Monitoring

SEASONAL ADJUSTMENT. 1 Introduction. 2 Methodology. 3 X-11-ARIMA and X-12-ARIMA Methods

As widely accepted performance measures in supply chain management practice, frequency-based service

The Architecture of a Churn Prediction System Based on Stream Mining

Advanced Traffic Routing as Part of the USA Intelligent Telecommunications Network

System Performance Improvement By Server Virtualization

DDoS Attacks Detection Model and its Application

Distributed Echo Cancellation in Multimedia Conferencing System

Forecasting. Including an Introduction to Forecasting using the SAP R/3 System

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

Chapter 7. Response of First-Order RL and RC Circuits

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

Option Put-Call Parity Relations When the Underlying Security Pays Dividends

Small and Large Trades Around Earnings Announcements: Does Trading Behavior Explain Post-Earnings-Announcement Drift?

This is the author s version of a work that was submitted/accepted for publication in the following source:

Why Did the Demand for Cash Decrease Recently in Korea?

Spectrum-Aware Data Replication in Intermittently Connected Cognitive Radio Networks

A New Type of Combination Forecasting Method Based on PLS

policies are investigated through the entire product life cycle of a remanufacturable product. Benefiting from the MDP analysis, the optimal or

Usefulness of the Forward Curve in Forecasting Oil Prices

AP Calculus AB 2013 Scoring Guidelines

Ceramic Modules And Trends In Efficient Compuing

Transcription:

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems Tudor Dumiraş 1, Daniela Roşu 2, Asi Dan 2, and Priya Narasimhan 1 1 ECE Deparmen, Carnegie Mellon Universiy, Pisburgh, PA 15213, USA 2 IBM T.J. Wason Research Cener, Hawhorne, NY 10532, USA udor@cmu.edu, drosu@us.ibm.com, asi@us.ibm.com, priya@cs.cmu.edu Absrac. Dynamic change managemen in an auonomic, service-oriened infrasrucure is likely o disrup he criical services delivered by he infrasrucure. Furhermore, change managemen mus accommodae complex real-world sysems, where dependabiliy and performance objecives are managed across muliple disribued service componens and have specific criicaliy/value models. In his paper, we presen Ecoopia, a framework for change managemen in complex service-oriened archiecures (SOA) ha is ecological in is inen: i schedules change operaions wih he goal of minimizing he service-delivery disrupions by accouning for heir impac on he SOA environmen. The change-planning funcionaliy of Ecoopia is spli beween muliple objecive-advisors and a sysem-level change-orchesraor componen. The objecive advisors assess he change-impac on service delivery by esimaing he expeced values of he Key Performance Indicaors (KPIs), during and afer change. The orchesraor uses he KPI esimaions o assess he per-objecive and overall business-value changes over a long imehorizon and o idenify he scheduling plan ha maximizes he overall business value. Ecoopia handles boh exernal change requess, like sofware upgrades, and inernal changes requess, like faul-recovery acions. We evaluae he Ecoopia framework using wo realisic change-managemen scenarios in disribued enerprise sysems. Keywords: Dynamic Change Managemen, Service Orchesraion, Faul- Toleran Archiecure, Performabiliy, Auonomic Compuing. 1 Inroducion Enerprises demand highly available online sysems and saisfacory service levels (e.g., average response ime) in he face of change. The kinds of changes ha can occur are diverse, and can include recovery acions in response o failures, or upgrades due o new versions of sofware ha become available. Curren changemanagemen sraegies, for he mos par, end o execue a change reques as soon as possible (e.g., as soon as a faul is deeced or an upgrade is requesed), raher han looking for he bes ime o do so. The downime (or he perceived lack of responsiveness/availabiliy) due o change managemen can disrup he performance expecaions of services and have an adverse effec on business. R. de Lemos e al. (Eds.): Archiecing Dependable Sysems IV, LNCS 4615, pp. 262 286, 2007. Springer-Verlag Berlin Heidelberg 2007

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 263 Exernal: Requess for HW & SW Upgrade Generae Inernal: Sysem Managemen Lis of Evens (e.g. fauls, expeced Change workload changes) Operaions Enerprise SLAs (e.g., response ime, availabiliy, recovery ime) Generae Change Schedule Change Planner Timed Change Schedule Fig. 1. Dynamic change managemen is likely o disrup he criical services running in he IT infrasrucure. Ecoopia handles changes based on boh exernal requess (e.g., sofware upgrades) and evens deeced inernally by he auonomic managemen infrasrucure (e.g., fauls) while aking ino accoun heir impac of he service-level agreemens. The oupu is a imed schedule ha seeks o wai for he mos opporune ime o apply each change operaion and o maximize he enerprise business value. Indusry analyss indicae ha "unmanaged change is one of he leading causes of downime or missed service-level agreemens (SLAs)." [1] Garner Group saes ha o address he 80 percen of unplanned downime caused by "people failures," enerprises should inves in improving heir change and problem managemen processes (o reduce downime caused by applicaion failures) and in auomaion ools, such as job scheduling and even managemen (o reduce downime caused by operaor errors). [2] Thus, we hypohesize ha i is more appropriae o seek he mos opporune ime o execue he change operaions in a disribued service-oriened infrasrucure, based on he change s impac on he service-level objecives (e.g. response ime, availabiliy, and recovery ime). Such an impac-sensiive changemanagemen sraegy aims o respec he overall performance and dependabiliy guaranees of he running services, ye allowing he sysem o incorporae changes of various kinds. Fig. 1 illusraes he main elemens of he change-planning problem. In ypical IT infrasrucures, here are muliple kinds of change operaions, originaing from various sources. Some changes are planned in advance (e.g., deploying new applicaions, upgrading obsolee sofware, increasing he sysem capaciy), and are derived from an exernal reques for change (RFC). In oher cases, changes are due o firefighing (i.e., miigaing he negaive effecs of unplanned siuaions), and are riggered by inernal sysem-managemen evens, e.g., fauls or load surges. Change requess are characerized by a se of (parially) ordered change operaions and by change objecives such as he deadline for implemening he change. The changeoperaion planner mus produce a imed change-schedule for execuing he changes and, in he process, mus consider boh he impac of he changes on all he relevan qualiy-of-service requiremens as expressed by service-level objecives (SLOs), as well as he objecives of each change operaion. An SLO defines bounds and arges for a level-of-service meric (e.g., response ime, recovery ime, availabiliy), called Key Performance Indicaor (KPI). An SLO also has a specific business value meric (e.g., he penalies associaed wih a missed

264 T. Dumiraş e al. change deadline or wih a degraded performance) for gauging he uiliy of fulfilling he objecive [3]. The change schedule mus maximize he aggregaed business value, associaed wih all of he enerprise s SLOs. This opimizaion mus span a long imehorizon, o accoun for boh ransien effecs ha migh occur during he change execuion, and permanen effecs ha migh persis afer he change has been finalized. The change planner mus be ecological in naure, i.e., i mus assess he impac of he change on he environmen and is SLOs by considering a number of facors: he iner-dependencies among various sysem componens, he available prior knowledge of workload flucuaions or anicipaed load surges during prime-ime, as well as he degree of resource sharing across heerogeneous, off-he-shelf componens ha someimes span independen adminisraive domains. In hese environmens, he high-level service objecives ranslae ino componen-level objecives ha can be managed by componen-specific configuraion managers. For example, a workload manager prioriizes and roues he service requess by monioring he response-ime objecives, while a dependabiliy manager primes backup nodes in anicipaion of failures and performs recovery by monioring he availabiliy objecives. These managers use exensive, and someimes proprieary, domain knowledge (e.g., workload characerisics, resource-uilizaion models), and can perform sophisicaed reques classificaion, prioriizaion, monioring and reques rouing [4]. As a resul, we believe ha he complexiy and he disribued naure of objecivemanagemen in real-world sysems makes i unfeasible for a fully cenralized changeoperaion planner o direcly assess he impac of change operaions on each service KPI. Raher, he impac on service KPIs should be esimaed by he componenspecific managers ha conrol hese services. However, componen-specific managers migh no be able o direcly assess SLO business values necessary for esimaing he overall change-impac, eiher because hey do no direcly implemen he enerprise SLO models or because he service spans muliple managers and adminisraive domains. Building on his principle, we propose Ecoopia, a change-managemen framework ha decouples he impac assessmen (handled by muliple objecive advisors, e.g., performance and dependabiliy advisors) from he change-operaion scheduling (handled by a change orchesraor). The orchesraor builds he change-operaion schedule and esimaes is business value impac based on he service KPIs prediced by he objecive advisors. The advisors are sofware componens ha incorporae he domain knowledge o answer "wha-if" quesions abou service KPIs (such as performance and availabiliy forecass), given a descripion of he change operaions and he iming properies associaed wih heir execuion. The orchesraor leverages he advisors predicions o compue he per-objecive and he aggregae business value, and o converge owards an opimal change-operaion schedule hrough an ieraive refinemen process. The objecive advisors hemselves can be composie, hird-pary services. The novel characerisics of he Ecoopia framework for orchesraing changemanagemen operaions are:

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 265 Rich wha-if ineracion model ha enables he use of fine-grained objeciveadvisor knowledge for an effecive change scheduling decision. Our wha-if model includes: Timeline of predicion poins: he advisors inform he orchesraor of he expeced workload changes during he scheduling imeline. The orchesraor uses hese guidelines o boosrap he scheduling algorihms. Proacive acions: he advisors can inform he orchesraor abou specific acions ha may improve he impac on KPIs during relaed change operaions. The orchesraor can include hese operaions in he final schedule if hey resul in an improved overall business-value. Inegraed managemen of boh inernal (e.g., fauls, workload changes) and exernal (e.g., upgrades, capaciy increases) changes. This approach is necessary because boh ypes of changes affec a common pool of resources and services. Exising soluions [5, 6] assume differen decision makers for he wo ypes of changes. Complex business value funcions for SLOs and change-reques deadlines ha can change along wih he underlying enerprise service models, enabled by compliance wih WS-Agreemen sandard [3]. Exising soluions suppor only prioriy-based models [5] or embedded, hard-coded uiliy funcions [4, 7, 8]. Opimizaion based on he long-erm impac of change on performance and dependabiliy objecives, accouning for boh he ime during and afer execuion of he change. Exising soluions consider only one of he wo impac componens (e.g., [7] considers he impac during change execuion, [5, 9] consider he impac afer he change). In Secion 2 we compare Ecoopia wih he sae of he ar in impac-aware change managemen. Secion 3 describes he design of Ecoopia framework and Secion 4 describes he curren implemenaion. Secion 5 presens wo case sudies of change managemen ha we use o validae our archiecure. Secion 6 discusses he applicabiliy of our ecological approach for realisic sysems and oulines direcions for fuure work. 2 Background In heir seminal paper, Segal and Frieder [10] idenify a se of general requiremens for any dynamic updaing sysem: preserving program correcness (during and afer he updae), minimizing human inervenion, supporing program resrucuring and low-level program changes (e.g., boh implemenaions and inerfaces), supporing disribued programs (communicaing across muually disrusful adminisraive domains), no requiring special-purpose hardware and no consraining he language and environmen. Their survey illusraes ha in general, research has focused on mechanisms for implemening change a differen levels of granulariy (e.g. replacing componens, objecs, procedures), raher han on impac assessmen and coordinaion of disribued changes. Kramer and Magee [11] noe ha fauls, as well as live upgrades, migh have a disrupive effec on he funcionaliy of a disribued sysem, and ha he echniques o miigae hese problems could be combined in a unified

266 T. Dumiraş e al. framework. For insance, a change-managemen sysem ha oally separaes he funcional applicaion concerns from he configuraion managemen concerns (such as Kramer and Magee's Conic sysem), can provide a good basis for implemening faul recovery [11]. Conversely, an infrasrucure buil for faul-olerance can provide a good basis for live upgrades because of he inheren redundancy [12, 13]. For example, a faul-oleran CORBA sysem using he inercepion approach provides all he ingrediens needed for dynamic change managemen of CORBA objecs, including an inercepor (i.e., he indirecion layer needed when swiching o a new version), replicaion mechanisms (for incremenally upgrading some replicas while ohers coninue o provide service) and sae exracion/resoraion mechanisms (for mainaining consisency beween versions) [12]. In he Ecoopia framework, we also adop his unifying approach of considering boh exernal (e.g., sofware upgrades) and inernal change requess (e.g., operaions needed o miigae he effecs of a faul). Addiionally, he goal of our ecological framework is o manage he impac of change-managemen on he SOA environmen (he running services and he exising resources). We assess his impac by asking and answering wha-if quesions abou he oucome of he change operaions. We assume some advance knowledge of he workload, as a running sysem has differen behavioral profiles depending on he sysem load and he oucome of he changes will depend on he workload as well. Ecoopia ries o minimize he negaive impac on he environmen by using he answers o he wha-if quesions o deermine he mos opporune ime o apply he changes, given he exising resources, he sae of he running services and he workload. 2.1 Workload Predicion Many workloads are characerized by a day-nigh periodiciy [14]: he incoming reques load increases during he day, wih comparable peak reques-raes from day o day, and decreases a nighime o a very low baseline level. Sysem adminisraors ake advanage of his knowledge o over-provision he sysem for he highes expeced loads [15] and o run mainenance aciviies (such as change managemen) during he nigh. There are also workloads wih more complex paerns. The 1998 World Cup workload 1 [16] shows ha he incoming load increases suddenly around game imes, wih lower peaks for he games played over a weekend. This rend is ypical for sies dedicaed o sporing evens; his can be observed on Alexa.com 2 [17], by comparing saisics for wo differen sies covering he same even (e.g., f1.com and fi-live.com): even if he peak loads are differen, he access paerns are he same. On-line aucion sies, such as ebay.com, exhibi similar load surges before he closing ime of an aucion. Furhermore, recognizable paerns of warnings and noificaions ha precede sysem evens may faciliae he workload predicion [18, 19]. Ecoopia uses he abiliy o predic when he sysem is under high and low load for opimizing across muliple service-level objecives. For insance, an enerprise sysem 1 This is he workload of a websie dedicaed he 1998 soccer World Cup in France. Wih 1.4 billion requess in he server logs, his is he larges web workload ever analyzed. 2 Alexa is a ool for comparing saisics on he populariy and workloads of differen websies.

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 267 may have wo objecives: performance, expressed as average response ime, and dependabiliy, expressed as he expeced recovery ime afer a sysem failure. Afer a faul (which, unlike a failure, does no compleely disable he sysem), Ecoopia relies on knowledge of he workload o schedule he reconfiguraion operaions when he incoming load is low and avoid he penalies due o downime during a busy period. Noe ha we do no assume ha flash-crowd evens (sudden load surges due o an unexpeced increase in he sie s populariy) are predicable; however, we show ha exploiing irregular, bu predicable workloads such as he World Cup 98 race [16] allows Ecoopia o improve he scheduling of change operaions when pursuing muliple objecives. Workload predicion is an opional par of he framework; Ecoopia s orchesraor can funcion wih hird-pary advisors ha answer wha-if quesions wihou providing workload predicions, e.g., [8]. 2.2 Wha-if Quesions Exising service-orchesraion producs [5, 20], perform resource arbiraion beween node groups by evaluaing he impac jus afer he resource changes are enaced. While allowing he orchesraion of disribued services [4], his approach is limied because i ignores he long-erm impac of change managemen (e.g. ineracion wih expeced workload change). The CHAMPS projec [7] focuses on scheduling operaions o saisfy exernal RFC deadlines. I develops a complex dependencyracking framework and i formulaes he scheduling problem as he opimizaion of a generic cos funcion given a se of consrains (represening he impac during change, e.g., due o service unavailabiliy), providing a cenralized approach for boh scheduling and impac analysis. Our work is based on he observaion ha cenralized impac evaluaion is no appropriae for complex enerprise environmens. The problem of opimizing business value in a decenralized manner has also been addressed in he conex of auonomic managemen of sorage sysems. Hippodrome [9] refines he iniial configuraion of a sorage sysem hrough an ieraive process, using a performance model o esimae he hroughpu and capaciy of a paricular configuraion. Like our framework, Hippodrome separaes beween opimizaion and impac assessmen, alhough he ineracions beween he wo componens are more ighly inegraed and is based on a proprieary proocol. We submi ha for complex sysems inegraing muli-vendor componens we need an open communicaion proocol, for insance based on Web Services. The K2 middleware [21] goes furher in disribuing he auonomic managemen funcionaliy by eliminaing he cenralized decision-maker and allowing individual allocaion pools o manage heir own objecives. In K2, disribued decision algorihms deermine he goal configuraion and he allocaion pools sar moving in ha direcion; if condiions change par-way hrough reconfiguraion, he sysem changes is direcion wihou having o invalidae he previous plan. However, none of hese sysems consider he evoluion in ime of he KPIs and he long-erm impac of heir decisions which are necessary for avoiding sysem insabiliy and minimizing he overall business impac. Thereska e al. [8] define a resource advisor predicing he impac of daa placemen and encoding choices on performance. The advisor has a hierarchical design, based on several wha-if modules (for predicing he CPU, nework and disk delays and cache hi raes) ha can be combined ogeher for end-o-end KPI

268 T. Dumiraş e al. predicions. Alhough i does no accoun for he deailed KPI evoluion (i does no aemp o predic incoming reques raes), he advisor coninuously moniors he infrasrucure and uses hisorical daa o overprovision he sysem based on he peak loads observed. The auhors repor ha predicion errors are less han 15% in mos cases. This is an example of a hird-pary objecive advisor ha could be conneced o he Ecoopia framework. Our orchesraor doesn need o know he deails of he performance models for sorage sysems; insead, i can use he wha-if predicions o perform an ecological change managemen. 2.3 Timing he Applicaion of Change Operaions The idea of waiing for he mos opporune ime o apply a change is widely acceped wih respec o securiy paches for enerprise infrasrucures. Beaie e al. [22] show ha here is a swee-spo for he ime when securiy paches should be applied. Paches applied oo early, wihou enough esing in he field, may inroduce criical bugs or may conflic wih local configuraions. Paches applied oo lae leave he sysem exposed o securiy hreas for an exended period of ime. The auhors argue ha paching should be delayed unil he risk of a securiy breach ouweighs he risk of inroducing bugs, and hey develop a mahemaical model for esimaing he opimal ime o apply a securiy pach. Gorbenko e al. [23] ackle he problem of achieving high dependabiliy of composie Web Services undergoing online upgrades of heir componens. They advocae running muliple versions of a service in parallel and using hird-pary inercepion middleware o swich o a new replica when he confidence in is correcness is sufficienly high. The confidence in correcness meric is compued based on comparing he responses from differen versions of a service and using Bayesian inference o reason abou fuure failure raes. This approach is he closes o our focus on he long-erm impac of change operaions, excep ha we use impac assessmen across muliple service-level objecives and we use sandard merics, such as business value, for evaluaing his impac. Roşu e al. [24] inroduce he approach of evaluaing change plans based on acual SLO business values, which are compued by he orchesraor based on he service KPIs provided by objecive advisors. Ecoopia exends his approach o a compressive wha-if proocol appropriae for managemen of complex change requess. Oher change orchesraion soluions evaluae change plans in disconnecion from he acual SLO of he enerprise, based on hard-coded uiliy models embedded in he resource advisors [4, 5, 8]. In [25], he change manager uses WS-Agreemen specificaion o define business value parameers whereas he specificaion of he objecive and business value funcions is hard-coded in he orchesraor implemenaion. Neiher of hese approaches is appropriae for sysems in which he objecive and value models can evolve in ime. 3 Design of an Ecological Change-Managemen Framework A primary design goal for a change-managemen framework ha arges disribued, service-oriened infrasrucures is o make minimal assumpions abou he kinds of

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 269 knobs ha he various sofware componens are prepared o expose o a changemanagemen sysem for enabling he conrol of change impac. The key o achieving his goal is he separaion of scheduling and impac analysis. In Ecoopia, hese asks are performed by differen componens, which may come from differen providers. Service orchesraion refers o an execuable business process ha combines muliple services by defining heir ineracions dynamically, wih he goal of aligning he behavior of he composie service wih he business objecives [20]. Ecoopia conains an orchesraion engine ha queries muliple objecive advisors for KPI predicions and combines heir oupus ino a change-operaion schedule. The predicions are based on deailed domain knowledge of each sysem componen, bu his knowledge is no exposed ouside he objecive advisors. Insead, he advisors answer simple wha-if quesions [8] abou he impac of concree change operaions on service KPIs, considering he workload and he enaive schedules of hese operaions. The orchesraion is driven by he enerprise SLAs, which define mehods for compuing he business value [3] ha corresponds o he prediced KPI values. The business value reflecs he uiliy of a given change schedule, allowing us o compare schedules and make an ecological choice: considering he impac on he IT environmen, we selec he change schedule ha minimizes he service-delivery disrupions and ha maximizes he overall business value. General assumpions. We assume ha KPI predicions can be derived from some knowledge of fuure incoming loads, eiher because he workloads exhibi clear rends [14, 16], or because flucuaions are preceded by recognizable paerns of warnings and noificaions [18, 19]. Furhermore, we assume ha he execuion imes of all he change operaions submied o he Ecoopia orchesraor can be esimaed and ha services do no have hard real-ime consrains (which is ypical of enerprise sysems). 3.1 Framework Componens Fig. 2 illusraes he main componens and ineracions in he Ecoopia framework. The ChangeManager receives high-level RFCs, decomposes hem ino finer-grained change operaions and relaed dependencies, and forwards hem o a cenralized componen called he orchesraor. The orchesraor receives he lis of change operaions and heir execuion consrains and generaes a change plan hrough an ieraive process. Disribued componens called objecive advisors analyze he impac of planned change operaions; he orchesraor idenifies he relevan advisors by querying he SysemConfiguraionDaabase. The objecive advisors represen he service managers in he infrasrucure and can use manager-specific knowledge o esimae he impac of a change plan on he service KPIs. The orchesraor consumes hese esimaions and schedules he change operaions wih he goal of maximizing he overall business value. The ineracion beween he orchesraor and he advisors is based on he Web Services sandard, which faciliaes compaibiliy in a complex sysem wih componens buil by differen providers. The orchesraor sends he final schedule o he ScheduleExecuor, which riggers he change operaions a he indicaed imes. The ChangeManager is analogous o he Task Graph Builder

270 T. Dumiraş e al. RFC Sysem Managemen Evens SLAs Dependabiliy Advisor Change Manager Objecive Advisors Iniiae Resource Analyze Impac Acions on KPIs Performance Advisor Sysem Configuraion <change> <acion/> </change> <deadline/> Change Opearions Resource Arbiraion Requess Proacive Acions <schedule> <ime/> <acion/> </schedule> Tenaive Schedule Prediced KPIs Resources Advisors Orchesraor Maximize Overall Business Value Final Schedule Schedule Execuor Fig. 2. Ecoopia s disribued ecological archiecure for change managemen separaes he asks of impac assessmen (performed by he objecive advisors) and change scheduling (performed by he orchesraor). The orchesraor receives requess for change, queries he objecive advisors wih wha-if quesions abou he enaive change schedule and uses he answers o refine he schedule wih he goal of maximizing business value. The wha-if ineracions are based on an open proocol ha allows he inegraion of hird-pary objecive advisors. from [7], and he ScheduleExecuor is similar o he TIO Provisioning Manager [5]. In his paper, we focus on he orchesraor, he objecive advisors and heir ineracions, which are novel. Objecive advisors. The objecive advisors (e.g., performance and dependabiliy advisors) exploi he funcionaliy provided by he componen-specific configuraion managers. The advisors can be hierarchical and may span muliple adminisraive domains in order o manage end-o-end KPIs (in a similar manner o he resource advisor described in [8]). The Ecoopia advisors esimae he impac of observed, prediced, or scheduled evens on a few service KPIs; for insance, we can define a performance advisor ha predics violaions of he response-ime objecives. The predicions do no depend on he acual enerprise business-value models, which are handled by he orchesraor. The API of he advisors conains wo funcions, shown in see Table 1. GeCurrenKPIs() queries he KPI predicions if changes are no applied and i is used o assess he baseline for he change impac. GeImpacKPIs() rerieves he KPI predicions given a enaive change-operaion schedule and is used o assess he impac he change schedule. These funcion invocaions are synchronous (i.e., he requesor wais o receive he KPI predicions before proceeding). The reply includes he KPI predicions for he enire ime horizon of he decision. This migh span

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 271 muliple imeline poins where he service KPIs change due o specific evens such as expeced workload changes or failures. These imeline poins are called predicion poins. The advisor reply includes one se of KPI predicions for each predicion poin on he decision horizon. The replies can also sugges a se of proacive acions ha are expeced o improve he KPIs in conjuncion wih he change operaions (e.g., a checkpoin daabase acion migh reduce he recovery ime). Proacive acions are included in he final change-operaion schedule only if hey improve he overall business value. Orchesraor. The orchesraor is a resource broker and a change operaion planner. The orchesraor sars scheduling a group of change operaions in wo siuaions (see Table 1): (i) IniiaeChange() indicaes ha a change sequence has been iniiaed, following a RFC; (ii) IniiaeResourceBrokering() indicaes ha a prediced or observed infrasrucure even (e.g., a faul, a workload change) mandaes a resource reassignmen. All of hese invocaions on he orchesraor are asynchronous (i.e., a response conaining he schedule is no provided immediaely). During he scheduling process, he orchesraor communicaes wih he objecive advisors, asking wha-if quesions in order o assess he impac of enaive change-operaion schedules on he fuure service KPI values. Table 1. APIs of he Ecoopia framework componens Orchesraor IniiaeChange(): reques for scheduling a group of change operaions derived from an RFC. IniiaeResourceBrokering(): reques for reallocaion of resources (e.g. nodes) o miigae he impac of an even deeced by he sysem managemen infrasrucure (e.g. a hardware faul). ChangeSLA(): reques for inegraion of SLA updaes. Objecive Advisors GeCurrenKPIs(): reques for curren KPI predicions for a given ime inerval, assuming no change applied (i.e., only infrasrucure evens such as workload variaion or node failures will occur). GeImpacKPIs(): reques for KPI predicions over a given ime inerval for a schedule of change operaions. Based on he prediced KPIs, he orchesraor creaes a enaive change-operaion schedule and compues is overall business value (BV). The SLA defines service-level objecives based on he moniored KPIs (e.g., a arge for he average response-ime) and associaes a business-value funcion o each SLO (e.g., a penaly for each reques ha misses he arge). The orchesraor compues he overall BV for a paricular sae of he sysem by adding he business values of all he services and SLOs defined in

272 T. Dumiraş e al. he service-level agreemen. A change schedule will modify he overall BV by alering he sae of he sysem and is moniored KPIs. When he orchesraor needs o choose among several alernaive opions for changing he sysem (e.g., wheher o include a proacive acion in he schedule or no; all he possible imes for scheduling a change operaion), i uses he overall BV o selec he bes change-operaion schedule. The overall BV reflecs he uiliy of a change schedule and provides a way of comparing he effecs of changes affecing muliple KPIs and SLOs. The orchesraor is also invoked when an SLA has changed hrough ChangeSLA(), which indicaes a modificaion in he overall business-value calculaions. The orchesraor rerieves he new SLOs and he corresponding BV expressions and auomaically updaes is scheduling engine (more comprehensive mechanisms for managing SLAs updaes are described in [24]). This is a reflexive hook allowing he orchesraor o updae iself. In his case he change is applied immediaely or a a specified ime in he fuure, so i does no go hrough he scheduling process. New service-level agreemens are ypically defined in order o realign he business and IT objecives of he enerprise; herefore, he effec of he new SLAs mus be refleced as soon as hey are available. The goal of change-operaion scheduling is o maximize he business value for a cerain ime horizon. The Ecoopia orchesraor compues schedules for changeoperaion groups, which correspond o a reques for change (RFC) or o a reques for resource brokering. A schedule indicaes when each individual change operaion from he group will sar execuing. Using he overall business value, defined in he curren SLAs, o compare differen schedules, he orchesraor converges, hrough an ieraive process, o he bes feasible schedule. 3.2 Wha-If Ineracion Proocol The ineracion proocol is a he hear of he Ecoopia framework. As shown in Fig. 2, a change sequence is iniiaed by he ChangeManager wih he IniiaeChange() funcion, or by an advisor wih he IniiaeResource Brokering() funcion. The orchesraor iniiaes he wha-if ineracion by calling he GeCurrenKPIs()funcions of each of he advisors o learn abou heir predicion poins during he decision ime horizon and o esablish a baseline sae for assessing he impac of he proposed schedules. Then he orchesraor creaes and refines schedules hrough an incremenal process. I invokes he GeImpacKPIs() funcions on each of he advisors o acquire he KPI predicions necessary for assessing he impac of each of he proposed parial and complee schedules. The orchesraor and he objecive advisors exchange all he informaion abou he curren change group and change-operaion schedule needed o asses he impac on he KPIs and o improve he schedule. Table 2 summarizes hese parameers. A change operaion is defined by a name, a scope and a se of properies. The name is an enerprise-specific descripor (e.g., "Upgrade daabase sofware o version 10.0") recognized by all of he relaed objecive advisors and service managers. The scope idenifies he resources (e.g., "daabase node DB 1 ") involved by he operaion.

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 273 Table 2. Scheduling parameers CG(n, e 1 n, d 1 n, R, D) n e i ' e i d i R(e i,e j ) D m Pp k T H i Change-operaion group Number of change operaions in he group Change operaion Opional change operaion Duraion of change operaions e i True if e i mus be execued before e j Deadline of he change group Number of predicion poins Predicion poin Time horizon for scheduling and impac assessmen Time insan when change operaion e i is scheduled o begin. The properies are a lis of <name, value> pairs ha describe operaion characerisics such as he duraion of execuing he operaion, he addiional load imposed, ec. Change operaions can be mandaory, such as he operaions derived from an RFC, or opional, such as he resource-brokering operaions. The scheduler can discard opional operaions if hey do no improve he business value. The se of operaions in a group may expand during he scheduling process due o he proacive acions suggesed by he objecive advisors; in general, proacive acions can be considered opional. Each change group defines a parial order among is consiuen operaions, indicaing heir precedence dependencies. A group may also specify a deadline for compleing he execuion of all is consiuen operaions and a business-value expression reflecing he penaly of lae compleion, which will be facored ino he overall business value of he sysem o be maximized by he orchesraor. If he deadline informaion is missing, hen he aggregaed business value of he SLOs is he only crierion for selecing a schedule. A change-operaion group can be preemped by he arrival of a group wih a higher prioriy (e.g., if a previous change has damaged he sysem and needs o be rolled back). The orchesraor uses he curren KPI predicions as scheduling guidelines. The scheduler sars by invoking he GeCurrenKPIs() funcion of he objecive advisors o rerieve he fuure variaion of all he relevan KPIs due o infrasrucure evens (e.g., fauls, workload surges) and changes ha have already been scheduled. These predicion poins indicae he ime insans when he objecive advisors expec he KPIs o change. Afer he scheduling of a change group is compleed, he advisors add is impac on he infrasrucure o he curren KPI predicions. To minimize he communicaion coss, he orchesraor migh cache business-value informaion for parial schedules. Each unique schedule is agged wih an idenifier (similar o a hash key), known o he orchesraor and advisors, and is relaed KPI predicions are saved. The orchesraor rerieves he predicions whenever i modifies he parial schedule by adding one or more change-operaions, and hereby avoids repeaing mos of he compuaions.

274 T. Dumiraş e al. 4 Ecoopia Implemenaion In his paper, we focus on he implemenaion of Ecoopia s orchesraor. The objecive advisors rely on funcionaliy provided by componen-specific configuraion managers [4, 5, 26, 27]. These managers encapsulae he exensive, and someimes proprieary, domain knowledge (e.g., workload characerisics, resource-uilizaion models), needed for assessing he impac of change operaions on he service KPIs. For evaluaing our framework, we have developed configurable emulaors for he goal-advisors. We implemen he orchesraor and he objecive advisors as Web Services, which means ha he orchesraor can inerac wih any hird-pary advisors ha suppor he wha-if ineracion proocol described in Secion 3.2. 4.1 Objecive-Advisor Implemenaion While he orchesraor is a cenralized componen, he objecive advisors are disribued. Ecoopia uses an objecive advisor for each SLO of each service defined he service-level agreemen. For example, a performance advisor moniors he service o assess he response ime, and a dependabiliy advisor assesses he recovery ime and he availabiliy based on he amoun of redundancy available in he curren configuraion. We implemen he objecive advisors in our framework in a hierarchical manner: as each service is composed of several oher services, he advisor ha corresponds o a op-level service queries several lower-level advisors corresponding o he componen services. Every resource from he IT infrasrucure is reaed as a service: he nework, he CPU, he disk, ec. have service-level objecives specifying he arge for a se of KPIs, such as response ime, hroughpu and recovery ime. The service composiion and he mapping of services ono physical resources define a reques queuing-pah for each service. A change operaion modifies his queuing pah, eiher by alering is srucure (e.g., by defining a new service composiion), or by modifying he parameers of he componen queues (e.g., by replacing a CPU wih a faser one or by removing a replica from a load-balanced sysem). The advisors use his domain knowledge o answer "wha-if" quesions abou service KPIs (such as performance and availabiliy forecass), based on he descripion of he change operaions and he schedule. The advisors corresponding o he primiive services conain analyical models of he corresponding resources and esimae he value of he KPIs based on he workload and configuraion. For insance, he performance advisors esimae he response ime of a primiive resource using he operaional laws of queuing heory [28, 29], based on he incoming reques raes and he known peak hroughpu of he resource. Higherlevel advisors compue heir KPI predicions by combining he oupus of he lowerlevel advisors along he corresponding queuing pah. The composie queuing pahs can be eiher sequenial (e.g., a reques ravels hrough a fron-end, a local-area nework and hen a back-end) or parallel (e.g., a load-balancer forwards he reques o one of several servers for furher processing). The parallel queuing pahs do no necessarily have he same lengh; for insance, a reques for a daa iem presen in a proxy cache has a shorer pah han a reques ha resuls in a cache miss and ha

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 275 Response Time 0 1 2 3 n ime Fig. 3. A KPI (e.g., average laency) varies in ime, depending on he workload and he sysem configuraion. We represen his variaion by a vecor of <, KPI()> pairs indicaing he ime when a KPI changes and he new value. This corresponds o a sep funcion as shown in he figure. needs o be forwarded o he applicaion server for processing. The parallel queues have probabiliies associaed wih each alernaive pah represening he percenage of requess ha ravel along hose pahs. Our implemenaion is similar o he resource advisor described in [8]; in addiion, we leverage workload predicions o esimae he long-erm KPI variaion. KPIs change in ime; herefore, he advisors provide KPI esimaions as ime-varying funcions KPI(). A KPI value is assumed o hold for a period of ime, unil some even causes he KPI o ake anoher value. This means ha KPI() is a sep funcion, as shown in Fig. 3. When replying o he invocaion of GeCurrenKPIs(), he objecive advisor will provide a lis of pairs <Pp k, KPI(Pp k )>, indicaing he imes (predicion poins) Pp k when he KPI is expeced o change and he corresponding KPI values (see Table 2). GeImpacKPIs() reurns a similar lis, indicaing he effec of he suggesed change schedule on he KPIs, compued using he service queuing-pah creaed by he change. 4.2 Orchesraor Implemenaion The orchesraor generaes change-operaion schedules, which associae sar imes 1, 2 n wih operaions e 1, e 2 e n, respecively, which have he respecive duraions d 1, d 2 d n (see Table 2). The schedule mus comply wih he parial ordering among operaions and he group deadline D (if defined). During scheduling, he orchesraor queries he objecive advisors for predicions of he impac on KPIs during he relevan ime-horizon and uses hese predicions o compue he overall business value and o refine he schedule. The ime horizon T H mus be long enough o include he deadline D, bu in general will be longer, in order o accoun for he KPI impac afer he change has been execued. The aim of he scheduling process is o provide he bes possible business value. The orchesraor does no know he closed-form equaion ha yields he overall business value because par of his compuaion is performed inside he objecive advisors, which ac as black boxes for he orchesraor. In scheduling-heory erms, his means ha he scheduling problem has an unknown objecive funcion [30]. Given ha he complexiy of scheduling algorihms depends on heir objecive funcions, i is impossible for us o reason abou he complexiy of our problem. Moreover, even if we had a closed-form expression for he business value, his would mos likely be a non-regular objecive funcion (a regular objecive funcion is nondecreasing in he compleion imes of he change operaions); here are few heoreical resuls for scheduling problems wih non-regular objecive funcions. We herefore

276 T. Dumiraş e al. focus on approximae scheduling algorihms ha make he bes effor o compue a soluion close o he opimal schedule. Business-value model. The SLO business values are funcions ha associae a dollar value wih various levels of service provided by he sysem. A service-level objecive defines a arge for a paricular KPI. A service may have muliple SLOs (some of hese objecives may rack a common KPI, e.g., he arge bounds for average laency and maximum laency), and each SLO has a business-value funcion. Since he KPIs change in ime (see Fig. 3), he business values are also ime-variable funcions. A ime, a KPI value is KPI() and he corresponding business value is: BV SLO ( KPI ( )). For each KPI ha changes a imes 0, 1, n, he business value for he ime inerval [ 0, n ] is compued using a weighed average: BV SLO ([, ]) 0 n n 1 = = i 0 BV SLO ( KPI( ))( ) n 0 i i+ 1 i (1) The business-value funcions of differen SLOs are designed o be addiive. They are used for reasoning abou he muliple impacs of various change operaions and for selecing he bes rade-offs. We add he business values of all he SLOs o compue he overall business value, which reflecs he uiliy of he proposed schedule of operaions: BV All ([ 0, n ]) = BVSLO ([ 0, n ]) k All SLOk Scheduling assumpions. In his paper, we make a few simplifying assumpions abou our scheduling problem. Firs, we assume ha all he operaions in a change group are mandaory (here are no proacive acions). Second, we assume ha all he change-operaion groups have explici deadlines. When no defined explicily, he deadline can be fixed o he end of he ime horizon for business-value evaluaion; i makes no sense o schedule operaions pas his ime horizon because we would no be able o see heir impac on he business value. Third, he operaions in a change group are oally ordered (i.e. an operaion mus complee before he nex one can begin). While hese assumpions are somewha consraining, we believe ha in pracice here are many change-managemen siuaions ha saisfy hese consrains (we provide an example in Secion 5). Scheduling algorihms. The algorihms we have implemened are based on he following paern. Each operaion e k has a feasible scheduling inerval, defined by he earlies and laes imes when e k can be scheduled o allow enough ime for he prior and subsequen operaions: k 1 n d = k D = i i d 1 i k i (2) (3)

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 277 Pp 1 Pp 2 e k Pp m k d k {e 1, e k-1 } {e k+1, e n } D ime Fig. 4. Our Greedy algorihm for scheduling change operaions firs chooses he change operaion e k and he ime k ha yield he bes business value. This placemen splis he imeline and he change-operaion group in wo, and we apply he same algorihm o he wo halves of he problem. Using hese bounds, we ry o schedule each change operaion a he earlies possible ime, he laes possible ime and a all he m predicion poins (ime insans indicaing he fuure variaion of he KPIs) ha fall wihin his feasible inerval. The baseline scheduler is a backracking algorihm ha generaes and evaluaes all of he possible placemens for he change operaions in a group. We sar wih he firs even e 1 and we place i a all he predicion poins from is feasibiliy inerval ( 1 =0, 1 =Pp 1, 1 =Pp 2, ec.). For each of hese values of 1, we repea he algorihm for he remaining operaions and he new boundaries of he imeline (since we have sared wih he firs operaion, he deadline says he same and he sar ime becomes 1 +d 1, he ime when e 1 will complee). When we have successfully scheduled all he operaions from he change group, we compue he corresponding business value by invoking GeImpacKPIs() on he relevan advisors. We hen backrack o ry oher possible placemens of e n, hen of e n-1 ec., and we save he schedule ha generaes he highes business value. If he KPIs are expressed as sep funcions, as shown in Fig. 3, and he business values are linear funcions of he KPI values (which would make hem sep funcions as well), his algorihm generaes he opimal schedule. For each operaion e k, here may be m assignmens of k. An assignmen of n-1 will be esed in combinaion wih m assignmens of n. An assignmen of n-2 will be esed wih m assignmens of n-1, each of which will be esed wih m assignmens of n ; herefore, an assignmen of n-2 requires m 2 more operaions for deermining he bes corresponding business value. By inducion, his algorihm, henceforh called Backracking, has he wors-case complexiy O(m n ). A more realisic scheduler uses a polynomial bes-effor algorihm ha is no guaraneed o provide an opimal soluion. We achieve his wih a greedy algorihm: we place each operaion e k a each predicion poin from is feasibiliy inerval and we compue he business value ha corresponds o his placemen (during his sep, we are only ineresed in he impac of e k, so we invoke GeImpacKPIs() on he relevan advisors for a schedule ha conains only e k ). We selec he operaion and he placemen ha yield he bes possible business value. This placemen splis he imeline and he change-operaion group in wo, and he same algorihm is applied recursively o he wo segmens of he problem, as shown in Fig. 4. Operaions e 1 e k- 1 will be scheduled beween [0, k ], and operaions e k+1 e n will be scheduled beween [ k +d k, D]. The firs ieraion of his algorihm performs nm BV comparisons. In he wors case, he imeline pariioning will be skewed such ha e 1 will be chosen and all he predicion poins will fall afer 1 +d 1. he second ieraion will hen require m(n-1) BV

278 T. Dumiraş e al. comparisons. Since here are n ieraions, his algorihm (Greedy1) has he complexiy O(n 2 m). This algorihm has he disadvanage ha i ends o give prioriy o he shor operaions ha have a small negaive impac. These operaions ge he bes placemens, someimes leaving he large operaions o be scheduled during busier periods, hus affecing he overall business value. To avoid his siuaion, we can modify he selecion condiion in he following manner: a each ieraion, we choose he operaion e k ha displays he larges business value variaion depending on he scheduling ime. This sraegy leads o selecing he operaion mos sensiive o placemen firs. This algorihm, called Greedy2, has he same complexiy as he previous one: O(n 2 m). Schedule Sabiliy. The schedules generaed by he orchesraor remain consan in he absence of any addiional change requess, SLA updaes or sysem managemen evens such as fauls or workload changes. Fig. 5 shows ha all he changes ha migh affec he final schedules are always iniiaed ouside he scheduling loop involving he orchesraor and he advisors, which ensures he sabiliy of our proocol. The advisors generae deerminisic KPI predicions for a given change group (i.e., he same enaive schedule will yield he same predicions). 3 The predicions reurned by GeCurrenKPIs() will be adjused in beween change groups because he effecs of he change ha has jus been scheduled are facored ino he KPI predicions; however, no such adjusmen is performed inside he scheduling loop. The algorihms presened above are guaraneed o converge if he KPI predicions are deerminisic for a given change group. Oher auonomic managemen sysems based on ieraive opimizaion loops [9, 21] may oscillae beween borderline decisions because a resource reconfiguraion will affec he performance merics which may subsequenly rigger anoher reconfiguraion. Ecoopia, where all of he changes are iniiaed ouside he scheduling loop and he wha-if analysis considers a long imehorizon, guaranees ha such infinie cyclic dependencies are broken and ha hrashing canno occur. Canceling and Undoing Scheduled Changes. One corner case when he KPI predicions are no deerminisic is when a faul or a load-surge predicion occurs while he scheduling loop is execuing. Raher han updaing he KPI predicions, in his case, we cancel he scheduling of he change group in order o avoid confusing he scheduler. Moreover, a faul or a load surge will ypically be associaed wih a change reques ha has he highes urgency, so i is imporan o sar scheduling his change as soon as possible. In general, whenever he orchesraor receives an urgen change reques, i will preemp he currenly execuing scheduling process, and will sar working on he new reques immediaely. In some cases, i becomes obvious ha a scheduled change does no have he desired effec and mus be abandoned. If he change group has been scheduled bu no ye implemened, i can be canceled easily. More ofen, however, his decision is aken only afer he change has been finalized. In his case, anoher change has o be 3 The ineracion proocol described in Secion 3.2 also relies on his propery because he orchesraor and he advisors cache he KPI predicions corresponding o parial schedules.

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 279 RFC KPI Predicions Change Manager Change Group SLAs Objecive Orchesraor Advisors Schedule Execuion Final Schedule Orchesraor Tenaive Schedule Faul Noificaions Workload Predicions Sysem Managemen Monioring Daa Fig. 5. The scheduling loop of Ecoopia is designed such ha all he change requess originae from ouside of he ieraive ineracion beween he orchesraor and he objecive advisors. This ensures ha he scheduling process does no oscillae beween borderline decisions. scheduled o undo he effecs of he previous one. The logs of he orchesraor can assis his operaion by defining he reverse operaions needed o undo he undesirable change, bu he process mus be guided by an adminisraor since he auonomic infrasrucure has failed o ake ino accoun he negaive effecs of he change. In many cases, hese errors are due o bad SLAs, which hen have o be reworked by he sysem adminisraor. If he KPI predicions are accurae enough, we are confiden ha human inervenions for correcing he orchesraor s decisions will be uncommon. Noe ha, since he decision o undo is no made by he orchesraor, he sabiliy guaranees described above are no affeced. 5 Case Sudy: Two-Tiered Enerprise Infrasrucure We consider a wo-iered sysem, where he physical hoss are organized in independenly-managed node-groups. The firs ier is a node group of applicaion servers managed by applicaion server middleware (e.g., IBM WebSphere Exended Deploymen [6]) and he second ier is a node group of daabase servers, managed by daabase cluser infrasrucure (e.g., Oracle Cluserware [27]). The wo node-group managers perform various managemen asks (e.g., load balancing, reques rouing, faul recovery). This infrasrucure, illusraed in Fig. 6, provides wo services, each mapped ono corresponding applicaion-server and daabase services. The wo services processing Web ransacions are load-balanced across hree applicaion-servers, Srv 1 o Srv 3. These fron-end services query wo daabase services ha connec o separae daabase pariions. The daabase group comprises hree nodes: DB 1 acs as primary server for Service1 and as backup for Service2; DB 2 is par of he logical primary server for Service2, which is disribued on wo daabase nodes;

280 T. Dumiraş e al. Service 1 Primary Service 2 Backup Service 2 Primary Service 2 Primary Service 1 Backup Service 1 Service 2 Srv 1 DB DB 2 DB 3 1 Fron-end Srv 2 App. Server Group Srv 3 Daabase DB Group Fig. 6. Example: wo-ier sysem DB 3 is also par of he logical primary for Service2 and i is a backup for Service1 as well. Each of he wo enerprise services has response ime, recovery ime and availabiliy objecives. The business value associaed wih hese SLOs depends on he relaed KPIs, such as oal number of ransacions, number of ransacions wih response ime below arge, ec. A performance advisor evaluaes he impac of change operaions on he end-oend response ime for each service by exploiing he knowledge provided by he nodegroup managers (e.g., expeced workload variaions, service overheads). Similarly, a dependabiliy advisor evaluaes he impac on he recovery ime and he availabiliy SLOs. 5.1 Qualiaive Evaluaion For evaluaing he Ecoopia change-managemen framework in his conex, we discuss wo realisic change-managemen scenarios for his case sudy: a crash of node DB 1 and an upgrade of he daabase sofware. We complemen his analysis wih measuremens illusraing he rade-off beween he cos and he loss of opimaliy of differen scheduling algorihms (Secion 5.2). Scenario 1: Hardware crash. When he dependabiliy advisor deecs he crash of DB 1, he corresponding node-group manager akes immediae recovery measures. The daabase recovery manager handles he failover of Service1 o is backup node, DB 3. As a resul, DB 3 handles queries for boh services, while DB 2 coninues o handle only queries for Service2. However, since he daabase group now has fewer nodes, and an accompanying higher risk of failing he availabiliy objecives, he change-managemen sysem mus decide wheher removing one node from he applicaion server group and adding i o he daabase group would improve he overall business value and when hese operaions should be scheduled.

Ecoopia: An Ecological Framework for Change Managemen in Disribued Sysems 281 Srv 1 Srv 2 Srv 3 DB 1 Crash Remove node from App. Srv Add node o DB Group H-off DB 2 DB 3 H-off Checkpoin Workload (Service1) Workload (Service2) (b) Resp. Time (Service1) (c) Business Value Resp. Time (Service2) Recov. Time (Service 1) Recov. Time (Service2) Availabiliy (Service1) (a) (d) Availabiliy (Service2) Fig. 7. Hardware crash and faul-managemen scenario Fig. 7 shows he impac of hese change operaions. Afer he crash of DB 1, he lack of a backup leads o a sharp decrease of he prediced availabiliy of Service1 and a drop in he corresponding business value indicaed by poin (a) in he figure. However, since he load of Service2 is high a his poin, ransferring a node from he applicaion-server group o he daabase group would fail o mee he response ime objecive. Therefore, he orchesraor delays he change operaions unil he load of Service2 decreases, a poin (b). During he node ransfer, he response ime decreases for boh services, bu afer he hand-off poin (c) he response imes, as well as he availabiliy of Service1, may reurn o normal. However, since Service2 has been coninuously sending queries o he daabase, is log kep growing, leading o an increase of he recovery ime. To solve his problem, he dependabiliy advisor requess a proacive acion in he form of a daabase checkpoin (synchronizing he modified daa blocks in memory wih he disk and shorening he log processed during recovery). Afer he checkpoin, indicaed by poin (d), he response ime and he recovery ime for Service2 decrease o normal operaing levels. Scenario 2: Daabase upgrade. A similar impac analysis mus be underaken when upgrading he daabase sofware (Fig. 8). In his case, a reques for change is decomposed ino finer-grained change operaions: each daabase node is upgraded separaely and, for upgrading DB 1, Service1 is handed off o DB 3 (is backup) before he upgrade and resored a he end. The analysis mus consider he impac of