A 10-year Perspective on Software Engineering Self- Adaptive Systems SEAMS 2013 David Garlan May2013 San Francisco, CA A 10-year Perspective In 2002 Alex Wolf, Jeff Kramer and I organized the first ACM SIGSOFT Workshop on Self-Healing Systems (WOSS 02) Charleston, South Carolina (at FSE-10) 35 participants, 22 presentations 2-days Evolved into today s SEAMS It seems appropriate to reflect on what has gone on since that event What were the challenges then? What are they now? Has there been progress? 2 1
Talk Outline Partial history of the field Precursors; WOSS 02; what followed Personal reflections on the Rainbow System What we did right; what we got wrong Challenges moving forward Some open problems and ideas for progress 3 Timeline WOSS 02 We Are Here SEAMS 013 2002 2004 2006 2008 2010 2012 WOSS: Workshop on Self-Healing Systems SEAMS: SW Engineering for Adaptive and Self-Managing Systems 4 2
WOSS 02 5 WOSS 02 The goal of this workshop is to bring together researchers and practitioners from many [areas] to discuss the fundamental principles, state of the art, and critical challenges of self-healing systems. we intend to focus on the software engineering aspects, including the software languages, techniques and mechanisms that can be used to support dynamic adaptive behavior. 6 3
ACM SIGSOFT* Workshop on Self-Healing Systems (WOSS 02) November 18-19, 2002 Charleston, SC * With partial support from the NASA High Dependability Computing Project Workshop Logistics Sessions organized by affinity groups 3-4 papers per 1.5 hour session Session chairs, as noted on program See your chair before your session Emphasis on discussion 10-minute research overviews from each paper Half-hour breaks between sessions Joint lunch on Monday Open (uncommitted) sessions at end of each day (for extra discussion, topics that don t have another home, etc.) 8 4
Software Engineering Today Common assumptions Known and stable system requirements Known and stable operating environment Control over the development of assembled parts Development time and run time are completely separate Systems can be taken down for maintenance Consequences Focus on improving development time processes, techniques, notations Provide high assurance through testing, rigorous specification, modeling, verification, etc. Expect end users to do manual installation and upgrades 9 Isn t This Good Enough? Increasingly, systems 10 are composed of parts built by many organizations must run continuously operate in environments where resources change frequently are used by mobile users For such systems, traditional methods break down Exhaustive verification and testing not possible Manual reconfiguration does not scale Off-line repair and enhancement is not an option 5
What Has to Change? Goal: systems automatically and optimally adapt to handle changes in user needs variable resources faults mobility But how? Answer: Move from open-loop to closed-loop systems Affect Adaptive Control Sense? 11 Executing System Many Approaches Programming language support Algorithms (e.g., self-stabilizing, machine learning) Architecture-based adaptation Operating systems support Domain-specific techniques (e.g., distributed databases, pub-sub architectures, ) Adaptable middleware Support for user mobility Fault tolerant system design (e.g., graceful degradation) Biologically-inspired models Inferring correct system behavior through observation 12 6
Why Have a Workshop? Understand the relationships between these different approaches Identify the software engineering challenges and opportunities Create a common vocabulary (or possibly a reference model) 13 Affinity Groups (in order of appearance) Architecture-based adaptation Systems: operating systems, distributed systems, databases Middleware & mobility Programming languages User-centric approaches: requirements specification, inference New paradigms: neural nets, biologically-inspired computing, homeostatic systems 14 7
Outcomes It was a motley crew representing a wide range of areas of expertise We had a lot of fun We decided to do it again Control Systems Biology Fault Tolerance Self Adaptive Systems Human Immune Sys Software Architecture AI 15 Timeline IBM Autonomic DASADA WOSS 02 WOSS 04 TASS ICAC SASO SOAR SEAMS 06 IJAC We Are Here SEAMS 013 2002 2004 2006 2008 2010 2012 WOSS: Workshop on Self-Healing Systems SEAMS: SW Engineering for Adaptive and Self-Managing Systems DASADA: Dynamic Assembly for System Adaptability, Dependability, and Assurance ICAC: Intl. Conf. on Autonomic Computing SASO: Self-Adaptive and Self-Organizing Systems SOAR: Self-Organizing Architectures IBM Autonomic A.G. Ganak and T.A. Corbi, The Dawning of the Autonomic Computing Era. IBM Systems J, vol. 42,no. 1, 2003, pp. 5-18. Kephart, J. O. and Chess, D. M. The vision of autonomic computing. IEEE Computer, 36, 1, January 2003. TASS: ACM Transactions on Autonomous and Adaptive Systems IJAC: International Journal of Autonomic Computing 16 8
Forces Behind the Expansion New system requirements Maintain availability and utility in the presence of variable environments, faults, attacks, changing requirements, mobility, reliance on systems outside our direct control, Traditional solutions increasingly inadequate Humans are expensive and error prone Embedded mechanisms (e.g., timeouts, exceptions) didn t work Recognition that control systems are not just for hardware Same principles can be applied to software systems 17 IBM MAPE-K 18 9
The Google File System Source: The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. SOSP 2003. 19 19 Summary Part 1 Since the first WOSS in 2002 the field of selfadaptive systems has exploded, driven by technological and economic forces Conferences Publications Journals Industrial attention From the outset it represented a area of convergence for many disciplines Fault-tolerance, architecture, AI, etc. But with a distinct software engineering flavor of its own 20 10
Talk Outline Partial history of the field Precursors; WOSS 02; what followed Personal reflections on the Rainbow System What we got right; what we got wrong Challenges moving forward Some open problems and ideas for progress 21 Rainbow Goals Engineer self-adaptation to support: Cost-effectiveness (relative to handcrafted solutions) Automation of routine adaptations performed by humans Legacy systems Domain-specific adaptations Ease of changing adaptation strategies and policies Reasoning about the effects of selfadaptation actions and strategies 22 11
Rainbow Approach A framework that Allows one to add a control layer to existing systems Uses architecture models to detect problems and reason about repair Can be tailored to specific domains Separates concerns through multiple extension points: probes, actuators, models, fault detection, repair A language (Stitch) for programming repair actions that Addresses self-adaptation concerns 23 The Rainbow Framework Architecture Layer Adaptation Manager Model Manager Translation Infrastructure Target System System Layer 24 12
Rainbow Framework Overview Architecture Layer Strategy Executor Adaptation Manager Architecture Evaluator Model Manager Translation Infrastructure Gauges Effectors System API Resource Discovery Probes System Layer Target System 25 Rainbow Framework Overview Decides on the best adaptation Carries out that adaptation Strategy Executor Adaptation Manager Architecture Layer Architecture Evaluator Model Manager Detects problem in system Bridges abstraction (aggregate info up, map action down) Customization points Effectors Translation Infrastructure System API Resource Discovery Gauges Probes Manages arch & env t model Relates system info to model Changes state in target system System Layer Target System Extracts system information 26 13
Self-Adaptation Example: Znn.com Server pool Client 1 WebServer 1 Backend DB Net Load Balancer ODBC-Conn WebServer k Client n 27 Self-Adaptation Example: Znn.com Adaptation Condition: client requestresponse time must fall within threshold Latency Load Server pool Client 1 Response- Time WebServer 1 Backend DB Client n Net Load ODBC-Conn Balancer Possible actions -restartlb WebServer k Possible actions -enlistservers -dischargeservers -restartwebserver -lowerfidelity -raisefidelity 28 14
Znn.com: Rainbow Customizations (Part 1) AM Architecture Layer SX addserver removeserver setfidelity MM AE ClientT.reqRespLatency <= MAX_LATENCY activateserver.pl deactivateserver.pl Effectors setfidelity.pl System Layer Translation Infrastructure System API Resource PingRTTLatency Discovery Bandwidth Probes Load Fidelity Znn.com Cost ClientT.reqRespLatency Gauges HttpConnT.bandwidth ServerT.load ServerT.fidelity ServerT.cost Model Manager Architecture Evaluator Adaptation Manager Strategy Executor MM AE AM SX 29 Stitch: A Language for Specifying Self-Adaptation Strategies Control-system model: Decision tree in which election of next action in a strategy depends on observed effects of previous action Value system: Utility-based selection of best strategy allows context-sensitive adaptation Asynchrony: Explicit timing delays capture settling time Uncertainty: The effect of a given tactic/strategy is known only within some probability 30 15
Strategy Selection Given: Quality dimensions and weights (e.g., 4) A strategy with N nodes Branch probabilities as shown Tactic cost-benefit attributes Score = 0.58 [+500, +2, 5, 1] 75% T1 T0 [-125, 1.5, 4.75, 2] 60% 25% T2 done [-1000, -2, 1, 3] [0.4, [900, 1, 5, 0.2, 4.75, 0.6] 2] [-1000, +2, 3, 2] fail [+1000,-5,+5,5] 40% Propagate cost-benefit vectors up the tree, reduced by branch probabilities Merge expected vector with current conditions (assume: [1025, 3.5, 0, 0]) Evaluate quality attributes against utility functions Compute weighted sum to get utility score u latency (), u quality (), u cost (), u disruption () (w latency, w quality, w cost, w disruption ) = (0.5, 0.3, 0.1, 0.1) [= 1] Algorithm Given tree g with node x and its children c: EAAV (g) = sysav + AggAV (root (g)) AggAV (x) = cbav(x) + c prob(x,c) AggAV(c) 31 31 Znn.com: Rainbow Customizations (Part 2) Objectives: timely response (ur), high-quality content (uf), low-provision cost (uc) Architecture Layer ur: AM [T] switchtotextual utility: SX [T] switchtomultimedia AE 0: 1.00 [T] enlistservers [ur: -1000, uf:0, uc:+1.00] 500: 0.90 [T] dischargeservers 1500: 0.50 [S] SimpleReduceResponseTime 4000: 0.00 MM [S] ImproveFidelity weights: ur: 0.3 uf: 0.4 uc: 0.2 usf: 0.1 Translation Infrastructure Gauges Effectors System API Resource Discovery Probes Model Manager MM System Layer Znn.com Architecture Evaluator Adaptation Manager Strategy Executor AE AM SX 32 16
Znn.com Experiment Run Results Aggregate Z.com Webpage JMeter Latency Data ms response URL #Samples Average Median 90% Line Min Max Error% Throughput KB/sec 0415K No Rainbow 1200 7981 1953 37514 93 52201 0.00% 8.4/sec 6.92 0415M Control+Monitoring 1200 5926 1531 26905 78 45936 0.00% 10.6/sec 10 0415L SimpleReduce1 1200 1502 16 2187 0 22202 0.17% 29.5/sec 78.26 0415D simplereduce0 1200 1674 172 2546 0 24220 1.25% 14.2/sec 38.36 0415F lowestfidelity 1200 4 0 16 0 62 0.00% 49.2/sec 162.73 60 50 40 % of K #K > 10sec 217 #L > 10sec 80 36.9% #K > 1 sec 1120 #L > 1 sec 298 26.6% Latency (sec) 30 20 0415K-NoRainbow 0415L-SimpleReduce1 10 0 0 200 400 600 800 1000 1200 Timepoint of Request (sec) 33 33 Latency (s) Control run 1000 100 10 1 0.1 0 600 1200 1800 Time elapsed (s) 100 Client1 Client2 Client3 Client4 Client5 Constraint System Adapts Data shows that our adaptation approach improves overall system performance Adaptation run Latency = 2 secs 10 Latency (s) 1 0.1 0 600 1200 1800 Time elapsed (s) 34 17
System Administration Evaluation Sys-admin interview Results: Stitch concepts seem natural fit for sys-admin routines Methodology: priming, interview, compose Stitch script from scenarios White problem scenarios: scripts represented in Stitch Almossawi problem scenarios: structure matches Stitch strategies Analysis using CMU sys-admin example: Netbwe Results: Rainbow captured adaptation concerns Stitch hoisted policies buried in Perl code Distinguishable adaptation tasks Core commands as operators Switch Router Usage Log loadstate Usage + doviolation block Incident DB sense usage NetBlock Quagar block Ac Pt Coarser-grained sequence of commands (step) with conditions of applicability and intended effects Adaptations with intermediate condition-actions and observations Internet uplink link NetBWE logusage 35 Rainbow Reflection: What we got right Embodies an explicit control systems architecture Supports model-based adaptation (in this case through architectural models) Models are updated by gauges and form the basis for all decision making Framework allows domain specialization Style of system, repertoire of strategies, Repair language (Stitch) incorporates: Control perspective guiding each step Utility to select best strategy Timing to account for asynchrony Probabilities to handle uncertainty 36 18
Rainbow Reflection: What we got wrong Fixed set of strategies What do you do if no strategy can help? Tactic outcomes specified in utility space Limits amount of reasoning one can do about strategies Repair triggered by constraint violation Does not easily accommodate homeostatic adaptation Humans not in the loop Not prohibited, but not easily supported Complexity in number of plug-ins for installation Deployment is difficult 37 Summary Part 2 Rainbow attempted to capture the self-adaptive control loop through a framework that can be specialized for specific domains a language for programming adaptations Recognized the first class nature of Control, Timing, Uncertainty, Utility, Analysis But left many issues unaddressed or inadequately supported Humans in the loop, dynamically generated strategies, tactic specification, and deployability 38 19
Talk Outline Partial history of the field Precursors; WOSS 02; what followed Personal reflections on the Rainbow System What we got right; what we got wrong Challenges moving forward Some open problems and ideas for progress 39 Some Challenges 1. Uncertainty 2. Combining Reactive and Deliberative Adaptation 3. Humans in the Loop 4. Architecting for Adaptability 5. Proactivity 6. Concurrency, preemption, synchronization 7. Distribution 40 20
1. Uncertainty Uncertainty is everywhere Monitor: Sensing is imprecise and partial Analyze: Detection and diagnosis are often ambiguous Plan: The effect of a tactic/strategy is cannot be accurately predicted Execute: The system may not behave the way we expect, or may be subject to other influences outside our control Knowledge: Run-time models are only approximate Environment: Hard to predict or control what s outside the system (loads, attacks, ) How do we bring this under control without overwhelming our ability to reason about the adaptation? 41 Ideas Adaptive probing Reduce uncertainty when problems occur Adaptive diagnosis Expand window of observation when cause is uncertain Machine learning Improve our understanding over time Make uncertainty first class in our models (both design models and run-time models) Exploit emerging probabilistic model checking tools 42 21
2. Combining Reactive and Deliberative Reactive adaptation (ala Rainbow) is efficient and supports design-time analysis but may not cover all situations Deliberative adaptation allows one to reason a way to a solution in unfamiliar situations, but may be slow Can we combine these? Supervisory planning to set longer-term goals (ala Magee and Kramer)? In parallel and collaboratively? 43 Ideas 44 22
? System 2 System 1 Analytical Deliberative Slow High resource consumption Conscious More often right Example: 24 * 17 Intuitive Reactive Fast Low resource consumption Unconscious Useful if not always right Example: 2 + 2 45 Example: Arrows 46 23
3. Humans in the Loop Taking deliberative approaches even further, how can we involve humans in the adaptation loop? Providing situational awareness to operators Using them to improve adaptation by combining autonomic and human guidance for mixedinitiative control Learning from human behavior How does that impact our ability to reason about adaptive systems? Humans add an even larger source of uncertainty, compounding Challenge 1 47 47 Ideas Perhaps approach the problem by considering humans in various roles in the control loop: Sensors (providing inputs about the state of the system and the environment) Actuators (carrying out actions that may not be automatable As knowledge/model augmentation As decision makers (developing new tactics/strategies) Learn from new breed of interactive robots 48 24
4. Architecting for Adaptability It is routine now to consider various ilities when architecting systems What does it mean to architect a system (under control of an autonomic layer) for adaptability? Example: monitorability Example controlability What tradeoffs are involved and how can one balance these? Example: monitorability may impact performance Example: controlability may impact security 49 Ideas Make adaptability part of architecture design reviews Create a catalog of adaptability-enhancing design patterns/tactics Some may be generic Example: pub-sub mechanisms support monitorability of communication Others will be domain-specific Example: Stateless servers support flexible server reconfiguration in web-oriented domains 50 25
5. Proactivity Today most self-adaptive systems are reactive A problem occurs, and the system addresses it But in many cases it may be better to do things before the problem occurs Example: Adapt the number of servers in anticipation of high anticipated load Example: Moving target approaches to security Example: periodic rebooting of suspicious subcomponents How do we rationally balance the cost and benefits of a proactive approach? 51 Ideas Change the basis for strategy selection from instantaneous utility to aggregate utility Develop better ways to predict future behavior Game theory for adversarial environments Predictive calculi (cf. Poladian thesis) Use machine learning Learn from the planning under uncertainty community They have insight into when is it better to balance short-term pain with long-term gain 52 26
6. Concurrency, Preemption, Synch Many autonomic systems execute a single adaptation at a time Easier to reason about But many potential benefits to concurrent adaptation Performance through parallelism Rapid response when new opportunities arise How do we make this manageable? Guarantee concurrent adaptations don t interfere Know when to terminate a long-running but unproductive adaptation Interrupt a low-priority adaptation with a highpriority one 53 7. Distribution Centralized adaptation architectures simplify control but may not scale Systems of systems built by different parties Ultra-large scale systems with no central control or authority How can we coordinate adaptation among many managed systems? 54 27
55 Ideas Explore alternative integration architectures Hierarchical e.g., master-slave Cooperative e.g., pub-sub based Nearest-neighbor topologies e.g., combined with self-stabilizing algorithms Social networks in which nodes are managed systems Systems are part of social ensembles with similar interests and trust relations 56 56 28
Other Challenges Assurances Resiliency of adaptive mechanisms themselves Run-time simulation, model checking, testing Cyber-Physical Systems Combining physical and software control Security Many flavors Adversarial nature of the environment Benchmark problems and metrics Znn is a start, but need more Compositionality 57 Other Challenges Reconciling multiple mechanisms Low and high level Reactive and proactive Immediate and deliberative Automated and human-directed Centralized and distributed 58 29
Summary The field has grown substantially from the early days of WOSS 02 to the point where it is a relatively well-established subfield First generation systems, like Rainbow, have taught us a lot Many challenges remain 59 Acknowledgements Funders DARPA: Initial investigations NSF: Refinements and currently Fault Localization CMU-Portugal: Analysis, Planning Samsung Electronics: Applications to manufacturing control NSA, Navy, Army Graduate students and colleagues (many) Especially Owen Cheng and Bradley Schmerl More information: www.cs.cmu.edu/~able 60 30
The End 61 SEAMS SEAMS 2012 in Zürich, Switzerland SEAMS 2011 in Hawaii, USA SEAMS 2010 in Cape Town, South Africa SEAMS 2009 in Vancouver, British Columbia, Canada SEAMS 2008 in Leipzig, Germany SEAMS 2007 in Minneapolis, Minnesota, USA SEAMS 2006 in Shanghai, China WOSS 2004 in Newport Beach, CA WOSS 2002 in Charleston, SC 62 31
Select References Rainbow Software Architecture-Based Self-Adaptation, David Garlan, Bradley Schmerl and Shang-Wen Cheng. Autonomic Computing and Networking, ISBN 978-0-387-89827-8, Springer, 2009. Increasing System Dependability through Architecture-based Self-repair. Garlan, Cheng & Schmerl. Architecting Dependable Systems, Springer-Verlag, 2003. Diagnosis Architecture-based Run-time Fault Diagnosis, Paulo Casanova, Bradley Schmerl, David Garlan and Rui Abreu. Proc. of the 5th European Conference on Software Architecture, Sept 2011. Prediction [Pol07] Leveraging Resource Prediction for Anticipatory Dynamic Configuration Poladian et al. Proc 1 st IEEE Intl. Conf. on Self-Adaptive and Self-Organizing Systems SASO, 2007. 63 32