Software Health Management An Introduction. Gabor Karsai Vanderbilt University/ISIS

Similar documents

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

The Service Availability Forum Specification for High Availability Middleware

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

Mixed-Criticality Systems Based on Time- Triggered Ethernet with Multiple Ring Topologies. University of Siegen Mohammed Abuteir, Roman Obermaisser

Methods and Tools For Embedded Distributed System Scheduling and Schedulability Analysis

Embedded Systems Lecture 9: Reliability & Fault Tolerance. Björn Franke University of Edinburgh

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

The Temporal Firewall--A Standardized Interface in the Time-Triggered Architecture

Replication on Virtual Machines

HRG Assessment: Stratus everrun Enterprise

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MultiPARTES. Virtualization on Heterogeneous Multicore Platforms. 2012/7/18 Slides by TU Wien, UPV, fentiss, UPM

Trends in Embedded Software Engineering

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Modular Communication Infrastructure Design with Quality of Service

Safety Issues in Automotive Software

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Verifying Semantic of System Composition for an Aspect-Oriented Approach

High Availability Storage

Designing Real-Time and Embedded Systems with the COMET/UML method

How To Understand The Concept Of A Distributed System

Multiagent Reputation Management to Achieve Robust Software Using Redundancy

Real Time Programming: Concepts

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

The MILS Component Integration Approach To Secure Information Sharing

Diagnosis and Fault-Tolerant Control

Aperiodic Task Scheduling

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Using UML Part Two Behavioral Modeling Diagrams

Unified Batch & Stream Processing Platform

FactoryTalk View Site Edition V5.0 (CPR9) Server Redundancy Guidelines

EVALUATION. WA1844 WebSphere Process Server 7.0 Programming Using WebSphere Integration COPY. Developer

Architecture Design & Sequence Diagram. Week 7

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

How To Secure Cloud Computing

Software Defined Security Mechanisms for Critical Infrastructure Management

System Aware Cyber Security

Principles and characteristics of distributed systems and environments

Providing Load Balancing and Fault Tolerance in the OSGi Service Platform

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Embedded Systems. 6. Real-Time Operating Systems

A Data Centric Approach for Modular Assurance. Workshop on Real-time, Embedded and Enterprise-Scale Time-Critical Systems 23 March 2011

Quotes from Object-Oriented Software Construction

How To Test Automatically

Expansion of a Framework for a Data- Intensive Wide-Area Application to the Java Language

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models

Verification of Triple Modular Redundancy (TMR) Insertion for Reliable and Trusted Systems

Chapter 3 Operating-System Structures

DeviceNet Communication Manual

Cloud Resilient Architecture (CRA) -Design and Analysis. Hamid Alipour Salim Hariri Youssif-Al-Nashif

MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS

Stretching A Wolfpack Cluster Of Servers For Disaster Tolerance. Dick Wilkins Program Manager Hewlett-Packard Co. Redmond, WA dick_wilkins@hp.

From Control Loops to Software

Chapter 1 - Web Server Management and Cluster Topology

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

SCALABILITY AND AVAILABILITY

Distributed Databases

Safety Requirements Specification Guideline

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

FIPA agent based network distributed control system

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

secure intelligence collection and assessment system Your business technologists. Powering progress

Cloud Computing. Up until now

Oracle WebLogic Server 11g Administration

Software Architecture Case Study. Air Traffic Control - Designing for High Availability

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

EonStor DS remote replication feature guide

AndroLIFT: A Tool for Android Application Life Cycles

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Disaster Recovery, Business Continuity, and Hosting Solutions for IFS Applications IFS CUSTOMER SUMMIT 2011, CHICAGO

Software Testing Strategies and Techniques

Budapest University of Technology and Economics Department of Measurement and Information Systems. Business Process Modeling

Managing a Fibre Channel Storage Area Network

Applying 4+1 View Architecture with UML 2. White Paper

A Framework for Highly Available Services Based on Group Communication

Value Paper Author: Edgar C. Ramirez. Diverse redundancy used in SIS technology to achieve higher safety integrity

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Architectures for Distributed Real-time Systems

Introduction to Embedded Systems. Software Update Problem

Cisco Active Network Abstraction Gateway High Availability Solution

Managing and Maintaining a Windows Server 2003 Network Environment

Suppor&ng the Design of Safety Cri&cal Systems Using AADL

Transcription:

Software Health Management An Introduction Gabor Karsai Vanderbilt University/ISIS Tutorial at PHM 2009

Outline Definitions Backgrounds Approaches Summary

Definitions Software Health Management: A branch of System Health Management that applies health management techniques to the controlling software of a system. SHM goes beyond classical fault tolerance Software Fault Tolerance Fault detected Functionality restored Software Health Management Anomaly detected Fault source isolated Fault mitigated Fault prognosticated When the fault happens, SFT reacts Software and system health is managed

Software Health Management Goals: To prevent a (software) fault from becoming a (system) failure Manage health of the software Sense, analyze, and act upon health indicators Provide (relevant) information to operator, maintainer, designer Assumption: Software health is a measurable, non-binary property

Software Health Management Characteristics Performed at run-time, on the running system Includes all phases of health management: Detection: detect anomalous behavior Isolation: isolate source of fault (component, failure mode) Mitigation: take action to reduce/eliminate impact of fault Prognostics: predict impending faults and failures Can be highly mode- and mission/goal-dependent

Backgrounds: Basics of Software Fault Tolerance Definition: Software Fault Tolerance: Methods and techniques to implement software that can tolerate faults in itself, in the platform it is running on, in the hardware system it is connected to, in the environment

Backgrounds: Basics of Software Fault Tolerance Why? Serves as a foundation for SHM See Fault-Tolerance vs. System Health Management What? Follows the (HW) Fault Tolerance principles in SW Literature: Wilfredo Torres-Pomales: Software Fault Tolerance: A Tutorial, NASA/TM-2000-210616, Langley Research Center, 2000 CREDIT Software Fault Tolerance, Edited by Michael R. Lyu, Published by John Wiley & Sons Ltd. Google: Software Fault Tolerance

Basics of Software Fault Tolerance Single version Definition: FT for a software component (module, application, service, ) one version of the component (code) is used Architectural issues Foundation for SFT: the architecture Component-oriented architecture Modularization horizontal partitioning Layering vertical partitioning Common thread: prevent propagation of failures (H + V)

Basics of Software Fault Tolerance Single version Detection Requires: Self protection: component protects itself from outside effects Self checking: component detects its own faults and prevents their propagation Concepts / Techniques: Replication checks: components replicated and results compared Timing checks: deadlines, response times, Reversal checks: inverse function: output input Coding checks: use redundancy in representations, e.g. CRC Reasonableness checks: value/range/rate/sequence of data Structural checks : verify data structure integrity

Basics of Software Fault Tolerance Single version Exceptions and their management Language-based mechanisms C++, Java, Ada, not in C! Hierarchical nesting (per control flow) Incorrect requirements/design can lead to major problems (Ariane 5) Categories: Interface exceptions: self-protecting component raises it Local exceptions: generated and contained w/in component Failure exceptions: local management failed, global actions is needed

Basics of Software Fault Tolerance Single version Checkpoints and restarts Detect and restart Categories: Static : reset to an initial state Dynamic : checkpoint state, restore previous one upon failure Problems: non-invertible actions Process pairs Identical versions Separate processors State checkpointed On fault, backup takes over Component Checkpoint Checkpointed state Primary State copy Backup Restart Selector Error detector Error detector

Basics of Software Fault Tolerance Multi version Definition: FT for software system multiple versions of component/s (code) are used Multiple versions: Issues Same spec Diversity: in design, implementation, language, compiler, processor, etc. + independent teams Specification errors (e.g. omissions) could be a common source of faults Experimental result: faults are not really independently distributed over the input space underlying similarities in design/implementation/etc. and faults?

Basics of Software Fault Tolerance Multi version Recovery blocks Create checkpoint before start If version fails, try another one (use checkpointed state) Alternatives can provide graceful degradation Checkpoint state State copy Primary Alt 1 Alt 2 Alt n Selector Acceptance test N-version programming Independent alternatives Generic voter selects Alt 1 Alt 2 Voter Alt n

Basics of Software Fault Tolerance Multi version N self-checking Alt 1 Acceptance Test Each alternative is selfchecking Selection logic selects best Alt 2 Acceptance Test Selection Logic Alt n Consensus-based If the selection algorithm fails to find a correct output then an output is chosen that has passed the acceptance test Alt 1 Alt 2 Alt n Acceptance Test Selection Algorithm Error Switch Failure Acceptance Test Switch

Basics of Software Fault Tolerance Multi version Output selection issues Acceptance tests are hard to build Voters may have to work with inexact comparisons Two-step process: Filtering via acceptance tests Arbitration step to choose output Generalized voters: Majority, median, plurality, weighted averaging, Choice must be based on system level issues Reliability, safety, availability, etc.

Software Fault Tolerance vs. Software Health Management Complexity of systems necessitates an additional layer above SFT that manages the Software Health Why? Software is a crucial ingredient in aerospace systems Software as a method for implementing functionality Software as the universal system integrator Software could exhibit faults that lead to system failures Software complexity has progressed to the point that zero-defect systems (containing both hardware and software) are very difficult to build Systems Health Management is an emerging field that addresses precisely this problem: How to manage systems health in case of faults?

Software Health Management and System Health Management What is System Health Management? The on-line view: Detection of anomalies in system or component behavior Identification and isolation of the fault source/s Prognostication of impending faults that could lead to system failures Mitigation of current or impending fault effects while preserving mission objective/s Reports Observations Isolation Corrections Detection Mitigation Prognostics

Design for Software Health Management Component-oriented software architecture Systems are built by composing components via welldefined interfaces and composition principles There is a (highly robust and reliable) component framework that mitigates all component interactions Component framework is built to higher integrity/quality standards than application software (e.g. RTOS vs. app) Beyond classical architecture-based SFT: No single fault assumption multiple faults are possible Cascading fault effects are also possible Software Health Management is a system-level function it must be integrated with System Health Management

Example: Component Model Parameter Subscribe (Event) Provided (Interface) Component Publish (Event) Required (Interface) State Resource Trigger A component is a unit (containing potentially many objects). The component is parameterized, has state, it consumes resources, publishes and subscribes to events, provides interfaces and requires interfaces from other components. Publish/Subscribe: Event-driven, asynchronous communication Required/Provided: Synchronous communication using call/return semantics. Triggering can be periodic or sporadic.

Example: Component Interactions Sampler Component GPS Component Display Component P S S Components can interact via asynchronous/event-triggered and synchronous/call-driven connections. Example: The Sampler component is triggered periodically and it publishes an event upon each activation. The GPS component subscribes to this event and is triggered sporadically to obtain GPS data from the receiver, and when ready it publishes its own output event. The Display component is triggered sporadically via this event and it uses a required interface to retrieve the position data from the GPS component.

Design for Software Health Management Component-level health management Very localized limited capability, yet needed for higher levels Monitor component detect anomalies What to monitor Input and output: pre- and post-conditions on incoming and outgoing synchronous calls and asynchronous events State: invariants over the component state Timing: component operation execution time Execution (response) time Frequency of invocation Resource usage: component resource consumption patterns Memory, resource lock/unlock, etc. How to monitor Momentary values Rates History/trends

Component Monitoring Monitor arriving events Monitor published events Component Monitor incoming calls Monitor outgoing calls Observe state Monitor resource usage Monitor control flow/ triggering

Component-level Health Management A Component Level Health Manager reacts to detected events and takes mitigation actions. It also reports events to higher-level manager/s. Events: detected by monitoring Actions: Basic mitigation: reset, init, shutdown, destroy, checkpoint/restore Intercept related: allow/block call Specialized mitigation: inject event, call method, deallocate memory, release resource, Event or time-triggered activation Reporting Report events/actions to other managers Component Monitor Actions Events Component Framework Manager s behavioral model: - Finite-state machine - Triggers: monitored events, time - Actions: mitigation activities Manager Events Manager is local to component container (for efficiency) but must be protected from the faults of functional components.

Component-level Health Management Manager behavior: Track component state changes via detected events and progression of time Take mitigation actions as needed Design issues: Co-location with component Fault containment Efficiency Local detection may implicate another component Mitigation action may include blocking the call, overriding data Complexity of mitigation actions Verification of mitigation logic Safety conditions Performance issues Manager encapsulates all HM Logic Idle start timeout Exec finish /init /reset Component Monitor inva_violation Actions Events Component Framework WCET InvA Events Manager

Design for Software Health Management System-level health management Multiple components can fail, independently Fault effects cascade through components Anomalies (with cascading effects) and faults propagating through components and assemblies must be correlated and managed Diagnosis: Isolate the fault source component Mitigation: Take (component-)local or global action to mitigate effect of fault/s

Design for Software Health Management A system-level fault model: Timed Failure Propagation Graph Failure Mode FM1 FM2 Unmonitored Discrepancy (OR) 2,3 a,b 1,4 a,c 1,6 a,b,c 1,3 b,c D1 SD1 D2 Monitored Discrepancy (AND) 1,3 b,c 1,4 a,c 1,3 b,c t=3 t=6 2,5 a Current Time = 10 Op. Modes = {a,b,c}, Current Mode = b Alarm sequence: {(3,D2), (6,D3), (8,D4)} Propagation interval t=8 D4 1,6 a,b,c D3 F: set of failure modes D: set of discrepancies Discrepancy attributes: Type: {AND, OR} Condition: {Monitored, unmonitored} M: Set of operating modes E: set of edges Edge attributes: Propagation interval: [t min,t max ] Activation modes Abdelwahed, S., G. Karsai, and G. Biswas, "A Consistency-based Robust Diagnosis Approach for Temporal Causal Systems", 16th International Workshop on Principles of Diagnosis (DX '05), Monterey, CA, June, 2005.

Design for Software Health Management System-level health management Model: Faults (failure modes) and discrepancies (observed anomalies) can be located in different components Fault propagation occurs along component communication links / call chains Diagnosis: Correlate observations across multiple components, deduce fault source Features: modal, robust, ranked results, multiple faults

Design for Software Health Management System-level health management: Multi-component diagnosis Component Fault Model Component Fault Model D FM D FM FM D FM D D FM D D Diagnoser Manager Managed Component Component CHM Managed Component Component CHM Component Platform

Design for Software Health Management System-level health management: Multi-component, hierarchical mitigation Local: reflex reactions Regional: mitigation in an area Global: system-level mitigation Dubey, A., S. Nordstrom, T. Keskinpala, S. Neema, T. Bapty, and G. Karsai, "Towards a verifiable real-time, autonomic, fault mitigation framework for large scale real-time systems", ISSE, vol. 3, pp. 33--52, 2007.

Summary Software Health Management: A branch of System Health Management that applies HM techniques to the controlling software of a larger system. Software Fault Tolerance provides useful techniques for SHM, but SHM reaches beyond SFT as it has a comprehensive approach to anomaly detection, diagnosis, mitigation and prognostics. Initial progress in the area of component-level and system-level software health management shows promise, but it is subject of ongoing research.