GENIVI Lifecycle Webcast 30 th January 2014 29-Jan-14 David Yates, Continental Automotive Gmbh Lifecycle topic owner and SysArch Member Dashboard image reproduced with the permission of Visteon and 3M Corporation. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 1
Scope of Presentation The aim of this presentation is to provide an overview of the Lifecycle architecture within GENIVI detailing where we believe the Automotive world requires extensions to existing open source solutions. The following topics will be covered: Welcome & Introduction Lifecycle Domain Overview Component Overview Startup/Shutdown Concept Introduction to NSM (session management) Introduction to Resource Roadmap Call to action Location of further information (AMM presentations) 29-Jan-14 2
Lifecycle Domain Overview Set events (session information) Request state change Get node states Node state change notification Set last user context Lifecycle Domain Get internal supply/thermal states Supply state change notification Thermal state change notification Supply Thermal Node State Power Boot Resource Startup/shutdown management Resource limitation Run time observation Persistency Log & Trace 29-Jan-14 3
Lifecycle Manifest Package Product Component Platform Component Node State Machine Node State Manager Node Startup Controller Supply Manager Supply Node State Boot systemd Thermal Manager Thermal Power Resource cgroup service Power Event Collector Node Resource Mgr Node Health Monitor 29-Jan-14 4
Lifecycle Concept Plug in for: ADC, PMIC Plug in for: Sensors, Devices Plug in for: Wakeup reason, node / vehicle network State chart Supply Reaction on conditions Turn off display, drives, mute audio, State chart Thermal Reaction on conditions Turn on fan, reduce audio volume, Power *1 state change notifications *1 Events: Good Poor Bad Clamp State Button WU Bus WU Vehicle Network State chart Node State Set LUC Last-User-Context Session handling (Phone, Diag,SWL, ) Node State change protocol Shutdown management Boot config Boot HMI, Phone, SWL/Update Diagnostics Node observing for CPU load, memory, appl. crash Resource Plug in for: Application specific observing and recovering 29-Jan-14 5 Ctrls Node Resource config
Startup/Shutdown Boot takes care about Startup Node State takes care about Shutdown Why do we have this split? systemd stops and unloads all components during its shutdown concept. This requires alot of time to make them functional again in the event of a cancel shutdown. An IVI system must be able to resume operation without losing any context and without the need for a reboot. Therefore Node State will only call registered consumers in the shutdown phase. This event notification will drive the components into a stable state and ensure that everything has been stored which will be needed for the next startup. With this approach components would not be shutdown which is required for certain exceptions like the flash filesystem. Therefore additionally the shutdown management concept will include/use the systemd shutdown concept, where appropriate for legacy/critical components. 6
Node State Manager Shutdown preparation in Startup Phase kernel Before systemd Runlevel replacement GENIVI extensions initrd Mandatory targets (Base System & Early Features) Start NSM via systemd A B C BASE_RUNNING (during Node Startup Controller init) focussed.target (last user context) LUC_RUNNING unfocussed.target(s) FULLY_RUNNING FULLY_ OPERATIONAL lazy.target J 7
Node State Manager Shutdown Execution Consumer J Consumer I Writing LUC Consumer H Node Startup Controller systemd app1.service Consumer G Consumer F Consumer E Node Startup Controller systemd app2.service Consumer D Writing LUC Consumer C Consumer B Consumer A Node Startup Controller systemd Shutdown.target (flash file systems) Enables: 1. Shutdown activities are triggerable without unloading the components. 2. Legacy components can be shut down in their traditional way. 3. Full flexibility on where to integrate systemd based shutdown units. 8
NSM Session Phone Node State Machine Events/ Data Events/ Data Node Session State PhoneSession SWLSession. Node State Manager Node State PhoneSession SWLSession. Shutdown Phone Set method Request system restart Signal SWL Audio HMI Audio HMI Navigation Lifecycle Requests Navi 9
Resource - Goals Resource management contains the functionality to ensure that the node runs in a stable and defined manner. To do this, it will monitor and limit different aspects of SW component behavior including system resources (i.e. CPU load and memory) and critical run-time observation. Resource allocation will be configurable on a component basis through the use of cgroups. Health management will provide a configurable escalation strategy defining actions to be taken in the case of system failures. Note: what is not included is security handling for resources (i.e. restricted access to resources) 10
Health Health will ensure that the node runs in a stable and defined manner. To do this it is planned to have the following multi layered observation system and escalation strategy: Read/write data start/ restart Applications Applications Applications notify alive Persistence execute recovery Recovery Recovery Apps Apps Recovery Client Delete app data request app restart request node restart start/ restart systemd NHM monitoring of userland request node restart notify alive and monitor node status NSM notify alive /dev/watchdog forward NHM heartbeat externally or to internal HW Watchdog 11
Concepts for the System Health - NHM The Node Health Monitor will work in conjunction with systemd to monitor component failures in the system. It will be responsible for : Monitoring systemd to automatically record and track application failures Providing an interface with which components can register failures when not using the systemd watchdog mechanism Maintaining failure statistics over multiple lifecycles for the system and components the service name will be used to identify and track component failures statistics on number of failures in number of lifecycles will be maintained (i.e. 3 failures in last 32 lifecycles) Monitoring the wakeup and shutdown events to catch unexpected system restarts Provide an interface for components to read system and component error counts Provide an interface for recovery applications to request a node restart
Concepts for the System Health NHM cont.. Additionally the Node Health Monitor will test a number of product defined criteria with the aim to ensure that userland is stable and functional. For instance it will be able to validate that : there is enough free system memory the CPU is not reporting an excessively high load for a sustained period defined file accessibility is possible defined processes are still running communication is possible (DBUS) a user defined process can be executed with an expected result If the NHM believes that there is an issue with user land then it will be capable to initiate a system restart
Concepts for the System Health Node Wdog It is proposed to use, when supported, a low level HW watchdog to validate that systemd is running correctly. A typical watchdog implementation is capable to initiate an emergency shutdown process when it believes that a failure has occurred : idle init, so nothing new can be started kill all processes write a reboot record to wtmp turn off accounting turn off quota turn off swap unmount all mounted partitions NOTE: In this scenario a normal system shutdown will not be completed therefore cached persistent data from that Lifecycle will be lost
Concepts for the System Health - systemd systemd provides watchdog functionality for monitoring and restarting failing services in the system and for sending heartbeats itself to a HW Watchdog Within a service unit file it is possible to configure systemd that it will expect a heartbeat from the service within a particular time interval (WatchdogSec=). If this heartbeat is not received then systemd can be directed using tags in the applications unit file on how to behave. Typically this will result in the application being automatically restarted (Restart=). The problem is that this can result in a cyclic restart scenario with only limited options (StartLimitInterval=, StartLimitBurst=) to influence the restart behavior. Therefore, it is proposed that recovery applications are started automatically by systemd (OnFailure=) in case of failures.
Concepts for the System Health Recovery Client A Recovery Client is a component that is executed when a failure has been detected in the system. There can be a one to one relationship between apps and recovery clients or one client can handle multiple apps. It should contain enough functionality to be able to : request the error status count from the NHM providing the name of the service file failing based on the error count attempt recovery, for instance: if a file system fails to mount then the recovery action could be to format the file system and request a node restart if it is an application that has failed multiple times then we may want to delete that applications persistency data and restart the application when possible, request that the SW is uninstalled or rolled back request systemd to restart the application request a node restart via the NHM
Resource Resource Node State Mgmt systemd cgroups Node Resource Manager Node State Manager Starts services Configure cgroups Control system resources Report/Handle resource allocation errors Monitor system resources Kill resource abusers Evaluate node restart requests Handle node restart requests Application Component P3 Supply Control Logic 17
Example cgroup configuration (CPU) Radio NAV Speech Weather 3 rd party APPS Media Phone AUTOMOTIVE cpu.shares = 50, runtime= 100, period = 1000 APPS cpu.shares = 20, runtime= 500, period = 2000 Browser ROOT Unlimited Diagnostics Safety Cameras Positioning Comm Stacks Background tasks Infrastructure Services SW Loading Vehicle Network PDC BGND cpu.shares = 1, Kernel 18
Example cgroup configuration (Memory) Radio NAV Speech Weather 3 rd party APPS Media Phone Comm Stacks PDC Browser APPS memory.limit_in_bytes = 200M.. ROOT Unlimited Diagnostics Safety Cameras Positioning Background tasks Infrastructure Services SW Loading Vehicle Network BGND memory.limit_in_bytes = 10M Kernel 19
Lifecycle Roadmap Gemini Horizon Roadmap systemd cgroup service Adopted comps. from the OSS community specific Node Startup Controller Owned component, funded by GENIVI, implemented by Codethink specific Node State Manager Owned component, implemented by Continental abstract specific Node State Machine Product specific extension to the Node State Manager placeholder placeholder placeholder Node Resource Mgr Owned component, to be implemented by Continental specific Node Health Monitor Owned component, implemented by Continental abstract specific 29-Jan-14 20
Call to action We hope today s presentation has interested you in what we are working on within GENIVI and the Open Source Software that we have already released and plan to release in the future. The components described today have been defined and created within the GENIVI consortium as Open Source Software with the MPLv2 licence. For that reason the code is freely available in a public git repository outside of GENIVI. If you have interest in the components and can see other potential uses in your domains then please check out the links on the following slides. We are very open to inputs and requirements from all interested parties so please ask questions and get involved. http://projects.genivi.org/node-state-manager/about 29-Jan-14 21
Call to action continued For those already working inside of GENIVI that wish to contribute directly in the Systemd Infrastructure group we are always looking for more participants and have many topics ongoing for which you might be interested: Persistency User SW IPC Diagnostics Please feel free to check out the GENIVI Wiki page where you can find more information about the above topics and how to participate in our weekly telephone conference calls. https://collab.genivi.org/wiki/display/genivi/system+infrastructure+expert+group 29-Jan-14 22
Further Information If you are interested in further information regarding the GENIVI Lifecycle concept then you can find information within the GENIVI Wiki and public project page: https://collab.genivi.org/wiki/display/genivi/sysinfraeglifecycledef (restricted) http://projects.genivi.org/node-state-manager/about (open) All presentations of the concepts can be found using this link Lifecycle Presentations The code for the Node State Manager and the Node Health Monitor can be found in the GENIVI git : http://git.projects.genivi.org/?p=lifecycle/node-state-manager.git http://git.projects.genivi.org/?p=lifecycle/node-startup-controller.git http://git.projects.genivi.org/?p=lifecycle/node-health-monitor.git and you can contact me directly (David.Yates@continental-corporation.com) 29-Jan-14 Copyright GENIVI Alliance 2012 23