Computer Lab Software Fault-tolerance: Task Process Pairs Systems Engineering Group Dresden University of Technology http://wwwse.inf.tu-dresden.de/ January 25, 2013 1 / 16
One task less! Too many tasks to present! Task state correction removed Do you still need tasks? Talk to me at the end of the session... 2 / 16
Outline Task Process Pair : 1 Process Pairs 2 Scenario 3 Task 3 / 16
Process Pairs a watch dog process monitors an unreliable worker process watch dog spawns a new worker as soon as the old one crashes 4 / 16
Implementation watch dog uses fork to spawn a new child process child process starts offering its service right after the fork the parent process uses waitpid to block as long the child process is running when waitpid returns, the child has crashed so the watch dogs spawns a new child 5 / 16
Extensions Graceful degradation for fork: watch dog process becomes worker if fork fails in case of a crash the watch dog process is not restarted Check-pointing the worker: worker saves its current state on HDD in reasonable intervals before worker starts with its job it restores the state of its crashed predecessor 6 / 16
Scenario: Watching the Watch Dog The environment of the watch dog is unreliable! 7 / 16
HTTP-Watch Dog Details: implemented in C++ Makefile watch_dog.cc url.lst command line help: 1 Error : Wrong number of command line arguments 2 Usage : watch_dog <url_file > <timeout > <pause > 3 <url_file >: file with URLs to monitor ( one per line ) 4 <timeout >: timeout in ms for requests to the server 5 <pause >: pause in ms before starting the next request 8 / 16
Example: url.lst Content of url.lst: 1 http :// www. heise.de/ security / dienste / browsercheck / tests / activex. shtml 2 http :// wwwse. inf.tu - dresden.de/ 3 http :// wwwse. inf.tu - dresden.de/ does_not_exist. html 4 http :// www. does. not. exist / 9 / 16
Example Example: 1 # >./ watch_dog url. lst 2000 5000 2 URL file = url. lst 3 timeout = 2000 ms 4 pause = 5000 ms 5 host = www. heise. de; uri = / security / dienste / browsercheck / tests / activex. shtml 6 host = wwwse. inf.tu - dresden.de; uri = / 7 host = wwwse. inf.tu - dresden.de; uri = / does_not_exist. html 8 host = www. does. not. exist ; uri = / 9 === > Successful response from host www. heise. de (193.99.144.85) : HTTP /1.1 200 OK 10 === > Successful response from host wwwse. inf. tu - dresden. de (141.76.44.180) : HTTP /1.1 200 OK 11 === > Host wwwse. inf. tu - dresden. de (141.76.44.180) does not respond with " success " 12 === > response line : HTTP /1.1 404 Not Found 13 === > Could not find host www. does. not. exist 14 === > Reason : Invalid argument 10 / 16
Environment unreliability HTTP Watch Dog crashes nondeterministically within an ongoing request reasons are unknown could be: hardware failure (e.g. in network interface) software bugs in OS or libraries software bugs in HTTP watch dog error search would be too expensive and too time consuming 11 / 16
Task Overview Increase the reliability of HTTP Watch Dog using the process pair approach. Extend watch_dog.cc with following features: process pairs to protect execution of void test_server (const URL& url, int timeout) graceful degradation in presence of fork failures worker process saves check point after each completed HTTP request a started worker restores its state from an existing check point Attention: Do not change output of function test_server 12 / 16
Hints you do not need to change the functions test_server and read_url_list keep the check point as small as possible fork as rarely as possible do not change the command line usage do not change the format of url.lst read man pages of fork and waitpid do not add or change any output statement that starts with TEST_PREFIX 13 / 16
Testing your solution Attention: test your solution before sending it in. Consider appropriate test strategies: simulate several crashes in test_server: use our fault injector (in initial checkout) run: LD_PRELOAD=fault_injector/fault_injector.so./watch_dog... or kill worker from second terminal (e.g. with kill or Windows Task-Manager) simulate fork failures test your check pointing solution 14 / 16
Conclusion Conclusion add process pairs, check pointing, and graceful degradation test our solution with fault injectors check in: watch_dog.cc hold the deadline to get the certificate 15 / 16
Deadline Deadline: Feb 15 th 2012 16 / 16