Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004
Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism Implementation Experiments 3 Fiedman, Kama - SRDS 2003 Introduction Non-determinism Design and implementation Experimentation
Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism Implementation Experiments 3 Fiedman, Kama - SRDS 2003 Introduction Non-determinism Design and implementation Experimentation
JVM philosophy Compile once, run everywhere Java bytecodes bytecode = instruction set of Java Virtual Machine One JVM for each architecture High-level support Memory management (garbage collection) Multithreading support (monitors)
JVM to real machines Internal components Dynamic class loader Interpreters vs. Just in Time compiling (JIT) Native methods (JNI) Provided libraries Allocation and garbage collection User-level vs. native threads
Random technical details A few characteristics Compact bytecodes (202 instructions) Types are preserved for safety, precise GC Objects accessible through references Strong, soft, weak, phantom references Object can be shared Passed to new thread constructor Static fields
Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism Implementation Experiments 3 Fiedman, Kama - SRDS 2003 Introduction Non-determinism Design and implementation Experimentation
General idea Their work Modify JVM to tolerate fail-stop failures. Extends hypervisor-based fault-tolerance Hypervisor model Implement a virtual state machine over underlying hardware Perform replica coordination in the hypervisor
State machine approach Requirements 1 Determinism: defining replicas 2 Independence: implementing replicas 3 Choice replication: ensuring replication 4 Transparency: guaranteeing single output A state machine Read set [ deterministic command ] Write set output to environment
A state machine for the JVM Challenges Non-determinism of commands Replication of sequence of commands Copying read-sets Multithreading Their approach: Bytecode execution engines (BEE) A BEE is a state machine JVM = set of BEEs, one for each application thread Replication at BEE level
Sources of non-determinism Some causes of non-determinism Asynchronous commands Non-deterministic commands Non-deterministic read sets Output to environment
Asynchronous commands Definition A command is asynchronous if it can appear anywhere in the BEE s sequence of commands. Examples Hardware interrupts, not for JVM Asynchronous Java exceptions Fatal errors, e.g. no resources, deadlocks Killing another thread, i.e. thread.stop() Added restrictions 1 Fatal exceptions are not replicated 2 Threads must not call Thread.stop
Non-deterministic commands Definition A command is non-deterministic if it write-set or output are not uniquely determined by the read-set values. Example Native calls: I/O, clock Solution Agreement between replicas on input environment and read-set. Not possible! input is outside JVM s control Backup must adopt primary write-set Restrict output on native methods
Non-deterministic commands Added restrictions 3 Native methods produce deterministic output to environment 4 Native methods invoke other methods deterministically Handling these conditions: splitting methods void non determ read write() { long r = read clock(); printf("%d\n",r); }
Non-deterministic commands Added restrictions 3 Native methods produce deterministic output to environment 4 Native methods invoke other methods deterministically Handling these conditions: splitting methods long non determ read(){ return read clock(); } void determ write(long r ){ printf("%d\n",r); }
Non-deterministic read sets Definition A read set is non-deterministic if it contains a shared variable. Examples Invoking methods on shared objects Storing objects in static references Alternatives Bookkeeping of shared data: an order of magnitude overhead Lock acquisition ordering, needs data-race elimination Exclusive access to all variables while thread is scheduled
Non-deterministic read sets Added restrictions 5 Include one of the following Example 1 No data-races (protect any shared data with monitors) 2 Exclusive access to all shared variables class Example { static F shared = null; String tostring() { if (null == shared) { shared = new F(); synchronized call();...
Output to the environment Definition An output is idempotent if it is independent of the number of times a command is executed. An output is testable when the environment can be tested for occurrence of output. Examples cd /home/siggi is idempotent, cd.. is not cd.. is testable if pwd is available. Definition A state is volatile if it does not survive failure of state machine. Otherwise, it is stable.
Output to the environment Solution Only support idempotent and testable output. Volatile data might be necessary for correct operation Side effects handlers: replicate lost volatile state of the primary Added restrictions 6 Output of native methods is idempotent or testable 7 Native methods annotated for volatile output
Implementation details Extended JVM New threads for Failure detection and backup initiation Transfer of logging information Interact with other system threads: GC, finalization Threading modification Restriction (5.2) requires modifying multi-threading libraries Sun s JVM provides both native and green threads Native threads are desired to run applications on SMP Green threads are desired for portability
Implementation for non-deterministic commands Initial work Inspected and categorized all native methods by hand! Found only 100 non-deterministic Runtime support algorithm Create table with (non-deterministic) method s unique signature Native call on primary triggers message to backup Backup on recovery uses same values Side effect handlers used for volatile state (e.g. file descriptor from an open command)
Non-deterministic read sets: first approach Replicated lock synchronization Assumes (5.1), ensuring mutual exclusion Defines lock acquisition record = (t i,l j,t # i,l # j ) Locking thread (t i ) Lock (l j ) Relative order of lock acquisition thread acquire sequence number (t # i ) lock acquire sequence number (l # j ) Primary creates (t i,l j,t # i,l # j ) Backup uses (t i,l j,t # i,l # j ) to repeat ordering
Computing lock acquisition record Defining record values Not trivial Object address for l j : meaningless at replicas Order of events: might differ in primary and backup Recursive definition for t i = (t p,k) t p is parent of t i t i is the k created w.r.t. siblings Use thread determinism for l j l j assigned the first time used Log map l j (t i,t # i )
Using lock acquisition record Recovery algorithm Case 1: Backup thread t i tries to acquire l j, and Log contains r = (t i,l j,t # i,l # j ) Wait until we reach l # j Remove r from log Case 2: Backup thread t i tries to acquire l j, and Log doesn t contain (t i,l j,t # i,l # j ) Wait until log is empty (end of recovery protocol)
Using lock acquisition record Recovery algorithm Case 3: Backup thread t i tries to acquire a lock with no id, and Log contains map l j (t i,t # i ) Assign lock primary s l j Remove map entry Case 4: Backup thread t i tries to acquire a lock with no id, and Log doesn t contain map l j (t i,t # i ) Wait until A thread t i assigns l j to the lock Log contains no more maps (assign fresh l j )
Non-deterministic read sets: second approach Replicated thread scheduling Assumes (5.2), all shared data is protected Defines thread scheduling record = (bn,pc,m,l #,t n ) Code executed Current program counter (pc) Trace summary to get there (bn) Monitor uses (m) Thread was waiting on a lock (l # ) Next scheduled thread (t n ) Log record on each context switch
Computing thread scheduling record Defining program position How many statements were executed? Avoid counting each instruction bn counts branches, jumps and invocations taken. pc is program counter offset (not absolute address): updated on every instruction!
Computing thread scheduling record Defining program position What if preemption occurred inside a native method? Can t control preemption outside JVM On recovery, preemption before native call? Need to keep track of locks acquired Locking done in JVM monitors On recovery, preempt when m is reached
Interaction with system threads Example Heap shared with GC. GC not in Java! Problems t i acquires a lock at primary with no contention, but t i waits at backup t n can enter at backup before it should! User-level threads to force t i to stay t i acquires a lock at backup with no contention, but t i waits at primary Use m also to force rescheduling at backup
Replicated scheduling: final details Wait and notifyall() Multiple threads awakened Store the l # to preserve order in backup Finishing recovery Log becomes empty, last entry contains t n Backup must schedule t n to reproduce interaction to environment.
Garbage Collection Common problems Soft/weak references Primary and backup may diverge Convert them to strong references Finalizers should be no source of non-determinism Replicate as before
Output to the environment Side effects handlers Store and recover volatile state Ensure exactly-once semantics for output Composed by 5 methods register test log receive restore
Components of SE handlers Method register Provide method signature Non-determinism flag Output command flag Arguments used for output Method test Used by backup True if output command was successfully executed Only defined for testable commands Idempotent commands are replayed
Components of SE handlers Method log Used by primary after an output command Saves arguments, return value and internal state Produces a message with recovery information Method receive Used by backup to retrieve result of log Can perform compaction of messages Method restore Used by backup only once to recover volatile state Uses received messages
Experimental setting Architecture and settings Sun E5000 Servers. 15 400MHz UltraSPARC II CPUs. 2GB Mem. 100Mbps Ethernet Primary and backup run on different machines Log is kept at backup in volatile memory Synchronization on each output (acks) Interpreted mode, no JIT 3 scenarios: AL, TS, NoFT Only green threads (native on SMP yield similar result)
Experimental setting Benchmarks Spec JVM98 Benchmarks Shown result for 6 benchmarks compress: cpu intensive db: database, heavy on locking mtrt: only multithreaded
Algorithms comparison Running times under two algorithms
Overhead Overhead of lock acquisition algorithm
Overhead Overhead of thread scheduling algorithm
Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism Implementation Experiments 3 Fiedman, Kama - SRDS 2003 Introduction Non-determinism Design and implementation Experimentation
Introduction Another hypervisor Build on top of Jikes RVM Ignore native code Support JIT Jikes RVM Almost all in Java Yield points and time slices
Sources of non-determinism Multithreading Use deterministic scheduler (yield points) Deterministic dequeuing Data-races on SMPs assume no data-races, enforce lock ordering
Design decisions Frames One frame lag between primary and backup Synchronize with replicas before starting a new frame Send all I/O results to replicas at start point Send locks (on SMP) anywhere Send non-deterministic read sets
Frames example Framing...
Implementation details Replication engine Additional module to Jikes RVM Communication between primary and backup Detection of fails Election of new primary
Implementation details Hurdles JIT compilation saving thread switch counter Non-deterministic number of statements also disable preemption Garbage collection on SMP: cooperative threads GC non-preemptive until all are done
Experimental setting Benchmarks Some Spec JVM98 Benchmarks and SciMark scimark, compress, db, raytrace, mtrt Variations on frame-size (number of context switches)
Compress compress Benchmark
Database db Benchmark
Raytrace raytrace Benchmark
Multithreaded Raytrace mtrt Benchmark
Replication Overhead Overhead
Final remarks Summary Common technique: hypervisor model Restrictions to solve non-determinism Support for SMPs First paper main features SE Handlers: native methods Second paper main features Frames Lower synchronization Faster recovery