Reducing Dynamic Compilation Latency

Transcription

1 LLVM 12 - European Conference, London Reducing Dynamic Compilation Latency Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh

2 LLVM 12 - European Conference, London Concurrent and Parallel Dynamic Compilation Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh

3 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp 2

4 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted 2

5 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly 2

6 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available 2

7 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution slow fast slow fast Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available 2

8 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution slow fast slow fast Interp Interp Interp Earlier transition from interpretive to native execution 2

9 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread Thread 1 3

12 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 3

13 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile 3

14 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile 3

15 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 3

16 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 3

17 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile critical r Thread path Thread 1 r Thread Thread 2 r Thread Thread 3 3

18 Solution To Dynamic Compilation Latency Problem Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

19 Solution To Dynamic Compilation Latency Problem improve code discovery/profiling Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

20 Solution To Dynamic Compilation Latency Problem improve code discovery/profiling improve dynamic compilation workload throughput Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

21 How hard can Code Discovery be? 5

22 How hard can Code Static Discovery be? Java byte-code CIL JavaScript 5

23 How hard can Code Discovery be? Static Dynamic Java byte-code ARCompact JavaScript x86 binary binary ARM CIL binary 5

24 How hard can Code Discovery be? A crucial problem in the decompilation or disassembly of computer programs is the identification of executable code, i.e. the separation of instructions from data. This problem, for most computer architectures, is equivalent to the Halting Problem and is therefore unsolvable in general. [Horspool and Marovac ] 5

25 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G Sequence of interpreted basic blocks A Basic Block CFG Edges Return to Interpreter

26 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 1 Region after t' A B C D A Basic Block CFG Edges Return to Interpreter

27 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 2 t'' 1 Region after t' 2 Region after t'' A A B B C C D D E A Basic Block CFG Edges Return to Interpreter F G

28 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 2 t'' 3 t''' 1 Region after t' 2 Region after t'' 3 Region after t''' A A A B B H B C C C D D D E E I A Basic Block F F CFG Edges Return to Interpreter G G

29 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 3 t''' 3 Region after t''' A H B Region == Dynamic CFG C D A Basic Block CFG Edges Return to Interpreter E F G I

30 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Simulation Trace Trace Trace 7

31 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Simulation Trace Trace Trace trace right from the start 7

32 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace break tracing into intervals 7

33 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency Dynamic Compilation Worker Thread 1 Region 1 Region 3 Dynamic Compilation Worker Thread 2 Region 2 Dynamic Compilation Worker Thread 3 Region 4 hide compilation latency 7

34 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace Dynamic Compilation Worker Thread 1 1 Hide Dynamic Compilation Latency Region 1 Region 3 async registration of Dynamic Compilation Worker Thread 2 Region 2 compiled regions Dynamic Compilation Worker Thread 3 Region 4 hide compilation latency 7

35 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 Page Page Page Region 5 Region 6 Region7 Page Page Page Page Page Region 9 Region8 Region10 Region 11 Region 12 Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency 2 Exploit Task Parallelism Dynamic Compilation Worker Thread 1 Region 1 Region 3 Region 5 Region 9 Region 11 Dynamic Compilation Worker Thread 2 Region 2 Region 6 Region 8 Region 12 Dynamic Compilation Worker Thread 3 Region 4 Region 7 Region 10 hide compilation latency exploit task parallelism 7

36 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Spatial Region Partitioning 4 Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 Page Page Page Region 5 Region 6 Region7 Page Page Page Page Page Region 9 Region8 Region10 Region 11 Region 12 Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Temporal Region Partitioning 3 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency 2 Exploit Task Parallelism Dynamic Compilation Worker Thread 1 Region 1 Region 3 Region 5 Region 9 Region 11 Dynamic Compilation Worker Thread 2 Region 2 Region 6 Region 8 Region 12 Dynamic Compilation Worker Thread 3 Region 4 Region 7 Region 10 hide compilation latency exploit task parallelism 7

37 Concurrent and Parallel JIT r Design PC address Block Translated Yes Code Execution No New Block Yes Record Block in Region No Interpretive Block Simulation No End of Trace Interval Analyse Recorded Regions Yes Execution Loop No Hot Yes Enqueue Hot Regions Regions and Present Continue 8

38 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 Concurrent Shared Data- Structure No Interpretive Block Simulation No End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue JIT Compilation Task Farm 8

39 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 4 Concurrent Shared Data- Structure Dequeue and Farm Out No No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

40 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region dynamic work scheduling Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 4 Concurrent Shared Data- Structure Dequeue and Farm Out No No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

41 Concurrent and Parallel JIT r Design PC address No Block Translated No New Block No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Yes Yes Code Execution Record Block in Region Execution Loop 1 Enqueue dynamic work scheduling adaptive hotspot selection Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 Concurrent Shared Data- Structure Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise 4 Dequeue and Farm Out JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

42 Concurrent and Parallel JIT r Design Based on LLVM Key Components: llvm::llvmcontext - owns and manages core global data of LLVM s core infrastructure llvm::executionengine - abstract, easy to use interface for implementation execution of LLVM modules state-of-the-art set of optimisation passes 9

43 Concurrent and Parallel JIT r Design Based on LLVM Key Concepts: dispatch of compilation units via threadsafe priority queue abstraction each JIT compiler thread owns private llvm::executionengine instance enabling parallel JIT compilation without explicit synchronisation asynchronous registration of compiled native code 10

44 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: } 11

45 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: void create() { CTX_ = new llvm::llvmcontext(); MOD_ = new llvm::module("module", *CTX_); ENG_ = llvm::enginebuilder(mod_).setenginekind(llvm::enginekind::jit).create();... } } 11

46 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: void create() { CTX_ = new llvm::llvmcontext(); MOD_ = new llvm::module("module", *CTX_); ENG_ = llvm::enginebuilder(mod_).setenginekind(llvm::enginekind::jit).create();... } } void run() { for ( ; /* ever */ ; ) { queue.mutex.acquire(); while (queue.empty()) { // wait for work if queue is empty queue.condvar.wait(queue.mutex); } WorkUnit* u = queue.top(); // retrieve compilation unit queue.pop(); queue.mutex.release(); llvm::function* f = Codegen(u); // generate IR void* native = ENG_->getPointerToFunction(f); // run JIT // register native translation for execution... } } 11

47 Evaluation Extensive evaluation using over 60 industry standard benchmarks built for ARCompact RISC platform: BioPERF SPEC CPU 2006 EEMBC and CoreMark Target Platform: ARCompact RISC ISA targeting ARC 700 processor Simulation Platform: standard x86 Dell Intel Xeon quad-core machine 12

48 2.5 Speedup BioPerf Speedup 1.0 Baseline 0.5 clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

49 2.5 Speedup BioPerf Speedup Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

52 2.5 Speedup BioPerf ,MIPS ,MIPS 2.08,,217,MIPS 50,MIPS ,MIPS ,MIPS ,MIPS 1.43,1.38,x, ,MIPS Speedup ,MIPS ,MIPS ,MIPS 1.01 Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

53 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup 1.0 Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

54 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

55 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

56 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] 1.47 Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

57 Speedup ,MIPS ,MIPS ,MIPS 1.06 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] 201,MIPS 378,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS, 1.15,x 1.15 Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

58 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

62 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] ,MIPS, ,x Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

63 How far does it scale? What is a sensible number of JIT compilation threads? blastp tcoffee %,of,benchmarks,benefi1ng,from,n,jit,compila1on,threads Speedup Speedup % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Cumula1ve,Histogram Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads gcc perlbench bzip Speedup Speedup Speedup Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads 16 Measured on a 16-core machine

64 Effect of Concurrent and Parallel JIT Compilation on Throughput 17

65 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T ,Interval 17

68 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T higher throughput ,Interval 17

69 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput ,Interval,Interval 17

72 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput 800 smaller queue length ,Interval,Interval 17

73 Does this scale for multi-threaded/core applications? 18

74 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing

75 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing Shared Regions A B C D Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3

76 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing Shared Regions A B C D Dynamic Compilation Worker Thread 1 1 Tag Entry Region In Translation Tag Existing Entry A T1 Region 2 Region Tagged for Multiple Threads Register Translation for T1 and T2 B T1 T2 T1 Region 1 Region 3 Region 4 Dynamic Compilation Worker Thread 2 A T2 Region 1 T2 Region 4 Dynamic Compilation Worker Thread 3 T2 Region 2 C Region 3 T1

77 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing 1 Region In Translation Tag Existing Entry 2 Regions Already Translated Retrieve from Translation Cache Shared Regions A B C D Dynamic Compilation Worker Thread 1 Tag Entry A T1 Region 2 Region Tagged for Multiple Threads Register Translation for T1 and T2 B T1 T2 T1 Region 1 Region 3 Region 4 C T2 Region 5 T2 Region 6 B T1 Region 5 D T1 T1 T1 Region 7 Region 8 Region 11 D T2 Region 10 Dynamic Compilation Worker Thread 2 A T2 Region 1 T2 Region 4 T2 T2 T1 T1 Region 7 Region8Region 12 Region 10 Dynamic Compilation Worker Thread 3 T2 Region 2 C Region 3 T1 T1 T2 Region 6 Region 9 T1 Region 9

78 Conclusions Novel interval based region code discovery scheme enables concurrent and parallel JIT compilation and is able to deliver: average reduction of execution time of 11.5% - and up to 51.9% across 60 industry standard benchmarks we minimise JIT compilation overhead and effectively hide compilation latency by combining: light-weight interval based tracing dynamic work scheduling adaptive hotspot threshold selection concurrent and parallel JIT compilation 20

79 Demos Video Decoding and Playback 21

80 Demos Video Decoding and Playback Full System OS Simulation 21

81 Thank You 22