Reducing Dynamic Compilation Latency

Size: px
Start display at page:

Download "Reducing Dynamic Compilation Latency"

Transcription

1 LLVM 12 - European Conference, London Reducing Dynamic Compilation Latency Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh

2 LLVM 12 - European Conference, London Concurrent and Parallel Dynamic Compilation Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh

3 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp 2

4 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted 2

5 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly 2

6 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available 2

7 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution slow fast slow fast Interp Interp initially code is interpreted frequently executed code is compiled on-the-fly switch from interpretive to native code execution as soon as dynamically compiled code is available 2

8 Dynamic Compilation What do we want to improve? Interp Interpretation Code Execution slow fast slow fast Interp Interp Interp Earlier transition from interpretive to native execution 2

9 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread Thread 1 3

10 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread Thread 1 3

11 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread Thread 1 3

12 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 3

13 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile 3

14 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile 3

15 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 3

16 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 3

17 Interp Interpretation Dynamic Compilation Profile Interpretation with Profiling Code Execution 1 Dynamic Compilation using one Concurrent JIT r Main Thread Interp Profile Interp Interp Profile Interp Profile r Thread critical path Thread 1 2 Dynamic Compilation using Concurrent and Parallel JIT r Task Farm Main Thread Profile Profile Profile critical r Thread path Thread 1 r Thread Thread 2 r Thread Thread 3 3

18 Solution To Dynamic Compilation Latency Problem Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

19 Solution To Dynamic Compilation Latency Problem improve code discovery/profiling Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

20 Solution To Dynamic Compilation Latency Problem improve code discovery/profiling improve dynamic compilation workload throughput Main Thread Concurrent and Parallel Dynamic Compilation Profile Profile Profile r Thread Thread 1 r Thread Thread 2 r Thread Thread 3 4

21 How hard can Code Discovery be? 5

22 How hard can Code Static Discovery be? Java byte-code CIL JavaScript 5

23 How hard can Code Discovery be? Static Dynamic Java byte-code ARCompact JavaScript x86 binary binary ARM CIL binary 5

24 How hard can Code Discovery be? A crucial problem in the decompilation or disassembly of computer programs is the identification of executable code, i.e. the separation of instructions from data. This problem, for most computer architectures, is equivalent to the Halting Problem and is therefore unsolvable in general. [Horspool and Marovac ] 5

25 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G Sequence of interpreted basic blocks A Basic Block CFG Edges Return to Interpreter

26 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 1 Region after t' A B C D A Basic Block CFG Edges Return to Interpreter

27 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 2 t'' 1 Region after t' 2 Region after t'' A A B B C C D D E A Basic Block CFG Edges Return to Interpreter F G

28 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 1 t' 2 t'' 3 t''' 1 Region after t' 2 Region after t'' 3 Region after t''' A A A B B H B C C C D D D E E I A Basic Block F F CFG Edges Return to Interpreter G G

29 6 Incremental Code Discovery Trace Interval A B C D E F G A F G A B I F G 3 t''' 3 Region after t''' A H B Region == Dynamic CFG C D A Basic Block CFG Edges Return to Interpreter E F G I

30 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Simulation Trace Trace Trace 7

31 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Simulation Trace Trace Trace trace right from the start 7

32 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace break tracing into intervals 7

33 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency Dynamic Compilation Worker Thread 1 Region 1 Region 3 Dynamic Compilation Worker Thread 2 Region 2 Dynamic Compilation Worker Thread 3 Region 4 hide compilation latency 7

34 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 page is a fixed size container for translation Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Simulation Trace Trace Trace Dynamic Compilation Worker Thread 1 1 Hide Dynamic Compilation Latency Region 1 Region 3 async registration of Dynamic Compilation Worker Thread 2 Region 2 compiled regions Dynamic Compilation Worker Thread 3 Region 4 hide compilation latency 7

35 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 Page Page Page Region 5 Region 6 Region7 Page Page Page Page Page Region 9 Region8 Region10 Region 11 Region 12 Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency 2 Exploit Task Parallelism Dynamic Compilation Worker Thread 1 Region 1 Region 3 Region 5 Region 9 Region 11 Dynamic Compilation Worker Thread 2 Region 2 Region 6 Region 8 Region 12 Dynamic Compilation Worker Thread 3 Region 4 Region 7 Region 10 hide compilation latency exploit task parallelism 7

36 Concurrent and Parallel JIT Compilation in Action (reducing the critical path) Spatial Region Partitioning 4 Trace Dynamic Code Discovery Code Execution Regions Page Page Page Page Region 1 Region 2 Region 3 Region 4 Page Page Page Region 5 Region 6 Region7 Page Page Page Page Page Region 9 Region8 Region10 Region 11 Region 12 Page (size variable) ext r2,r9 xor r3,r12,r2 and r3,r3,0xf asl r3,r3,0x3 and r2,r2,0x7 or r3,r3,r2 asl r4,r3,0x8 brcc.d r10,r13,0x2c or r4,r4,r3 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Temporal Region Partitioning 3 Simulation Trace Trace Trace 1 Hide Dynamic Compilation Latency 2 Exploit Task Parallelism Dynamic Compilation Worker Thread 1 Region 1 Region 3 Region 5 Region 9 Region 11 Dynamic Compilation Worker Thread 2 Region 2 Region 6 Region 8 Region 12 Dynamic Compilation Worker Thread 3 Region 4 Region 7 Region 10 hide compilation latency exploit task parallelism 7

37 Concurrent and Parallel JIT r Design PC address Block Translated Yes Code Execution No New Block Yes Record Block in Region No Interpretive Block Simulation No End of Trace Interval Analyse Recorded Regions Yes Execution Loop No Hot Yes Enqueue Hot Regions Regions and Present Continue 8

38 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 Concurrent Shared Data- Structure No Interpretive Block Simulation No End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue JIT Compilation Task Farm 8

39 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 4 Concurrent Shared Data- Structure Dequeue and Farm Out No No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

40 Concurrent and Parallel JIT r Design PC address Block Translated No New Block Yes Yes Code Execution Record Block in Region dynamic work scheduling Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 4 Concurrent Shared Data- Structure Dequeue and Farm Out No No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Execution Loop 1 Enqueue Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

41 Concurrent and Parallel JIT r Design PC address No Block Translated No New Block No Interpretive Block Simulation End of Trace Interval Analyse Recorded Regions Yes Yes Yes Code Execution Record Block in Region Execution Loop 1 Enqueue dynamic work scheduling adaptive hotspot selection Region 1 Region 2 Translation Priority Queue Region 6 Region 7 Region N 3 Concurrent Shared Data- Structure Concurrent and Parallel Dynamic Compilation Task Farm JIT Compilation Thread 1 Create LLVM IR Optimise JIT Compilation Thread 2 Create LLVM IR Optimise 4 Dequeue and Farm Out JIT Compilation Thread N Create LLVM IR Optimise No Hot Yes Enqueue Hot Regions Regions and Present Continue 2 Continue Link Code Link Code Link Code JIT Compilation Task Farm 8

42 Concurrent and Parallel JIT r Design Based on LLVM Key Components: llvm::llvmcontext - owns and manages core global data of LLVM s core infrastructure llvm::executionengine - abstract, easy to use interface for implementation execution of LLVM modules state-of-the-art set of optimisation passes 9

43 Concurrent and Parallel JIT r Design Based on LLVM Key Concepts: dispatch of compilation units via threadsafe priority queue abstraction each JIT compiler thread owns private llvm::executionengine instance enabling parallel JIT compilation without explicit synchronisation asynchronous registration of compiled native code 10

44 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: } 11

45 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: void create() { CTX_ = new llvm::llvmcontext(); MOD_ = new llvm::module("module", *CTX_); ENG_ = llvm::enginebuilder(mod_).setenginekind(llvm::enginekind::jit).create();... } } 11

46 Concurrent and Parallel JIT r Design Based on LLVM class JITThread : public Thread { private: llvm::llvmcontext* CTX_; // per thread LLVMContext llvm::module* MOD_; // per thread main Module llvm::executionengine* ENG_; // per thread ExecutionEngine... public: void create() { CTX_ = new llvm::llvmcontext(); MOD_ = new llvm::module("module", *CTX_); ENG_ = llvm::enginebuilder(mod_).setenginekind(llvm::enginekind::jit).create();... } } void run() { for ( ; /* ever */ ; ) { queue.mutex.acquire(); while (queue.empty()) { // wait for work if queue is empty queue.condvar.wait(queue.mutex); } WorkUnit* u = queue.top(); // retrieve compilation unit queue.pop(); queue.mutex.release(); llvm::function* f = Codegen(u); // generate IR void* native = ENG_->getPointerToFunction(f); // run JIT // register native translation for execution... } } 11

47 Evaluation Extensive evaluation using over 60 industry standard benchmarks built for ARCompact RISC platform: BioPERF SPEC CPU 2006 EEMBC and CoreMark Target Platform: ARCompact RISC ISA targeting ARC 700 processor Simulation Platform: standard x86 Dell Intel Xeon quad-core machine 12

48 2.5 Speedup BioPerf Speedup 1.0 Baseline 0.5 clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

49 2.5 Speedup BioPerf Speedup Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

50 2.5 Speedup BioPerf Speedup Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

51 2.5 Speedup BioPerf Speedup Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

52 2.5 Speedup BioPerf ,MIPS ,MIPS 2.08,,217,MIPS 50,MIPS ,MIPS ,MIPS ,MIPS 1.43,1.38,x, ,MIPS Speedup ,MIPS ,MIPS ,MIPS 1.01 Baseline clustalw fasta(ssearch promlk grappa hmmsearch hmmpfam tcoffee blastp glimmer ce average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 13 Measured on standard x86 quad-core machine

53 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup 1.0 Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

54 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

55 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

56 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] 1.47 Speedup Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

57 Speedup ,MIPS ,MIPS ,MIPS 1.06 Speedup SPEC CPU 2006 very long running CPU intensive benchmarks [worst-case scenario] 201,MIPS 378,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS ,MIPS, 1.15,x 1.15 Baseline perlbench bzip2 gcc mcf milc gobmk soplex povray hmmer sjeng libquantum h264ref lbm omnetpp astar sphinx3 xalancbmk average Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, 14 Measured on standard x86 quad-core machine

58 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

59 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

60 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

61 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

62 Speedup Speedup EEMBC CoreMark very short running embedded benchmarks [worst-case scenario] ,MIPS, ,x Baseline a2time01 aifftr01 aifirf01 aiifft01 autcor00 basefp01 bezier01 bitmnp01 cacheb01 coremark cjpeg canrdr01 Interpreted(only,Execu1on Execu1on,using,concurrent,JIT,r Execu1on,using,concurrent,and,parallel,JIT,r, conven00 dither01 djpeg fbital00 fft00 15 idctrn01 iirflt01 matrix01 ospf pktflow pntrch01 puwmod01 rgbcmy01 rgbhpg01 rgbyiq01 rotate01 routelookup rspeed01 Measured on standard x86 quad-core machine tblook01 text01 ttsprk01 viterb00 average

63 How far does it scale? What is a sensible number of JIT compilation threads? blastp tcoffee %,of,benchmarks,benefi1ng,from,n,jit,compila1on,threads Speedup Speedup % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Cumula1ve,Histogram Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads gcc perlbench bzip Speedup Speedup Speedup Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads Number,of,JIT,Compila1on,Threads 16 Measured on a 16-core machine

64 Effect of Concurrent and Parallel JIT Compilation on Throughput 17

65 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T ,Interval 17

66 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T ,Interval 17

67 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T ,Interval 17

68 Effect of Concurrent and Parallel JIT Compilation on Throughput gcc - Regions d Regions compiled T1 Regions compiled T higher throughput ,Interval 17

69 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput ,Interval,Interval 17

70 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput ,Interval,Interval 17

71 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput ,Interval,Interval 17

72 Effect of Concurrent and Parallel JIT Compilation on Throughput 403.gcc - Regions d 403.gcc - Average Queue Length Regions compiled T1 Regions compiled T Avg Queue Length T1 Avg Queue Length T3 80 higher throughput 800 smaller queue length ,Interval,Interval 17

73 Does this scale for multi-threaded/core applications? 18

74 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing

75 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing Shared Regions A B C D Dynamic Compilation Worker Thread 1 Dynamic Compilation Worker Thread 2 Dynamic Compilation Worker Thread 3

76 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing Shared Regions A B C D Dynamic Compilation Worker Thread 1 1 Tag Entry Region In Translation Tag Existing Entry A T1 Region 2 Region Tagged for Multiple Threads Register Translation for T1 and T2 B T1 T2 T1 Region 1 Region 3 Region 4 Dynamic Compilation Worker Thread 2 A T2 Region 1 T2 Region 4 Dynamic Compilation Worker Thread 3 T2 Region 2 C Region 3 T1

77 19 Concurrent and Parallel JIT Compilation in Action (trace sharing) Regions Thread1 T1 Region 1 Region 1 Region 2 A Region 2 Region 4 Region 3 C Region 3 Region 4 Region 7 Region 6 Region 5 B Region 5 Region 6 Region 7 Region 12 Region 11 Region 10 Region 9 Region 8 D Region 8 Region 9 Region 10 Region 11 Region 12 Code Execution Thread 1 T1 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Regions Thread 2 T2 Region 2 Region 1 A Region 1 Region 2 Region 4 Region 3 B Region 3 Region 4 Region 7 Region 6 Region 5 C Region 5 Region 6 Region 7 Region 9 Region 8 Region 8 Region 9 Region 10 D Region 10 Thread 2 T2 Interval 1 Interval 2 Interval 3 Interval 4 Interval 5 Tracing Tracing Tracing 1 Region In Translation Tag Existing Entry 2 Regions Already Translated Retrieve from Translation Cache Shared Regions A B C D Dynamic Compilation Worker Thread 1 Tag Entry A T1 Region 2 Region Tagged for Multiple Threads Register Translation for T1 and T2 B T1 T2 T1 Region 1 Region 3 Region 4 C T2 Region 5 T2 Region 6 B T1 Region 5 D T1 T1 T1 Region 7 Region 8 Region 11 D T2 Region 10 Dynamic Compilation Worker Thread 2 A T2 Region 1 T2 Region 4 T2 T2 T1 T1 Region 7 Region8Region 12 Region 10 Dynamic Compilation Worker Thread 3 T2 Region 2 C Region 3 T1 T1 T2 Region 6 Region 9 T1 Region 9

78 Conclusions Novel interval based region code discovery scheme enables concurrent and parallel JIT compilation and is able to deliver: average reduction of execution time of 11.5% - and up to 51.9% across 60 industry standard benchmarks we minimise JIT compilation overhead and effectively hide compilation latency by combining: light-weight interval based tracing dynamic work scheduling adaptive hotspot threshold selection concurrent and parallel JIT compilation 20

79 Demos Video Decoding and Playback 21

80 Demos Video Decoding and Playback Full System OS Simulation 21

81 Thank You 22

Achieving QoS in Server Virtualization

Achieving QoS in Server Virtualization Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng (chao.p.peng@intel.com) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure

More information

Compiler-Assisted Binary Parsing

Compiler-Assisted Binary Parsing Compiler-Assisted Binary Parsing Tugrul Ince tugrul@cs.umd.edu PD Week 2012 26 27 March 2012 Parsing Binary Files Binary analysis is common for o Performance modeling o Computer security o Maintenance

More information

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture Dong Ye David Kaeli Northeastern University Joydeep Ray Christophe Harle AMD Inc. IISWC 2006 1 Outline Motivation

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

An OS-oriented performance monitoring tool for multicore systems

An OS-oriented performance monitoring tool for multicore systems An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense

More information

How Much Power Oversubscription is Safe and Allowed in Data Centers?

How Much Power Oversubscription is Safe and Allowed in Data Centers? How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu 1,2, Xiaorui Wang 1,2, Charles Lefurgy 3 1 EECS @ University of Tennessee, Knoxville 2 ECE @ The Ohio State University 3 IBM

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 6 Fundamentals in Performance Evaluation Computer Architecture Part 6 page 1 of 22 Prof. Dr. Uwe Brinkschulte,

More information

Secure Cloud Computing: The Monitoring Perspective

Secure Cloud Computing: The Monitoring Perspective Secure Cloud Computing: The Monitoring Perspective Peng Liu Penn State University 1 Cloud Computing is Less about Computer Design More about Use of Computing (UoC) CPU, OS, VMM, PL, Parallel computing

More information

Online Adaptation for Application Performance and Efficiency

Online Adaptation for Application Performance and Efficiency Online Adaptation for Application Performance and Efficiency A Dissertation Proposal by Jason Mars 20 November 2009 Submitted to the graduate faculty of the Department of Computer Science at the University

More information

Compilers and Tools for Software Stack Optimisation

Compilers and Tools for Software Stack Optimisation Compilers and Tools for Software Stack Optimisation EJCP 2014 2014/06/20 christophe.guillon@st.com Outline Compilers for a Set-Top-Box Compilers Potential Auto Tuning Tools Dynamic Program instrumentation

More information

An Event-Driven Multithreaded Dynamic Optimization Framework

An Event-Driven Multithreaded Dynamic Optimization Framework In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005. An Event-Driven Multithreaded Dynamic Optimization Framework Weifeng Zhang Brad Calder

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract

More information

FACT: a Framework for Adaptive Contention-aware Thread migrations

FACT: a Framework for Adaptive Contention-aware Thread migrations FACT: a Framework for Adaptive Contention-aware Thread migrations Kishore Kumar Pusukuri Department of Computer Science and Engineering University of California, Riverside, CA 92507. kishore@cs.ucr.edu

More information

MCCCSim A Highly Configurable Multi Core Cache Contention Simulator

MCCCSim A Highly Configurable Multi Core Cache Contention Simulator MCCCSim A Highly Configurable Multi Core Cache Contention Simulator Michael Zwick, Marko Durkovic, Florian Obermeier, Walter Bamberger and Klaus Diepold Lehrstuhl für Datenverarbeitung Technische Universität

More information

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign

More information

Protean Code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers

Protean Code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers Protean Code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers Michael A. Laurenzano, Yunqi Zhang, Lingjia Tang, Jason Mars Advanced Computer Architecture Laboratory, University

More information

Wiggins/Redstone: An On-line Program Specializer

Wiggins/Redstone: An On-line Program Specializer Wiggins/Redstone: An On-line Program Specializer Dean Deaver Rick Gorton Norm Rubin {dean.deaver,rick.gorton,norm.rubin}@compaq.com Hot Chips 11 Wiggins/Redstone 1 W/R is a Software System That: u Makes

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Fine-Grained User-Space Security Through Virtualization. Mathias Payer and Thomas R. Gross ETH Zurich

Fine-Grained User-Space Security Through Virtualization. Mathias Payer and Thomas R. Gross ETH Zurich Fine-Grained User-Space Security Through Virtualization Mathias Payer and Thomas R. Gross ETH Zurich Motivation Applications often vulnerable to security exploits Solution: restrict application access

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

Comparative Performance Review of SHA-3 Candidates

Comparative Performance Review of SHA-3 Candidates Comparative Performance Review of the SHA-3 Second-Round Candidates Cryptolog International Second SHA-3 Candidate Conference Outline sphlib sphlib sphlib is an open-source implementation of many hash

More information

When Prefetching Works, When It Doesn t, and Why

When Prefetching Works, When It Doesn t, and Why When Prefetching Works, When It Doesn t, and Why JAEKYU LEE, HYESOON KIM, and RICHARD VUDUC, Georgia Institute of Technology In emerging and future high-end processor systems, tolerating increasing cache

More information

The Effect of Input Data on Program Vulnerability

The Effect of Input Data on Program Vulnerability The Effect of Input Data on Program Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu I. INTRODUCTION

More information

secubt : Hacking the Hackers with User-Space Virtualization

secubt : Hacking the Hackers with User-Space Virtualization secubt : Hacking the Hackers with User-Space Virtualization Mathias Payer Department of Computer Science ETH Zurich Abstract In the age of coordinated malware distribution and zero-day exploits security

More information

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: Clock cycle where: Clock rate = 1 / clock cycle f = 1 /C

More information

Thwarting Cache Side-Channel Attacks Through Dynamic Software Diversity

Thwarting Cache Side-Channel Attacks Through Dynamic Software Diversity Thwarting Cache Side-Channel Attacks Through Dynamic Software Diversity Stephen Crane, Andrei Homescu, Stefan Brunthaler, Per Larsen, and Michael Franz University of California, Irvine {sjcrane, ahomescu,

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Schedulability Analysis for Memory Bandwidth Regulated Multicore Real-Time Systems

Schedulability Analysis for Memory Bandwidth Regulated Multicore Real-Time Systems Schedulability for Memory Bandwidth Regulated Multicore Real-Time Systems Gang Yao, Heechul Yun, Zheng Pei Wu, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University of Illinois at Urbana-Champaign, USA.

More information

AppDynamics Lite Performance Benchmark. For KonaKart E-commerce Server (Tomcat/JSP/Struts)

AppDynamics Lite Performance Benchmark. For KonaKart E-commerce Server (Tomcat/JSP/Struts) AppDynamics Lite Performance Benchmark For KonaKart E-commerce Server (Tomcat/JSP/Struts) At AppDynamics, we constantly run a lot of performance overhead tests and benchmarks with all kinds of Java/J2EE

More information

Selective Hardware/Software Memory Virtualization

Selective Hardware/Software Memory Virtualization Selective Hardware/Software Memory Virtualization Xiaolin Wang Dept. of Computer Science and Technology, Peking University, Beijing, China, 100871 wxl@pku.edu.cn Jiarui Zang Dept. of Computer Science and

More information

Introduction to the NI Real-Time Hypervisor

Introduction to the NI Real-Time Hypervisor Introduction to the NI Real-Time Hypervisor 1 Agenda 1) NI Real-Time Hypervisor overview 2) Basics of virtualization technology 3) Configuring and using Real-Time Hypervisor systems 4) Performance and

More information

Virtualizing Performance Asymmetric Multi-core Systems

Virtualizing Performance Asymmetric Multi-core Systems Virtualizing Performance Asymmetric Multi- Systems Youngjin Kwon, Changdae Kim, Seungryoul Maeng, and Jaehyuk Huh Computer Science Department, KAIST {yjkwon and cdkim}@calab.kaist.ac.kr, {maeng and jhhuh}@kaist.ac.kr

More information

PROBLEMS. which was discussed in Section 1.6.3.

PROBLEMS. which was discussed in Section 1.6.3. 22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between

More information

STAILIZER and Its Effectiveness

STAILIZER and Its Effectiveness STABILIZER: Statistically Sound Performance Evaluation Charlie Curtsinger Emery D. Berger Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 {charlie,emery}@cs.umass.edu

More information

Validating Java for Safety-Critical Applications

Validating Java for Safety-Critical Applications Validating Java for Safety-Critical Applications Jean-Marie Dautelle * Raytheon Company, Marlborough, MA, 01752 With the real-time extensions, Java can now be used for safety critical systems. It is therefore

More information

HOTPATH VM. An Effective JIT Compiler for Resource-constrained Devices

HOTPATH VM. An Effective JIT Compiler for Resource-constrained Devices HOTPATH VM An Effective JIT Compiler for Resource-constrained Devices based on the paper by Andreas Gal, Christian W. Probst and Michael Franz, VEE 06 INTRODUCTION INTRODUCTION Most important factor: speed

More information

Using the LLVM Compiler Infrastructure for Optimised, Asynchronous Dynamic Translation in Qemu

Using the LLVM Compiler Infrastructure for Optimised, Asynchronous Dynamic Translation in Qemu Using the LLVM Compiler Infrastructure for Optimised, Asynchronous Dynamic Translation in Qemu By Andrew Jeffery A thesis submitted to University of Adelaide in partial fulfilment of the Degree of Bachelor

More information

Static Analysis of Virtualization- Obfuscated Binaries

Static Analysis of Virtualization- Obfuscated Binaries Static Analysis of Virtualization- Obfuscated Binaries Johannes Kinder School of Computer and Communication Sciences École Polytechnique Fédérale de Lausanne (EPFL), Switzerland Virtualization Obfuscation

More information

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux

Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Programming and Scheduling Model for Supporting Heterogeneous Architectures in Linux Third Workshop on Computer Architecture and Operating System co-design Paris, 25.01.2012 Tobias Beisel, Tobias Wiersema,

More information

Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites

Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites Kenneth Hoste Lieven Eeckhout ELIS Department, Ghent University Sint-Pietersnieuwstraat

More information

Building a More Efficient Data Center from Servers to Software. Aman Kansal

Building a More Efficient Data Center from Servers to Software. Aman Kansal Building a More Efficient Data Center from Servers to Software Aman Kansal Data centers growing in number, Microsoft has more than 10 and less than 100 DCs worldwide Quincy Chicago Dublin Amsterdam Japan

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources

Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources JeongseobAhn,Changdae Kim, JaeungHan,Young-ri Choi,and JaehyukHuh KAIST UNIST {jeongseob, cdkim, juhan, and jhuh}@calab.kaist.ac.kr

More information

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Jan Waller, Florian Fittkau, and Wilhelm Hasselbring 2014-11-27 Waller, Fittkau, Hasselbring Application Performance

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, Onur Mutlu Beihang University, IBM Thomas J. Watson

More information

Introducing EEMBC Cloud and Big Data Server Benchmarks

Introducing EEMBC Cloud and Big Data Server Benchmarks Introducing EEMBC Cloud and Big Data Server Benchmarks Quick Background: Industry-Standard Benchmarks for the Embedded Industry EEMBC formed in 1997 as non-profit consortium Defining and developing application-specific

More information

HQEMU: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores

HQEMU: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores H: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores Ding-Yong Hong National Tsing Hua University Institute of Information Science Academia Sinica, Taiwan dyhong@sslab.cs.nthu.edu.tw

More information

THeME: A System for Testing by Hardware Monitoring Events

THeME: A System for Testing by Hardware Monitoring Events THeME: A System for Testing by Hardware Monitoring Events Kristen Walcott-Justice University of Virginia Charlottesville, VA USA walcott@cs.virginia.edu Jason Mars University of Virginia Charlottesville,

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

Cache Capacity and Memory Bandwidth Scaling Limits of Highly Threaded Processors

Cache Capacity and Memory Bandwidth Scaling Limits of Highly Threaded Processors Cache Capacity and Memory Bandwidth Scaling Limits of Highly Threaded Processors Jeff Stuecheli 12 Lizy Kurian John 1 1 Department of Electrical and Computer Engineering, University of Texas at Austin

More information

Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches

Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches Manu Awasthi, Kshitij Sudan, Rajeev Balasubramonian, John Carter School of Computing,

More information

A Dynamic Binary Translation System in a Client/Server Environment

A Dynamic Binary Translation System in a Client/Server Environment A Dynamic Binary Translation System in a Client/Server Environment Chun-Chen Hsu a, Ding-Yong Hong b,, Wei-Chung Hsu a, Pangfeng Liu a, Jan-Jan Wu b a Department of Computer Science and Information Engineering,

More information

picojava TM : A Hardware Implementation of the Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer

More information

Performance Tuning and Optimizing SQL Databases 2016

Performance Tuning and Optimizing SQL Databases 2016 Performance Tuning and Optimizing SQL Databases 2016 http://www.homnick.com marketing@homnick.com +1.561.988.0567 Boca Raton, Fl USA About this course This four-day instructor-led course provides students

More information

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran Nachiappan Chidambaram Nachiappan Adwait Jog Rachata Ausavarungnirun Mahmut T. Kandemir Gabriel H. Loh Onur Mutlu Chita R. Das The Pennsylvania

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

Safe Software Updates via Multi-version Execution

Safe Software Updates via Multi-version Execution Safe Software Updates via Multi-version Execution Petr Hosek Cristian Cadar Department of Computing Imperial College London {p.hosek, c.cadar}@imperial.ac.uk Abstract Software systems are constantly evolving,

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Benchmarking the Amazon Elastic Compute Cloud (EC2)

Benchmarking the Amazon Elastic Compute Cloud (EC2) Benchmarking the Amazon Elastic Compute Cloud (EC2) A Major Qualifying Project Report submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the

More information

Armed E-Bunny: A Selective Dynamic Compiler for Embedded Java Virtual Machine Targeting ARM Processors

Armed E-Bunny: A Selective Dynamic Compiler for Embedded Java Virtual Machine Targeting ARM Processors 2005 ACM Symposium on Applied Computing Armed E-Bunny: A Selective Dynamic Compiler for Embedded Java Virtual Machine Targeting ARM Processors Mourad Debbabi Computer Security Research Group CIISE, Concordia

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems A. Carbon, Y. Lhuillier, H.-P. Charles CEA LIST DACLE division Embedded Computing Embedded Software Laboratories France

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Pydgin: Generating Fast Instruction Set Simulators from Simple Architecture Descriptions with Meta-Tracing JIT Compilers

Pydgin: Generating Fast Instruction Set Simulators from Simple Architecture Descriptions with Meta-Tracing JIT Compilers Appears in the Proceedings of the Int l Symp. on Performance Analysis of Systems and Software (ISPASS-2015), March 2015 Pydgin: Generating Fast Instruction Set Simulators from Simple Architecture Descriptions

More information

How Much Power Oversubscription is Safe and Allowed in Data Centers?

How Much Power Oversubscription is Safe and Allowed in Data Centers? How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu, Xiaorui Wang University of Tennessee, Knoxville, TN 37996 The Ohio State University, Columbus, OH 43210 {xfu1, xwang}@eecs.utk.edu

More information

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors Leonid Domnitser, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {lenny,

More information

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB Lua as a business logic language in high load application Ilya Martynov ilya@iponweb.net CTO at IPONWEB Company background Ad industry Custom development Technical platform with multiple components Custom

More information

Practical taint analysis for protecting buggy binaries

Practical taint analysis for protecting buggy binaries Practical taint analysis for protecting buggy binaries So your exploit beats ASLR/DEP? I don't care Erik Bosman Traditional Stack Smashing buf[16] GET / HTTP/1.100baseretnarg1arg2 Traditional

More information

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Nuno A. Carvalho, João Bordalo, Filipe Campos and José Pereira HASLab / INESC TEC Universidade do Minho MW4SOC 11 December

More information

Performance And Scalability In Oracle9i And SQL Server 2000

Performance And Scalability In Oracle9i And SQL Server 2000 Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability

More information

Replication on Virtual Machines

Replication on Virtual Machines Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

Enterprise Applications

Enterprise Applications Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting

More information

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms Matthias Korch Thomas Rauber Universität Bayreuth Fakultät für Mathematik, Physik und Informatik Fachgruppe Informatik {matthias.korch,

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Data Structure Oriented Monitoring for OpenMP Programs

Data Structure Oriented Monitoring for OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

General Introduction

General Introduction Managed Runtime Technology: General Introduction Xiao-Feng Li (xiaofeng.li@gmail.com) 2012-10-10 Agenda Virtual machines Managed runtime systems EE and MM (JIT and GC) Summary 10/10/2012 Managed Runtime

More information

Agile Performance Testing

Agile Performance Testing Agile Performance Testing Cesario Ramos Independent Consultant AgiliX Agile Development Consulting Overview Why Agile performance testing? Nature of performance testing Agile performance testing Why Agile

More information

HVM TP : A Time Predictable and Portable Java Virtual Machine for Hard Real-Time Embedded Systems JTRES 2014

HVM TP : A Time Predictable and Portable Java Virtual Machine for Hard Real-Time Embedded Systems JTRES 2014 : A Time Predictable and Portable Java Virtual Machine for Hard Real-Time Embedded Systems JTRES 2014 Kasper Søe Luckow 1 Bent Thomsen 1 Stephan Erbs Korsholm 2 1 Department of Computer Science Aalborg

More information

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL

More information

Stock Trader System. Architecture Description

Stock Trader System. Architecture Description Stock Trader System Architecture Description Michael Stevens mike@mestevens.com http://www.mestevens.com Table of Contents 1. Purpose of Document 2 2. System Synopsis 2 3. Current Situation and Environment

More information

XtratuM hypervisor redesign for LEON4 multicore processor

XtratuM hypervisor redesign for LEON4 multicore processor XtratuM hypervisor redesign for LEON4 multicore processor E.Carrascosa, M.Masmano, P.Balbastre and A.Crespo Universidad Politécnica de Valencia, Spain Outline Motivation/Introduction XtratuM hypervisor

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Architectural Support for Software-Defined Metadata Processing

Architectural Support for Software-Defined Metadata Processing Architectural Support for Software-Defined Metadata Processing Udit Dhawan 1 Cătălin Hriţcu 2 Raphael Rubin 1 Nikos Vasilakis 1 Silviu Chiricescu 3 Jonathan M. Smith 1 Thomas F. Knight Jr. 4 Benjamin C.

More information

The Design of the Inferno Virtual Machine. Introduction

The Design of the Inferno Virtual Machine. Introduction The Design of the Inferno Virtual Machine Phil Winterbottom Rob Pike Bell Labs, Lucent Technologies {philw, rob}@plan9.bell-labs.com http://www.lucent.com/inferno Introduction Virtual Machine are topical

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information