Improving the performance of data servers on multicore architectures. Fabien Gaud

Transcription

1 Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, / 50

2 Processor evolution Before 2006: One core Regular increase of clock frequency Since then: Almost no increase of clock frequency Increasing number of cores: Multicore architectures NUMA architectures Manycore architectures 2 / 50

3 Multicore is a hot topic Legacy applications do not efficiently leverage multicore hardware Research topics: Programming models/languages Operating systems abstractions/internals Runtime/libraries Applications Active research field: Corey (OSDI 08) Barrelfish (SOSP 09), Helios (SOSP 09) PK (OSDI 10) 3 / 50

4 This thesis Application domain: data servers, a.k.a. networked services Goal: Improve the performance of data servers on multicore architectures Contributions: Efficient multicore event-driven programming Scaling the Apache Web server on NUMA multicore systems 4 / 50

5 #1: Efficient multicore event-driven programming CFSE 2009 (best paper award) ICDCS / 50

6 Event-driven programming Application is structured as a set of handlers processing events An event can be: Triggered by an I/O operation Produced internally by the application Events are stored in a queue and processed by a single thread Handler 1 Control Loop Handler 2 Event Handler 3 Handler 4 6 / 50

7 Multicore event-driven programming Goal: concurrently execute multiple handlers Challenges: Concurrency management Balancing load on cores Solutions: N-Copy 1-Copy with synchronization 7 / 50

8 N-Copy Principle: running one instance of the application per core App1 App2 App3 App4 Core 1 Core 2 Core 3 Core 4 Control loop Event queue Event 8 / 50

9 N-Copy (2) Advantages: No concurrency management needed No application modification needed Drawbacks: Not applicable to all applications Multiple copies of data Requires external load balancing 9 / 50

10 1-copy with synchronization Principle: 1 instance on multiple cores Concurrency can be managed using: Locks STM Annotations Load balancing can be achieved with: Static placement Workgiving Workstealing Chosen approach is implemented in Libasync-SMP (Usenix 03) 10 / 50

11 Libasync-SMP Concurrency management Annotations (colors) set on events App1 Core 1 Core 2 Core 3 Core 4 Control loop Event queue Events with color 0 Events with color 1 Events with color 2 Events with color 3 11 / 50

12 Libasync-SMP Load balancing Load balancing is done through workstealing App1 Core 1 Core 2 Core 3 Core 4 Control loop Event queue Events with color 0 Events with color 1 Events with color 2 Events with color 3 12 / 50

13 1-Copy with synchronization Advantages: Allows sharing between cores Allows load balancing between cores Drawbacks: Need to modify the application Efficient load balancing is difficult 13 / 50

14 Workstealing performance: SFS Libasync-smp Libasync-smp - WS Throughput (MB/sec) % throughput increase 14 / 50

15 Workstealing performance: Web server Libasync-smp Libasync-smp - WS Throughput (MB/sec) % throughput decrease 15 / 50

16 What is the problem? Fine grain events: Stealing time (197 Kcycles) stolen processing time (20 Kcycles) Inefficient cache usage: +146% L2 cache misses Inefficient workstealing implementation O(n) complexity 16 / 50

17 Contributions New: Workstealing algorithm Runtime implementation Fine grain events: Algorithm: steal events with high execution time Inefficient cache usage: Algorithm: steal cache-friendly events Algorithm: take cache hierarchy into account Inefficient workstealing implementation Runtime: mitigate stealing costs 17 / 50

18 Idea #1: Take into account execution time Problem: stealing cost is not always amortized Many event handlers are relatively fine grain Workstealing may have a significant cost Solution: Time-left stealing Know at any time which colors are worthy (Handler execution time is set by the programmer) 18 / 50

19 Idea #2: Take into account caches Problem: Workstealing can reduce cache efficiency Stealing events increases cache misses Example: event handlers accessing large, long-lived, data sets Solution 1: Penalty-aware stealing Set penalties on handlers based on their cache access pattern (Penalties are set manually based on preliminary profiling) Solution 2: Locality-aware stealing Give priority to a neighbor when stealing 19 / 50

20 Runtime implementation Core X color-queue Control loop Color 0 Color 1 Color 2 Color 3 core-queue stealing-queue One color-queue per color One core-queue per core that links color-queues One stealing-queue per core 20 / 50

21 Performance evaluation: SFS Libasync-smp Libasync-smp - WS Mely - WS Throughput (MB/sec) No throughput degradation 21 / 50

22 Performance evaluation: Web server Libasync-smp Libasync-smp - WS Mely - WS Throughput (MB/sec) % throughput improvement 22 / 50

23 Web server profiling Web server configuration Stealing time Stolen time Cache misses/event Libasync-SMP - WS 197 Kcycles 20 Kcycles 21 Mely - WS 6 Kcycles 23 Kcycles 9 Stealing time (6 Kcycles) < stolen processing time (23 Kcycles) Improved cache efficiency: -57% L2 cache misses 23 / 50

24 Summary Goal: efficient runtime for multicore event-driven systems Problem: workstealing sometimes degrades performance Contributions: New workstealing algorithm New runtime implementation Results: improve throughput by up to 73% 24 / 50

25 #2: Scaling the Apache Web server on NUMA multicore systems Under submission 25 / 50

26 Problem % # of clients per die Ideal scalability Apache # of dies The Apache web server do not scale on NUMA architectures 26 / 50

27 What can we do? Address scalability issues at the OS level Corey (OSDI 08) Barrelfish (SOSP 09) PK (OSDI 10) 27 / 50

28 Apache on PK % # of clients per die Ideal scalability Apache on PK Apache # of dies Does not solve scalability issues 28 / 50

29 What do we propose? Addressing scalability issues at the OS level is not sufficient Application-level issues Some issues are difficult to handle (e.g. scheduling) Approach: address scalability issues at the application level 29 / 50

30 Methodology Consider both hardware and software bottlenecks Hardware bottlenecks: Processor interconnect Distant memory accesses Software bottlenecks: Synchronization primitives 30 / 50

31 Hardware testbed 61 Gb/s C0 C4 C8 C12 C3 C7 C11 C15 I/O L1 L1 L1 L1 24 Gb/s L2 L2 L2 L2 L3 Die 0 Die 1 Die 2 C2 C6 C10 C14 Die 3 C1 C5 C9 C13 I/O 24 Gb/s 4 processors / 16 cores 31 / 50

32 Hardware testbed 61 Gb/s C0 C4 C8 C12 C3 C7 C11 C15 I/O L1 L1 L1 L1 24 Gb/s L2 L2 L2 L2 L3 Die 0 Die 1 Die 2 C2 C6 C10 C14 Die 3 C1 C5 C9 C13 I/O 24 Gb/s 4 processors / 16 cores 31 / 50

33 Hardware bottlenecks Memory efficiency (IPC) Configuration Average IPC 1 die dies % IPC decrease 32 / 50

34 Hardware bottlenecks (2) IPC decrease: Reduced cache efficiency Configuration L3 cache miss ratio (%) 1 die 14 4 dies / 50

35 Hardware bottlenecks (2) IPC decrease: Reduced cache efficiency HyperTransport link saturation Configuration Max HT usage (%) 1 die 25 4 dies / 50

36 Hardware bottlenecks (2) IPC decrease: Reduced cache efficiency HyperTransport link saturation Increased number of distant memory accesses Configuration Distant accesses/kb 1 die 4 4 dies / 50

37 Request processing C0 C5 Die 0 Die 1 Die 2 Die 3 C10 NIC Request Receiving a TCP request 34 / 50

38 Request processing C0 C5 Die 0 Die 1 Die 2 Die 3 C10 NIC HTTP request processing 34 / 50

39 Request processing C0 C5 Die 0 Die 1 Die 2 Die 3 C10 NIC PHP processing 34 / 50

40 Request processing C0 C5 Die 0 Die 1 Die 2 Die 3 C10 NIC Sending the response (1) 34 / 50

41 Request processing C0 C5 Die 0 Die 1 Die 2 Die 3 Reply C10 NIC Sending the response (2) 34 / 50

42 Proposal #1 Solution: co-localizing TCP, Apache and PHP processing Implementation: use one instance of the Apache/PHP stack per die (N-Copy) One node manages 5 network interfaces 35 / 50

43 N-Copy: request processing Die 0 Die 1 Die 2 Die 3 C2 NIC Request Receiving a TCP request 36 / 50

44 N-Copy: request processing Die 0 Die 1 Die 2 Die 3 C2 C6 C10 NIC HTTP request processing 36 / 50

45 N-Copy: request processing Die 0 Die 1 Die 2 Die 3 C2 C6 C10 NIC PHP processing 36 / 50

46 N-Copy: request processing Die 0 Die 1 Die 2 Die 3 C2 C6 C10 NIC Sending the response (1) 36 / 50

47 N-Copy: request processing Die 0 Die 1 Die 2 Die 3 Reply C2 C6 C10 NIC Sending the response (2) 36 / 50

48 N-Copy: performance % # of clients per die Ideal scalability Apache NCopy Apache # of dies 9.1% performance improvement compared to stock Apache 37 / 50

49 N-Copy: performance (2) Configuration Average IPC Distant accesses/kb 1 die dies (Stock Apache) dies (N-Copy) Memory efficiency improved by 20% 38 / 50

50 N-Copy: can we do better? Die Average CPU usage Die Die 1 85 Die 2 85 Die Problem: Dies are not equally efficient Load is not properly balanced on dies 39 / 50

51 N-Copy: load balancing Solution: balance load on dies proportionally to their efficiency Implementation: use an external load balancing mechanism Currently implemented at client-side Could be integrated in a more global solution 40 / 50

52 N-Copy: final performance % # of clients per die Ideal scalability Apache NCopy LB Apache NCopy Apache # of dies 21.2% performance improvement compared to stock Apache 41 / 50

53 Software bottlenecks Goal: find functions that Do not scale Represent a significant execution time Example: Function f accounts for 1 cycle/byte at 1 die 10 cycles/byte at 4 dies 20% of the total execution time 18% potential performance gain 42 / 50

54 Software bottlenecks (2) Function Potential performance gain (%) d lookup 2.49% atomic dec and lock 2.32% lookup mnt 1.41% copy user generic string 0.83% memcpy 0.76% Problem: the VFS layer does not scale Aggregated potential performance gain: 6 % Most of the calls are issued by the stat function 43 / 50

55 Proposal #2 Solution: use an application-level cache to reduce the number of calls to stat Implementation: Modified the Apache ap directory walk function Using inotify for file updates 44 / 50

56 Stat cache: performance % # of clients per die Ideal scalability 2000 Apache with all optims Apache NCopy LB Apache NCopy Apache # of dies 33% performance improvement compared to stock Apache 45 / 50

57 Summary Problem: Apache does not scale on NUMA architectures Contribution: application-level optimizations considering NUMA aspects and Linux scalability issues Results: +33% performance improvement 46 / 50

58 Conclusion 47 / 50

59 Conclusion Application domain: data servers Goal: Improve the performance of data servers on multicore architectures Contributions: Efficient multicore event-driven programming Scaling the Apache Web server on NUMA multicore systems 48 / 50

60 Future work Short term: Workstealing: automate profiling and decisions Apache: study other workloads Long term: Study the impact of distant memory accesses on other servers Study the impact of programming models on multicore performance Study the scalability of the Java virtual machine 49 / 50

61 Questions? 50 / 50

62 Backup Slides 2 / 50

63 Web server Returns static page content (1KB files requested) Closed-loop injection 5 load injectors simulating between 200 and 2000 clients Architecture is based on legacy design Per-connection coloring RegisterFd InEpoll Close Dec Accepted Clients Accept Read Request Parse Request GetFrom Cache Write Response Epoll 3 / 50

64 Web server evaluation Throughput (KRequests/s) Mely - WS Libasync-smp Mely Libasync-smp - WS Number of Clients Up to 73% improvement over the Libasync-SMP workstealing mechanism 4 / 50

65 Mely - Other web server evaluation (2) 200 Mely - WS Userver Apache Throughput (KRequests/s) Number of Clients Performance better than other real world Web servers 5 / 50

66 Apache Workload description SPECWeb2005 Support benchmark Vendor site Mostly static / PHP for dynamic pages Back-end Simulator (BeSim) Closed-loop injection with think times Defined QoS: 99% of clients served within 5s 95% of clients served within 3s Throughput constraints Modified to fit in main memory: 12GB 6 / 50

67 Software configuration Apache Worker version using both threads and processes Sendfile enabled to improve performance PHP Tuned number of PHP processes With eaccelerator Linux NUMA support IRQ processing manually balanced Responsible for dispatching thread and processes 7 / 50