Efficient Workstealing for Multicore Event-Driven Systems

Similar documents
Improving the performance of data servers on multicore architectures. Fabien Gaud

The Design and Implementation of Scalable Parallel Haskell

CHAPTER 1 INTRODUCTION

ORACLE INSTANCE ARCHITECTURE

Control 2004, University of Bath, UK, September 2004

Dynamic Thread Pool based Service Tracking Manager

A Locality Approach to Architecture-aware Task-scheduling in OpenMP

MAGENTO HOSTING Progressive Server Performance Improvements

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Recommendations for Performance Benchmarking

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Resource Utilization of Middleware Components in Embedded Systems

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Software and the Concurrency Revolution

Weighted Total Mark. Weighted Exam Mark

Web Server Architectures

Parallel Computing. Benson Muite. benson.

Optimizing Shared Resource Contention in HPC Clusters

Shared Parallel File System

RAID 5 rebuild performance in ProLiant

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Capriccio: Scalable Threads for Internet Services

A Software Approach to Unifying Multicore Caches

Software design ideas for SoLID

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Safe Kernel Scheduler Development with Bossa

The Advantages of Network Interface Card (NIC) Web Server Configuration Strategy

Data Center Performance Insurance

Improving Compute Farm Throughput in Electronic Design Automation (EDA) Solutions

Carlos Villavieja, Nacho Navarro Arati Baliga, Liviu Iftode

Multiprocessor Support for Event-Driven Programs

On some Potential Research Contributions to the Multi-Core Enterprise

Performance Analysis and Optimization Tool

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Optimization of Cluster Web Server Scheduling from Site Access Statistics

Improving System Scalability of OpenMP Applications Using Large Page Support

Web Caching With Dynamic Content Abstract When caching is a good idea

Real-Time Monitoring Framework for Parallel Processes

MAQAO Performance Analysis and Optimization Tool

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Multi-GPU Load Balancing for Simulation and Rendering

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

The CPU Scheduler in VMware vsphere 5.1

Load Balancing MPI Algorithm for High Throughput Applications

Operating Systems, 6 th ed. Test Bank Chapter 7

Petascale Software Challenges. William Gropp

System Software and TinyAUTOSAR

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Hardware Supported Flexible Monitoring: Early Results

Microsoft Robotics Studio

The International Journal Of Science & Technoledge (ISSN X)

Performance measurements of syslog-ng Premium Edition 4 F1

Microsoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Scalability and Programmability in the Manycore Era

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Resource-aware task scheduling

Parallel Processing and Software Performance. Lukáš Marek

Wisdom from Crowds of Machines

MapReduce and Hadoop Distributed File System

Module 3: Instance Architecture Part 1

Proposal of Dynamic Load Balancing Algorithm in Grid System

MPI and Hybrid Programming Models. William Gropp

Enterprise Applications

What can DDS do for You? Learn how dynamic publish-subscribe messaging can improve the flexibility and scalability of your applications.

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Evolution of Web Application Architecture International PHP Conference. Kore Nordmann / <kore@qafoo.com> June 9th, 2015

Transactional Memory

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Comparing the Performance of Web Server Architectures

Design Issues in a Bare PC Web Server

Clustering Versus Shared Nothing: A Case Study

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

WIND RIVER SECURE ANDROID CAPABILITY

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Useful metrics for Interpreting.NET performance. UKCMG Free Forum 2011 Tuesday 22nd May Session C3a. Introduction

PART III. OPS-based wide area networks

A Locality Approach to Task-scheduling in OpenMP for Tiled Multicore Architectures

Dynamic Resource allocation in Cloud

Rackspace Cloud Databases and Container-based Virtualization

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

NAND Flash Architecture and Specification Trends

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

Multi-core and Linux* Kernel

Locating Cache Performance Bottlenecks Using Data Profiling

Chapter 1 Computer System Overview

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Transcription:

Efficient Workstealing for Multicore Event-Driven Systems Fabien Gaud 1, Sylvain Genevès 1, Renaud Lachaize 1, Baptiste Lepers 2, Fabien Mottet 2, Gilles Muller 2, Vivien Quéma 3 1 University of Grenoble 2 INRIA 3 CNRS International Conference on Distributed Computing Systems 2010 1 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions 4 Performance evaluation 5 Conclusion 2 / 27

Objectives Application domain : data servers Focus on event-driven programming Multicore architectures are mainstream Exploiting the available hardware parallelism becomes crucial for data server performance Our goal is to provide an efficient multicore runtime for event-driven programming 3 / 27

Event-driven runtime basics Application is structured as a set of handlers processing events. An event can be triggered by an I/O or produced internally The runtime engine repeatedly processes events from its queue Get an event from the runtime s queue Call the associated handler which may produce new events 4 / 27

Multicore event-driven runtime Challenges Helping programmers dealing with concurrency Locks STM Annotations Efficiently dispatching events on cores Static placement Load balancing through workgiving Load balancing through workstealing Libasync-SMP is an annotation-based multicore event-driven runtime 5 / 27

Libasync-SMP [Zeldovich03] One event queue per core Mutual exclusion ensured by annotations on events (colors) Event dispatching on cores Colors are initially dispatched in a round robin manner Load balancing is readjusted through workstealing Core 1 Core 2 Core 3 Core 4 Thread Event queue Color 0 Color 1 Color 2 Color 3 Evaluation on two network servers Workstealing is only evaluated on micro-benchmarks 6 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly 4 Performance evaluation 5 Conclusion 7 / 27

Expected behavior : the SFS case Many expensive cryptographic operations Good case for workstealing algorithm Example : clients accessing a 200MB file 140 Libasync-smp Libasync-smp - WS 120 Throughput (MB/sec) 100 80 60 40 20 0 35% throughput increase thanks to workstealing 8 / 27

Unwanted behavior : the Web server case Web server serving static content Workstealing costs are noticeable Example : clients accessing 1KB files 200 Libasync-smp Libasync-smp - WS Throughput (KRequests/s) 150 100 50 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Clients 33% throughput decrease due to the workstealing mechanism 9 / 27

Unwanted behavior : the Web server case (2) Web server configuration Stealing time Stolen time Cache misses / event Libasync-SMP without workstealing - - 9 Libasync-SMP with workstealing 197 Kcycles 20 Kcycles 21 Very high stealing costs stolen computing time Very low cache efficiency : +146% L2 cache misses over Libasync-smp without workstealing 10 / 27

Problem statement Naive workstealing can hurt system performance This paper improves workstealing performance for multicore event-driven runtimes Majors differences with workstealing for thread-based runtimes Tasks are more fine grained Sensitivity to stealing costs One core can post tasks to another core Cannot use efficient DEqueue structures [Chase05] Stealing is constrained by colors O(n) workstealing algorithm 11 / 27

Workstealing main steps core set = construct core set(); (1) foreach ( core c in core_set ) { LOCK (c); if(can be stolen(c)) { (2) color = choose colors to steal(c); (3) event set = construct event set(c, color); } UNLOCK (c); if (! is_empty ( event_set )) { LOCK ( myself ); migrate(event set); UNLOCK ( myself ); exit ; } } 12 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly 4 Performance evaluation 5 Conclusion 13 / 27

Idea #1 : Taking hardware topology into account core_set = construct_core_set (); (1) In a multicore system, some cores usually share caches Time needed to access cached data is significantly faster than accessing them in main memory Idea : Take the cache hierarchy into consideration when stealing Locality-aware stealing Give priority to a neighbor when stealing 14 / 27

Idea #2 : Taking into account computation length if( can_be_stolen (c)) { (2) Many event handlers are relatively fine grain In our context, workstealing may have a significant cost Idea : Stealing some type of events is not beneficial Time-left stealing : know at any time which colors are worthy Handler execution time is currently set by the programmer but could be discovered at runtime 15 / 27

Idea #3 : Taking cache footprint into consideration color = choose_colors_to_steal ( c); (3) Sometime events can be stolen but are not the best candidates For example, event handlers accessing large, long-lived, data sets Penalty-aware stealing : giving penalty to events handlers based on their behavior Penalties are set by the programmer based on preliminary profiling and/or using application behavior knowledge 16 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly 4 Performance evaluation 5 Conclusion 17 / 27

The Mely runtime Core X color-queue Thread Color 0 Color 1 Color 2 Color 3 core-queue stealing-queue Backward compatible with Libasync-SMP One thread per core One color-queue per color One core-queue per core that links color-queues One stealing-queue per core that allows to efficiently implement Time-left and Penalty-aware stealing strategies 18 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions Improving the workstealing algorithm Making runtime internals workstealing friendly 4 Performance evaluation 5 Conclusion 19 / 27

SFS 15 clients repeatedly request a 200MB file 60% time spent in cryptographic operations only color cryptographic operations 160 Libasync-smp Libasync-smp - WS Mely - WS 140 120 100 MB/sec 80 60 40 20 0 as expected same throughput as the legacy workstealing mechanism 20 / 27

Web server Returns static page content (1KB files requested) Closed-loop injection 5 load injectors simulating between 200 and 2000 clients Architecture is based on legacy design Per-connection coloring RegisterFd InEpoll Close Dec Accepted Clients Accept Read Request Parse Request GetFrom Cache Write Response Epoll 21 / 27

Web server evaluation Throughput (KRequests/s) 200 150 100 50 Mely - WS Libasync-smp Mely Libasync-smp - WS 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Clients Up to 73% improvement over the libasync-smp workstealing mechanism 22 / 27

Web server evaluation (2) 200 Mely - WS Userver Apache Throughput (KRequests/s) 150 100 50 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Clients Performances better than other real world Web servers 23 / 27

Web server profiling Web server configuration Stealing time Stolen time Cache misses / event Libasync-SMP without workstealing - - 9 Libasync-SMP with workstealing 197 Kcycles 20 Kcycles 21 Mely with workstealing 6 Kcycles 23 Kcycles 9 Low stealing overhead : 6 Kcycles < stolen computing time Much more cache-efficient than Libasync-SMP Locality and penalty aware heuristics decrease the number of L2 cache misses by 24% 24 / 27

Outline 1 Context 2 Evaluation of Libasync-SMP workstealing 3 Contributions 4 Performance evaluation 5 Conclusion 25 / 27

Conclusion Context Event driven programming for system services on multicore architectures Workstealing sometimes degrades performances in such systems Contributions New heuristics to improve workstealing efficiency Revised runtime internals to reduce workstealing costs Improved Web server performance by 73% compared to the legacy workstealing mechanism. Future work : Automating runtime profiling and decision 26 / 27

Thank You! Questions? 27 / 27