Understanding Hardware Transactional Memory

Similar documents
SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Parallel Algorithm Engineering

Azul Compute Appliances

Software and the Concurrency Revolution

Multi-core Programming System Overview

Low level Java programming With examples from OpenHFT

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

FPGA-based Multithreading for In-Memory Hash Joins

Chapter 18: Database System Architectures. Centralized Systems

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel TSX (Transactional Synchronization Extensions) Mike Dai Wang and Mihai Burcea

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Improving In-Memory Database Index Performance with Intel R Transactional Synchronization Extensions

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Parallel Replication for MySQL in 5 Minutes or Less

CSC 2405: Computer Systems II

DISTRIBUTED AND PARALLELL DATABASE

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Rethinking SIMD Vectorization for In-Memory Databases

Garbage Collection in the Java HotSpot Virtual Machine

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

CHAPTER 1 INTRODUCTION

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Azul's Zulu JVM could prove an awkward challenge to Oracle's Java ambitions

Datacenter Operating Systems

Multi-Threading Performance on Commodity Multi-Core Processors

COS 318: Operating Systems

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

A Deduplication File System & Course Review

An Oracle White Paper September Advanced Java Diagnostics and Monitoring Without Performance Overhead

Informatica Ultra Messaging SMX Shared-Memory Transport

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

TheraDoc v4.6.1 Hardware and Software Requirements

Performance Characteristics of Large SMP Machines

Practical Performance Understanding the Performance of Your Application

Precise and Accurate Processor Simulation

Principles and characteristics of distributed systems and environments

Virtuoso and Database Scalability

Zing Vision. Answering your toughest production Java performance questions

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Real Time Programming: Concepts

Parallelism and Cloud Computing

Distribution transparency. Degree of transparency. Openness of distributed systems

Java Performance. Adrian Dozsa TM-JUG

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Shared Address Space Computing: Programming

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

What s Cool in the SAP JVM (CON3243)

Performance Management for Cloudbased STC 2012

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introduction to Parallel and Distributed Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases

Eloquence Training What s new in Eloquence B.08.00

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

Lecture 6: Semaphores and Monitors

Configuring Apache Derby for Performance and Durability Olav Sandstå

Understanding the Benefits of IBM SPSS Statistics Server

Definition of SOA. Capgemini University Technology Services School Capgemini - All rights reserved November 2006 SOA for Software Architects/ 2

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Extreme Performance with Java

Control 2004, University of Bath, UK, September 2004

Bigdata High Availability (HA) Architecture

An Oracle Benchmarking Study February Oracle Insurance Insbridge Enterprise Rating: Performance Assessment

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer

Muse Server Sizing. 18 June Document Version Muse

Overview: X5 Generation Database Machines

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Virtual Machine Learning: Thinking Like a Computer Architect

Liferay Performance Tuning

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Enabling Java in Latency Sensitive Environments

Introduction to GPU hardware and to CUDA

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Delivering Quality in Software Performance and Scalability Testing

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Full and Para Virtualization

Binary search tree with SIMD bandwidth optimization using SSE

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Replication on Virtual Machines

Transcription:

Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc.

Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence basics & how HTM works What it looks like from a runtime point of view Interesting coding considerations 2015 Azul Systems, Inc.

About me: Gil Tene co-founder, CTO @Azul Systems Have been working on think different GC and runtime approaches since 2002 Built world s 1st commercially shipping HTM system, along with JVM support for HTM A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... I also depress people by demonstrating how terribly wrong their latency measurements are 2015 Azul Systems, Inc. * working on real-world trash compaction issues, circa 2004

As far as I am concerned GC is a solved problem 2015 Azul Systems, Inc.

Why does HTM matter now? Because it is (finally) here! HTM already available in the past e.g. Azul Vega, since 2004 e.g. some later variants of Power architecture e.g. some designs for SPARC But it is now here in commodity server chips Intel TSX, on modern Intel Xeons Already in E7-V3 (4+ sockets), E3-V4, E3-V5 (1 socket) Coming (1H2016) in E5-26xx V4 ( Broadwell") chips Most new servers will have HTM by 2H2016 2015 Azul Systems, Inc.

2015 Azul Systems, Inc. What is HTM?

What is HTM? (For the kind of HTM I will be talking about ) Can be thought of as Speculative Multi-Address Atomicity Transaction starts and ends with explicit instructions No special load or store instructions All memory operations in a successfully completed transaction appear to execute atomically (to other threads) Transactions may abort. All memory operations in an aborted transaction appear to have never happened 2015 Azul Systems, Inc.

Cache Coherence Protocols can be messy 2015 Azul Systems, Inc.

2015 Azul Systems, Inc.

2015 Azul Systems, Inc.

2015 Azul Systems, Inc.

2015 Azul Systems, Inc. But conceptually it s not that messy

Cache line state from an individual CPU s point of view I S E M I don t have it (Invalid) I have a copy (and someone else may, too) (Shared) I have the only copy (Exclusive) I have the only copy, and I ve changed it (Modified) 2015 Azul Systems, Inc.

Cache line state from an individual CPU s point of view M E S I I have the only copy, and I ve changed it (Modified) I have the only copy (Exclusive) I have a copy (and someone else may, too) (Shared) I don t have it (Invalid) 2015 Azul Systems, Inc.

2015 Azul Systems, Inc. HTM builds on existing cache coherence

Conceptual cache line state additions for HTM Line was accessed during speculation: Line was read from during speculation Line was modified during speculation When transaction completes, clear all speculation tracking state Losing track of a line that was accessed during speculation aborts the transaction Aborts invalidate speculatively modified lines 2015 Azul Systems, Inc.

What can make the cache lose track of a line? Another CPU wants to write to it Other CPU would need it exclusive It would first need to invalidate it in this cache Another CPU wants to read from it Cache line would need to be in shared state here If it were in speculatively modified state: abort Capacity related self-eviction E.g. current Xeons: 32KB, 8 way set associative 2015 Azul Systems, Inc.

2015 Azul Systems, Inc. That s it. For memory.

CPU state. E.g. Intel TSX In addition to memory transactionality, CPU architectural state is maintained On abort, PC moves to location provided in XBEGIN EAX is changed to indicate abort information All other architecture state remains the same as it was before XBEGIN was executed 2015 Azul Systems, Inc.

So when can you do with HTM? Well, transact on memory, of course 2015 Azul Systems, Inc.

Speculative Lock Elision 2001 PhD. thesis by Ravi Rajwar ** An independent work on a somewhat similar lock serialization avoidance concept by Jose F. Martinez and Josep Torrellas, UIUC, also published in 2001 2015 Azul Systems, Inc.

2015 Azul Systems, Inc.

Using HTM under the hood in a JVM A trip down transactional memory lane 2015 Azul Systems, Inc.

Speculative Locking: Breaking the Scale Barrier Gil Tene, VP Technology, CTO Ivan Posva, Senior Staff Engineer Azul Systems 2005 Azul Systems, Inc.

Multi-threaded Java Apps can Scale New JVM capabilities improve multi-threaded application scalability. How can this affect the way you code? Speculative locking reduces effects of Amdahl's law 25 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional execution of synchronized { } Measurements Effects on how you code for contention Summary 26 2005 Azul Systems, Inc.

Amdahl s Law Serialized portions of program limit scale efficiency = 1/(N*q + (1-q)) N = # of concurrent threads q = fraction of serialized code 1.0 fraction of serialized code efficiency 0.8 0.5 0.10% 0.50% 1.00% 2.00% 5.00% 20.00% 0.3 27 0.0 1 11 21 31 41 51 61 71 81 91 processors (N) 2005 Azul Systems, Inc.

Amdahl s Law Effect on Throughput 100 throughput scale factor 75 50 25 0.10% 0.50% 1.00% 2.00% 5.00% 20.00% ideal 0 1 11 21 31 41 51 61 71 81 91 processors 28 2005 Azul Systems, Inc.

Amdahl s Law Example The theoretical limit is usually intuitive Assume 10% serialization At best you can do 10x the work of 1 CPU Efficiency drops are dramatic and may be less intuitive Assume 10% Serialization 10 CPUs will not scale past a speedup of 5.3x (Eff. 0.53) 16 CPUs will not scale past a speedup of 6.4x (Eff. 0.48) 64 CPUs will not scale past a speedup of 8.8x (Eff. 0.14) 99 CPUs will not scale past a speedup of 9.2x (Eff. 0.09) It will take a whole lot of inefficient CPUs to [never] reach a 10x 29 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional execution of synchronized { } Measurements Effects on how you code Summary 30 2005 Azul Systems, Inc.

Lock Contention vs. Data Contention Lock contention: An attempt by one thread to acquire a lock when another thread is holding it Data contention: An attempt by one thread to atomically access data when another thread expects to manipulate the same data atomically 31 2005 Azul Systems, Inc.

Data Contention in a Shared Data Structure Readers do not contend Readers and writers don t always contend Even writers may not contend with other writers 32 2005 Azul Systems, Inc.

Synchronization and Locking Locks are typically very conservative Need synchronization for correct execution Critical sections, shared data structures Intent is to protect against data contention Can t easily tell in advance That s why we lock Lock contention >= Data contention In reality: lock contention >>= Data contention 33 2005 Azul Systems, Inc.

Database Transactions The industry has already solved a similar problem Semantics of potential failure exposed to the application Transactions: atomic group of DB commands All or nothing From BEGIN TRANSACTION to COMMIT Data contention results in a rollback Leaves no trace Application can re-execute until successful Optimistic concurrency does scale 34 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional execution of synchronized { } Measurements Effects on how you code for contention Summary 35 2005 Azul Systems, Inc.

There is no spoon.

What does synchronized mean? It does not actually mean: grab lock, execute block, release lock It does mean: execute block atomically in relation to other blocks synchronizing on the same object It can be satisfied by the more conservative: execute block atomically in relation to all other threads That looks a lot like a transaction The Java Language Specification, The Java Virtual Machine Specification, JSR133 37 2005 Azul Systems, Inc.

Transactional execution of synchronized { } Two basic requirements Detect data contention within the block Roll back synchronized block on data contention synchronized can run concurrently JVM can use hardware transactional memory to detect data contention JVM must rolls back synchronized blocks that encounter data contention 38 2005 Azul Systems, Inc.

Transactional execution of synchronized { } The JVM maintains the semantic meaning of: execute block atomically in relation to all other threads Uncontended synchronized blocks run just as fast as before Data contended synchronized blocks still serialize execution synchronized blocks without data contention can execute in parallel 39 2005 Azul Systems, Inc.

Transactional execution of synchronized { } It s all transparent! No changes to Java code The VM handles everything Nested synchronized blocks Roll back to outermost transactional synchronized Reduces serialization Amdahl s Law now only reflects data contention Desire to reduce data contention 40 2005 Azul Systems, Inc.

Implementation in a JVM How does it fit in the current locking schemes? Thin locks handle uncontended synchronized blocks Most common case Uses CAS, no OS interaction Thick locks handle data contended synchronized blocks Blocks in the OS Transactional monitors handle contended synchronized blocks that have no data contention Execute synchronized blocks in parallel Uses HW transactional memory support 41 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional synchronized { } Measurements Effects on how you code for contention Summary 42 2005 Azul Systems, Inc.

Data Contention and Hashtables Examples of no data contention in a Hashtable 2 readers 1 reader, 1 writer, different hash buckets 2 writers, different hash buckets Examples of data contention in a Hashtable 1 reader, 1 writer in same hash bucket 2 writers in same hash bucket 43 2005 Azul Systems, Inc.

Measurements: Hashtable (0% writes) 100000 10000 Locking Spec. Locking 1000 100 10 1 4 Threads 16 Threads 64 Threads 44 2005 Azul Systems, Inc.

Measurements: Hashtable (5% writes) 100000 10000 Locking Spec. Locking 1000 100 10 4 Threads 16 Threads 64 Threads 45 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional synchronized { } Measurements Effects on how you code for contention Summary 46 2005 Azul Systems, Inc.

Minimizing Data Contention 1 private Object table[]; private int size; public synchronized void put(object key, Object val) { // missed, insert into table table[idx] = new HashEntry(key, val, table[idx]); size++; // writer data contention } public synchronized int size() { return size; } 47 2005 Azul Systems, Inc.

Minimizing Data Contention 2 private Object table[]; private int sizes[]; public synchronized void put(object key, Object val) { // missed, insert into table table[idx] = new HashEntry(key, val, table[idx]); sizes[idx]++; // reduced writer data contention } public synchronized int size() { int size = 0; for (int i=0; i<sizes.length; i++) size += sizes[i]; return size; } 48 2005 Azul Systems, Inc.

Minimizing Data Contention 3 private Object table[]; private int sizes[]; private int cachedsize; public synchronized void put(object key, Object val) { // missed, insert into table table[idx] = new HashEntry(key, val, table[idx]); sizes[idx]++; cachedsize = -1; // clear the cache } public synchronized int size() { if (cachedsize < 0) { // reduce size recalculation cachedsize = 0; for (int i=0; i<sizes.length; i++) cachedsize += sizes[i]; } return cachedsize; } 49 2005 Azul Systems, Inc.

Minimizing Data Contention 4 private Object table[]; private int sizes[]; private int cachedsize; public synchronized void put(object key, Object val) { // missed, insert into table table[idx] = new HashEntry(key, val, table[idx]); sizes[idx]++; if (cachedsize >= 0) cachedsize = -1; // avoid contention } public synchronized int size() { if (cachedsize < 0) { cachedsize = 0; for (int i=0; i<sizes.length; i++) cachedsize += sizes[i]; } return cachedsize; } 50 2005 Azul Systems, Inc.

Agenda Why do we care? Lock contention vs. Data contention Transactional synchronized { } Measurements Effects on how you code Summary 51 2005 Azul Systems, Inc.

Summary Hardware Transactional Memory is here! Expect speculative use for locking and synchronization in libraries and runtimes JVMs, CLR, POSIX mutexes, semaphores, etc. It may be useful for some other cool stuff 2015 Azul Systems, Inc.

2015 Azul Systems, Inc. Q&A