CS 33. Multithreaded Programming V. CS33 Intro to Computer Systems XXXVII 1 Copyright 2015 Thomas W. Doeppner. All rights reserved.



Similar documents
Systemnahe Programmierung KU

Shared Address Space Computing: Programming

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Threads Scheduling on Linux Operating Systems

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

W4118 Operating Systems. Instructor: Junfeng Yang

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Win32 API Emulation on UNIX for Software DSM

Lecture 6: Semaphores and Monitors

Tail call elimination. Michel Schinz

Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm

The Plan Today... System Calls and API's Basics of OS design Virtual Machines

Dynamic Behavior Analysis Using Binary Instrumentation

Multi-core Programming System Overview

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

CS11 Java. Fall Lecture 7

Last Class: Semaphores

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

A Survey of Parallel Processing in Linux

Virtualization Technologies

Intel TSX (Transactional Synchronization Extensions) Mike Dai Wang and Mihai Burcea

Topics. Producing Production Quality Software. Concurrent Environments. Why Use Concurrency? Models of concurrency Concurrency in Java

Linux Scheduler. Linux Scheduler

Machine Programming II: Instruc8ons

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.

An Implementation Of Multiprocessor Linux

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

SYSTEM ecos Embedded Configurable Operating System

Applying Clang Static Analyzer to Linux Kernel

Assembly Language: Function Calls" Jennifer Rexford!

For a 64-bit system. I - Presentation Of The Shellcode

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

CS414 SP 2007 Assignment 1

Operating Systems and Networks

Thread Synchronization and the Java Monitor

10 Things Every Linux Programmer Should Know Linux Misconceptions in 30 Minutes

Frysk The Systems Monitoring and Debugging Tool. Andrew Cagney

Writing a C-based Client/Server

MatrixSSL Developer s Guide

First-class User Level Threads

Monitors, Java, Threads and Processes

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform)

Korset: Code-based Intrusion Detection for Linux

System Calls and Standard I/O

Virtuozzo Virtualization SDK

LLVMLinux: Embracing the Dragon

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

Instruction Set Architecture

Free Java textbook available online. Introduction to the Java programming language. Compilation. A simple java program

Semaphores and Monitors: High-level Synchronization Constructs

Free Java textbook available online. Introduction to the Java programming language. Compilation. A simple java program

Understanding Hardware Transactional Memory

Distributed Systems [Fall 2012]

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

Jorix kernel: real-time scheduling

8.5. <summary> Cppcheck addons Using Cppcheck addons Where to find some Cppcheck addons

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

3 - Lift with Monitors

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Multi-Threading Performance on Commodity Multi-Core Processors

1 Posix API vs Windows API

Why Alerts Suck and Monitoring Solutions need to become Smarter

Tasks Schedule Analysis in RTAI/Linux-GPL

Threads 1. When writing games you need to do more than one thing at once.

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Effective Computing with SMP Linux

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

MatrixSSL Porting Guide

Embedded Systems. 6. Real-Time Operating Systems

Using Power to Improve C Programming Education

Java Memory Model: Content

Performance Tools for Parallel Java Environments

! Past few lectures: Ø Locks: provide mutual exclusion Ø Condition variables: provide conditional synchronization

Introduction to Multicore Programming

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Kernel Synchronization and Interrupt Handling

MPLAB Harmony System Service Libraries Help

Using the Intel Inspector XE

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

Introduction. Application Security. Reasons For Reverse Engineering. This lecture. Java Byte Code

Grundlagen der Betriebssystemprogrammierung

Operating System Manual. Realtime Communication System for netx. Kernel API Function Reference.

Chapter 2 Process, Thread and Scheduling

Whitepaper: performance of SqlBulkCopy

X86-64 Architecture Guide

Course Development of Programming for General-Purpose Multicore Processors

Transcription:

CS 33 Multithreaded Programming V CS33 Intro to Computer Systems XXXVII 1 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Alternatives to Mutexes: Atomic Instructions Read-modify-write performed atomically Lock prefix may be used with certain IA32 and x86-64 instructions to make this happen lock incr x lock add %2, x It s expensive It s not portable no POSIX-threads way of doing it Windows supports» InterlockedIncrement» InterlockedDecrement CS33 Intro to Computer Systems XXXVII 2 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Alternatives to Mutexes: Spin Locks Consider pthread_mutex_lock(&mutex); x = x+1; pthread_mutex_unlock(&mutex); A lot of overhead is required to put thread to sleep, then wake it up Rather than do that, repeatedly test mutex until it s unlocked, then lock it makes sense only on multiprocessor system CS33 Intro to Computer Systems XXXVII 3 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Compare and Exchange cmpxchg src, dest compare contents of %eax with contents of dest» if equal, then dest = src (and ZF = 1)» otherwise %eax = dest (and ZF = 0) CS33 Intro to Computer Systems XXXVII 4 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Spin Lock the spin lock is pointed to by the first arg (%edi) locked is 1, unlocked is 0 slock: loop:.text.globl slock, sunlock movl $0, %eax movq $1, %r10 lock cmpxchg %r10, 0(%edi) jne loop ret sunlock: movl $0, 0(%edi) ret CS33 Intro to Computer Systems XXXVII 5 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Improved Spin Lock slock: loop:.text.globl slock, sunlock cmp $0, 0(%edi) # compare using normal instructions jne loop movl $0, %eax movq $1, %r10 lock cmpxchg %r10, 0(%edi) # verify w/ cmpxchg jne loop ret sunlock: movl $0, 0(%edi) ret CS33 Intro to Computer Systems XXXVII 6 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Yet More From POSIX int pthread_spin_init(pthread_spin_t *s, int pshared); int pthread_spin_destroy(pthread_spin_t *s); int pthread_spin_lock(pthread_spin_t *s); int pthread_spin_trylock(pthread_spin_t *s); int pthread_spin_unlock(pthread_spin_t *s); CS33 Intro to Computer Systems XXXVII 7 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Implementing Mutexes (1) Strategy make the usual case (no waiting) very fast can afford to take more time for the other case (waiting for the mutex) CS33 Intro to Computer Systems XXXVII 8 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Implementing Mutexes (2) Mutex has three states unlocked locked, no waiters locked, waiting threads Locking the mutex use cmpxchg (with lock prefix) if unlocked, lock it and we re done» state changed to locked, no waiters otherwise, make futex system call to wait till it s unlocked» state changed to locked, waiting threads CS33 Intro to Computer Systems XXXVII 9 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Implementing Mutexes (3) Unlocking the mutex if locked, but no waiters» state changed to unlocked if locked, but waiting threads» futex system call made to wake up a waiting thread CS33 Intro to Computer Systems XXXVII 10 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Memory Allocation Multiple threads One heap Bottleneck? CS33 Intro to Computer Systems XXXVII 11 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Solution 1 Divvy up the heap among the threads each thread has its own heap no mutexes required no bottleneck How much heap does each thread get? CS33 Intro to Computer Systems XXXVII 12 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Solution 2 Multiple arenas each with its own mutex thread allocates from the first one it can find whose mutex was unlocked» if none, then creates new one deallocations go back to original arena CS33 Intro to Computer Systems XXXVII 13 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Solution 3 Global heap plus per-thread heaps threads pull storage from global heap freed storage goes to per-thread heap» unless things are imbalanced then thread moves storage back to global heap mutex on only the global heap What if one thread allocates and another frees storage? CS33 Intro to Computer Systems XXXVII 14 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Malloc/Free Implementations ptmalloc based on solution 2 in glibc (i.e., used by default) tcmalloc based on solution 3 from Google Which is best? CS33 Intro to Computer Systems XXXVII 15 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Test Program const unsigned int N=64, nthreads=32, iters=10000000; int main() { } void *tfunc(void *); pthread_t thread[nthreads]; for (int i=0; i<nthreads; i++) { } pthread_create(&thread[i], 0, tfunc, (void *)i); pthread_detach(thread[i]); pthread_exit(0); void *tfunc(void *arg) { } long i; for (i=0; i<iters; i++) { } long *p = (long *)malloc(sizeof(long)*((i%n)+1)); free(p); return 0; CS33 Intro to Computer Systems XXXVII 16 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Compiling It % gcc -o ptalloc alloc.cc lpthread % gcc -o tcalloc alloc.cc lpthread -ltcmalloc CS33 Intro to Computer Systems XXXVII 17 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Running It $ time./ptalloc real 0m5.142s user sys 0m20.501s 0m0.024s $ time./tcalloc real 0m1.889s user sys 0m7.492s 0m0.008s CS33 Intro to Computer Systems XXXVII 18 Copyright 2015 Thomas W. Doeppner. All rights reserved.

What s Going On? $ strace c f./ptalloc % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.040002 13 3007 520 futex $ strace c f./tcalloc % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 59 13 futex CS33 Intro to Computer Systems XXXVII 19 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Test Program 2, part 1 #define N 64 #define npairs 16 #define allocsperiter 1024 const long iters = 8*1024*1024/allocsPerIter; #define BufSize 10240 typedef struct buffer { int *buf[bufsize]; unsigned int nextin; unsigned int nextout; sem_t empty; sem_t occupied; pthread_t pthread; pthread_t cthread; } buffer_t; CS33 Intro to Computer Systems XXXVII 20 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Test Program 2, part 2 int main() { long i; } buffer_t b[npairs]; for (i=0; i<npairs; i++) { } b[i].nextin = 0; b[i].nextout = 0; sem_init(&b[i].empty, 0, BufSize/allocsPerIter); sem_init(&b[i].occupied, 0, 0); pthread_create(&b[i].pthread, 0, prod, &b[i]); pthread_create(&b[i].cthread, 0, cons, &b[i]); for (i=0; i<npairs; i++) { } pthread_join(b[i].pthread, 0); pthread_join(b[i].cthread, 0); return 0; CS33 Intro to Computer Systems XXXVII 21 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Test Program 2, part 3 void *prod(void *arg) { long i, j; } buffer_t *b = (buffer_t *)arg; for (i = 0; i<iters; i++) { } sem_wait(&b->empty); for (j = 0; j<allocsperiter; j++) { } b->buf[b->nextin] = malloc(sizeof(int)*((j%n)+1)); if (++b->nextin >= BufSize) b->nextin = 0; sem_post(&b->occupied); return 0; CS33 Intro to Computer Systems XXXVII 22 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Test Program 2, part 4 void *cons(void *arg) { long i, j; } buffer_t *b = (buffer_t *)arg; for (i = 0; i<iters; i++) { } sem_wait(&b->occupied); for (j = 0; j<allocsperiter; j++) { } free(b->buf[b->nextout]); if (++b->nextout >= BufSize) b->nextout = 0; sem_post(&b->empty); return 0; CS33 Intro to Computer Systems XXXVII 23 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Running It $ time./ptalloc2 real 0m1.087s user sys 0m3.744s 0m0.204s $ time./tcalloc2 real 0m3.535s user sys 0m11.361s 0m2.112s CS33 Intro to Computer Systems XXXVII 24 Copyright 2015 Thomas W. Doeppner. All rights reserved.

What s Going On? $ strace c f./ptalloc2 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 94.96 2.347314 44 53653 14030 futex $ strace c f./tcalloc2 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 93.86 6.604632 36 185731 45222 futex CS33 Intro to Computer Systems XXXVII 25 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Running it Today... sphere $ time./ptalloc real 0m2.373s user 0m9.152s sys 0m0.008s sphere $ time./tcalloc real user sys 0m4.868s 0m19.444s 0m0.020s CS33 Intro to Computer Systems XXXVII 26 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Running it Today... kui $ time./ptalloc real 0m2.787s user 0m11.045s sys 0m0.004s kui $ time./tcalloc real user sys 0m1.701s 0m6.584s 0m0.004s CS33 Intro to Computer Systems XXXVII 27 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Running it Today... cslab0a $ time./ptalloc real 0m2.234s user 0m8.468s sys 0m0.000s cslab0a $ time./tcalloc real user sys 0m4.938s 0m19.584s 0m0.000s CS33 Intro to Computer Systems XXXVII 28 Copyright 2015 Thomas W. Doeppner. All rights reserved.

What s Going On? On kui: libtcmalloc.so -> libtcmalloc.so.4.1.0 On other machines: libtcmalloc.so -> libtcmalloc.so.4.2.2 CS33 Intro to Computer Systems XXXVII 29 Copyright 2015 Thomas W. Doeppner. All rights reserved.

However... cslab0a $ time./ptalloc2 real 0m0.466s user 0m1.504s sys 0m0.212s cslab0a $ time./tcalloc2 real user sys 0m1.516s 0m5.212s 0m0.328s CS33 Intro to Computer Systems XXXVII 30 Copyright 2015 Thomas W. Doeppner. All rights reserved.

You ll Soon Finish CS 33 You might celebrate take another systems course» 32» 138» 166» 167 become a 33 TA CS33 Intro to Computer Systems XXXVII 31 Copyright 2015 Thomas W. Doeppner. All rights reserved.

Systems Courses Next Semester CS 32 you ve mastered low-level systems programming now do things at a higher level learn software-engineering techniques using Java, XML, etc. CS 138 you now know how things work on one computer what if you ve got lots of computers? some may have crashed, others may have been taken over by your worst (and brightest) enemy CS 166 liked buffer? you ll really like 166 CS 167/169 still mystified about what the OS does? write your own! CS33 Intro to Computer Systems XXXVII 32 Copyright 2015 Thomas W. Doeppner. All rights reserved.

The End Well, not quite Database is due on 12/16. We ll do our best to get everything else back this week. Happy coding and happy holidays! CS33 Intro to Computer Systems XXXVII 33 Copyright 2015 Thomas W. Doeppner. All rights reserved.