Understanding the Python GIL

Similar documents

SYSTEM ecos Embedded Configurable Operating System

CPU Scheduling Outline

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

W4118 Operating Systems. Instructor: Junfeng Yang

Linux Process Scheduling Policy

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Last Class: Semaphores

Scheduling 0 : Levels. High level scheduling: Medium level scheduling: Low level scheduling

/ Operating Systems I. Process Scheduling. Warren R. Carithers Rob Duncan

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Chapter 5 Process Scheduling

CS414 SP 2007 Assignment 1

Chapter 6, The Operating System Machine Level

Why Threads Are A Bad Idea (for most purposes)

Operating System: Scheduling

Semaphores and Monitors: High-level Synchronization Constructs

Chapter 13 Embedded Operating Systems

RTOS Debugger for ecos

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

! Past few lectures: Ø Locks: provide mutual exclusion Ø Condition variables: provide conditional synchronization

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

CPU Scheduling 101. The CPU scheduler makes a sequence of moves that determines the interleaving of threads.

Project No. 2: Process Scheduling in Linux Submission due: April 28, 2014, 11:59pm

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

Software and the Concurrency Revolution

Real Time Programming: Concepts

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ICS Principles of Operating Systems

Readings for this topic: Silberschatz/Galvin/Gagne Chapter 5

Thread Synchronization and the Java Monitor

Chapter 1 Computer System Overview

Operating Systems Lecture #6: Process Management

Input / Output and I/O Strategies

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Distributed Systems [Fall 2012]

Shared Address Space Computing: Programming

Outline: Operating Systems

Monitors, Java, Threads and Processes

Real-Time Scheduling 1 / 39

CS11 Java. Fall Lecture 7

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

OS OBJECTIVE QUESTIONS

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Windows Server Performance Monitoring

Frequently Asked Questions: Cisco Jabber 9.x for Android

CPU Scheduling. CPU Scheduling

CPU SCHEDULING (CONT D) NESTED SCHEDULING FUNCTIONS

Jive Connects for Openfire

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Mutual Exclusion using Monitors

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Chapter 6 Concurrent Programming

REAL TIME OPERATING SYSTEMS. Lesson-10:

Parallel Programming

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

COS 318: Operating Systems. Semaphores, Monitors and Condition Variables

Lecture 6: Semaphores and Monitors

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

Tasks Schedule Analysis in RTAI/Linux-GPL

Introduction. Scheduling. Types of scheduling. The basics

CPU Scheduling. CSC 256/456 - Operating Systems Fall TA: Mohammad Hedayati

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

Concurrent Programming

CS4410 Final Exam : Solution set. while(ticket[k] && (ticket[k],k) < (ticket[i],i))

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Massachusetts Institute of Technology 6.005: Elements of Software Construction Fall 2011 Quiz 2 November 21, 2011 SOLUTIONS.

CSE 120 Principles of Operating Systems. Modules, Interfaces, Structure

Understanding Hardware Transactional Memory

A Survey of Parallel Processing in Linux

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Facing the Challenges for Real-Time Software Development on Multi-Cores

Survey on Job Schedulers in Hadoop Cluster

Kepware Technologies Optimizing KEPServerEX V5 Projects

Main Points. Scheduling policy: what to do next, when there are multiple threads ready to run. Definitions. Uniprocessor policies

Programming real-time systems with C/C++ and POSIX

Chapter 2: OS Overview

napp-it ZFS Storage principles for Video Performance Dual Xeon, 40 GB RAM Intel X540, mtu 9000 (JumboFrames) Server OS: Oracle Solaris 11.3 SMB 2.

Event processing in Java: what happens when you click?

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

Multi-core Programming System Overview

Operating Systems. III. Scheduling.

VEEAM ONE 8 RELEASE NOTES

Chapter 1 FUNDAMENTALS OF OPERATING SYSTEM

CPU Scheduling. Core Definitions

OPERATING SYSTEMS SCHEDULING

Announcements. Basic Concepts. Histogram of Typical CPU- Burst Times. Dispatcher. CPU Scheduler. Burst Cycle. Reading

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

Java Memory Model: Content

Transcription:

Understanding the Python GIL David Beazley http://www.dabeaz.com Presented at PyCon 2010 Atlanta, Georgia 1

Introduction As a few of you might know, C Python has a Global Interpreter Lock (GIL) >>> import that The Unwritten Rules of Python 1. You do not talk about the GIL. 2. You do NOT talk about the GIL. 3. Don't even mention the GIL. No seriously.... It limits thread performance Thus, a source of occasional "contention" 2

An Experiment Consider this trivial CPU-bound function def countdown(n): while n > 0: n -= 1 Run it once with a lot of work COUNT = 100000000 countdown(count) # 100 million Now, subdivide the work across two threads t1 = Thread(target=countdown,args=(COUNT//2,)) t2 = Thread(target=countdown,args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join() 3

A Mystery Performance on a quad-core MacPro Sequential Threaded (2 threads) : 7.8s : 15.4s (2X slower!) Performance if work divided across 4 threads Threaded (4 threads) : 15.7s (about the same) Performance if all but one CPU is disabled Threaded (2 threads) Threaded (4 threads) Think about it... : 11.3s (~35% faster than running : 11.6s with all 4 cores) 4

This Talk An in-depth look at threads and the GIL that will explain that mystery and much more Some cool pictures A look at the new GIL in Python 3.2 5

Disclaimers I gave an earlier talk on this topic at the Chicago Python Users Group (chipy) http://www.dabeaz.com/python/gil.pdf That is a different, but related talk I'm going to go pretty fast... please hang on 6

Part I Threads and the GIL 7

Python Threads Python threads are real system threads POSIX threads (pthreads) Windows threads Fully managed by the host operating system Represent threaded execution of the Python interpreter process (written in C) 8

Alas, the GIL Parallel execution is forbidden There is a "global interpreter lock" The GIL ensures that only one thread runs in the interpreter at once Simplifies many low-level details (memory management, callouts to C extensions, etc.) 9

Thread Execution Model With the GIL, you get cooperative multitasking I/O I/O I/O I/O I/O Thread 1 run run Thread 2 Thread 3 release GIL acquire GIL run release GIL run acquire GIL run When a thread is running, it holds the GIL GIL released on I/O (read,write,send,recv,etc.) 10

CPU Bound Tasks CPU-bound threads that never perform I/O are handled as a special case A "check" occurs every 100 "ticks" CPU Bound Thread Run 100 ticks check Run 100 ticks check Run 100 ticks check Change it using sys.setcheckinterval() 11

What is a "Tick?" Ticks loosely map to interpreter instructions def countdown(n): while n > 0: print n n -= 1 Instructions in the Python VM Not related to timing (ticks might be long) Tick 1 Tick 2 Tick 3 Tick 4 >>> import dis >>> dis.dis(countdown) 0 SETUP_LOOP 33 (to 36) 3 LOAD_FAST 0 (n) 6 LOAD_CONST 1 (0) 9 COMPARE_OP 4 (>) 12 JUMP_IF_FALSE 19 (to 34) 15 POP_TOP 16 LOAD_FAST 0 (n) 19 PRINT_ITEM 20 PRINT_NEWLINE 21 LOAD_FAST 0 (n) 24 LOAD_CONST 2 (1) 27 INPLACE_SUBTRACT 28 STORE_FAST 0 (n) 31 JUMP_ABSOLUTE 3... 12

The Periodic "Check" The periodic check is really simple The currently running thread... Resets the tick counter Runs signal handlers if the main thread Releases the GIL Reacquires the GIL That's it 13

Implementation (C) Decrement ticks Reset ticks Run signal handlers Release and reacquire the GIL /* Python/ceval.c */... if (--_Py_Ticker < 0) {... _Py_Ticker = _Py_CheckInterval;... if (things_to_do) { if (Py_MakePendingCalls() < 0) {... } } if (interpreter_lock) { /* Give another thread a chance */ PyThread_release_lock(interpreter_lock);... } /* Other threads may run now */ Note: Each thread is running this same code PyThread_acquire_lock(interpreter_lock, 1); 14

Big Question What is the source of that large CPU-bound thread performance penalty? There's just not much code to look at Is GIL acquire/release solely responsible? How would you find out? 15

Part 2 The GIL and Thread Switching Deconstructed 16

Python Locks The Python interpreter only provides a single lock type (in C) that is used to build all other thread synchronization primitives It's not a simple mutex lock It's a binary semaphore constructed from a pthreads mutex and a condition variable The GIL is an instance of this lock 17

Locks Deconstructed pseudocode Locks consist of three parts locked = 0 # Lock status mutex = pthreads_mutex() # Lock for the status cond = pthreads_cond() # Used for waiting/wakeup Here's how acquire() and release() work release() { mutex.acquire() locked = 0 mutex.release() cond.signal() } A critical aspect concerns this signaling between threads acquire() { mutex.acquire() while (locked) { cond.wait(mutex) } locked = 1 mutex.release() } 18

Thread Switching Suppose you have two threads Thread 1 Running Thread 2 READY Thread 1 : Running Thread 2 : Ready (Waiting for GIL) 19

Thread Switching Easy case : Thread 1 performs I/O (read/write) Thread 1 Running release GIL I/O Thread 2 READY Thread 1 might block so it releases the GIL 20

Thread Switching Easy case : Thread 1 performs I/O (read/write) Thread 1 Running release GIL I/O signal pthreads/os Thread 2 READY context switch Running acquire GIL Release of GIL results in a signaling operation Handled by thread library and operating system 21

Thread Switching Tricky case : Thread 1 runs until the check Thread 1 100 ticks release GIL check signal??? pthreads/os Which thread runs now? Thread 2 READY??? Either thread is able to run So, which is it? 22

pthreads Undercover Condition variables have an internal wait queue Condition Variable waiters cv.wait() thread enqueues thread thread thread queue of threads waiting on cv (often FIFO) cv.signal() dequeues Signaling pops a thread off of the queue However, what happens after that? 23

OS Scheduling The operating system has a priority queue of threads/processes ready to run Signaled threads simply enter that queue The operating system then runs the process or thread with the highest priority It may or may not be the signaled thread 24

Thread Switching Thread 1 might keep going Thread 1 (high priority) 100 ticks release GIL check acquire GIL pthreads/os Running Thread 2 (low priority) schedule READY... Runs later Thread 2 moves to the OS "ready" queue and executes at some later time 25

Thread Switching Thread 2 might immediately take over Thread 1 (low priority) Thread 2 (high priority) 100 ticks release GIL READY check signal pthreads/os context switch acquire GIL Running Again, highest priority wins 26

Part 3 What Can Go Wrong? 27

GIL Instrumentation To study thread scheduling in more detail, I instrumented Python with some logging Recorded a large trace of all GIL acquisitions, releases, conflicts, retries, etc. Goal was to get a better idea of how threads were scheduled, interactions between threads, internal GIL behavior, etc. 28

GIL Logging An extra tick counter was added to record number of cycles of the check interval Locks modified to log GIL events (pseudocode) release() { mutex.acquire() locked = 0 if gil: log("release") mutex.release() cv.signal() } Note: Actual code in C, event logs are stored entirely in memory until exit (no I/O) acquire() { mutex.acquire() if locked and gil: log("busy") while locked: cv.wait(mutex) if locked and gil: log("retry") locked = 1 if gil: log("acquire") mutex.release() } 29

A Sample Trace thread id tick countdown total number of "checks" executed t2 100 5351 ACQUIRE t2 100 5352 RELEASE t2 100 5352 ACQUIRE t2 100 5353 RELEASE t1 100 5353 ACQUIRE t2 38 5353 BUSY t1 100 5354 RELEASE t1 100 5354 ACQUIRE t2 79 5354 RETRY t1 100 5355 RELEASE t1 100 5355 ACQUIRE t2 73 5355 RETRY t1 100 5356 RELEASE t2 100 5356 ACQUIRE t1 24 5356 BUSY t2 100 5357 RELEASE ACQUIRE : GIL acquired RELEASE : GIL released BUSY : Attempted to acquire GIL, but it was already in use RETRY : Repeated attempt to acquire the GIL, but it was still in use Trace files were large (>20MB for 1s of running) 30

Logging Results The logs were quite revealing Interesting behavior on one CPU Diabolical behavior on multiple CPUs Will briefly summarize findings followed by an interactive visualization that shows details 31

Single CPU Threading Threads alternate execution, but switch far less frequently than you might imagine Thread 1 100 ticks check 100 ticks check... check READY signal signal pthreads/os Thread Context Switch Thread 2 schedule READY run Hundreds to thousands of checks might occur before a thread context switch (this is good) 32

Multicore GIL War With multiple cores, runnable threads get scheduled simultaneously (on different cores) and battle over the GIL Thread 1 (CPU 1) Thread 2 (CPU 2) Release GIL Acquire GIL Release GIL Acquire GIL run run run signal signal Wake Acquire GIL (fails) Wake Acquire GIL (fails) Thread 2 is repeatedly signaled, but when it wakes up, the GIL is already gone (reacquired) 33

Multicore Event Handling CPU-bound threads make GIL acquisition difficult for threads that want to handle events Thread 1 (CPU 1) Thread 2 (CPU 2) run signal signal signal signal sleep Event Acquire GIL (fails) Acquire GIL (fails) Acquire GIL (fails) Acquire GIL (success) run Might repeat 100s-1000s of times 34

Behavior of I/O Handling I/O ops often do not block Thread 1 run read write write write write run signal burst Due to buffering, the OS is able to fulfill I/O requests immediately and keep a thread running However, the GIL is always released Results in GIL thrashing under heavy load 35

GIL Visualization (Demo) Let's look at all of these effects http://www.dabeaz.com/gil Some facts about the plots: Generated from ~2GB of log data Rendered into ~2 million PNG image tiles Created using custom scripts/tools I used the multiprocessing module 36

Part 4 A Better GIL? 37

The New GIL Python 3.2 has a new GIL implementation (only available by svn checkout) The work of Antoine Pitrou (applause) It aims to solve all that GIL thrashing It is the first major change to the GIL since the inception of Python threads in 1992 Let's go take a look 38

New Thread Switching Instead of ticks, there is now a global variable /* Python/ceval.c */... static volatile int gil_drop_request = 0; A thread runs until the value gets set to 1 At which point, the thread must drop the GIL Big question: How does that happen? 39

New GIL Illustrated Suppose that there is just one thread Thread 1 running It just runs and runs and runs... Never releases the GIL Never sends any signals Life is great! 40

New GIL Illustrated Suppose, a second thread appears Thread 1 running Thread 2 SUSPENDED It is suspended because it doesn't have the GIL Somehow, it has to get it from Thread 1 41

New GIL Illustrated Waiting thread does a timed cv_wait on GIL Thread 1 running Thread 2 SUSPENDED cv_wait(gil, TIMEOUT) By default TIMEOUT is 5 milliseconds, but it can be changed The idea : Thread 2 waits to see if the GIL gets released voluntarily by Thread 1 (e.g., if there is I/O or it goes to sleep for some reason) 42

New GIL Illustrated Voluntary GIL release Thread 1 running I/O signal Thread 2 SUSPENDED running cv_wait(gil, TIMEOUT) This is the easy case. Second thread is signaled and it grabs the GIL. 43

New GIL Illustrated If timeout, set gil_drop_request Thread 1 running gil_drop_request = 1 Thread 2 TIMEOUT SUSPENDED cv_wait(gil, TIMEOUT) cv_wait(gil, TIMEOUT) Thread 2 then repeats its wait on the GIL 44

New GIL Illustrated Thread 1 suspends after current instruction Thread 1 running Thread 2 gil_drop_request = 1 TIMEOUT SUSPENDED signal running cv_wait(gil, TIMEOUT) cv_wait(gil, TIMEOUT) Signal is sent to indicate release of GIL 45

New GIL Illustrated On a forced release, a thread waits for an ack Thread 1 cv_wait(gotgil) running WAIT gil_drop_request = 1 signal signal (ack) Thread 2 TIMEOUT SUSPENDED running cv_wait(gil, TIMEOUT) cv_wait(gil, TIMEOUT) Ack ensures that the other thread successfully got the GIL and is now running This eliminates the "GIL Battle" 46

New GIL Illustrated The process now repeats itself for Thread 1 Thread 1 cv_wait(gotgil) cv_wait(gil, TIMEOUT) running WAIT SUSPENDED gil_drop_request = 1 signal gil_drop_request =0 Thread 2 TIMEOUT SUSPENDED running cv_wait(gil, TIMEOUT) cv_wait(gil, TIMEOUT) So, the timeout sequence happens over and over again as CPU-bound threads execute 47

Does it Work? Yes, apparently (4-core MacPro, OS-X 10.6.2) Sequential : 11.53s Threaded (2 threads) : 11.93s Threaded (4 threads) : 12.32s Keep in mind, Python is still limited by the GIL in all of the usual ways (threads still provide no performance boost) But, otherwise, it looks promising! 48

Part 5 Die GIL Die!!! 49

Alas, It Doesn't Work The New GIL impacts I/O performance Here is a fragment of network code Thread 1 Thread 2 def spin(): while True: # some work pass def echo_server(s): while True: data = s.recv(8192) if not data: break s.sendall(data) One thread is working (CPU-bound) One thread receives and echos data on a socket 50

Response Time New GIL increases response time Thread 1 running gil_drop_request = 1 signal Thread 2 TIMEOUT READY cv_wait(gil, TIMEOUT) running cv_wait(gil, TIMEOUT) data arrives To handle I/O, a thread must go through the entire timeout sequence to get control Ignores the high priority of I/O or events 51

Unfair Wakeup/Starvation Most deserving thread may not get the GIL Thread 1 running gil_drop_request = 1 signal Thread 2 Thread 3 data arrives TIMEOUT READY cv_wait(gil, TIMEOUT) READY READY cv_wait(gil, TIMEOUT) running wakeup Caused by internal condition variable queuing Further increases the response time 52

Convoy Effect I/O operations that don't block cause stalls Thread 1 Thread 2 running timeout sig timeout sig timeout READY READY READY data arrives send (executes immediately) send (executes immediately) Since I/O operations always release the GIL, CPU-bound threads will always try to restart On I/O completion (almost immediately), the GIL is gone so the timeout has to repeat 53

An Experiment Send 10MB of data to an echo server thread that's competing with a CPU-bound thread Python 2.6.4 (2 CPU) : 0.57s (10 sample average) Python 3.2 (2 CPU) : 12.4s (20x slower) What if echo competes with 2 CPU threads? Python 2.6.4 (2 CPU) : 0.25s (Better performance?) Python 3.2 (2 CPU) : 46.9s (4x slower than before) Python 3.2 (1 CPU) : 0.14s (330x faster than 2 cores?) Arg! Enough already! 54

Part 6 Score : Multicore 2, GIL 0 55

Fixing the GIL Can the GIL's erratic behavior be fixed? My opinion : Yes, maybe. The new GIL is already 90% there It just needs a few extra bits 56

The Missing Bits Priorities: There must be some way to separate CPU-bound (low priority) and I/O bound (high priority) threads Preemption: High priority threads must be able to immediately preempt low-priority threads 57

A Possible Solution Operating systems use timeouts to automatically adjust task priorities (multilevel feedback queuing) If a thread is preempted by a timeout, it is penalized with lowered priority (bad thread) If a thread suspends early, it is rewarded with raised priority (good thread) High priority threads always preempt low priority threads Maybe it could be applied to the new GIL? 58

Remove the GIL? This entire talk has been about the problem of implementing one tiny little itty bitty lock Fixing Python to remove the GIL entirely is an exponentially more difficult project If there is one thing to take away, there are practical reasons why the GIL remains 59

Final Thoughts Don't use this talk to justify not using threads Threads are a very useful programming tool for many kinds of concurrency problems Threads can also offer excellent performance even with the GIL (you need to study it) However, you should know about the tricky corner cases 60

Final Thoughts Improving the GIL is something that all Python programmers should care about Multicore is not going away You might not use threads yourself, but they are used for a variety of low-level purposes in frameworks and libraries you might be using More predictable thread behavior is good 61

Thread Terminated That's it! Thanks for listening! Questions 62