Thousands of Threads and Blocking I/O

Similar documents

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Why Threads Are A Bad Idea (for most purposes)

Delivering Quality in Software Performance and Scalability Testing

Tomcat Tuning. Mark Thomas April 2009

Multi-core Programming System Overview

A Performance Engineering Story

SCALABILITY AND AVAILABILITY

Whitepaper: performance of SqlBulkCopy

Configuring Apache Derby for Performance and Durability Olav Sandstå

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Performance and scalability of a large OLTP workload

Effective Java Programming. measurement as the basis

Google File System. Web and scalability

Benchmarking FreeBSD. Ivan Voras

Communication Protocol

How to Choose your Red Hat Enterprise Linux Filesystem

Apache Tomcat Tuning for Production

RevoScaleR Speed and Scalability

Understanding Hardware Transactional Memory

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Comparing the Performance of Web Server Architectures

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

CS11 Java. Fall Lecture 7

Multi-core and Linux* Kernel

Adobe LiveCycle Data Services 3 Performance Brief

Developing Scalable Java Applications with Cacheonix

Benchmarking Hadoop & HBase on Violin

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

The Google File System

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

B M C S O F T W A R E, I N C. BASIC BEST PRACTICES. Ross Cochran Principal SW Consultant

Overlapping Data Transfer With Application Execution on Clusters

Benchmarking Cassandra on Violin

SDFS Overview. By Sam Silverberg

Real-Time Scheduling 1 / 39

White Paper Perceived Performance Tuning a system for what really matters

Capacity Planning Guide for Adobe LiveCycle Data Services 2.6

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Frysk The Systems Monitoring and Debugging Tool. Andrew Cagney

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Apache Tomcat. Load-balancing and Clustering. Mark Thomas, 20 November Pivotal Software, Inc. All rights reserved.

Java EE Web Development Course Program

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

Development and Evaluation of an Experimental Javabased

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

A Hybrid Web Server Architecture for e-commerce Applications

Performance And Scalability In Oracle9i And SQL Server 2000

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Understanding Flash SSD Performance

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

A Deduplication File System & Course Review

TDA - Thread Dump Analyzer

Analysis of Delivery of Web Contents for Kernel-mode and User-mode Web Servers

Java Coding Practices for Improved Application Performance

Virtuoso and Database Scalability

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

WebSphere Architect (Performance and Monitoring) 2011 IBM Corporation

Understanding the Benefits of IBM SPSS Statistics Server

Replication on Virtual Machines

Java Garbage Collection Basics

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Managing your Domino Clusters

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Building a Multi-Threaded Web Server

HPC performance applications on Virtual Clusters

Enabling Technologies for Distributed Computing

Speeding Up File Transfers in Windows 7

DMS Performance Tuning Guide for SQL Server

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Real-world Java 4 beginners. Dima, Superfish github.com/dimafrid

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Large-Scale Web Applications

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Configuring Nex-Gen Web Load Balancer

Android Application Development Course Program

Performance Tuning Guide for ECM 2.0

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Resource Utilization of Middleware Components in Embedded Systems

ZooKeeper Administrator's Guide

CHAPTER 1 INTRODUCTION

Top 10 reasons your ecommerce site will fail during peak periods

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

CSC 2405: Computer Systems II

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Java SE 7 Programming

CPU Scheduling. Core Definitions

Muse Server Sizing. 18 June Document Version Muse

PCI vs. PCI Express vs. AGP

TFE listener architecture. Matt Klein, Staff Software Engineer Twitter Front End

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Embedded Systems. 6. Real-Time Operating Systems

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Holly Cummins IBM Hursley Labs. Java performance not so scary after all

PERFORMANCE TESTING CONCURRENT ACCESS ISSUE AND POSSIBLE SOLUTIONS A CLASSIC CASE OF PRODUCER-CONSUMER

Transcription:

Thousands of Threads and Blocking I/O The old way to write Java Servers is New again (and way better) Paul Tyma paul.tyma@gmail.com paultyma.blogspot.com

W h o a m I Day job: Senior Engineer, Google Founder, Preemptive Solutions, Inc. Dasho, Dotfuscator Founder/CEO ManyBrain, Inc. Mailinator empirical max rate seen ~1250 emails/sec extrapolates to 108 million emails/day Ph.D., Computer Engineering, Syracuse University 2

A b o u t T h i s T a l k Comparison of server models IO synchronous NIO asynchronous Empirical results Debunk myths of multithreading (and nio at times) Every server is different compare 2 multithreaded server applications 3

A n a r g u m e n t o v e r s e r v e r d e s i g n Synchronous I/O, single connection per thread model high concurrency many threads (thousands) Asynchronous I/O, single thread per server! in practice, more threads often come into play Application handles context switching between clients 4

E v o l u t i o n We started with simple threading modeled servers one thread per connection synchronization issues came to light quickly Turned out synchronization was hard Pretty much, everyone got it wrong Resulting in nice, intermittent, untraceable, unreproducible, occasional server crashes Turns out whether we admit it or not, occasional is workable 5

E v o l u t i o n The bigger problem was that scaling was limited After a few hundred threads things started to break down i.e., few hundred clients In comes java NIO a pre-existing model elsewhere (C++, linux, etc.) Asynchronous I/O I/O becomes event based 6

E v o l u t i o n - n i o The server goes about its business and is notified when some I/O event is ready to be processed We must keep track of where each client is within a i/o transaction telephone: For client XYZ, I just picked up the phone and said 'Hello'. I'm now waiting for a response. In other words, we must explicitly save the state of each client 7

E v o l u t i o n - n i o In theory, one thread for the entire server no synchronization no given task can monopolize anything Rarely, if ever works that way in practice small pools of threads handle several stages multiple back-end communication threads worker threads DB threads 8

E v o l u t i o n SEDA Matt Welsh's Ph.D. thesis http://www.eecs.harvard.edu/~mdw/proj/seda Gold standard in server design dynamic thread pool sizing at different tiers somewhat of an evolutionary algorithm try another thread, see what happens Build on Java NBIO (other NIO lib) 9

E v o l u t i o n Asynchronous I/O Limited by CPU, bandwidth, File descriptors (not threads) Common knowledge that NIO >> IO java.nio java.io Somewhere along the line, someone got scalable and fast mixed up and it stuck 10

NIO vs. IO (asynchronous vs. synchronous io) 11

A n a t t e m p t t o e x a m i n e b o t h p a r a d i g m s If you were writing a server today, which would you choose? why? 12

R e a s o n s I ' v e h e a r d t o c h o o s e N I O ( n o t e : a l l o f t h e s e a r e u p t o d e b a t e ) Asynchronous I/O is faster Thread context switching is slow Threads take up too much memory Synchronization among threads will kill you Thread-per-connection does not scale 13

R e a s o n s t o c h o o s e t h r e a d - p e r - c o n n e c t i o n I O ( a g a i n, m a y b e, m a y b e n o t, w e ' l l s e e ) Synchronous I/O is faster Coding is much simpler You code as if you only have one client at a time Make better use of multi-core machines 14

A l l t h e r e a s o n s ( f o r r e f e r e n c e ) nio: Asynchronous I/O is faster io: Synchronous I/O is faster nio: Thread context switching is slow io: Coding is much simpler nio: Threads take up too much memory io: Make better use of multi-cores nio: Synchronization among threads will kill you nio: Thread-per-connection 15 C2008 does Paul Tyma not scale

N I O v s. I O w h i c h i s f a s t e r Forget threading a moment single sender, single receiver For a tangential purpose I was benchmarking NIO and IO simple blast data benchmark I could only get NIO to transfer data up to about 75% of IO's speed asynchronous (blocking NIO was just as fast) I blamed myself because we all know asynchronous is faster than synchronous right? 16

N I O v s. I O Started doing some emailing and googling looking for benchmarks, experiential reports Yes, everyone knew NIO was faster No, no one had actually personally tested it Started formalizing my benchmark then found http://www.theserverside.com/discussions/thread.tss?thread_id=2 http://www.realityinteractive.com/rgrzywinski/archives/000096.ht 17

E x c e r p t s Blocking model was consistently 25-35% faster than using NIO selectors. Lot of techniques suggested by EmberIO folks were employed - using multiple selectors, doing multiple (2) reads if the first read returned EAGAIN equivalent in Java. Yet we couldn't beat the plain thread per connection model with Linux NPTL. To work around not so performant/scalable poll() implementation on Linux's we tried using epoll with Blackwidow JVM on a 2.6.5 kernel. while epoll improved the over scalability, the performance still remained 25% below the vanilla thread per connection model. With epoll we needed lot fewer threads to get to the best performance mark that we could get out of NIO. Rahul Bhargava, CTO Rascal Systems 18

A s y n c h r o n o u s ( s i m p l i f i e d ) 1)Make system call to selector 2)if nothing to do, goto 1 3)loop through tasks a)if its an OP_ACCEPT, system call to accept the connection, save the key b)if its an OP_READ, find the key, system call to read the data c)if more tasks goto 3 4)goto 1 19

S y n c h r o n o u s ( s i m p l i f i e d ) 1)Make system call to accept connection (thread blocks there until we have one) 2)Make system call to read data (thread blocks there until it gets some) 3)goto 2 20

S t r a i g h t T h r o u g h p u t * d o n o t c o m p a r e c h a r t s a g a i n s t e a c h o t h e r Linux 2.6 throughput Windows XP Throughput 1000 900 800 700 600 500 400 300 200 100 0 nio server io server MB/s 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 nio server io server MB/s 21

A l l t h e r e a s o n s nio: Asynchronous I/O is faster io: Synchronous I/O is faster nio: Thread context switching is slow io: Coding is much simpler nio: Threads take up too much memory io: Make better use of multi-cores nio: Synchronization among threads will kill you nio: Thread-per-connection 22 C2008 does Paul Tyma not scale

Multithreading 23

M u l t i t h r e a d i n g Hard You might be saying nah, its not bad Let me rephrase Hard, because everyone thinks it isn't http://today.java.net/pub/a/today/2007/06/28/extending-reentrantreadwritelock.html http://www.ddj.com/java/199902669?pgno=3 Much like generics however, you can't avoid it good news is that the rewards are significant Inherently takes advantage of multi-core systems 24

M u l t i t h r e a d i n g Linux 2.4 and before Threading was quite abysmal few hundred threads max Windows actually quite good up to 16000 threads jvm limitations 25

I n w a l k s N P T L Threading library standard in Linux 2.6 linux 2.4 by option Idle thread cost is near zero context-switching is much much faster Possible to run many (many) threads http://en.wikipedia.org/wiki/native_posix_thread_library 26

T h r e a d c o n t e x t - s w i t c h i n g i s e x p e n s i v e Lower lines are faster JDK1.6, Core duo Blue line represents up-to-1000 threads competing for the CPUs (core duo) notice behavior between 1 and 2 threads 1 or 1000 all fighting for the CPU, context switching doesn't cost much at all HashMap Gets 0% Writes Core Duo 2500 milliseconds 2250 2000 1750 1500 1250 1000 750 500 250 HashMap ConcHashMap Cliff Tricky 0 1 2 5 100 500 1000 number of threads 27

S y n c h r o n i z a t i o n i s E x p e n s i v e milliseconds With only one thread, uncontended synchronization is cheap Note how even just 2 threads magnifies synch cost 11000 10500 10000 9500 9000 8500 8000 7500 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 HashMap Gets O% Writes Core Duo Smaller is Faster 1 2 5 100 500 1000 number of threads HashMap SyncHashMap Hashtable ConcHashMap Cliff Tricky 28

4 c o r e O p t e r o n Synchronization magnifies method call overhead more cores, more sink in line non-blocking data structures do very well - we don't always need explicit synchronization 7000 HashMap Gets 0% writes Smaller is faster MS to completion 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 HashMap SyncHashMap Hashtable ConcHashMap Cliff Tricky 500 1 2 5 100 500 1000 29 number of threads

H o w a b o u t a l o t m o r e c o r e s? g u e s s : h o w m a n y c o r e s s t a n d a r d i n 3 y e a r s? 22500 20000 Azul 768core 0% writes 17500 15000 12500 10000 7500 HashMap SyncHashMap Hashtable ConcHashMap Cliff Tricky 5000 2500 0 1 2 5 100 500 1000 30

T h r e a d i n g S u m m a r y Uncontended Synchronization is cheap and can often be free Contended Synch gets more expensive Nonblocking datastructures scale well Multithreaded programming style is quite viable even on single core 31

A l l t h e r e a s o n s io: Synchronous I/O is faster nio: Thread context switching is slow io: Coding is much simpler nio: Threads take up too much memory io: Make better use of multi-cores nio: Synchronization among threads will kill you nio: Thread-per-connection does not scale 32

M a n y p o s s i b l e N I O s e r v e r c o n f i g u r a t i o n s True single thread Does selects reads writes and backend data retrieval/writing from/to db from/to disk from/to cache 33

M a n y p o s s i b l e N I O s e r v e r c o n f i g u r a t i o n s True single thread This doesn't take advantage of multicores Usually just one selector Usually use a threadpool to do backend work NIO must lock to give writes back to net thread Interview question: What's harder, synchronizing 2 threads or synchronizing 1000 threads? 34

A l l t h e r e a s o n s io: Synchronous I/O is faster io: Coding is much simpler nio: Threads take up too much memory CHOOSE io: Make better use of multi-cores nio: Synchronization among threads will kill you 35

T h e s t o r y o f R o b V a n B e h r e n ( a s r e m e m b e r e d b y m e f r o m a l u n c h w i t h R o b ) Set out to write a high-performance asynchronous server system Found that when switching between clients, the code for saving and restoring values/state was difficult Took a step back and wrote a finely-tuned, organized system for saving and restoring state between clients When he was done, he sat back and realized he had written the foundation for a threading package 36

S y n c h r o n o u s I / O s t a t e i s k e p t i n c o n t r o l f l o w // exceptions and such left out InputStream inputstream = socket.getinputstream(); OutputStream outputstream = socket.getoutputstream(); String command = null; do { command = inputstream.readutf(); } while ((!command.equals( HELO ) && senderror()); do { command = inputstream.readutf(); } while ((!command.startswith( MAIL FROM: ) && senderror()); handlemailfrom(command); do { command = inputstream.readutf(); } while ((!command.startswith( RCPT TO: ) && senderror()); handlercptto(command); do { command = inputstream.readutf(); } while ((!command.equals( DATA ) && senderror()); handledata(); 37

Server in Action 38

O u r 2 s e r v e r d e s i g n s ( s y n c h r o n o u s m u l t i t h r e a d e d ) One thread per connection All code is written as if your server can only handle one connection And all data structures can be manipulated by many other threads Synchronization can be tricky Reads are blocking Thus when waiting for a read, a thread is completely idle Writing is blocking too, but is not typically significant Hey, what about scaling? 39

S y n c h r o n o u s M u l t i t h r e a d e d Mailinator server: Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 Runs both HTTP and SMTP servers Quiz: How many threads can a computer run? How many threads should a computer run? 40

S y n c h r o n o u s M u l t i t h r e a d e d Mailinator server: Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 How many threads can a computer run? Each thread in java x32 requires 48k of stack space linux java limitation, not OS (windows = 2k?) option -Xss:48k (512k by default) ~41666 stack frames not counting room for OS, java, other data, 41 etc

S y n c h r o n o u s M u l t i t h r e a d e d Mailinator server: Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 How many threads should a computer run? Similar to asking, how much should you eat for dinner? usual answer: until you've had enough usual answer: until you're full fair answer: until you're sick odd answser: until you're dead 42

S y n c h r o n o u s M u l t i t h r e a d e d Mailinator server: Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 How many threads should a computer run? Just enough to max out your CPU(s) replace CPU with any other resource How many threads should you run for 100% cpu bound tasks? maybe #ofcores+1 or so 43

S y n c h r o n o u s M u l t i t h r e a d e d Note that servers must pick a saturation point Thats either CPU or network or memory or something Typically serving a lot of users well is better than serving a lot more very poorly (or not at all) You often need push back such that too much traffic comes all the way back to the front 44

Acceptor Threads Task Queue put fetch Worker Threads how big should the queue be? should it be unbounded? how many worker threads? what happens if it fills? 45

Design Decisions 46

T h r e a d P o o l s Executors.newCachedThreadPool = evil, die die die Tasks goto SynchronousQueue i.e., direct hand off from task-giver to new thread unused threads eventually die (default: 60sec) new threads are created if all existing are busy but only up to MAX_INT threads (snicker) Scenario: CPU is pegged with work so threads aren't finishing fast, more tasks arrive, more threads are created to handle new work when CPU is pegged, more threads is not what you need 47

B l o c k i n g D a t a S t r u c t u r e s BlockingQueue canonical hand-off structure embedded within Executors Rarely want LinkedBlockingQueue i.e. more commonly use ArrayBlockingQueue removals can be blocking insertions can wake-up sleeping threads In IO we can hand the worker threads the socket 48

N o n B l o c k i n g D a t a S t r u c t u r e s ConcurrentLinkedQueue concurrent linkedlist based on CAS elegance is downright fun No data corruption or blocking happens regardless of the number of threads adding or deleting or iterating Note that iteration is fuzzy 49

N o n B l o c k i n g D a t a S t r u c t u r e s ConcurrentHashMap Not quite as concurrent non-blocking reads stripe-locked writes Can increase parallelism in constructor Cliff Click's NonBlockingHashMap Fully non-blocking Does surprisingly well with many writes Also has NonBlockingLongHashMap (thanks Cliff!) 50

B u f f e r e d S t r e a m s BufferedOutputStream simply creates an 8k buffer requires flushing can immensely improve performance if you keep sending small sets of bytes The native call to do a send of a byte wraps it in an array and then sends that Look at BufferedStreams like adding a memory copy between your code and the send 51

B u f f e r e d S t r e a m s BufferedOutputStream If your sends are broken up, use a buffered stream if you already package your whole message into a byte array, don't buffer i.e., you already did Default buffer in BufferedOutputStream is 8k what if your message is bigger? 52

B y t e s Java programmers seem to have a fascination with Strings fair enough, they are a nice human abstraction Can you keep your server's data solely in bytes? Object allocation is cheap but a few million objects are probably measurable String manipulation is cumbersome internally If you can stay with byte arrays, it won't hurt Careful with autoboxing too 53

P a p e r s ( i n d i r e c t r e f e r e n c e s ) Why Events are a Bad Idea (For highconcurrency servers), Rob von Behren Dan Kegel's C10K problem somewhat dated now but still great depth 54

Designing Servers 55

A s t a t i c W e b S e r v e r HTTP is stateless very nice, one request, one response Many clients, many, short connections Interestingly for normal GET operations (ajax too) we only block as they call in. After that, they are blocked waiting for our reply Threads must cooperate on cache 56

T h r e a d p e r r e q u e s t cache Worker Worker Worker disk Client Create a fixed number of threads in the beginning (not in a thread pool) All threads block at accept(), then go process the client Watch socket timeouts Catch all exceptions 57

T h r e a d p e r r e q u e s t cache Acceptor create new Thread Worker Worker Worker disk Client Client Thread creation cost is historically expensive Load tempered by keeping track of the number of threads alive not terribly precise 58

T h r e a d p e r r e q u e s t cache Acceptor Queue Worker Worker Worker disk Client Client Executors provide many knobs for thread tuning Handle dead threads Queue size can indicate load (to a degree) Acceptor threads can help out easily 59

A s t a t i c W e b S e r v e r Given HTTP is request-reply The longer transactions take, the more threads you need (because more are waiting on the backend) For a smartly cached static webserver, 2 core machine, 15-20 threads saturated the system Add a database backend or other complex processing per transaction and number of threads linearly increases 60

SMTP Server a much more chatty protocol 61

A n S M T P s e r v e r Very conversational protocol Server spends a lot of time waiting for the next command (like many milliseconds) Many threads simply asleep Few hundred threads very common Few thousand not uncommon (JVM max allow 32k) 62

A n S M T P s e r v e r All threads must cooperate on storage of messages (keep in mind the concurrent data structures!) Possible to saturate the bandwidth after a few thousand threads or the disk 63

C o d i n g Multithreaded server coding is more intuitive You simply follow the flow of whats going to happen to one client Non-blocking data structures are very fast Immutable data is fast Sticking with bytes-in, bytes-out is nice might need more utility methods 64

Questions? 65