Dick Sites Google Inc. February 2015. Datacenter Computers. modern challenges in CPU design



Similar documents
EFFICIENT ANALYSIS OF APPLICATION SERVERS IN THE CLOUD

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

SGI High Performance Computing

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Putting it all together: Intel Nehalem.

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Resource Efficient Computing for Warehouse-scale Datacenters

AIX NFS Client Performance Improvements for Databases on NAS

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Scalable Internet Services and Load Balancing

Google File System. Web and scalability

D1.2 Network Load Balancing

GraySort on Apache Spark by Databricks

Performance and scalability of a large OLTP workload

Zadara Storage Cloud A

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Client/Server and Distributed Computing

VI Performance Monitoring

Scalable Internet Services and Load Balancing

Architecture of Hitachi SR-8000

Hardware Configuration Guide

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Optimizing Ext4 for Low Memory Environments

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Linux Scheduler. Linux Scheduler

Datacenter Operating Systems

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

A Deduplication File System & Course Review

High-Density Network Flow Monitoring

Understanding the Benefits of IBM SPSS Statistics Server

Infrastructure Matters: POWER8 vs. Xeon x86

Question: 3 When using Application Intelligence, Server Time may be defined as.

How To Make Your Database More Efficient By Virtualizing It On A Server

Whitepaper: performance of SqlBulkCopy

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Benchmarking Hadoop & HBase on Violin

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Network Attached Storage. Jinfeng Yang Oct/19/2015

1 Storage Devices Summary

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Storage in Database Systems. CMPSCI 445 Fall 2010

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Capacity Estimation for Linux Workloads

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Performance Guideline for syslog-ng Premium Edition 5 LTS

Tyche: An efficient Ethernet-based protocol for converged networked storage

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Speeding Up Cloud/Server Applications Using Flash Memory

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Internet Content Distribution

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Chapter 12: Mass-Storage Systems

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Deploying and Optimizing SQL Server for Virtual Machines

09'Linux Plumbers Conference

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Enterprise Applications

Overlapping Data Transfer With Application Execution on Clusters

Chapter 10: Mass-Storage Systems

Optimizing Linux Performance

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

White Paper. Recording Server Virtualization

PERFORMANCE TUNING ORACLE RAC ON LINUX

Computer Architecture. Secure communication and encryption.

The Google File System

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX. Contents. Performance and User Experience... 2

Parallel Computing. Benson Muite. benson.

How to Build a Massively Scalable Next-Generation Firewall

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE

KVM & Memory Management Updates

White Paper Perceived Performance Tuning a system for what really matters

Chapter 18: Database System Architectures. Centralized Systems

Performance Characteristics of Large SMP Machines

One Server Per City: C Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko, hgs}@cs.columbia.edu

RevoScaleR Speed and Scalability

The Database is Slow

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Programmable Networking with Open vswitch

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

How to handle Out-of-Memory issue

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Transcription:

Dick Sites Google Inc. February 2015 Datacenter Computers modern challenges in CPU design

Thesis: Servers and desktops require different design emphasis February 2015 2

Goals Draw a vivid picture of a computing environment that may be foreign to your experience so far Expose some ongoing research problems Perhaps inspire contributions in this area February 2015 3

Analogy (S.A.T. pre-2005) : as : February 2015 4

Analogy (S.A.T. pre-2005) : as : February 2015 5

Datacenter Servers are Different Move data: big and small Real-time transactions: 1000s per second Isolation between programs Measurement underpinnings February 2015 6

Move data: big and small

Move data: big and small Move lots of data Disk to/from RAM Network to/from RAM SSD to/from RAM Within RAM Bulk data Short data: variable-length items Compress, encrypt, checksum, hash, sort February 2015 8

Lots of memory 4004: no memory i7: 12MB L3 cache February 2015 9

Little brain, LOTS of memory Server 64GB-1TB Drinking straw access February 2015 10

High-Bandwidth Cache Structure Hypothetical 64-CPU-thread server 0 1 2 3 Core 0 4 5 6 7 Core 1 8 9 10 11 Core 2 60 61 62 63 PCs Core 15 L1 caches L1 caches L1 caches L1 caches L2 cache L2 cache L3 cache February 2015 11

Forcing Function Move 16 bytes every CPU cycle What are the direct consequences of this goal? February 2015 12

Move 16 bytes every CPU cycle Load 16B, Store 16B, test, branch All in one cycle = 4-way issue minimum Need some 16-byte registers At 3.2GHz, 50 GB/s read + 50 GB/s write, 100 GB/sec, one L1 D-cache Must avoid reading cache line before write if it is to be fully written (else 150 GB/s) February 2015 13

Short strings Parsing, words, packet headers, scattergather, checksums the _ quick _ brown _ fox VLAN IP hdr TCP hdr chksum February 2015 14

Generic Move, 37 bytes thequ ickbrownfoxjumps overthelazydog.. thequ ickbrownfoxjumps overthelazydog.. 5 bytes 16 bytes 16 bytes February 2015 15

Generic Move, 37 bytes 16B aligned cache accesses thequ ickbrownfoxjumps overthelazydog.. thequ ickbrownfoxjumps overthelazydog.. 5 bytes 16 bytes 16 bytes February 2015 16

Generic Move, 37 bytes, aligned target 16B aligned cache accesses thequick brownfoxjumpsove rthelazydog.. thequickbr ownfoxjumpsovert helazydog.. 10 bytes 16 bytes 11 bytes 32B shift helpful: LD, >>, ST February 2015 17

Useful Instructions Load Partial R1, R2, R3 Store Partial R1, R2, R3 Load/store R1 low bytes with 0(R2), length 0..15 from R3 low 4 bits, 0 pad high bytes, R1 = 16 byte register 0(R2) can be unaligned Length zero never segfaults Similar to Power architecture Move Assist instructions Handles all short moves Remaining length is always multiple of 16 February 2015 18

Move 16 bytes every CPU cycle L1 data cache: 2 aligned LD/ST accesses per cycle: 100 GB/s, plus 100 GB/s fill 0 1 2 3 Core 0 L1 caches 4 5 6 7 Core 1 L1 caches 8 9 10 11 Core 2 L1 caches 60 61 62 63 PCs Core 15 L1 caches L2 cache L2 cache L3 cache February 2015 19

Move 16 bytes every CPU cycle Sustained throughout L2, L3, and RAM Perhaps four CPUs simultaneously 100 GB/s x 16 0 1 2 3 Core 0 L1 caches 4 5 6 7 Core 1 L1 caches 8 9 10 11 Core 2 L1 caches 60 61 62 63 PCs Core 15 L1 caches 200 GB/s x 4 L2 cache L2 cache 400 GB/s x 1 L3 cache February 2015 20

Top 20 Stream Copy Bandwidth (April 2014) Date Machine ID ncpus COPY 1. 2012.08.14 SGI_Altix_UV_2000 2048 6591 GB/s 2. 2011.04.05 SGI_Altix_UV_1000 2048 5321 3. 2006.07.10 SGI_Altix_4700 1024 3661 4. 2013.03.26 Fujitsu_SPARC_M10-4S 1024 3474 5. 2011.06.06 ScaleMP_Xeon_X6560_64B 768 1493 6. 2004.12.22 SGI_Altix_3700_Bx2 512 906 7. 2003.11.13 SGI_Altix_3000 512 854 8. 2003.10.02 NEC_SX-7 32 876 9. 2008.04.07 IBM_Power_595 64 679 10. 2013.09.12 Oracle_SPARC_T5-8 128 604 11. 1999.12.07 NEC_SX-5-16A 16 607 12. 2009.08.10 ScaleMP_XeonX5570_vSMP_16B 128 437 13. 1997.06.10 NEC_SX-4 32 434 14. 2004.08.11 HP_AlphaServer_GS1280-1300 64 407 Our Forcing Function 1 400 GB/s 15. 1996.11.21 Cray_T932_321024-3E 32 310 16. 2014.04.24 Oracle_Sun_Server_X4-4 60 221 17. 2007.04.17 Fujitsu/Sun_Enterprise_M9000 128 224 18. 2002.10.16 NEC_SX-6 8 202 19. 2006.07.23 IBM_System_p5_595 64 186 20. 2013.09.17 Intel_XeonPhi_SE10P 61 169 https://www.cs.virginia.edu/stream/top20/bandwidth.html February 2015 21

Latency Buy 256GB of RAM only if you use it Higher cache miss rates And main memory is ~200 cycles away February 2015 22

Cache Hits 200-cycle 1-cycle hit miss @4 Hz @4 Hz February 2015 23

Cache Miss to Main Memory 200-cycle miss @4 Hz cache miss February 2015 24

What constructive work could you do during this time? February 2015 25

Latency Buy 256GB of RAM only if you use it Higher cache miss rates And main memory is ~200 cycles away For starters, we need to prefetch about 16B * 200cy = 3.2KB to meet our forcing function; call it 4KB February 2015 26

Additional Considerations L1 cache size = associativity * page size Need bigger than 4KB pages Translation buffer at 256 x 4KB covers only 1MB of memory Covers much less than on-chip caches TB miss can consume 15% of total CPU time Need bigger than 4KB pages With 256GB of RAM @4KB: 64M pages Need bigger than 4KB pages February 2015 27

Move data: big and small It s still the memory Need a coordinated design of instruction architecture and memory implementation to achieve high bandwidth with low delay February 2015 28

Modern challenges in CPU design Lots of memory Multi-issue CPU instructions every cycle that drive full bandwidth Full bandwidth all the way to RAM, not just to L1 cache More prefetching in software Bigger page size(s) Topic: do better February 2015 29

Real-time transactions: 1000s per second

A Single Transaction Across ~40 Racks of ~60 Servers Each Each arc is a related client-server RPC (remote procedure call) February 2015 31

A Single Transaction RPC Tree vs Time: Client & 93 Servers??? February 2015 32

Single Transaction Tail Latency One slow response out of 93 parallel RPCs slows the entire transaction February 2015 33

One Server, One User-Mode Thread vs. Time Eleven Sequential Transactions (circa 2004) Normal: 60-70 msec Long: 1800 msec February 2015 34

One Server, Four CPUs: User/kernel transitions every CPU every nanosecond (Ktrace) CPU 0 CPU 1 CPU 2 200 usec CPU 3 February 2015 35

16 CPUs, 600us, Many RPCs CPUs 0..15 RPCs 0..39 February 2015 36

16 CPUs, 600us, Many RPCs LOCKs 0..22 THREADs 0..46 February 2015 37

That is A LOT going on at once Let s look at just one long-tail RPC in context February 2015 38

16 CPUs, 600us, one RPC CPUs 0..15 RPCs 0..39 February 2015 39

16 CPUs, 600us, one RPC LOCKs 0..22 THREADs 0..46 February 2015 40

Wakeup Detail Thread 19 frees lock, sends wakeup to waiting thread 25 Thread 25 actually runs 50 us 50us wakeup delay?? February 2015 41

Wakeup Detail Target CPU was busy; kernel waited THREADs 50 us 50us CPUs CPU 9 busy; But 3 was idle February 2015 42

CPU Scheduling, 2 Designs Re-dispatch on any idle CPU core But if idle CPU core is in deep sleep, can take 75-100us to wake up Wait to re-dispatch on previous CPU core, to get cache hits Saves squat if could use same L1 cache Saves ~10us if could use same L2 cache Saves ~100us if could use same L3 cache Expensive if cross-socket cache refills Don t wait too long February 2015 43

Real-time transactions: 1000s per second Not your father s SPECmarks To understand delays, need to track simultaneous transactions across servers, CPU cores, threads, queues, locks February 2015 44

Modern challenges in CPU design A single transaction can touch thousands of servers in parallel The slowest parallel path dominates Tail latency is the enemy Must control lock-holding times Must control scheduler delays Must control interference via shared resources Topic: do better February 2015 45

Isolation between programs

Histogram: Disk Server Latency; Long Tail Latency 99th %ile = 696 msec February 2015 47

Non-repeatable Tail Latency Comes from Unknown Interference Isolation of programs reduces tail latency. Reduced tail latency = higher utilization. Higher utilization = $$$.

Many Sources of Interference Most interference comes from software But a bit from the hardware underpinnings In a shared apartment building, most interference comes from jerky neighbors But thin walls and bad kitchen venting can be the hardware underpinnings February 2015 49

Isolation issue: Cache Interference CPU thread 0 is moving 16B/cycle flat out, filling caches, hurting threads 3, 7, 63 0 1 2 3 Core 0 00000000 4 5 6 7 Core 1 L1 caches 0000000000000000000 L2 cache 60 61 62 63 PCs Core 15 L1 caches 0000000000000000000000000000000000 000 000 February 2015 50

Isolation issue: Cache Interference CPU thread 0 is moving 16B/cycle flat out, hurting threads 3, 7, 63 Today 0 1 2 3 4 5 6 7 Core 0 Core 1 Desired 0 1 2 3 4 5 6 7 Core 0 Core 1 00000000 77777777 ~00112233 ~44556677 000000000000000000000 ~000111222333444555666777 February 2015 51

Cache Interference How to get there? Partition by ways no good if 16 threads and 8 ways No good if result is direct-mapped Underutilizes cache Selective allocation Give each thread a target cache size Allocate lines freely if under target Replace only own lines if over target Allow over-budget slop to avoid underutilization February 2015 52

Cache Interference Each thread has target:current, replace only own lines if over target Desired 0 1 2 3 4 5 6 7 Target 25 25 25 25 Core 0 Core 1 Current 27 24 24 24 ~00112233 ~44556677 Target 100 100 Current 105 99 ~000111222333444555666777 February 2015 53

Cache Interference Each thread has target:current, replace only own lines if over target Requires owner bits per cache line expensive bits Requires 64 target/current at L3 Fails if L3 not at least 64-way associative Can rarely find own lines in a set February 2015 54

Design Improvement Track ownership just by incoming paths Plus separate target for kernel accesses Plus separate target for over-target accesses Fewer bits, 8-way assoc OK Target OV K 0 1 2 3 Current OV K 0 1 2 3 0 1 2 3 Core 0 ~00112233 Desired 4 5 6 7 Core 1 ~44556677 Target Current OV OV K K 0-3 4-7 8-0-3 4-7 8- ~000111222333444555666777 February 2015 55

Isolation between programs Good fences make good neighbors We need better hardware support for program isolation in shared memory systems February 2015 56

Modern challenges in CPU design Isolating programs from each other on a shared server is hard As an industry, we do it poorly Shared CPU scheduling Shared caches Shared network links Shared disks More hardware support needed More innovation needed February 2015 57

Measurement underpinnings

Profiles: What, not Why Samplesof 1000s of transactions, merging results Pro: understanding average performance Blind spots: outliers, idle time February 2015 59 Chuck Close

Traces: Why Full detail of individual transactions Pro: understanding outlier performance Blind spot: tracing overhead February 2015 60

Histogram: Disk Server Latency; Long Tail 1 2 3 CPU profile shows the 99% common cases Latency 99th %ile = 696 msec The 1% long tail never shows up 4 5 6 7 0 25 50 100 250 500 750 1000 February 2015 61

Trace: Disk Server, Event-by-Event Read RPC + disk time Write hit, hit, miss 700ms mixture 13 disks, three normal seconds: February 2015 62

Trace: Disk Server, 13 disks, 1.5 sec Phase transition to 250ms boundaries exactly Read Write Read hit Write hit February 2015 63

Trace: Disk Server, 13 disks, 1.5 sec Latencies: 250ms, 500ms, for nine minutes 5 4 4 Read Write Read hit Write hit February 2015 64

Why? Probably not on your guessing radar Kernel throttling the CPU use of any process that is over purchased quota Only happened on old, slow servers February 2015 65

Disk Server, CPU Quota bug Understanding Why 4 sped up 25% of entire disk fleet worldwide! Had been going on for three years Savings paid my salary for 10 years Hanlon's razor: Never attribute to malice that which is adequately explained by stupidity. Sites corollary: Never attribute to stupidity that which is adequately explained by software complexity. February 2015 66

Measurement Underpinnings All performance mysteries are simple once they are understood Mystery means that the picture in your head is wrong; software engineers are singularly inept at guessing how their view differs from reality February 2015 67

Modern challenges in CPU design Need low-overhead tools to observe the dynamics of performance anomalies Transaction IDs RPC trees Timestamped transaction begin/end Traces CPU kernel+user, RPC, lock, thread traces Disk, network, power-consumption Topic: do better February 2015 68

Summary: Datacenter Servers are Different

Datacenter Servers are Different Move data: big and small Real-time transactions: 1000s per second Isolation between programs Measurement underpinnings February 2015 70

References Claude Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379 423, 623 656, July, October, 1948. http://cs.ucf.edu/~dcm/teaching/cop5611-spring2013/shannon48-maththeorycomm.pdf Richard Sites, It s the Memory, Stupid!, Microprocessor Report August 5, 1996. http://cva.stanford.edu/classes/cs99s/papers/architects_look_to_future.pdf Ravi Iyer, CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms, ACM International Conference on Supercomputing, 2004. http://cs.binghamton.edu/~apatel/cache_sharing/cqos_iyer.pdf M Bligh, M Desnoyers, R Schultz, Linux Kernel Debugging on Google-sized clusters, Linux Symposium, 2007 https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=29 February 2015 71

References Daniel Sanchez and Christos Kozyrakis, The ZCache: Decoupling Ways and Associativity. IEEE/ACM Symp. on Microarchitecture (MICRO-43), 2010. http://people.csail.mit.edu/sanchez/papers/2010.zcache.micro.pdf Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report dapper-2010-1, April 2010 http://scholar.google.com/scholar?q=google+technical+report+dapper-2010-1&btng=&hl=en&as_sdt=0%2c5&as_vis=1 Daniel Sanchez, Christos Kozyrakis, Vantage: Scalable and Efficient Fine- Grained Cache Partitioning, Symp. on Computer Architecture ISCA 2011. http://ppl.stanford.edu/papers/isca11-sanchez.pdf Luiz André Barroso and Urs Hölzle, The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines, 2nd Edition 2013. http://www.morganclaypool.com/doi/pdf/10.2200/s00516ed2v01y201306cac024 February 2015 72

Thank You, Questions? February 2015 73

Thank You February 2015 74