<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Similar documents
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

SPARC T5-1B SERVER MODULE

OC By Arsene Fansi T. POLIMI

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Performance Impacts of Non-blocking Caches in Out-of-order Processors


Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Intel Pentium 4 Processor on 90nm Technology

NVIDIA Tegra 4 Family CPU Architecture

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

SOC architecture and design

Virtuoso and Database Scalability

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

OPENSPARC T1 OVERVIEW

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

The Foundation for Better Business Intelligence

Scaling Hadoop for Multi-Core and Highly Threaded Systems

SPARC T4-1 Server. Product Overview

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Is there any alternative to Exadata X5? March 2015

Introduction to Microprocessors

SPARC M5-32 SERVER. Product Overview

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY RX200 S6

SPARC64 VIIIfx: CPU for the K computer

Next Generation GPU Architecture Code-named Fermi

OpenSPARC T1 Processor

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

Putting it all together: Intel Nehalem.

AIX NFS Client Performance Improvements for Databases on NAS

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

An Oracle White Paper February Oracle's SPARC T5-2, SPARC T5-4, SPARC T5-8, and SPARC T5-1B Server Architecture

FPGA-based Multithreading for In-Memory Hash Joins

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Operating System Impact on SMT Architecture

Thread Level Parallelism II: Multithreading

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Parallel Algorithm Engineering

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

A Quantum Leap in Enterprise Computing

Enterprise Applications

AMD Opteron Quad-Core

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

SPARC M6-32 SERVER. Product Overview

WebSphere Architect (Performance and Monitoring) 2011 IBM Corporation

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

POWER8 Performance Analysis

Eloquence Training What s new in Eloquence B.08.00

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Radeon HD 2900 and Geometry Generation. Michael Doggett

High Performance Tier Implementation Guideline

Multi-Threading Performance on Commodity Multi-Core Processors

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

ICRI-CI Retreat Architecture track

Types of Workloads. Raj Jain. Washington University in St. Louis

2009 Oracle Corporation 1

Multithreading Lin Gao cs9244 report, 2006

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Monitoring DoubleTake Availability

Energy-Efficient, High-Performance Heterogeneous Core Design

Going Linux on Massive Multicore

The Transition to PCI Express* for Client SSDs

FlashSoft for VMware vsphere VM Density Test

EEM 486: Computer Architecture. Lecture 4. Performance

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

ORACLE EXALYTICS IN-MEMORY MACHINE T5-8 DATA SHEET

High-Density Network Flow Monitoring

An Oracle White Paper June Consolidating Oracle Siebel CRM Environments with High Availability on Sun SPARC Enterprise Servers

VI Performance Monitoring

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Mainframe. Large Computing Systems. Supercomputer Systems. Mainframe

Design Cycle for Microprocessors

PROBLEMS #20,R0,R1 #$3A,R2,R4

Latency Considerations for 10GBase-T PHYs

Transcription:

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle

Agenda T4 Overview S3 Core Overview Block Diagram Pipeline T4 Performance Dynamic Threading Crypto Overview Block Diagram Performance Summary 2

T4 Design Objectives Optimize Software and Hardware for Oracle Workloads and Engineered Systems Extend SPARC ISA Performance Much better singlethread performance vs T3 Double T3 s per thread throughput performance Enhance overall crypto performance vs T3 Compatibility Maintain SPARC V9 and CMT model compatibility Maintain current T3 system scalability Reliability Extend T3 s RAS capabilities 3

T4 Chip Overview S3 S3 S3 S3 L3 CCX L3 S3 S3 S3 S3 8 SPARC S3 cores 8 threads each Shared 4 MB L3 8-banks 16-way associative Two dual-channel DDR3-1066 memory controllers Two PCI-Express x8 2.0 ports Two 10G Ethernet ports TSMC 40 nm ~855 million transistors 4

S3 Core Overview Out-of-order Dual-issue Dynamically threaded Balanced pipeline design Singlethread performance Estimate ~5X S2 s SPECint2006* performance Estimate ~7X S2 s SPECfp2006* performance Throughput performance ~2X S2 s per thread throughput performance High frequency, deep pipeline 16 stage integer pipe 3+ GHz *SPEC, SPECfp, SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Estimates as of 8/18/2011, see www.spec.org for more about SPEC. 5

S3 Core Overview Extensive branch prediction Perceptron direction predictor Return Stack to predict return addresses Far and Indirect target predictors BTC to reduce taken branch penalty Hardware / Software optimizations for Oracle applications User level crypto instructions PAUSE instruction Fused compare-branch instruction HW prefetchers Instruction cache sequential line prefetcher Data cache stride based prefetcher 16 KB, 4-way, L1 instruction and data cache 128 KB, 8-way, unified private L2 cache 6

S3 Core Block Diagram 7

S3 Core Pipeline 8

Threaded S3 Core Pipeline View Before Pick Only 1 thread per pipe stage Pick to Commit Multiple threads per pipe stage Commit Only 1 thread per pipe stage 9

T4 Relative Performance 6 Relative Performance T4 vs T1-T3* Singlethread Performance Integer Workloads 5 4 3 2 1 0 T1-8 S1 cores, 32 threads T2-8 S2 cores, 64 threads 0 0.5 1 1.5 2 2.5 3 3.5 4 Multithread Performance OLTP Workload T4-8 S3 cores, 64 threads T3-16 S2 cores, 128 threads *All performance estimates are relative to T1 performance. 10

T4 Relative Performance Estimated SPECint2006* Relative Performance T4/T3 14 12 10 8 6 4 2 0 Estimated SPECfp2006* Relative Performance T4/T3 20 18 16 14 12 10 8 6 4 2 0 *SPEC, SPECfp, SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Estimates as of 8/18/2011, see www.spec.org for more about SPEC. 11

Dynamic Threading Many of the resources on S3 are shared between threads Load-buffers, store-buffers, pickqueue, working-register-file, reorder-buffer, etc. Thread sharing of resources Static resource allocation Not optimal for heterogeneous workloads Dynamic resource allocation Better for heterogeneous workloads Improves overall application scaling Resources are dynamically configured between threads each cycle No synchronization required 12

Dynamic Threading Thread Hogs Thread hog definition A thread which fails to release its shared resources in a timely fashion Thread hog mitigation using watermarks High and low watermarks defined for each shared resource High watermark reached by allocation Low watermark reached by deallocation Upon reaching high watermark, thread resource allocation stalls Thread resource allocation remains stalled until low watermark is reached 13

Dynamic Threading Thread Hogs Thread hog mitigation using rate of deallocation Pick Queue (PQ) mitigation technique PQ is most critical shared resource on S3 Threads are expected to deallocate PQ entries in a timely fashion If not, thread is considered a PQ thread hog If thread is a PQ thread hog PQ resource available to thread is reduced and made available to other threads If hogging behavior continues, PQ resource is further reduced and made available to other threads If thread deallocates PQ entries in a timely fashion, PQ resource is increased 14

Dynamic Threading Thread Hogs Thread hog mitigation using flushes Flush on L3 miss When a load misses the L3, flush the thread Flushing releases any allocated shared resources for that thread Load/Store timeout Some events are not covered by other thread hog mitigation policies Load that RAWs to a previous store that misses the L3 Flush any load/store that is the oldest and does not commit for N cycles Flush after IO Flush thread after an IO access reaches Commit 15

Crypto Overview Crypto programmed via user instructions Instructions are either in-pipe or out-of-pipe in-pipe set supports 3 cycle internal latency AES, DES, Kasumi, Camellia, CRC32c out-of-pipe set has long latency MD5, SHA-1, SHA-256, SHA- 512, MPMUL, MONTMUL, MONTSQR MPMUL, MONTMUL, MONTSQR have separate state machine Stall Pick Queue to inject crypto multiplies Fairness heuristic between crypto and non-crypto threads 16

Crypto Unit Block Diagram 17

Crypto Relative Performance 4 Relative Core Performance T4/T3 3.5 3 2.5 2 1.5 1 0.5 0 18

Summary Next-generation processor for Oracle servers Significantly improved per thread throughput performance Much better singlethread performance Dynamic threading Better for heterogeneous workloads Improves overall application scaling Excellent overall crypto performance Enables transparent encryption across Oracle software stack 19

<Insert Picture Here>

Glossary SPC SPARC core CCX crossbar BTC branch target cache IRF integer register file WRF working register file FRF floating-point register file FGU floating-point / graphics unit IB instruction buffer 21