Energy-Efficient, High-Performance Heterogeneous Core Design

Similar documents
Thread level parallelism

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Heterogeneous Microarchitectures Trump Voltage Scaling for Low-Power Cores

Exploring Heterogeneity within a Core for Improved Power Efficiency

Multithreading Lin Gao cs9244 report, 2006

Implementation of Core Coalition on FPGAs

Thread Level Parallelism II: Multithreading

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP

MONITORING power consumption of a microprocessor

Feb.2012 Benefits of the big.little Architecture


The IntelliMagic White Paper: Green Storage: Reduce Power not Performance. December 2010

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Rethinking SIMD Vectorization for In-Memory Databases

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

GPUs for Scientific Computing

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

CHAPTER 1 INTRODUCTION

Operating System Impact on SMT Architecture

x64 Servers: Do you want 64 or 32 bit apps with that server?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

POWER8 Performance Analysis

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

NVIDIA Tegra 4 Family CPU Architecture

Performance Optimization Guide

Chip Multithreading: Opportunities and Challenges

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

VLIW Processors. VLIW Processors

Low Power AMD Athlon 64 and AMD Opteron Processors

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Networking Virtualization Using FPGAs

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

SHIFT! Shared History Instruction Fetch! for Lean-Core Server Processors" Cansu Kaynak, Boris Grot, Babak Falsafi"

Operating System Resource Management. Burton Smith Technical Fellow Microsoft Corporation

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Derek Chiou, UT and IAA Workshop

HyperThreading Support in VMware ESX Server 2.1

Driving force. What future software needs. Potential research topics

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Precise and Accurate Processor Simulation

Multi-Core Programming

Web Server Software Architectures

Parallel Programming Survey

Introduction to GPU Architecture

The Truth Behind IBM AIX LPAR Performance

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Eloquence Training What s new in Eloquence B.08.00

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

SPARC64 VIIIfx: CPU for the K computer

Multicore Processor, Parallelism and Their Performance Analysis

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Optimizing Shared Resource Contention in HPC Clusters

Real-Time Monitoring Framework for Parallel Processes

Transactional Memory

Intel DPDK Boosts Server Appliance Performance White Paper

CPU Scheduling Outline

Energy Efficient Job Scheduling in Single-ISA Heterogeneous Chip-Multiprocessors

OC By Arsene Fansi T. POLIMI

Software and the Concurrency Revolution

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Parallel Processing and Software Performance. Lukáš Marek

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Runtime Hardware Reconfiguration using Machine Learning

Application Performance Analysis of the Cortex-A9 MPCore

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

Enterprise Applications

High Performance or Cycle Accuracy?

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

Design and Implementation of the Heterogeneous Multikernel Operating System

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Multi-Threading Performance on Commodity Multi-Core Processors

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

Putting it all together: Intel Nehalem.

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Energy Efficiency of Software Transactional Memory in a Heterogeneous Architecture

Disk Storage Shortfall

Architecture Support for Big Data Analytics

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

An Analysis of Power Reduction in Datacenters using Heterogeneous Chip Multiprocessors

Control 2004, University of Bath, UK, September 2004

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Scalable Cache Miss Handling For High MLP

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Thread Level Parallelism (TLP)

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Probabilistic Modeling for Job Symbiosis Scheduling on SMT Processors

FPGA-based Multithreading for In-Memory Hash Joins

Transcription:

Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design

References Composite Cores: Pushing Heterogeneity into a Core A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke University of Michigan, Ann Arbor MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, Y. N. Patt UT Austin, HPS Lab, Intel Labs - Hillsboro Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 2

Motivation Workload and applications exhibit different phases Some phases are constrained by fundamental ILP limit In an inherently low ILP phase a simple in-order, instead of out-of-order, core can be used In-order core saves energy w/o degrading overall performance Phases also have varying degree of exploitable ILP and TLP An out-of-order engine is more efficient in the high ILP phases A highly threaded in-order SMT is more beneficial in TLP phases Overall idea is to identify the phase behavior and change the architecture on-the-fly to suit the need Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 3

Outline Motivation Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 4

within a Single Core Heterogeneous multicore systems, capable of achieving either high-performance or energy-efficiency, are quite prominent Often migrate applications/phases to specific core which favors it Issues with conventional heterogeneous system Slow migrations, requires large phases (100s of millions insts) Often coarse-grain and the fine-grain opportunities are lost Switching and migration has significant performance overhead Proposed solutions: a single core microarchitecture which integrates big and little compute µengines together An online controller can map 25% code to little µengine Achieves 18% energy efficiency at performannce loss 5% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 5

Conventional Heterogeneous CMP: ARM s big.little Incorporates two different kind of cores on same chip big: Cortex-A15(3-way OoO), deeply pipelined (15-25 stages) LITTLE: Cortex-A7(2-way in-order), short pipeline (8-10 stages) How do these fare against each other? Performance: Cortex-A15 is 2-3x faster than Cortex-A7 Energy: Cortex-A7 is 3-4x more energy-efficient than Cortex-A15 These two kind of cores are utilized, through migration, when an appropriate phase arrives Migration happens through coherent L2 caches, costs about 20 µs Requires large phases to amortize the cost of slow migration Composite cores: modify single core to suit both the needs Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 6

Fine-Grain Switching Interval Conventional heterogeneous CMP requires large phases To amortize the cost of switching, typically few millions insts The migration overhead precludes fine-grained switching in traditional heterogeneous core designs Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 7

Composite Cores: Architecture Each core consists of two tightly coupled compute µengines Achieves high-performance and energy efficiency by switching the µengines in response to changes in application performance Shared: Front-end, branch predictor, data and inst caches Extra component: A reactive online controller to perform switching Switching requires only the register file transfer and some stalling Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 8

Reactive Online Controller Online controller tries to maximize the energy savings subject to a configurable maximum performance degradation, or slowdown Estimates dynamic performance loss using a liner model Switching happens when loss is more than the acceptable threshold Performance estimator is the most crucial, complex, trickiest component and involves many approximations Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 9

Performance Estimator Goal of this module is to provide an estimate of the performance of both the µengines in the previous quantum and overall Performance estimation of the non-active core is challenging Uses a linear performance estimating model: y = a 0 + a i x i Various stats are collected: L2 miss, ILP, L2 hit, MLP etc. Utilize ridge regression analysis to determine the coefficients Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 10

Overall Energy Savings Implementable regression model saves about 18% energy Reduction in energy-delay-product is 21% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 11

Switching Impact on Performance Subject to 5% slowdown, accptable margin in performance mcf : is memory bound, decrease in branch misprediction latency actually causes a small performance improvement Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 12

Little Core Utilization On an average about 25% of code can be mapped to little core Given the oracle knowledge about 37% code can be mapped Applications like mcf can be completely mapped to little core Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 13

Average Little Core Power Little µengine consumes little extra power compared to little core because of over-provisioned shared resources Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 14

Performance Energy Sensitivity Allowing only 1% slowdown saves upto 4% of the energy 20% performance drop can save upto 44% of the energy Good feature to have where maintaining usability is essential Low-battery levels in laptops and cell phones Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 15

MorphCore: Motivation In general, industry builds two types of cores: Large out-of-order cores: Intel s Sandybridge, IBM s Power 7 Small cores: Intel s Larrabee, Sun s Niagara, ARM s A15 OoO cores provide high single-thread performance by exploiting ILP but are power inefficient for multi-threaded programs Key insight: Highly-threaded in-order SMT core can achieve the instruction issue throughput similar to an OoO (Hily, Seznec) MorphCore is built on two key insights: above observation and In-order SMT core can be built using subset of the OoO hardware MorphCore: Start with a traditional OoO core and make minimal changes to transform it to highly-threaded in-order SMT Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 16

In-order SMT vs Out-of-order Superscalar Hily & Seznec: Highly-threaded in-order core can achieve similar throughput to an OoO core on multi-threaded apps (HPCA 99) In high TLP applications, high-performance and low energy consumption can be achieved with in-order SMT execution Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 17

MorphCore Microarchitecture Two modes of execution: OutOfOrder and InOrder Based on a traditional OoO core and also supports Additional in-order SMT threads, in-order scheduling, execution and commit of simultaneously running threads Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 18

Details of Microarchitecture Fetch: using hardware muxes 2 front-ends can be configured InOrder SMT mode - 8 threads, OutOfOrder mode - 2 threads Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 19

Real Details: Too Specific Hw mux, reconfigurable logic to transform OoO to in-order SMT Modified rename stage: details are too involved! Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 20

Wakeup and Selection Logic After all these modifications they claim that only 2.5% of extra critical delay is added in the design 2.5% slower frequency Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 21

MorphCore Mode Switching No switching overhead on OS Hardware does it itself Not mentioned clearly (most of it is future work!) General idea is that when OS schedules more threads you are in parallel region so enable in-order SMT threshold: >2 threads When the number of active threads is 2, enable OoO engine Assumes thread library uses MONITOR/MWAIT insts such that MorphCore hardware can detect a thread becoming inactive Claims that since no migration of instruction and data needs to happen on mode switches, the penalty is minimum Pipeline flushing and stalling Registers and muxes reconfiguration Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 22

Performance Results ST apps: MorphCore achieves very close to OoO 2-way SMT MT apps: achieves close to 6-thread in-order SMT (SMALL) Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 23

Overall Speedup, Power and Energy Performance and Energy combined MorphCore does better than all other alternative Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 24

Comparison with CoreFusion Opposite approach: Instead of building a larger core from small cores (CoreFusion), MorphCore tries to scale down the OoO design to implement simple in-order SMT core Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 25

Other Metrics compared to CoreFusion Reduces power by 19%, energy by 29% and energy-delay squarred product by 29% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 26

Both ideas are quite similar to each other Both proposal bring the notion of heterogeneity within a core Both designs try to leverage fine-grain phases in runtime They also try to reuse (share) as much as hardware possible Both designs also try to minimize the migration overhead Both designs require significant modifications in the core microarchitecture The savings/benefits are only few %age Complexity is quite high for these new core design Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 27