Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design
References Composite Cores: Pushing Heterogeneity into a Core A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke University of Michigan, Ann Arbor MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, Y. N. Patt UT Austin, HPS Lab, Intel Labs - Hillsboro Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 2
Motivation Workload and applications exhibit different phases Some phases are constrained by fundamental ILP limit In an inherently low ILP phase a simple in-order, instead of out-of-order, core can be used In-order core saves energy w/o degrading overall performance Phases also have varying degree of exploitable ILP and TLP An out-of-order engine is more efficient in the high ILP phases A highly threaded in-order SMT is more beneficial in TLP phases Overall idea is to identify the phase behavior and change the architecture on-the-fly to suit the need Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 3
Outline Motivation Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 4
within a Single Core Heterogeneous multicore systems, capable of achieving either high-performance or energy-efficiency, are quite prominent Often migrate applications/phases to specific core which favors it Issues with conventional heterogeneous system Slow migrations, requires large phases (100s of millions insts) Often coarse-grain and the fine-grain opportunities are lost Switching and migration has significant performance overhead Proposed solutions: a single core microarchitecture which integrates big and little compute µengines together An online controller can map 25% code to little µengine Achieves 18% energy efficiency at performannce loss 5% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 5
Conventional Heterogeneous CMP: ARM s big.little Incorporates two different kind of cores on same chip big: Cortex-A15(3-way OoO), deeply pipelined (15-25 stages) LITTLE: Cortex-A7(2-way in-order), short pipeline (8-10 stages) How do these fare against each other? Performance: Cortex-A15 is 2-3x faster than Cortex-A7 Energy: Cortex-A7 is 3-4x more energy-efficient than Cortex-A15 These two kind of cores are utilized, through migration, when an appropriate phase arrives Migration happens through coherent L2 caches, costs about 20 µs Requires large phases to amortize the cost of slow migration Composite cores: modify single core to suit both the needs Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 6
Fine-Grain Switching Interval Conventional heterogeneous CMP requires large phases To amortize the cost of switching, typically few millions insts The migration overhead precludes fine-grained switching in traditional heterogeneous core designs Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 7
Composite Cores: Architecture Each core consists of two tightly coupled compute µengines Achieves high-performance and energy efficiency by switching the µengines in response to changes in application performance Shared: Front-end, branch predictor, data and inst caches Extra component: A reactive online controller to perform switching Switching requires only the register file transfer and some stalling Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 8
Reactive Online Controller Online controller tries to maximize the energy savings subject to a configurable maximum performance degradation, or slowdown Estimates dynamic performance loss using a liner model Switching happens when loss is more than the acceptable threshold Performance estimator is the most crucial, complex, trickiest component and involves many approximations Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 9
Performance Estimator Goal of this module is to provide an estimate of the performance of both the µengines in the previous quantum and overall Performance estimation of the non-active core is challenging Uses a linear performance estimating model: y = a 0 + a i x i Various stats are collected: L2 miss, ILP, L2 hit, MLP etc. Utilize ridge regression analysis to determine the coefficients Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 10
Overall Energy Savings Implementable regression model saves about 18% energy Reduction in energy-delay-product is 21% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 11
Switching Impact on Performance Subject to 5% slowdown, accptable margin in performance mcf : is memory bound, decrease in branch misprediction latency actually causes a small performance improvement Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 12
Little Core Utilization On an average about 25% of code can be mapped to little core Given the oracle knowledge about 37% code can be mapped Applications like mcf can be completely mapped to little core Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 13
Average Little Core Power Little µengine consumes little extra power compared to little core because of over-provisioned shared resources Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 14
Performance Energy Sensitivity Allowing only 1% slowdown saves upto 4% of the energy 20% performance drop can save upto 44% of the energy Good feature to have where maintaining usability is essential Low-battery levels in laptops and cell phones Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 15
MorphCore: Motivation In general, industry builds two types of cores: Large out-of-order cores: Intel s Sandybridge, IBM s Power 7 Small cores: Intel s Larrabee, Sun s Niagara, ARM s A15 OoO cores provide high single-thread performance by exploiting ILP but are power inefficient for multi-threaded programs Key insight: Highly-threaded in-order SMT core can achieve the instruction issue throughput similar to an OoO (Hily, Seznec) MorphCore is built on two key insights: above observation and In-order SMT core can be built using subset of the OoO hardware MorphCore: Start with a traditional OoO core and make minimal changes to transform it to highly-threaded in-order SMT Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 16
In-order SMT vs Out-of-order Superscalar Hily & Seznec: Highly-threaded in-order core can achieve similar throughput to an OoO core on multi-threaded apps (HPCA 99) In high TLP applications, high-performance and low energy consumption can be achieved with in-order SMT execution Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 17
MorphCore Microarchitecture Two modes of execution: OutOfOrder and InOrder Based on a traditional OoO core and also supports Additional in-order SMT threads, in-order scheduling, execution and commit of simultaneously running threads Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 18
Details of Microarchitecture Fetch: using hardware muxes 2 front-ends can be configured InOrder SMT mode - 8 threads, OutOfOrder mode - 2 threads Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 19
Real Details: Too Specific Hw mux, reconfigurable logic to transform OoO to in-order SMT Modified rename stage: details are too involved! Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 20
Wakeup and Selection Logic After all these modifications they claim that only 2.5% of extra critical delay is added in the design 2.5% slower frequency Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 21
MorphCore Mode Switching No switching overhead on OS Hardware does it itself Not mentioned clearly (most of it is future work!) General idea is that when OS schedules more threads you are in parallel region so enable in-order SMT threshold: >2 threads When the number of active threads is 2, enable OoO engine Assumes thread library uses MONITOR/MWAIT insts such that MorphCore hardware can detect a thread becoming inactive Claims that since no migration of instruction and data needs to happen on mode switches, the penalty is minimum Pipeline flushing and stalling Registers and muxes reconfiguration Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 22
Performance Results ST apps: MorphCore achieves very close to OoO 2-way SMT MT apps: achieves close to 6-thread in-order SMT (SMALL) Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 23
Overall Speedup, Power and Energy Performance and Energy combined MorphCore does better than all other alternative Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 24
Comparison with CoreFusion Opposite approach: Instead of building a larger core from small cores (CoreFusion), MorphCore tries to scale down the OoO design to implement simple in-order SMT core Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 25
Other Metrics compared to CoreFusion Reduces power by 19%, energy by 29% and energy-delay squarred product by 29% Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 26
Both ideas are quite similar to each other Both proposal bring the notion of heterogeneity within a core Both designs try to leverage fine-grain phases in runtime They also try to reuse (share) as much as hardware possible Both designs also try to minimize the migration overhead Both designs require significant modifications in the core microarchitecture The savings/benefits are only few %age Complexity is quite high for these new core design Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 27