Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. Yuan Chou Architecture Technology Group Microelectronics Division

Size: px

Start display at page:

Download "Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. Yuan Chou Architecture Technology Group Microelectronics Division"

Malcolm Page
7 years ago
Views:

1 Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications Yuan Chou Architecture Technology Group Microelectronics Division 1

2 2 Motivation Performance of many commercial applications limited by processor stalls due to off-chip cache misses Applications characterized by irregular control-flow and complex data access patterns Software prefetching and simple stride-based hardware prefetching ineffective Hardware correlation prefetching more promising - can remember complex recurring data access patterns Current correlation prefetchers have severe drawbacks but we think we can overcome them

3 3 Talk Outline Traditional Correlation Prefetching Epoch-Based Correlation Prefetching Experimental Results Summary

4 Traditional Correlation Prefetching Basic idea: use current miss address M to predict N future miss addresses F 1...F N (where N = prefetch depth) Miss address sequence: A B C D E F G H I assume N=2 Use A to prefetch B C Use D to prefetch E F Use G to prefetch H I Correlations recorded in correlation table Correlation table size proportional to application working set

5 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I A B C D F G H I time compute off-chip access E

6 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I A B C DE F G H I time Since C, D and E naturally overlapped, prefetching only C may not improve performance

7 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Prefetches misses naturally overlapped with current miss Miss address sequence: A B C D E F G H I A B C DE F G H I time Since A and B naturally overlapped, prefetching B does not improve performance but wastes table storage

8 6 Epoch-Based Correlation Prefetching (EBCP)

9 7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time compute off-chip access

10 7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch

11 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch Group off-chip accesses based on which epoch they issue Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I 7

12 Epoch Model Insights Insight #1: Target removal of entire epochs instead of individual misses Miss address sequence: A B C D E F G H I A B C D F G H I time E Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I X X Use first miss in epoch to prefetch all misses in next 2 epochs Results in removal of 2 epochs

13 Epoch-Based Correlation Prefetcher No prefetching Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Epoch-based correlation prefetching (EBCP) Epoch i i+1 Miss addresses A B H I Prefetches C D E F G Traditional correlation prefetching (depth=2) Epoch i i+1 i+2 Miss addresses A B E H I Prefetches B C D F G EBCP achieves better epoch reduction 9

14 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Prefetches F G H I Read correlation table Use miss in epoch i to prefetch all misses in epochs i+2 and i+3 Use epoch i to read correlation table Use epoch i+1 to issue prefetches 10

15 10 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Prefetches F G H I Read correlation table Results in removal of 2 epochs X X Correlation table can be stored in main memory!

16 EBCP Advantages Trad: store correlation table on-chip EBCP: store correlation table in main memory (hide table access latency under previous epoch) Trad: no attempt to eliminate all naturally overlapped misses EBCP: target removal of entire epochs Trad: prefetch misses naturally overlapped with current miss EBCP: avoid prefetching these misses EBCP overcomes drawbacks of traditional correlation prefetchers 11

17 Prefetcher control observes all L2 cache requests L2 banks notify prefetcher control which requests are misses 12 EBCP Components L1-I L1-D Processor Core Prefetch Control Crossbar Processor Core L1-I L1-D L2 bank L2 bank L2 bank L2 bank Memory Controller Correlation Table DRAM Memory Controller Correlation Table DRAM

18 EBCP Prefetcher Control Request OS for memory to store correlation table Detect epochs observe when number of off-chip misses transition 0 to 1 Learn correlations record correlations in main memory correlation table Issue prefetches use first miss address in epoch to look up correlation table select miss addresses from correlation table entry issue prefetches (lower priority than demand accesses) Return memory to OS if needed EBCP very simple and requires almost zero on-chip storage! 13

19 Experimental Results 1

20 Baseline Processor Model Moderate out-of-order issue core single thread -wide issue 6 entry issue queue, 12 entry reorder buffer KB -way L1 instruction and data caches 2MB -way L2 cache prefetches installed into prefetch buffer Memory bandwidth model 9.6 GB/s read bandwidth. GB/s write bandwidth 500 cycle unloaded memory latency Commercial applications benchmarks OLTP, TPC-W, SPECjbb2005, SPECjAppServer200 15

21 Effects of Prefetch Degree 5% 0% Infinite correlation table OLTP TPC-W SPECjbb SPECjAppServer 35% 30% 25% 20% 15% 10% 5% 0% % Performance Improvement Prefetch Degree Performance improvement increases with prefetch degree

22 Coverage vs Accuracy take-away % 10% 20% 30% 0% 50% 60% Prefetch Degree % Coverage % 10% 20% 30% 0% 50% Prefetch Degree % Accuracy OLTP OLTP TPC-W TPC-W SPECjAppServer SPECjAppServer SPECjbb SPECjbb 17

23 Memory Bandwidth Sensitivity 5% 0% OLTP TPC-W SPECjbb SPECjAppServer % Performance Improvement 35% 30% 25% 20% 15% 10% 5% 0% BW=3.2GB/s BW=6.GB/s BW=9.6GB/s -5% Prefetch Degree Optimal prefetch degree depends on available memory BW 1

24 Correlation Table Size % Performance Improvement 35% 30% 25% 20% 15% 10% 5% Prefetch degree OLTP TPC-W SPECjbb SPECjAppServer 0% 6K 12K 256K 512K 1M 2M M M 6K 12K 256K 512K 1M 2M M M 6K 12K 256K 512K 1M 2M M M Predictor Table Entries 6K 12K 256K 512K 1M 2M M M Storing table in main memory makes such large sizes practical 19

25 Comparison with Other Prefetchers Global History Buffer G/AC (GHB) address correlation, unique table storage (small: 256KB large: MB) Tag Correlating Prefetcher (TCP) tag correlation (small: 256KB large: MB) Stream traditional stride-based stream prefetcher Spatial Memory Streaming (SMS) spatial locality within region (12KB) Solihin memory-side address correlation prefetcher (6MB) Prefetch degree = 6 for all prefetchers (except SMS) Prefetches brought into 6 entry prefetch buffer 20

26 Comparison with Other Prefetchers % Performance Improvement 25% 20% 15% 10% 5% OLTP TPC-W SPECjbb SPECjAppServer 0% GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP EBCP outperforms all prefetchers for all four benchmarks 21

27 Summary EBCP successfully overcomes drawbacks of traditional correlation prefetchers stores large correlation table in main memory exploits unused memory capacity and bandwidth targets removal of entire epochs very simple prefetcher control almost zero on-chip storage EBCP performs very well on all four commercial benchmarks Future work: efficient implementation for chip multi-processors improved accuracy Epoch-based concept can be applied to other uarch techniques! 22

28 Yuan Chou 2

29 Prefetch Buffer Size % Performance Improvement 35% 30% 25% 20% 15% 10% 5% Prefetch degree 1 million table entries OLTP TPC-W SPECjbb SPECjAppServer 0% Prefetch Buffer Entries 6 entries sufficient for all four benchmarks 19

SHIFT! Shared History Instruction Fetch! for Lean-Core Server Processors" Cansu Kaynak, Boris Grot, Babak Falsafi"

SHIFT! Shared History Instruction Fetch! for Lean-Core Server Processors" Cansu Kaynak, Boris Grot, Babak Falsafi" Instruction Fetch Stalls in Servers" Traditional and emerging server apps:" Deep software