Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

Transcription

1 Effective Instruction Prefetching in Chip Multiprocessors for Moern Commercial Applications Lawrence Spracklen, Yuan Chou & Santosh G. Abraham International Symposium on High-Performance Computer Architecture Feb 15 th 2005 Lawrence Spracklen Avance Processor Architecture Sun Microsystems

2 Outline Motivation Limit stuy The iscontinuity prefetcher Results Conclusions 2

3 Motivation Cache missing memory accesses frequently ictate application performance Typically think in terms of ata misses ominate for SPEC CPU2000 benchmarks Commercial applications have large instruction working sets Exemplifie by atabases, application servers, an web servers We investigate the performance implications for: A atabase workloa TPC-W SPECjAppServer2002 SPECweb99 3

4 Instruction Miss Rates Commercial applications observe significant stalls ue to both L1 cache an L2 cache instruction misses Instruction misses can be more problematic than either loa or store misses Miss rate (per 100 instructions) KB 4-way L1$ 2MB 4-way L2$ B TPC-W japp Web 0 B TPC-W japp Web 4

5 Chip Multiprocessors Recent years witnesse a significant paraigm shift an the emergence of Chip Multiprocessors (CMPs) CMPs are exemplifie by multiple cores on a single chip Avance CMP cores typically: Have private Level-1 caches (L1$) Share a moest Level-2 cache (L2$) We investigate how CMPs impact commercial workloas Use a next-generation 4-core CMP, with a 2MB share L2$ 5

6 CMP Instruction Miss Rates L1$ miss rates are ientical as cores have private I$s L2$ is share between 4 cores Less cache resources per core Applications experience more frequent cache misses HW prefetchers are even more important! Miss rate (per 100 instructions) Single core 4-core CMP 2MB 4-way L2$ 0 B TPC-W japp Web Mix 6

7 Miss Classification Few processors provie support for instruction prefetching Typically focus on sequential prefetchers Given an access/miss for line L, prefetch line L+1 Irregular control flow in commercial apps means many misses are non-sequential 1.0 Sequential 1.0 Non-sequential Miss Breakown B TPC-W japp Web B TPC-W japp Web Mixe 32KB 4-way L1$ 2MB 4-way L2$ 7

8 Non-Sequential Misses Sequential prefetchers fail to capture up to almost 60% of instruction misses in commercial applications Non-sequential misses are not attributable to a single cause Cause by a variety of control transfer instructions (CTI) Trap Return Jump Call Uncon branch Con branch (nt) Con branch (tb) Con branch (tf) Non-sequential miss breakown B TPC-W japp Web B TPC-W japp Web Mixe 32KB 4-way L1$ 2MB 4-way L2$ 8

9 Potential Performance Improvements Failure to aress non-sequential instruction misses sacrifices significant performance Can ouble performance gain by also targeting non-sequential misses Non-sequential prefetchers must target all main CTI groups Many prior instruction prefetchers only capture a subset of misses Sequential only Branch only Function only Sequential + Branch Sequential + Function Sequential + Branch + Function Potential performance improvement, X Single core B TPC-W japp Web way CMP B TPC-W japp Web Mixe 9

10 Introucing iscontinuities When a CTI instruction causes a transition to a nonsequential cache line it causes a 'iscontinuity' in the instruction fetch stream # Misses No Prefetch L L+1 L+2 L+4 L+20 L+21 L+23 7 Next-line prefetchers can't capture these transitions # Misses Next-line (on miss) L L+1 L+2 L+4 L+20 L+21 L+23 5 Next-line (tagge) L L+1 L+2 L+4 L+20 L+21 L

11 Capturing iscontinuities Next-N-line sequential prefetchers capture short forwar iscontinuities Where the target lies within the prefetch-ahea istance # Misses Next-4-line (tagge) L L+1 L+2 L+4 L+20 L+21 L+23 2 Next-N-line sequential prefetchers represent a simple, low-cost mechanism to prefetch for small iscontinuities An elegant, CTI inepenent metho for capturing the remaining iscontinuities is require We propose the iscontinuity prefetcher 11

12 The iscontinuity Prefetcher The iscontinuity prefetcher utilizes a history-base preictor to track iscontinuities that incur an L1$ miss Only nee to track large iscontinuities that aren't covere by the next-n-line sequential prefetcher Significantly reuces the size of the require preictor Preictor only nees to cover large iscontinuity => Small Preictor iscontinuity + Next-4-line (tagge) L L+1 L+2 L+4 L+20 L+21 L+23 # Misses 0 Small forwar iscontinuities covere by next-4-line sequential prefetcher 12

13 Prefetcher Implementation Preictor is implemente as a irect-mappe table Inexe by a portion of the aress of the trigger Entry is tagge with a portion of the aress of the trigger Only require one target per entry Request/Miss info Tag Target Core Next-4-line Tag Target Prefetch Queue To L2$ 13

14 Prefetcher Operation Allocation: When a iscontinuity causes a miss, it is inserte into the table L-128 L L+1 L+2 L+4 L+20 L+21 L+23 Prefetch Queue Tag Target 14

15 Prefetcher Operation Allocation: When a iscontinuity causes a miss, it is inserte into the table L-128 L L+1 L+2 L+4 L+20 L+21 L+23 Tag Target Prefetch Queue Tag Target 15

16 Prefetcher Operation Preiction: Preictor is probe by the sequential prefetcher moving ahea of the eman fetch stream If a vali entry is locate a prefetch is issue for the potential target Prefetches are also issue for sequential lines following the target (up to N) L-128 L L+1 L+2 L+4 L+20 L+21 L+23 Prefetch Queue Tag Target Tag Target 16

17 Prefetcher Operation Preiction: Preictor is probe by the sequential prefetcher moving ahea of the eman fetch stream If a vali entry is locate a prefetch is issue for the potential target Prefetches are also issue for sequential lines following the target (up to N) L-128 L L+1 L+2 L+4 L+20 L+21 L+23 Prefetch Queue Tag Target Tag Target 17

18 Methoology Processor overview Processor 4-core CMP 64-entry issue winow 3-wie issue 64K gshare preictor Memory Hierarchy 32KB 4-way 64B I$ an $ (per core) 2MB 4-way 64B L2$ (share between cores) 400-cycle memory latency 20GB/s offchip BW iscontinuity Prefetcher 8192-entry irect-mappe table (per core) Next-4-line sequential prefetcher Compare iscontinuity prefetcher to: Next line (on miss): if line L is a miss, prefetch line L+1 Next line (tagge): if line L is a miss or a previously prefetche line, prefetch line L+1 Next-4-line (tagge): if line L is a miss or a previously prefetche line, prefetch lines L+1, L+2, L+3 an L+4 18

19 Miss Coverage single core Achieve a significant reuction in both the I$ miss rate an the L2$ instruction miss rate 90% of L1$ misses an 85% of L2$ misses eliminate for atabase workloa The iscontinuity prefetcher outperforms the sequential prefetchers Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity Miss rate (normalize to no prefetch) KB 4-way L1$ MB 4-way L2$ B TPC-W japp Web B TPC-W japp Web 19

20 Miss Coverage CMP CMP L1$ miss reuction is ientical to the reuctions achieve for a single core Cores have private L1$s L2$ miss rate reuctions similar to single-core reuctions Also manage to eliminate 82% of L2$ instruction misses for the mixe workloa Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity Miss rate (normalize to no prefetch) MB 4-way L2$ B TPC-W japp Web Mixe 20

21 Performance Improvements The instruction prefetchers provie significant performance benefits Higher performance benefits observe for the CMP Given the significant reuction in miss rates, greater performance improvements seeme likely Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity Performance improvement, X Single core way CMP 1 B TPC-W japp Web 1 B TPC-W japp Web Mixe 21

22 ata Miss Rates L2$ ata miss rates increase significantly when aggressive instruction prefetching was enable Increase in ata misses offsets the benefits from the reuction in instruction misses Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity Miss rate (normalize to no prefetch) Single core B TPC-W japp Web way CMP B TPC-W japp Web Mixe 22

23 L2-Bypass Prefetching Introuce L2-Bypass prefetching Prefetches are initially only installe in the L1$ If the line is utilize uring its resience in the L1$, on eviction, the line is installe in the L2$ Eliminates L2$ pollution by instruction prefetchers Observe full performance benefits of the instruction prefetchers (up to 1.38X) Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity Performance improvement, X Single core B TPC-W japp Web way CMP B TPC-W japp Web Mixe 23

24 Low Cost? A small iscontinuity prefetcher still provies appreciable performance increases 8192-entries 4096-entries 2048-entries 1024-entries 512-entries 256-entries Next-4-lines (tagge) KB 4-way L1$ 1.0 2MB 4-way L2$ (CMP) Miss Coverage B TPC-W japp Web Mixe B TPC-W japp Web Mixe Very little aitional HW cost for the smaller preictors, yet they achieve significant performance gains over sequential prefetchers

25 Relate Art Significant prior work on instruction prefetching in aition to the next-line an next-n-line sequential prefetchers: Target prefetching [Hsu, Smith] Markov prefetching [Joseph, Grunwal] Branch-history guie prefetching [Tyson, Charney, Srinivasan, avison] Call-graph prefetching [avison, Annavaram, Patel] Fetch irecte prefetching [Caler, Reinman, Austin] Wrong-path prefetching [Pierce, Muge] Benefits an rawbacks of these alternative schemes are iscusse in more etail in the paper 25

26 Concluing Remarks Moern commercial applications have high instruction miss rates at both the L1 an L2 levels Effective instruction prefetching is imperative to mitigate the performance losses ue to these misses Necessary to target all types of instruction misses Sequential misses AN non-sequential misses causes by control transfer inst. Propose the iscontinuity prefetcher which reuces the miss rate by ~90% Nee to consier the pollution effects of aggressive prefetchers (especially in CMPs) Accelerate commercial apps by up to 38% using the iscontinuity prefetcher an selective L2$ installation 26

27 Questions? 27

28 Prefetch Accuracy Lower for the more aggressive instruction prefetchers Accuracy of the iscontinuity prefetcher is comparable with the next-4-lines sequential prefetcher Yet the iscontinuity prefetcher achieves superior performance 2-line iscontinuity prefetcher outperforms next-4-lines an has 50% higher accuracy (BW constraine 1.4 systems) Next-line (on miss) Next-line (tagge) Next-4-line (tagge) iscontinuity iscontinuity (2NL) 4-way CMP 4-way CMP Prefetch Accuracy B TPC-W japp Web Mixe 1 B TPC-W japp Web Mixe 28

29 Prefetching for CMPs Implications for prefetching? Resources per core ecrease Potential for inter-stran pollution increase Chip real-estate available to support HW prefetchers ecrease May require multiple L1$ prefetchers per chip HW prefetchers nee to be effective, accurate an lowcost 29