ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors



Similar documents
Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Reasoning to Solve Equations and Inequalities

Small Business Networking

How To Set Up A Network For Your Business

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

Small Business Networking

Small Business Networking

How To Network A Smll Business

Small Business Networking

Hillsborough Township Public Schools Mathematics Department Computer Programming 1

Helicopter Theme and Variations

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

ClearPeaks Customer Care Guide. Business as Usual (BaU) Services Peace of mind for your BI Investment

Introducing Kashef for Application Monitoring

Basic Analysis of Autarky and Free Trade Models

Operations with Polynomials

EQUATIONS OF LINES AND PLANES

Enterprise Risk Management Software Buyer s Guide

Performance analysis model for big data applications in cloud computing

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

Techniques for Requirements Gathering and Definition. Kristian Persson Principal Product Specialist

Graphs on Logarithmic and Semilogarithmic Paper

Experiment 6: Friction

All pay auctions with certain and uncertain prizes a comment

Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Unleashing the Power of Cloud

Econ 4721 Money and Banking Problem Set 2 Answer Key

Data replication in mobile computing

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

The LENA TM Language Environment Analysis System:

Protocol Analysis / Analysis of Software Artifacts Kevin Bierhoff

An Undergraduate Curriculum Evaluation with the Analytic Hierarchy Process

ENHANCING CUSTOMER EXPERIENCE THROUGH BUSINESS PROCESS IMPROVEMENT: AN APPLICATION OF THE ENHANCED CUSTOMER EXPERIENCE FRAMEWORK (ECEF)

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Vendor Rating for Service Desk Selection

Integration. 148 Chapter 7 Integration

Factoring Polynomials

Lecture 3 Gaussian Probability Distribution

Network Configuration Independence Mechanism

Warm-up for Differential Calculus

Contextualizing NSSE Effect Sizes: Empirical Analysis and Interpretation of Benchmark Comparisons

JaERM Software-as-a-Solution Package

Math 135 Circles and Completing the Square Examples

Health insurance exchanges What to expect in 2014

VoIP for the Small Business

WEB DELAY ANALYSIS AND REDUCTION BY USING LOAD BALANCING OF A DNS-BASED WEB SERVER CLUSTER

Virtual Machine. Part II: Program Control. Building a Modern Computer From First Principles.

Engineer-to-Engineer Note

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

Discovering General Logical Network Topologies

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Section 7-4 Translation of Axes

5.2. LINE INTEGRALS 265. Let us quickly review the kind of integrals we have studied so far before we introduce a new one.

6.2 Volumes of Revolution: The Disk Method

Small Business Cloud Services

QoS Mechanisms C HAPTER Introduction. 3.2 Classification

Distributions. (corresponding to the cumulative distribution function for the discrete case).

9 CONTINUOUS DISTRIBUTIONS

Small Businesses Decisions to Offer Health Insurance to Employees

2. Transaction Cost Economics

The Velocity Factor of an Insulated Two-Wire Transmission Line

CHAPTER 11 Numerical Differentiation and Integration

The Definite Integral

COMPARISON OF SOME METHODS TO FIT A MULTIPLICATIVE TARIFF STRUCTURE TO OBSERVED RISK DATA BY B. AJNE. Skandza, Stockholm ABSTRACT

Traffic Rank Based QoS Routing in Wireless Mesh Network

Portfolio approach to information technology security resource allocation decisions

1.00/1.001 Introduction to Computers and Engineering Problem Solving Fall Final Exam

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Why is the NSW prison population falling?

Algebra Review. How well do you remember your algebra?

Space Vector Pulse Width Modulation Based Induction Motor with V/F Control

2015 EDITION. AVMA Report on Veterinary Compensation

AntiSpyware Enterprise Module 8.5

Engineer-to-Engineer Note

Physics 43 Homework Set 9 Chapter 40 Key

Decision Rule Extraction from Trained Neural Networks Using Rough Sets

Project 6 Aircraft static stability and control

TITLE THE PRINCIPLES OF COIN-TAP METHOD OF NON-DESTRUCTIVE TESTING

Section 5-4 Trigonometric Functions

piecewise Liner SLAs and Performance Timetagment

Health insurance marketplace What to expect in 2014


4.11 Inner Product Spaces

Module 2. Analysis of Statically Indeterminate Structures by the Matrix Force Method. Version 2 CE IIT, Kharagpur

Engineer-to-Engineer Note

Regular Sets and Expressions

DEVELOPMENT. Introduction to Virtualization E-book. anow is the time to realize all of the benefits of virtualizing your test and development lab.

Vectors Recap of vectors

Facilitating Rapid Analysis and Decision Making in the Analytical Lab.

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Software Cost Estimation Model Based on Integration of Multi-agent and Case-Based Reasoning

Learner-oriented distance education supporting service system model and applied research

Transcription:

ProfileMe: Hrdwre Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Den Jmes E. Hicks Crl A. Wldspurger Willim E. Weihl George Chrysos Digitl Equipment Corportion Abstrct Profile dt is vluble for identifying performnce bottlenecks nd guiding optimiztions. Periodic smpling of processor s performnce monitoring hrdwre is n effective, unobtrusive wy to obtin detiled profiles. Unfortuntely, existing hrdwre simply counts events, such s cche misses nd brnch mispredictions, nd cnnot ccurtely ttribute these events to instructions, especilly on out-of-order mchines. We propose n lterntive pproch, clled ProfileMe, tht smples instructions. Assmpled instruction moves through the processorpipeline, detiled record of ll interesting events nd pipeline stge ltencies is collected. ProfileMe lso support pired smpling, which cptures informtion bout the interctions between concurrent instructions, reveling informtion bout useful concurrency nd the utiliztion of vrious pipeline stges while n instruction is in flight. We describe n inexpensive hrdwre implementtion of ProfileMe, outline vriety of softwre techniques to extrct useful profile informtion from the hrdwre, nd explin severl wys in which this informtion cn provide vluble feedbck for progrmmers nd optimizers. 1 Introduction Processors re getting fster, yet ppliction performnce is not keeping pce. On lrge commercil pplictions, verge cycles-per-instruction (CPI) vlues my be s high s 2.5 or 3. With 4-wy instruction issue, CPI of 3 mens tht only one issue slot in every 12 is being put to good use! It is common to blme such problems on poor memory performnce, nd in fct most pplictions spend mny cycles witing for memory, but other problems, such s brnch mispredictions, lso wste cycles. To improve the performnce of prticulr ppliction, we need to know which instructions re stlling nd why. In this pper, we describe hrdwre nd softwre support for smpling-bsed profiling system tht provides de- All of the uthors cn be reched t DIGITAL. Den is t the Western Reserch Lbortory (jden@p.dec.com), Hicks is t the Cmbridge Reserch Lbortory (jmey@crl.dec.com), Wldspurger nd Weihl re t the Systems Reserch Center (fcw,weihlg@p.dec.com), nd Chrysos is with the Advnced Development group of Digitl Semiconductor (chrysos@vssd.hlo.dec.com). More informtion bout profiling reserch t DIGITAL cn be found on the Web t http://www.reserch.digitl.com/src/dcpi/. Copyright 1997 IEEE. Published in the Proceedings of Micro-30, December 1-3, 1997 in Reserch Tringle Prk, North Crolin. Personl use of this mteril is permitted. However, permission to reprint/republish this mteril for dvertising or promotionl purposes or for creting new collective works for resle or redistribution to servers or lists, or to reuse ny copyrightedcomponentof this work in other works, must be obtined from the IEEE. Contct: Mnger, Copyrightsnd Permissions / IEEE Service Center / 445 Hoes Lne / P.O. Box 1331 / Pisctwy, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. tiled instruction-level informtion on processors tht cn execute instructions specultively nd out of order. Our pproch, clled ProfileMe, consists of two prts: n instruction smpling technique, which cptures informtion bout individul instructions (e.g., cche miss rtes for ech instruction), nd pired smpling technique, which cptures informtion bout the interctions mong instructions (e.g., concurrency levels). ProfileMe hs severl key dvntges over previous techniques such s hrdwre event counters: (1) it collects complete informtion bout ech instruction, rther thn smpling smll number of events t time; (2) it ccurtely ttributes events to instructions; (3) it collects informtion bout ll instructions, including instructions in uninterruptible sections of code; nd (4) it collects informtion bout useful concurrency, thus helping to pinpoint rel bottlenecks. Smpling hs number of dvntges over other profiling methods, such s simultion or instrumenttion: it works on unmodified progrms, it profiles complete systems, nd it cn hve very low overhed. Indeed, recent work [2] hs shown tht low-overhed smpling-bsed profiling cn revel detiled instruction-level informtion bout pipeline stlls nd their cuses, nd tht this sort of informtion is extremely helpful in dignosing nd fixing performnce problems but tht work is limited to in-order processors, nd its techniques do not extend to out-of-order processors. Most modern microprocessors, including the Alph 21164 [8], Pentium Pro [11] nd R10000 [14], provide performnce counters tht count vriety of events (e.g., brnch mispredicts or dt cche misses) nd deliver n interrupt when the counters overflow. Event counters provide useful ggregte informtion, such s the totl number of brnch mispredicts during progrm run. However, s discussed in Section 2.2, they do not give ccurte informtion bout individul instructions, such s the mispredict rte for single brnch. ProfileMe is deprture from trditionl performnce counters. Rther thn counting events nd smpling the progrm counter when the event counters overflow, we smple instructions. At rndom intervls, we select n instruction; s it executes, we record informtion bout its execution in internl registers. Informtion recorded includes the instruction s PC, the number of cycles spent in ech pipeline stge, whether it suffered I-cche or D- cche misses, the effective ddress of memory opernd or brnch trget, nd whether it retired or why it borted. After the instruction completes, we generte n interrupt nd

deliver the recorded informtion to softwre. Our core instruction smpling technique cptures detiled informtion bout single instruction, nd is useful for identifying instructions tht remin in the pipeline for long time. On n in-order mchine, this informtion is sufficient to identify bottlenecks. However, on n out-of-order mchine, the concurrency provided by executing instructions out-of-order msks some stlls. To identify rel bottlenecks, instruction-level informtion must be combined with informtion bout useful concurrency (e.g., while given instruction is in flight, how mny issue slots re used by instructions tht ultimtely retire). We use pired smpling, nested form of smpling, to mesure useful concurrency: for ech profiled instruction, the instructions tht my execute concurrently with it re lso rndomly smpled, forming smple pir. Pired smpling exposes the interctions mong instructions, enbling wide vriety of interesting concurrency nd utiliztion metrics to be computed. The reminder of this pper describes ProfileMe in more detil. Section 2 explins why performnce on out-of-order processors is hrd to understnd nd why event counters re insufficient. Section 3 presents n overview of ProfileMe. Section 4 describes its hrdwre requirements, while Section 5 discusses how profiling softwre cn collect profiles from this hrdwre nd nlyze them to extrct useful informtion. Section 6 discusses lterntive metrics for identifying bottlenecks. Section 7 discusses optimiztions bsed on the informtion produced by ProfileMe. Relted work is exmined in Section 8. Finlly, we summrize our conclusions in Section 9. 2 Problem The behvior of progrms run on out-of-order processors cn be subtle nd difficult to understnd. To motivte our profiling mechnism, we begin this section by reviewing the flow of instructions in out-of-order processors. Using the Alph 21264 processor s concrete exmple, we discuss the myrid wys in which instructions my be delyed. We then demonstrte the problems with using event counters to understnd the performnce of progrms executed on processors with out-of-order nd specultive execution. 2.1 A Supersclr Out-of-Order Architecture An out-of-order execution processor fetches nd retires instructions in order, but my execute them out of order (subject to dt dependences). Figure 1 depicts the pipeline of the Alph 21264 processor [12]. Ech cycle, the first stge of the pipeline fetches nd decodes group of instructions from the instruction cche strting t the current PC. Becuse it tkes multiple cycles to resolve the PC of the next instruction to fetch, the current PC is predicted by brnch or jump predictor. If the prediction is incorrect, the processor will bort the mispredicted instructions (the bd pth) nd will restrt fetching instructions on the good pth. Becuse of this use of PC prediction, we refer to the FETCH MAP QUEUE REG EXEC DCACHE Stge: 0 1 2 3 4 5 6 Brnch Predictors Next-Line Address L1 Ins. Cche 64KB 2-Set Int Reg Mp 80 in-flight instructions Int Issue Queue (20) 4 Instructions / cycle FP Reg Mp FP Issue Queue (15) Reg File (80) Reg File (80) Reg File (72) Exec Addr Exec L1 Dt Exec Cche Addr 64KB Exec 2-Set FP ADD Div/Sqrt FP MUL Bus Interfce Unit Victim Buffer Miss Address Figure 1: Alph 21264 Processor Pipeline. Sys Bus 64-bit 128-bit 44-bit instruction strem followed by the fetcher s the predicted control pth. The decoder determines which instructions in the fetched group re prt of the instruction strem. When block of instructions is fetched from the I-cche, some of the instructions my not be on the predicted control pth due to brnches or jumps into or out of the middle of the fetch block. To support out-of-order execution, registers re renmed to prevent write-fter-red nd write-fter-write conflicts. This renming is ccomplished by mpping rchitecturl to physicl registers. Thus two instructionstht write the sme rchitecturl register cn sfely execute out of order becuse they will write different physicl registers, nd consumers of those rchitecturl registers will get the proper vlues. Instructions re fetched nd mpped in order long the predicted pth. A mpped instruction resides in the issue queue until its opernds hve been computed nd functionl unit of the pproprite type is vilble. After n instruction hs executed, it is mrked s redy to retire nd is retired by the processor when ll previous redy-to-retire instructions in progrm order hve been retired (nd erlier predicted brnches hve been confirmed). Upon retirement, the processor commits the instruction s chnges to the rchitecturl stte nd releses resources used by the instruction. In some cses, such s when brnch is mispredicted, instructions must be trpped or discrded. When this occurs, the specultive rchitecturl stte is rolled bck nd fetching continues fter the most recent untrpped instruction (i.e., the ctul brnch trget). Numerous events my dely the execution of n instruction. In the front of the pipeline, the fetcher my stll due to n I-cche miss or my fetch bd-pth instructions due to misprediction. The mpper my stll due to lck of free physicl registers or free slots in the issue queue. Instructions in the issue queue my wit for their register dependences to be stisfied or for the vilbility of functionl units. Instructions my stll due to dt cche misses. Instructions my trp becuse they were specultively issued down bd pth, or becuse the processor took n interrupt. Mny of these events re difficult to predict stticlly, nd

ll of them cn degrde performnce. At the sme time, the bility of n out-of-order processor to issue lter instruction while n erlier instruction is stlled cn men tht some delys re hidden. Our pproch identifies which instructions re delyed nd how tht ffects the running time of progrm, enbling progrmmers or optimiztion tools to improve performnce. 2.2 Event Counter Limittions As mentioned erlier, mny existing processors provide event counters to help mesure the performnce of progrms. Unfortuntely, event counters do not ccurtely ttribute events to instructions: the instruction tht cused n event resulting in n event-counter overflow is usully erlier, by n unpredictble mount, thn the instruction whose PC is delivered to the interrupt hndler. Out-of-order nd specultive execution mplify this problem, but it is present even on in-order mchines. Figure 2 compres the progrm counter vlue delivered to the performnce counter interrupt hndler when monitoring D-cche-reference events on the in-order Alph 21164 nd on the out-of-order Pentium Pro. The exmple progrm, shown on the left side of the figure, consists of loop contining single (cche hit) memory red instruction, followed by hundreds of instructions. On the in-order Alph, lmost ll performnce counter events re ttributed to the instruction tht is executing six cycles fter the event, resulting in lrge pek of smples on the seventh instruction fter the lod. This skewed distribution is not idel, but sttic nlysis cn sometimes work bckwrds from the single lrge pek to identify the instruction tht cused the event. On the Pentium Pro, the event smples re widely distributed over the next 25 instructions. This smered distribution of smples mkes it nerly impossible to ttribute n event to the instruction tht cused it. Similr behvior occurs when counting other hrdwre events. This problem is lso not specific to the Pentium Pro: we hve observed similr behvior with the MIPS R10000 s hrdwre event counters [14]. Aside from the wide distribution of event smples, hrdwre event counters suffer from severl dditionl problems. First, performnce-counter interrupts my be deferred when running non-interruptible or high-priority system code, such s Alph PALcode [8]. As result, event smples will be incorrectly ttributed to the instruction following the high-priority code, resulting in undesirble blind spots. In ddition, there re typiclly mny more events of interest thn there re hrdwre counters, mking it impossible to concurrently monitor ll interesting events. The incresing complexity of processors is likely to excerbte this problem. Moreover, event counters only record the fct tht n event occurred; they do not provide ny context bout the event. For mny kinds of events, dditionl informtion, such s the ltency to service cche miss, would be extremely useful. Alph 21164 L1: ldl 0,(0)... br L1 Pentium Pro mov ex,[ex] 782 Alph 21164 (in-order) Pentium Pro (out-of-order)... jmp L1 0 100 200 Dt Cche References Figure 2: Histogrm of PC vlues delivered to performnce counter interrupt routines on in-order nd out-of-order processors ProfileMe voids these problems on both in-order nd out-of-order processors. Profiling instructions insted of events completely elimintes the difficulties with event ttribution. Similrly, smpling instructions in hrdwre elimintes blind spots. Instruction-bsed profiling lso permits complete set of correlted events to be collected for ech instruction, voiding the need to process nd correlte interrupts from multiple event counters. 3 Overview of Approch To understnd progrm s performnce, we would like to gther informtion t two levels: Aggregte informtion, summrizing performnce sttistics over n entire worklod, n individul progrm, procedure, or smller unit such s loop. Instruction-level informtion, showing the verge behvior of ech instruction. We re interested in gthering the key performnce metrics needed to identify performnce bottlenecks. As described erlier, ProfileMe involves smpling instructions: rndomly choosing instructions to be profiled nd recording informtion bout their execution. By ggregting smples from repeted executions of the sme instruction, we cn estimte mny interesting metrics for ech instruction. Our pproch mkes gthering this instruction-level informtion both possible nd reltively inexpensive. Informtion bout individul instructions cn esily be ggregted to summrize the behvior of lrger units of code. ProfileMe consists of both hrdwre nd softwre. Hrdwre is needed to select instructions to be profiled, to record informtion bout profiled instructions, nd to generte n interrupt when profiled instruction completes so tht the recorded informtion cn be delivered to softwre. Softwre is needed to smple the instruction strem rndomly,

to field the interrupts generted by the hrdwre so tht informtion bout profiled instructions cn be sved in profile dtbse, nd to nlyze profile dt to identify performnce problems. To get sttisticlly meningful estimte of progrm behvior, the profiling softwre requires rndom smple of the instruction strem. Some nlyses require rndom smpling of ll fetched instructions, while others need rndom smple of only the retired instructions. As described lter, our pproch involves rndomly smpling fetched nd decoded instructions. This mens tht n instruction chosen for profiling my lter bort rther thn retiring (e.g., due to specultive execution down bd pth). Providing smples of fetched instructions (long with retired/borted sttus informtion) permits n nlysis of which instructions re borting nd why, rther thn mking borted instructions completely invisible to profiling. Note tht this does not impct the rndom smpling of retired instructions: selecting the retired instructions from rndom smpling of fetched instructions yields rndom smpling of retired instructions, just s if the hrdwre were providing rndom smpling of retired instructions directly. As mentioned erlier, smpling individul instructions is importnt, but is not sufficient to ccurtely identify bottlenecks in out-of-order processors. We ugment our core instruction smpling mechnism with pired smpling,which permits the smpling of multiple instructions tht my be in flight concurrently. Pired smpling provides essentil informtion for nlyzing interctions between instructions. The ide is to smple the instructions in reltively smll windowround ech instruction to obtin sttisticl estimte of concurrency levels nd other utiliztion mesures. Since both smples in smple pir include retired/borted sttus informtion, it is possible to determine the level of useful concurrency i.e., the number of concurrent instructions tht eventully retire. Pired smpling imposes reltively simple dditionl requirements. The hrdwre for selecting instructions to be profiled nd for recording informtion bout profiled instructions must be duplicted. In ddition, hrdwre is needed for mesuring the fetch ltency between the instructions in smple pir so tht concurrency levels cn be estimted. In the next section, we describe the hrdwre needed for ProfileMe, including both the core instruction smpling mechnisms nd pired smpling. Section 5 discusses in more detil how this hrdwre cn be used to provide useful profiling informtion for vriety of performnce understnding nd optimiztion tsks. 4 ProfileMe Hrdwre The hrdwre required for smpling instruction execution is modest nd scles linerly with the number of inflight instructions tht my be smpled simultneously. By restricting the number of instructions simultneously profiled to one or two instructions we limit the hrdwre overhed. The run-time profiling overhed my be decresed rbitrrily by reducing the smpling rte, lthough previous work hs shown tht high frequency smpling cn be implemented with reltively low overhed through creful progrmming [2]. In the subsections below we describe the hrdwre needed to smple single instruction, the dditionl hrdwre needed for pired smpling, nd how replicting some of the hrdwre cn reduce the softwre overhed substntilly. 4.1 Instruction Smpling Hrdwre must perform four tsks to smple individul instructions: instructions to be profiled must be selected, profiled instructions must be tgged in the processor pipeline, dt cptured bout profiled instruction s execution must be recorded in internl registers, nd n interrupt must be generted when profiled instruction completes so tht the recorded informtion cn be cptured by profiling softwre. 4.1.1 Choosing Profiled Instructions In the front of the pipeline, we need hrdwre to choose instructions to be profiled. To ensure tht instructions re chosen rndomly, we dd softwre-writble Fetched Instruction Counter to the processor s instruction fetcher. At the beginning of ech smpling intervl, the profiling softwre writes rndom vlue to the counter. The counter decrements once for ech instruction fetched on the predicted control pth; the instruction fetched when the counter reches zero is selected for profiling. Counting fetched instructions on the predicted control pth is ctully somewht complicted, since vrible number of instructions (zero to four on the Alph 21264) on the predicted control pth is fetched ech cycle. This complexity cn be voided by insted counting fetch opportunities (four per cycle on the Alph 21264) nd selecting prticulr fetch opportunity to be profiled. A given fetch opportunity my contin n instruction on the predicted control pth, n instruction not on the predicted control pth (but in the sme fetch block s instructions tht re on the predicted control pth), or no instruction t ll (e.g., ifthefetcher isstlledwitingforn I-cche miss). Choosing instructions to profile bsed on counting fetch opportunities simplifies the hrdwre, but my result in significnt number of smples tht do not contin instructions on the predicted control pth, effectively reducing the useful smpling rte. We re still evluting the trdeoffs mong different methods of selecting instructions to be profiled. 4.1.2 The ProfileMe Tg We ugment the decoded instruction stte with ProfileMe tg tht is pssed through the processor pipeline with every in-flight instruction. The ProfileMe tg is set for n instruction when it is chosen to be profiled. In the lowestcost implementtion, the tg is set for t most one in-flight instruction t time, so tht single bit suffices for the

tg. For pired smpling or, in generl, N-wy smpling, dlog(n +1)ebits re needed. 4.1.3 Instruction-Level Dt Collection When the ProfileMe tg is set for n instruction, the profiling hrdwre records events, ltencies, ddresses, etc.,ssocited with tht instruction, in set of processor-internl Profile Registers indexed by the tg. The informtion collected for profiled instructions will vry cross processor implementtions. This subsection sketches the informtion tht is importnt for profiling in current out-of-order execution processors nd the hrdwre needed to gther it. It is reltively esy to hve the hrdwre record dditionl events or other informtion bout the instruction in the Profile Registers. The Profiled Context Register records the ddress spce number or other identifiction of the process or thred executing the profiled instruction. The Profiled PC Register records the ddress of the profiled instruction. The Profiled Address Register records the effective ddress of lod nd store instructions nd the trget ddress of indirect jump instructions. A Profiled Event Register is bit-field tht records whether vrious events were experienced by the instruction. Events include: I-cche nd D-cche miss, instruction nd dt TLB miss, brnch tken, brnch mispredicted, vrious resource conflicts, memory trps, whether the instruction retired, trp reson, etc. A Profiled Pth Register is used to cpture recent brnchtken informtion from the processor s globl brnch history register. This informtion cn be used to determine the code pth tken in reching the profiled instruction, s describedinsection5.3. AsetofLtency Registers records the number of cycles spent by the instruction in ech pipeline phse. Tble 1 lists some of the ltencies of interest on the Alph 21264, long with descriptions of the problems they help dignose. 4.1.4 Cpturing Profile Dt The ProfileMe tg remins set for profiled instruction until it retires or borts. After ll processor ctivity pertining to the instruction hs completed, n interrupt is generted. Profiling softwre fields the interrupt, reds the Profile Registers, nd resets the Fetched Instruction Counter to pseudo-rndom vlue. Note tht even if some of the informtion to be recorded in the Profile Registers needs to trvel long distnce cross the chip, this need not impct the cycle time. Ltches cn be inserted to pipeline the signls to the Profile Registers. If this is done, the interrupt tht signls the collection of ProfileMe smple must be delyed until ll the pproprite signls hve hd time to rech the Profile Registers. 4.2 Pired Smpling Pired smpling requires the bility to smple two potentilly concurrent instructions. We lso need informtion bout the overlp between the instructions in smple pir. Mesured Ltency Fetch!Mp Mp!Dt redy Dt redy!issue Issue!Retire redy Retire redy!retire Lod issue! Completion Explntion Stlls due to lck of physicl registers or issue queue slots Stlls due to dt dependences Stlls due to execution resouce contention Execution ltency Stlls due to prior unretired instructions Memory system ltency (Alph llows lods to retire before vlue returns, so this my be different from Issue!Retire redy) Tble 1: Ltency Mesurements. Pipeline stge ltencies re useful for identifying nd dignosing stlls nd delys. To ccommodte pired smpling, we mke the following extensions to the core instruction smpling mechnisms. To choose instructions in smple pir, we specify mjor nd minor smple intervls. The mjor intervl specifies the number of fetched instructions until the first instruction of pir is chosen. The minor intervl specifies the number of fetched instructions between the first nd second profiled instructions in smple pir. The softwre rndomizes both of these intervls. To record informtion bout both instructions in smple pir, we need two sets of Profile Registers, indexed by the ProfileMe tg, nd the signls crrying informtion to the registers must lso crry the tg. So profiling softwre cn cpture the recorded informtion, n interrupt must not be generted until both smpled instructions hve finished executing nd ll relevnt dt hs been recorded in their Profile Registers. Finlly, we need to cpture the ltency between the two smpled instructions (i.e., the number of cycles between the times when the two smpled instructions were fetched). This ltency is required to determine the degree of overlp of the instruction pir in the processor pipeline. 4.3 Amortizing Interrupt Delivery Costs Previous work hs shown tht the cost of delivering nd processing performnce interrupts is one of the most significnt sources of overhed in smpling-bsed profiling systems [2]. ProfileMe mkes it possible to reduce this overhed by providingdditionl hrdwre copies of profile registers nd by buffering multiple smples before delivering performnce interrupt. Softwre cn then red the dt for severl smples t once, thereby mortizing the performnce interrupt delivery cost. 5 Profiling Softwre The hrdwre mechnisms presented in the previous section cn be utilized in vrious wys. One pproch is to gther smples severl thousnd times per second, logging them in memory or on disk for lter processing. Spce consumption cn be reduced by processing some of the informtion s the smples re gthered, such s by ggregting smples for the sme instruction, s is done for eventcounter-bsed smples in DIGITAL s Continuous Profiling Infrstructure (DCPI) system [2]. Overhed cn be further

reduced by ignoring certin fields of the profile informtion except when gthering dt for specific optimiztions. Once the profile informtion hs been collected, it cn be nlyzed to extrct useful informtion. Severl nlyses re described in the following subsections. 5.1 Estimting Event Frequencies Smples for individul instructions cn be used to estimte vrious instruction-level event frequencies s follows. Assume n verge smpling rte of one smple every S fetched instructions. Suppose tht N instructions re fetched nd frction f of those hve given property P (e.g., instruction I retired, or instruction I missed in the D-cche ). We know how mny totl smples re collected (on verge, N=S) nd how mny of the smples hve the property P. Our gol is to estimte fn, i.e., the ctul number of fetched instructions with property P. Let the rndom vrible k be the number of smples with property P. We estimte the ctul number of fetched instructions with property P s ks. It is esy to show tht the expected vlue of ks is fn, i.e., the ctul number of fetches of I with property P. Under simple ssumptions, p the p coefficient of vrition of ks is ks =E[kS] = 1=N (S,f)=f, which is pproximtely p S=fN (since S f). This ltter expression is equivlent to p 1=E[k]. In other words, reltive error decreses with the reciprocl of the squre root of the (expected number of) smples with property P. Infrequent events or long smpling intervls require longer runs to get enough smples for ccurte estimtes. However, for mny pplictions the gol is to identify instructions tht exhibit n unusully high vlue for prticulr metric (e.g., D-cche miss count). Such instructions hve high vlue of fn for tht property, so convergence should hppen reltively quickly. To explore the issue of convergence, we extended cycle-ccurte simultor of the Alph 21264 processor to gther ProfileMe smples. Using suite of benchmrks tht included COMPRESS, GCC, GO, IJPEG, LI, PERL, POVRAY, nd VORTEX. we smpled every 10 3 to 10 5 fetched instruction from trces of 10 8 nd 10 9 instructions. Figure 3 illustrteshowtheestimted countsforech PC converge on the ctul vlues s the number of smples increses. The left column shows the results for the retire count for ech instruction while the right column shows the results for D- cche miss counts. In the grphs, ech point represents single sttic instruction. All grphs show the rtio of the estimted vlue to the ctul vlue on the y-xis; the top two rows use log scle, nd the bottom row uses liner scle. In the top row, the x-xis shows the totl number of smples for ech instruction; this is typiclly more thn the number of smples in which the instruction retired or suffered D-cche miss (especilly for D-cche misses). In the bottom two rows, the x-xis shows the number of smples with the relevnt property. The grphs in the bottom row show n expnded view Estimte / Actul (log) Estimte / Actul (log) Estimte / Actul 10000 100 1 0.01 0.0001 10000 100 1 0.01 0.0001 1.50 1.25 1.00 0.75 0.50 Retire Count 0 500 1000 # Inst. Smples 0 250 500 # Retired Smples 0 250 500 # Retired Smples D-Cche Misses 0 500 1000 # Inst. Smples 0 250 500 # D-Cche Miss Smples 0 250 500 # D-Cche Miss Smples 10000 100 1 0.01 0.0001 10000 100 1 0.01 0.0001 1.50 1.25 1.00 0.75 0.50 Figure 3: Convergence of Retire Count nd D-Cche Miss Rte of the sme dt s in the middle row; they lso show the edges of the envelope corresponding to one stndrd devition from the ctul vlue (y =11= p x). Two-thirds of the points re expected to be within this envelope, nd ny envelope tht includes fixed percentge of the points will follow 1= p x curve. Optimiztion nd profiling tools my lso find it useful to compute confidence intervls round dt derived from smpling. 5.2 Estimting Interction Frequencies Pired smples cn be used to estimte wide rnge of concurrency nd utiliztion metrics. For exmple, they cn be nlyzed to estimte useful concurrency levels, mking it possible to find true bottlenecks (see Section 5.2.3). Pired smples cn lso be used to mesure edge frequencies of progrm s control-flow nd cll grphs nd cn improve the ccurcy of smpling-bsed pth-profiling (see Section 5.3). Finlly, it my be possible to sttisticlly reconstruct detiled processor pipeline sttes from pired smples. This section explins in more detil how pired smpling works nd how pired smples cn be nlyzed to derive sttisticl estimtes of concurrency nd resource utiliztion levels while n instruction is in flight. 5.2.1 Nested Smpling Pired smpling enbles ProfileMe records to be collected for two instructions tht my be in flight simult-

b b c d e f c d f b fetched instructions Figure 4: Nested Smpling Exmple. Two levels of smpling re depicted: (1) mjor inter-pir smpling intervl between windows (blck regions), nd (2) minor intr-pir smpling intervl within ech window (expnded views). neously. A key ppliction of pired smpling hrdwre is nested smpling: for ech profiled instruction, the set of other instructions tht cn potentilly execute concurrently with it is directly smpled. Nested smpling is bsed on the sme sttisticl rguments tht justify ordinry smpling. Becuse it involves two levels of smpling, it will be most effective for hevily executed code. Figure 4 illustrtes n exmple of nested smpling. The rrow indictes the sequence of instructions tht re fetched (in progrm order) during some dynmic execution. The first level of smpling is represented by the smll blck regions of fetched instructions; their spcing corresponds to the mjor smpling intervl. The second level of smpling is depicted by the expnded window of instructions shown bove ech blck region. The first lbelled instruction in ech window represents the instruction selected by the first level of smpling. The second lbelled instruction in ech window is determined by the minor smpling intervl. Denote the size of the window of potentilly concurrent instructions by W. For ech pired smple hi 1 ;I 2 i,nested smpling is implemented by setting the intr-pir fetch distnce to pseudo-rndom number uniformly distributed between 1 nd W. The window size is conservtively chosen to include ny pir of instructions tht my be simultneously in flight. In generl, n pproprite vlue for W depends on the mximum number of in-flight instructions supported by the processor. (On most processors, this is less thn one hundred instructions.) The minor intrpir smpling intervl will typiclly be orders of mgnitude smller thn the mjor inter-pir intervl. 5.2.2 Anlyzing Smple Pirs For given profiled instruction I, the set of potentilly concurrent instructions re those tht my be co-resident in the processor pipeline with I during ny dynmic execution. This includes instructions tht my be in vrious stges of execution before I is fetched, s well s instructions tht re fetched fter I. Figure 5() shows how the smple pirs from Figure 4 cn be nlyzed to recover informtion bout instructions in window of W potentilly concurrent instructions round I. In this exmple, we consider ll pirs hi 1 ;I 2 i contining the instruction lbelled. WhenI 1 =, I 2 is rndom smple in the window fter ; wheni 2 =, I 1 is rndom smple in the window before. By considering ech pir twice, rndom smples re uniformly distributed over the set of ll potentilly concurrent instructions. c d -W +W fetch distnce (instructions) b d b d c d (borted) temporl overlp (cycles) Figure 5: Pired Smple Anlysis. () Smple pirs contining instruction form rndom smple of instructions in the window of W potentilly concurrent instructions round. (b) Execution timings for the instructions in ech pir enble their temporl overlp to be determined. The ProfileMe dt recorded for ech pired smple hi 1 ;I 2 iincludes ltency registers tht indicte where I 1 nd I 2 were in the processor pipeline t ech point in time, s well s the intr-pir fetch ltency tht llows the two sets of ltency registers to be correlted. The ProfileMe records for I 1 nd I 2 lso indicte whether they retired or borted. This informtion cn be used to determine whether or not the two instructions in smple pir overlpped in time, s illustrted in Figure 5(b). For exmple, the dt ssocited with smple pirs hd; i nd hc; i revel vrying degrees of execution overlp, nd there ws no overlp for h; di. Similrly, the dt for h; bi indictes tht while the executions of nd b overlpped, b ws subsequently borted. The definition of overlp cn be ltered to focus on prticulr spects of concurrent execution. The subsection below uses prticulr definition to estimte the number of issue slots wsted while given instruction ws in flight. Other useful definitions of overlp include: one instruction issued while the other ws stlled in the issue queue; one instruction retired within fixed number of cycles of the other; or both instructions were using rithmetic units t the sme time. 5.2.3 Exmple Metric: Wsted Issue Slots To pinpoint bottlenecks, we need to identify instructions with high execution counts, long ltencies, nd low levels of useful concurrency. One interesting mesure of concurrency is the totl number of issue slots wsted while n instruction is in progress. To compute this metric, we define useful overlp for smple pir contining n instruc-

tion I to men tht while I is in progress, the instruction pired with it in the smple pir issues nd subsequently retires. Here we define in progress to men the time between when I is fetched nd when it becomes redy to retire; we do not include time spent witing to retire, since such delys re purely due to stlls by erlier instructions. Fix n instructioni. To estimte the number of issue slots wsted while I is in progress, we first estimte the number of issue slots used by instructionstht exhibit useful overlp with I. We then estimte the totl number of issue slots vilble over ll executions of I; the difference between these two quntities is the number of wsted issue slots. Assume n verge smpling rte of one smple pir every S fetched instructions, with the second smple in pir chosen uniformly from the window of the W instructions fter the first. Let U F I denote the number of smples of the form hi;i 2 i such tht I 2 exhibits useful overlp with I; similrly, let U B I denote the number of smples of the form hi 1 ;Iisuch tht I 1 exhibits useful overlp with I. Let U I = U F I +UB I. We estimte the number of useful instructions tht issued while I ws in progress s U I WS. Now let L I be the sum over ll smples involving I of the smple ltency (in cycles) from fetch to redy-to-retire. (We include both smples in every pir in this sum.) Let C be the issue width of the mchine, i.e., the number of vilble issue slots per cycle (e.g., four per cycle sustinble on the Alph 21264). We estimte the totl ltency over ll executions of I s L I CS=2. Finlly, we estimte the totl number of wsted issue slots during ll executions of I s (L I CS=2), (U I WS). An importnt ttribute of our pproch is tht the components of metric such s wsted issue slots cn be ggregted incrementlly, enbling compct storge during dt collection (s in the DCPI profiling system [2]). 5.2.4 Flexible Support for Concurrency Metrics Mny other concurrency metrics cn be estimted in similr mnner, such s the number of instructions tht retired while I ws in flight, or the number of instructions tht issued round I. Instruction-per-cycle (IPC) levels in the neighborhood of I cn be mesured by counting the number of pirs in which both instructions retire within fixed number of cycles of ech other. More detiled informtion cn lso be extrcted or ggregted, such s the verge utiliztion of prticulr functionl unit while I ws in given pipeline stge. Perinstruction dt my lso be used to cluster interesting cses when ggregting concurrency informtion. For exmple, it my be useful to compre the verge concurrency level when instruction I hits in the cche with the concurrency level when I suffers cche miss. Other interesting spects to exmine for correltion with concurrency levels include register dependencies, brnch-mispredict stlls, nd recent brnch history. In generl, pired smpling provides significnt flexibility, llowing vriety of different metrics to be computed sttisticlly by smpling the vlue of ny function tht cn be expressed s f(i 1 ;I 2 )over window of W instructions. In contrst to hrdwre mechnisms designed to mesure single concurrency metric, this flexibility mkes pired smpling n ttrctive choice for cpturing concurrency informtion on complex, out-of-order processors, becuse it leves the door open for the design of new metrics nd nlysis techniques. 5.3 Pth Profiles Mny compiler optimiztions, such s trce scheduling [9] nd hot-cold optimiztion [5], rely on predicting the hevily executed pths through progrm. Frequently executed pths were conventionlly estimted by gthering bsic block or control-flow grph edge counts nd then using these counts to infer the hot pths. More recently, Bll nd Lrus [3] nd Young et l. [19] proposed more dvnced profiling methods to gther detiled pth informtion directly. Although such techniques yield exct pth counts, they require instrumenting the progrm nd re therefore expensive nd intrusive. By cpturing informtion bout the processor s globl brnch history nd combining this with sttic nlysis of progrm s control flow grph (CFG), we cn use ProfileMe hrdwre to perform sttisticl profiling of CFG pth segments. Most modern microprocessors store the directions of the lst N conditionl brnches in globl brnch history register s prt of their brnch prediction hrdwre. By cpturing the contents of this register t instruction fetch time s prt of the profile record, we cn nlyze the CFG by looking bckwrd from smpled instruction to find the pths leding up to it tht re consistent with the recorded brnchhistory informtion. Becuse of merges in the CFG, there my be multiple such consistent pths, becuse the history register contins only the directions of the brnches nd not their PCs. 1 To explore the effectiveness of this nlysis in identifying the true progrm pth given the PC nd globl brnch history contined in ProfileMe smple, we trced ech of the progrms in the SPECint95 benchmrk suite. For ech instruction in the trce, we computed the vlue of the brnch history bits t tht point, nd wlked bckwrds through the progrm s CFG to identify pth segments tht could hve been executed (i.e., where the prticulr brnch directions on the pth re consistent with the brnch directions indicted in the history bits). Idelly, this nlysis would identify just one potentil pth segment corresponding to the true execution pth. We compred three different schemes for constructing pths: Execution counts, which ignores the brnch history bits, using the execution frequencies t ech control-flow merge point to identify the most likely pth (trce scheduling compilers use similr technique to construct trces 1 Asynchronous events tht cuse code with brnches to be executed, such s interrupts or context switches, lso pollute the brnch history bits, but these events should be reltively infrequent. Since the gol is to identify high frequency pths, low frequency pths generted by noisy brnch history bits will be lrgely ignored.

Identified Single Correct Pth (%) 100 50 Intrprocedurl Interprocedurl History bits+pired smpling History bits Execution counts 0 0 1 3 5 7 9 11 1 3 5 7 9 11 Avilble Brnch History (conditionl brnches) Figure 6: Effectiveness of pth reconstruction strtegies from bsic-block execution-count profiles); History bits, which uses the globl brnch history bits to restrict the set of pths tht re exmined; nd History bits + pired smpling, which ugments History bits by discrding pths tht do not contin the other PC from pired smple (with the intr-pir distnce rndomly vried between 1 nd 50 fetched instructions). The results re shown in Figure 6. The grphs depict the ccurcy of ech of the three different schemes, s function of the length of the brnch history tht ws exmined (the history lengths mintined by current genertion processors typiclly re between 8 to 12). The verticl xis shows the success rte of the reconstruction over the entire SPECint95 suite (using trces of 33 to 83 million instructions for ech benchmrk), where success is defined s cse where only one pth is produced by the nlysis nd the pth corresponds to the ctul execution pth. The left grph of Figure 6 depicts n intrprocedurl experiment, where we finished pth when either the pth hd grown bckwrd to include fixed number of brnches corresponding to the length of the vilble brnch history or when the pth reched the beginning of the routine (such pths my not contins s mny brnches s there re bits in the vilble brnch history). This grph corresponds to the kinds of pths tht might be used to guide n intrprocedurl trce scheduler. The right grph of Figure 6 shows n interprocedurl experiment, where the nlysis continued through the cll-sites of routine when the beginning of the routine ws reched, so tht pth ws only complete if it contined number of brnches equl to the length of the brnch history being exmined. 2 In generl, the ccurcy of ll three methods decreses s we ttempt to infer longer execution pth segments, but using the brnch history noticebly improves ccurcy, nd pired smpling improves ccurcy further. All three methods re considerbly less ccurte when trying to construct interprocedurl pths, but the results still indicte tht brnch history bits significntly improve ccurcy nd tht 2 Note tht in either cse, when procedure cll instruction is encountered during the bckwrds trversl of the CFG, the nlysis continues t the exits of the clled procedure nd cn eventully return to the clling procedure if there is sufficient brnch history to work bckwrds through the entire clled routine. 100 50 Wsted Issue Slots 200000 150000 100000 50000 0 0 20000 40000 60000 Totl Instruction Ltency (cycles) Figure 7: Identifying Bottlenecks. Instruction ltency lone cnnot ccurtely identify bottlenecks due to out-of-order execution tht msks stlls. pired smpling further improves ccurcy. Further study is required to show the degree of improvement of code genertion tht cn be ttined using more ccurte pth profiles. 6 Metrics for Identifying Bottlenecks When we strted this work, we believed tht concurrency informtion would be needed to identify bottlenecks ccurtely; this motivted us to invent pired smpling. Given tht pired smpling imposes some dditionl costs, one might sk whether it is relly necessry. To explore this question, we exmined whether the totl ltency of ech instruction (which cn be estimted from individul instruction smples, without pired smpling) would pinpoint bottlenecks s effectively s would the totl number of issue slots wsted while ech instruction ws in progress. Figure 7 shows results from running simple progrm consisting of three seprte loops on n Alph 21264 simultor. Ech instruction in the progrm is represented by symbol (circle, squre, or tringle), with different symbol for instructions in ech of the three seprte loops. In the figure, n instruction s X coordinte gives the totl ltency from fetch to retire-redy experienced by the instruction over the execution of the progrm. 3 An instruction s Y coordinte gives the totl number of issue slots wsted while the instruction ws in progress. The results in the grph show tht ltency is not well correlted with wsted issue slots, due to vrying levels of useful concurrency in the different loops. For exmple, the instruction with the highest ltency (rightmost tringle) ctully wstes fewer issue slots thn instructions with lower ltencies (rightmost circle nd squres). However, when concurrency is firly constnt, ltency is highly correlted with wsted issue slots. In the figure, intr-loop concurrency is 3 We use this definition of ltency insted of the fetch-to-retire ltency to void penlizing instructions tht issue round stlled instruction nd execute quickly but stll witing to retire becuse the erlier instruction is not redy to retire; s with other out-of-order processors, the Alph 21264retires instructionsin order. Thisis consistentwith ourdefinition of wsted issue slots, which considers only slots wsted while n instruction is in progress i.e., between the time it is fetched nd the time it becomes redy to retire.

similr cross instructions, s indicted by the slopes of instructions in the different loops. (Though even within individul loops, there re some significnt differences.) Dt collected for severl SPEC95 benchmrks using the sme simultor indicte tht rel pplictions lso exhibit vrying levels of useful concurrency. We mesured instructions-per-cycle (IPC) levels by counting the number of instructions tht retired during fixed 30-cycle time window. The rtio of the mximum nd minimum of these windowed IPC levels rnged from 3 to 30 cross the vrious benchmrks; the stndrd devition of the windowed IPC, weighted by retire count, vried from 20 42% of the men for ech of the benchmrks, with n overll vlue of 31% of the men. It is not yet cler wht the right metric is for pinpointing bottlenecks. However, it seems likely tht ltency lone will not suffice. As the complexity of processors increses, instruction-level concurrency will only become hrder to understnd; pired smpling nd the nlyses it supports will be useful tool for getting to the root of performnce problems. 7 Potentil Optimiztions This section briefly outlines some wys in which informtion collected by ProfileMe could be used in compilers nd operting systems to improve performnce. We re currently exploring these nd other directions. Guiding trditionl compiler optimiztions: Execution frequencies, brnch mispredict rtes, nd I-cche miss rtes derived from the smples cn be used to guide register lloction spilling decisions, inlining decisions, code genertion, nd the rerrngement of procedures nd bsic blocks to improve I-cche loclity. Improved instruction scheduling: One importnt spect of instruction scheduling is the insertion of prefetches nd the scheduling of lods nd stores. The lck of informtion bout ctul ltencies mens tht compilers schedule lods nd stores ssuming tht they will hit in the dt cche. Abrhm nd Ru [1] hve experimented with using verge lod ltencies to drive compiler optimiztions, nd more recently Luk nd Mowry [13] hve explored the use of pth informtion to identify lods whose cche miss behvior is correlted with the execution pth tken to rech the lod. ProfileMe provides chep wy of gthering the dt needed to drive these optimiztions. Cche nd TLB hit rte enhncement: Recent studies hve shown tht dynmiclly controlling the operting system s virtul-to-physicl mpping policies using informtion bout dynmic reference ptterns cn reduce conflict misses in lrge direct-mpped cches [4, 15], lower TLB miss rtes through the cretion of superpges [16], nd decrese the number of remote memory references in NUMAbsed multiprocessors through repliction nd migrtion of pges [17]. All of these schemes gther reference pttern informtion through either specilized hrdwre for gthering cche miss ddresses or specilized softwre schemes (e.g., flushing the TLB nd observing the miss pttern tht results). By cpturing the virtul ddresses of memory references tht miss in the cche or TLB, ProfileMe provides the informtion needed to guide these policies, without dditionl hrdwre complexity. 8 Relted Work The work most closely relted to ProfileMe is ptent by Westcott nd White, who lso proposed hrdwre mechnism for instruction-bsed smpling in n out-of-order execution mchine [18]. Their system llows profiling of n instruction when its execution is ssigned prticulr internl instruction number instruction identifier (IID) inthe processor s pipeline. During its execution, informtion ssocited with the instruction (such s whether it suffered dt cche miss, nd its ltency from fetch time to completion time) is recorded in internl processor registers. When the instruction retires, the informtion is logged to specific re of memory, nd when this memory re fills up, n interrupt is generted. There re three key differences between this pproch nd ProfileMe. First, Westcott nd White llow n instruction to be profiled only when it is ssigned prticulr IID. In contrst, ProfileMe llows ny instruction to be smpled; this is essentil for obtining rndom smple of the entire instruction strem. Second, ProfileMe keeps informtion for ll smpled instructions nd provides bit in the profile record indicting the instruction s retirement sttus. This llows softwre to decide how to hndle unretired instructions, rther thn trnsprently discrding them in the hrdwre. Third, ProfileMe supports pired smpling (with inter-smple ltency informtion tht shows overlp in the pipeline); this is essentil for mesuring concurrency levels during the execution of ech instruction. The informtion collected by the Westcott nd White mechnism does not provide ny support for determining inter-instruction reltionships. ProfileMe lso collects dditionl informtion for ech smpled instruction, including brnch directions, globl brnch histories, nd brnch mispredict informtion, ll of which re useful for identifying the common pths. More recently, Horowitz et l. [10] proposed hrdwre mechnism clled informing lods, in which memory opertion cn be followed by conditionl brnch opertion tht is tken only if the memory opertion misses in the cche. This permits recting to cche misses t finegrined level, such s by brnching to code tht is scheduled for the cse of cche miss, rther thn for the cse of cche hit. ProfileMe provides informtion bout cche misses, but the informtion is vilble only for smpled instructions nd only fter performnce interrupt hs been delivered. At the sme time, the informtion provided by ProfileMe is more detiled, since it includes other informtion such s the ltency incurred in servicing miss nd other spects of n instruction s execution. In mny respects, these two designs re complementry: informing

memory opertions permit softwre to gin control very quickly fter cche miss, while ProfileMe record contins more detiled informtion bout n instruction s execution tht cn be used for lter nlysis. Bershd et l. [4] proposed specilized hrdwre clled cche miss lookside (CML) buffer to identify virtul memory pges tht suffer from high L2 cche miss rte. Using the effective ddresses nd the ltency informtion for lods nd stores cptured by ProfileMe, we cn provide the sme informtion s CML buffer. Some processors, such s the Intel Pentium, hve softwre redble brnch trget buffers (BTB). Conte et l. [7] showed how to cheply estimte progrm s edge execution frequencies by periodiclly reding the contents of the BTB. More recently, Conte et l. [6] proposed dditionl hrdwre clled profile buffer, which counts the number of times brnch is tken nd not-tken. The brnch direction informtion in ProfileMe record yields similr informtion; the brnch history bits provide dditionl informtion bout pths. Michel Mitzenmcher for helping with the stochstic nlysis in Section 5. We lso hd vluble discussions with mny people t DIGITAL, including Robert Cohn, Bruce Edwrds, Joel Emer, Kourosh Ghrchorloo, John Henning, Dn Liebholz, Ed McLelln, Rhul Rzdn, nd Steve Root. 9 Conclusions The performnce of modern processors is becoming incresingly difficult to understnd. The dynmic nture of specultive nd out-of-order execution, coupled with the complexity of deep memory hierrchies, mkes it impossible to predict progrm behvior solely through sttic nlysis. Smpled profile informtion offers n inexpensive, unobtrusive wy to collect detiled informtion for identifying bottlenecks nd improving performnce. However, this potentil cnnot be relized using the hrdwre performnce counters found in existing processors, which cnnot even ccurtely ttribute events to instructions. ProfileMe enbles the collection of ccurte, detiled informtion with modest hrdwre by smpling instructions insted of events. A complete record of interesting events, such s cche misses nd brnch mispredictions, is directly ssocited with ech profiled instruction. A welth of dditionl informtion is lso collected, including pipeline stge ltencies, brnch history dt, nd effective ddresses for memory opertions. By dditionlly llowing pir of inflight instructions to be simultneously profiled, vriety of interesting informtion cn be derived bout the interctions between instructions, including useful concurrency levels nd pipeline stge utiliztions. Together, these mechnisms enble the construction of powerful, low-overhed profiling system tht offers unprecedented instruction-level feedbck to progrmmers nd optimiztion tools. Acknowledgments We would like to thnk Jennifer Anderson, Luiz Brroso, Susn Eggers, Keith Frks, Snjy Ghemwt, Bill Gry, Alln Heydon, Shun-Tk Leung, Shron Perl, Mrk Vndevoorde, nd the nonymous referees for their helpful feedbck on erlier versions of this pper. Specil thnks to Mitch Lichtenberg for performing experiments illustrting how counters behve on the Intel Pentium Pro processor nd to Andrei Broder nd

References [1] S. G. Abrhm nd B. R. Ru. Predicting lod ltencies using cche profiling. Technicl Report HPL-94-110, Hewlett Pckrd Lbortories, Nov. 1994. [2] J. Anderson, L. M. Berc, J. Den, S. Ghemwt, M. R. Henzinger, S.-T. Leung, R. L. Sites, M. T. Vndevoorde, C. A. Wldspurger, nd W. E. Weihl. Continuous profiling: Where hve ll the cycles gone? In Proc. 16th Symp. on Operting System Principles, Oct. 1997. [3] T. Bll nd J. R. Lrus. Efficient pth profiling. In Proc. 29th Annul Intl. Symp. on Microrchitecture, pges 46 57, Dec. 1996. [4] B. N. Bershd, D. Lee, T. H. Romer, nd J. B. Chen. Avoiding conflict misses dynmiclly in lrge direct-mpped cches. In Proc. Sixth Intl. Conf. on Architecturl Support for Progrmming Lnguges nd Operting Systems, pges 158 170, Oct. 1994. [5] R. Cohn nd P. G. Lowney. Hot cold optimiztion of lrge Windows/NT pplictions. In Proc. 29th Annul Intl. Symp. on Microrchitecture, pges 80 89, Dec. 1996. [6] T. M. Conte, K. N. Menezes, nd M. A. Hirsch. Accurte nd prcticl profile-driven compiltion using the profile buffer. In Proc. 29th Annul Intl. Symp. on Microrchitecture, pges 36 45, Dec. 1996. [7] T. M. Conte, B. A. Ptel, nd J. S. Cox. Using brnch hndling hrdwre to support profile-driven optimiztion. In Proc. 27th Annul Intl. Symp. on Microrchitecture, pges 12 21, Nov. 1994. [8] Digitl Equipment Corportion. Alph 21164 MicroprocessorHrdwre Reference Mnul. Mynrd, MA, 1995. Order Number EC- QAEQB-TE. [9] J. A. Fisher. Trce scheduling: A technique for globl microcode compction. IEEE Trns. on Computing, 30(7):478 490, July 1981. [10] M. Horowitz, M. Mrtonosi, T. C. Mowry, nd M. D. Smith. Informing memory opertions: Providing memory performncefeedbckin modern processors. In Proc. 23nd Annul Intl. Symp. on Computer Architecture, pges 260 270, My 1996. [11] Intel Corportion. Pentium(R) Pro Processor Developer s Mnul. McGrw-Hill, June 1997. [12] D. Leibholz nd R. Rzdn. The Alph 21264: A 500 MHz Out-of- Order Execution Microprocessor. In IEEE CompCon 97, Feb. 1997. [13] C.-K. Luk nd T. C. Mowry. Predicting dt cche misses in nonnumeric pplictions through correltion profiling. In Proc. 30th Annul Intl. Symp. on Microrchitecture, Dec. 1997. [14] MIPS Technologies, Inc. MIPS R10000 Microprocessor User s Mnul. Mountin View, CA, 1995. [15] T. H. Romer, D. Lee, B. N. Bershd, nd J. B. Chen. Dynmic pge mpping policies for cche conflict resolution on stndrd hrdwre. In Proc. First Symp. on Operting Systems Design nd Implementtion, pges 255 266, 1994. [16] T. H. Romer, W. H. Ohlrich, A. R. Krlin, nd B. N. Bershd. Reducing TLB nd memory overhed using online superpge promotion. In Proc. 22nd Annul Intl. Symp. on Computer Architecture, pges 176 187, June 1995. [17] B. Verghese, S. Devine, A. Gupt, nd M. Rosenblum. Operting system support for improving dt loclity on CC-NUMA compute servers. In Proc. Seventh Intl. Conf. on Architecturl Support for Progrmming Lnguges nd Operting Systems, pges 279 289, Oct. 1996. [18] D. W. Westcott nd V. White. Instruction smpling instrumenttion, Sept. 1992. U.S. Ptent #5,151,981, ssigned to Interntionl Business Mchines Corportion. [19] C. Young nd M. D. Smith. Improving the ccurcy of sttic brnch prediction using brnch correltion. In Proc. Sixth Intl. Conf. on Architecturl Support for Progrmming Lnguges nd Operting Systems, pges 232 241, Oct. 1994.