Computer Science 146/246 Homework #3 Due 11:59 P.M. Sunday, April 12th, 2015 We played with a Pin-based cache simulator for Homework 2. This homework will prepare you to setup and run a detailed microarchitecture-level simulator, as well as to modify the simulator to facilitate your own research. We will use XIOSim, a Pin-based x86 microarchitecture-level simulator, for this homework. You can find more details about (an older version of) XIOSim in this paper [2] (Download), or you can just check it out on Github. 1 1 Download and Configure XIOSim a. Download XIOSim You can get XIOSim from Github: $ git clone https://github.com/s-kanev/xiosim.git b. Set up your environment The build and the following scripts rely on these variables to know which XIOSim installation to look for and how to satisfy dependences. Just execute: $ export BOOST_HOME=/home/cs246/boost_1_54_0 $ export XIOSIM_TREE=/your/path/to/XIOSim $ export XIOSIM_INSTALL=${XIOSIM_TREE}/pintool/obj-ia32 You can add these to your /.bashrc file so you don t type them out every time you login. c. Build XIOSim $ cd pintool $ make 1 Full disclosure, I m the main author of XIOSim. So, if you have any feedback, suggestions, bug reports, or curse words you want to throw at me, I d be more than happy to listen. 1
d. Run Your First Test Program Let s test the simulator with a simple benchmark. There is a script called run.sh under XIOSim/pintool directory which sets up the simulated architecture configuration and runs the simulation. It looks like this: PIN=${PIN_ROOT}/pin.sh PINTOOL=./obj-ia32/feeder_zesto.so ZESTOCFG=../config/A.cfg BENCHMARK_CFG_FILE=benchmarks.cfg CMD_LINE="setarch i686 -BR./obj-ia32/harness \ -benchmark_cfg ${BENCHMARK_CFG_FILE} \ -pin ${PIN} \ -pause_tool 1 \ -xyzzy \ -t \ ${PINTOOL} \ -num_cores 1 \ -s \ -config ${ZESTOCFG}" echo ${CMD_LINE} ${CMD_LINE} The are two things to notice. First benchmarks.cfg chooses which program to simulate. Then, A.cfg at../config/ sets up the simulation paramters This particular file models Intel s Atom processor. You should already be familiar with some of the knobs in that file. For example, search for bpred and you can see that the Atom model is set up to simulate a 2-level gshare predictor, very similar to what you implemented in homework 1. Now you are ready to run the simulator. pintool $./run.sh It will finish in a couple of minutes. The simulation output is in sim.out. You can see the simulator statistics about each pipeline stage, instruction breakdowns, as well as various caches and memory stats. You will want to spend some time looking through this file and the config file (A.cfg), to understand the output data, some of which you will need for the rest of the homework. 2
e. Run SPEC Benchmarks Now you are ready to run full-blown SPEC benchmarks using XIOSim. Before we modify the script, make a directory outside your XIOSim repository for output files, e.g. mkdir /your/path/to/hw3/spec out. Switch to XIOSim/scripts directory. First modify line #7 in spec.py to specdir= /home/cs246/cpu2006. Then make sure./run spec.py knows about your new output directory. I ve summarized the changes in that file for you below: Line # Change to 8 RUN DIR ROOT = "/your/path/to/hw3/spec out" 9 RESULT DIR = "/your/path/to/hw3/spec out" 10 CONFIG FILE = "config/a.cfg" After these edits, you can just execute./run spec.py to run the simulation for benchmark 401.bzip2 with input chicken. scripts $ nohup./run_spec.py & nohup will keep your job running in the background even if you log out of the machine. Currently we simulate 100M instructions, which will take around 30 mins per run. Note that XIOSim requires 2 threads per run. We strongly recommend you to run at most 3 jobs (6 threads in total) at a time. Before you start, do make sure there are at least two cores idling. Otherwise, you will grind not only your jobs, but everyone else s on the machine to a halt. You can use top to check whether there are jobs already running. After the simulation finishes, you can check the simulation output file ( *.sim.out) located at /your/path/to/spec out/. It reports that the execution time of sampled 100M instructions is 62834 us (sim time). For this homework, you need to present your results for two SPEC benchmarks, 401.bzip2 and 429.mcf. To run 429.mcf, just change the last line in run spec.py to RunSPECBenchmark("429.mcf.inp"). 3
2 Execution Time Decomposition [30 Points] In this homework, you will first use XIOSim to reproduce the execution time decomposition from Doug Burger s paper [1] (Download). A. Read the paper and understand how to quantify processor time, latency time, and bandwidth time; B. Run simulations to generate f p, f L, and f B for 401.bzip2 and 429.mcf C. Plot the breakdowns similar to Figure 3 in the paper and explain your findings. You do not need to change or recompile the simulator for this problem. You may need to change certain knobs in the config files to run the simulations with different assumptions, listed below. baseline change nothing, just use A.cfg every request hits in the L1 data cache core cfg.exec cfg.dcache cfg.magic hit rate : "1.0" infinite bandwidth between LLC and memory uncore cfg.fsb cfg.magic : "true" uncore cfg.dram cfg.dram config : "simplesdram-infbw:4:4:35:11.25:11.25:11.25:11.25:64" Here, the. -s separate the different sections in A.cfg. The run spec.py script is set up to take config file replacements in this format (check out line 68), so that it s much easier to automate your parameter sweeps. Of course, if you don t trust shady python scripts, you are welcome to change the config file by hand. 3 Effect of Increasing Frequency [15 Points] The paper mentions that faster clock speed will reduce processor time but increase latency and bandwidth times. You can test whether the statement is true or not with the help of the simulator. 4
A. Change the knob core cfg.core clock in A.cfg from 1600 MHz to 3200 MHz B. Repeat the steps in Section 2 to generate the time breakdowns. C. Plot the breakdowns, compare your results here with the ones running at 1.6 GHz, explain your findings. 4 Power and Energy [5 Points] Dynamic power can be estimated using Equation 1 P = αcv 2 f (1) where α is an activity factor, C is capacitance, V is voltage, and f is frequency. Using Equation 1, assuming constant activity, capacitance, and voltage, increasing frequency from 1.6 GHz to 3.2 GHz also doubles dynamic power consumption. We ignore static and leakage power for this homework. Energy is equal to power multiplied by time. Based on our assumption that dynamic power doubles from 1.6 GHz to 3.2 GHz, compare the dynamic energy consumption for two runs (1.6 GHz and 3.2 GHz) for each benchmark. Explain your findings. In order to simulate power, you need to add system cfg.simulate power : "true" to the list of replacments in your A.cfg file. 5 Dynamic Voltage Frequency Scaling [50 Points] Dynamic Voltage and Frequency Scaling (DVFS) is a power management technique that dynamically adjusts power and frequency based on the runtime behavior of applications to reduce power/energy consumption. Contemporary processors feature a variety of hardware power management mechanisms to adjust voltage and frequency. In this part, you will implement a frequency scaling policy which adapts the core s frequency to program behavior. The intuition behind dynamic frequency scaling is that if the core is stalling due to cache misses, lowering frequency can reduce dynamic power withot an effect on performance. XIOSim provides a modular interface for frequency scaling policies. You can find one simple example, sample.cpp, inside the XIOSim/ZCOMPS-dvfs/ directory. Currently, the scheduler is 5
really simple: every tick, it compares the dynamic IPC with a constant (0.6 in this case, out of a theoretical maximum of 2.0 on Atom). Higher IPC switches to the maximum frequency (3.2 GHz) and lower to the minimum (1.6 GHz). We already ran this simple policy on a test microbenchmark; results are shown in Figure 1. 2.0 Simple-DFS 1.6G 3.2G ideal 1.5 Normalized Power 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Normalized Performance Figure 1: Normalized power and performance running at different frequencies. Power and performance results running at 1.6 GHz are the baseline for normallization. Running at 3.2 GHz doubles power consumption since frequency doubles, but we only get around 1.8 performance benefit. Using our Simple-DFS policy, the performance benefit is almost linear with the additional power consumption. The ideal case would get the best of both world: the performance of running at 3.2 GHz and the power of running at 1.6 GHz. Although it is ideal, we want to get as close to that target as possible. YOUR JOB: Design and implement your own scaling policy in sample.cpp to get closer to the ideal case. Table 1 shows some stats from XIOSim, which may (or may not) be useful for your policy. You need to add #include zesto-uncore.h to XIOSim/zesto-dvfs.cpp if you need LLC stats or add #include zesto-fetch.h and #include zesto-bpred.h if you 6
need branch predictor stats. Stats core->stat.commit insn core->sim cycle uncore->llc->stat.core lookups[0] uncore->llc->stat.core misses[0] core->fetch->bpred->num updates core->fetch->bpred->num hits Notes committed instructions simulated core cycles number of lookups in the LLC number of misses in the LLC number of predictions branch predictor makes number of correct branch prediction Table 1: Stats you may need for your scaling policy. You need to add uncore cfg.dvfs cfg.config : "sample" uncore cfg.dvfs cfg.interval : 1000000 to the A.cfg file replacements in your runscripts. dvfs cfg.config sets the name of the DVFS policy to use; dvfs interval sets how frequently we update the frequency (in cycles). Rebuild the simulator every time you change sample.cpp. Do a make under XIOSim followed by a make under XIOSim/pintool. 2 To quickly test your policy, you can use the step micro-benchmark in the XIOSim directory. Just change pintool/bencharks.cfg to point to../tests/step, and run pintool/run.sh as in Section 1 (d). For the testing run, you need to set dvfs cfg.interval to 20000 since the micro-benchmark is relatively short. After you test with the simple benchmark, you can run your DVFS policy with SPEC and a longer DVFS interval. Run the two SPEC benchmarks in the following three cases: A. fixed 1.6 GHz B. fixed 3.2 GHz C. dynamic frequency using your DFS policy Generate figures similar to Figure 1 for each benchmark and explain your findings. 2 Yeah, I know this two-step build thing is ugly. Fixing it is on my to-do, I promise. 7
6 Submission Instructions Please present your results/figures/findings for all the problems in a single PDF file. Send the PDF file along with your frequency scaling policy file (sample.cpp) to skanev@eecs.harvard.edu. References [1] Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In Computer Architecture (ISCA), 1996. [2] Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: power-performance modeling of mobile x86 cores. In Low power electronics and design (ISLPED), 2012. Updated April 3, 2015, Svilen Kanev 8