Computer Science 146/246 Homework #3



Similar documents
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Source Code Management for Continuous Integration and Deployment. Version 1.0 DO NOT DISTRIBUTE

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Runtime Hardware Reconfiguration using Machine Learning

TEST AUTOMATION FRAMEWORK

Introduction. Created by Richard Bell 10/29/2014

About This Guide Signature Manager Outlook Edition Overview... 5

SIDN Server Measurements

Introduction to Git. Markus Kötter Notes. Leinelab Workshop July 28, 2015

MAGENTO HOSTING Progressive Server Performance Improvements

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Comparing Multi-Core Processors for Server Virtualization

Host Power Management in VMware vsphere 5

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Improved metrics collection and correlation for the CERN cloud storage test framework

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Dry Dock Documentation

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Version control. with git and GitHub. Karl Broman. Biostatistics & Medical Informatics, UW Madison

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

4D WebSTAR 5.1: Performance Advantages

Virtuoso and Database Scalability

Version Control with. Ben Morgan

A Talk ForApacheCon Europe 2008

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

RingCentral from AT&T Desktop App for Windows & Mac. User Guide

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

USER GUIDE. Snow Inventory Client for Unix Version Release date Document date

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

Copyright 1

FEEG Applied Programming 3 - Version Control and Git II

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Amazon EC2 XenApp Scalability Analysis

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Yocto Project Eclipse plug-in and Developer Tools Hands-on Lab

Volume SYSLOG JUNCTION. User s Guide. User s Guide

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Redtail CRM Integration. Users Guide Cities Digital, Inc. All rights reserved. Contents i

Load Manager Administrator s Guide For other guides in this document set, go to the Document Center

Raspberry Pi Kernel-o-Matic

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Benchmarking Citrix XenDesktop using Login Consultants VSI

LOCKSS on LINUX. Installation Manual and the OpenBSD Transition 02/17/2011

CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS

Archive One Policy V4.2 Quick Start Guide October 2005

LOCKSS on LINUX. CentOS6 Installation Manual 08/22/2013

Recommendations for Performance Benchmarking

CSEE W4824 Computer Architecture Fall 2012

Network Security EDA /2012. Laboratory assignment 4. Revision A/576, :13:02Z

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Precise and Accurate Processor Simulation

Hudson Continous Integration Server. Stefan Saasen,

Stage One - Applying For an Assent Remote Access Login

EE361: Digital Computer Organization Course Syllabus

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Load/Performance Test Plan

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

An OS-oriented performance monitoring tool for multicore systems

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Power-Aware High-Performance Scientific Computing

Running applications on the Cray XC30 4/12/2015

System Requirements Table of contents

Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX. Contents. Performance and User Experience... 2

Virtual Desktop Infrastructure (VDI) made Easy

Understand and Build Android Programming Environment. Presented by: Che-Wei Chang

WHITE PAPER. ClusterWorX 2.1 from Linux NetworX. Cluster Management Solution C ONTENTS INTRODUCTION

Exclaimer Mail Archiver User Manual

Content. Development Tools 2(63)

Installation Guide: Migrating Report~Pro v18

Informatica Data Director Performance

Citrix EdgeSight for Load Testing Installation Guide. Citrix EdgeSight for Load Testing 3.8

Performance Application Programming Interface

Systems Integration Roadmap

Virtual desktops made easy

University of Pennsylvania Department of Electrical and Systems Engineering Digital Audio Basics

Software Engineering Practices for Python

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Web Load Stress Testing

Content Repository Benchmark Loading 100 million documents

on an system with an infinite number of processors. Calculate the speedup of

Directions for VMware Ready Testing for Application Software

Continuous Delivery on AWS. Version 1.0 DO NOT DISTRIBUTE

Transcription:

Computer Science 146/246 Homework #3 Due 11:59 P.M. Sunday, April 12th, 2015 We played with a Pin-based cache simulator for Homework 2. This homework will prepare you to setup and run a detailed microarchitecture-level simulator, as well as to modify the simulator to facilitate your own research. We will use XIOSim, a Pin-based x86 microarchitecture-level simulator, for this homework. You can find more details about (an older version of) XIOSim in this paper [2] (Download), or you can just check it out on Github. 1 1 Download and Configure XIOSim a. Download XIOSim You can get XIOSim from Github: $ git clone https://github.com/s-kanev/xiosim.git b. Set up your environment The build and the following scripts rely on these variables to know which XIOSim installation to look for and how to satisfy dependences. Just execute: $ export BOOST_HOME=/home/cs246/boost_1_54_0 $ export XIOSIM_TREE=/your/path/to/XIOSim $ export XIOSIM_INSTALL=${XIOSIM_TREE}/pintool/obj-ia32 You can add these to your /.bashrc file so you don t type them out every time you login. c. Build XIOSim $ cd pintool $ make 1 Full disclosure, I m the main author of XIOSim. So, if you have any feedback, suggestions, bug reports, or curse words you want to throw at me, I d be more than happy to listen. 1

d. Run Your First Test Program Let s test the simulator with a simple benchmark. There is a script called run.sh under XIOSim/pintool directory which sets up the simulated architecture configuration and runs the simulation. It looks like this: PIN=${PIN_ROOT}/pin.sh PINTOOL=./obj-ia32/feeder_zesto.so ZESTOCFG=../config/A.cfg BENCHMARK_CFG_FILE=benchmarks.cfg CMD_LINE="setarch i686 -BR./obj-ia32/harness \ -benchmark_cfg ${BENCHMARK_CFG_FILE} \ -pin ${PIN} \ -pause_tool 1 \ -xyzzy \ -t \ ${PINTOOL} \ -num_cores 1 \ -s \ -config ${ZESTOCFG}" echo ${CMD_LINE} ${CMD_LINE} The are two things to notice. First benchmarks.cfg chooses which program to simulate. Then, A.cfg at../config/ sets up the simulation paramters This particular file models Intel s Atom processor. You should already be familiar with some of the knobs in that file. For example, search for bpred and you can see that the Atom model is set up to simulate a 2-level gshare predictor, very similar to what you implemented in homework 1. Now you are ready to run the simulator. pintool $./run.sh It will finish in a couple of minutes. The simulation output is in sim.out. You can see the simulator statistics about each pipeline stage, instruction breakdowns, as well as various caches and memory stats. You will want to spend some time looking through this file and the config file (A.cfg), to understand the output data, some of which you will need for the rest of the homework. 2

e. Run SPEC Benchmarks Now you are ready to run full-blown SPEC benchmarks using XIOSim. Before we modify the script, make a directory outside your XIOSim repository for output files, e.g. mkdir /your/path/to/hw3/spec out. Switch to XIOSim/scripts directory. First modify line #7 in spec.py to specdir= /home/cs246/cpu2006. Then make sure./run spec.py knows about your new output directory. I ve summarized the changes in that file for you below: Line # Change to 8 RUN DIR ROOT = "/your/path/to/hw3/spec out" 9 RESULT DIR = "/your/path/to/hw3/spec out" 10 CONFIG FILE = "config/a.cfg" After these edits, you can just execute./run spec.py to run the simulation for benchmark 401.bzip2 with input chicken. scripts $ nohup./run_spec.py & nohup will keep your job running in the background even if you log out of the machine. Currently we simulate 100M instructions, which will take around 30 mins per run. Note that XIOSim requires 2 threads per run. We strongly recommend you to run at most 3 jobs (6 threads in total) at a time. Before you start, do make sure there are at least two cores idling. Otherwise, you will grind not only your jobs, but everyone else s on the machine to a halt. You can use top to check whether there are jobs already running. After the simulation finishes, you can check the simulation output file ( *.sim.out) located at /your/path/to/spec out/. It reports that the execution time of sampled 100M instructions is 62834 us (sim time). For this homework, you need to present your results for two SPEC benchmarks, 401.bzip2 and 429.mcf. To run 429.mcf, just change the last line in run spec.py to RunSPECBenchmark("429.mcf.inp"). 3

2 Execution Time Decomposition [30 Points] In this homework, you will first use XIOSim to reproduce the execution time decomposition from Doug Burger s paper [1] (Download). A. Read the paper and understand how to quantify processor time, latency time, and bandwidth time; B. Run simulations to generate f p, f L, and f B for 401.bzip2 and 429.mcf C. Plot the breakdowns similar to Figure 3 in the paper and explain your findings. You do not need to change or recompile the simulator for this problem. You may need to change certain knobs in the config files to run the simulations with different assumptions, listed below. baseline change nothing, just use A.cfg every request hits in the L1 data cache core cfg.exec cfg.dcache cfg.magic hit rate : "1.0" infinite bandwidth between LLC and memory uncore cfg.fsb cfg.magic : "true" uncore cfg.dram cfg.dram config : "simplesdram-infbw:4:4:35:11.25:11.25:11.25:11.25:64" Here, the. -s separate the different sections in A.cfg. The run spec.py script is set up to take config file replacements in this format (check out line 68), so that it s much easier to automate your parameter sweeps. Of course, if you don t trust shady python scripts, you are welcome to change the config file by hand. 3 Effect of Increasing Frequency [15 Points] The paper mentions that faster clock speed will reduce processor time but increase latency and bandwidth times. You can test whether the statement is true or not with the help of the simulator. 4

A. Change the knob core cfg.core clock in A.cfg from 1600 MHz to 3200 MHz B. Repeat the steps in Section 2 to generate the time breakdowns. C. Plot the breakdowns, compare your results here with the ones running at 1.6 GHz, explain your findings. 4 Power and Energy [5 Points] Dynamic power can be estimated using Equation 1 P = αcv 2 f (1) where α is an activity factor, C is capacitance, V is voltage, and f is frequency. Using Equation 1, assuming constant activity, capacitance, and voltage, increasing frequency from 1.6 GHz to 3.2 GHz also doubles dynamic power consumption. We ignore static and leakage power for this homework. Energy is equal to power multiplied by time. Based on our assumption that dynamic power doubles from 1.6 GHz to 3.2 GHz, compare the dynamic energy consumption for two runs (1.6 GHz and 3.2 GHz) for each benchmark. Explain your findings. In order to simulate power, you need to add system cfg.simulate power : "true" to the list of replacments in your A.cfg file. 5 Dynamic Voltage Frequency Scaling [50 Points] Dynamic Voltage and Frequency Scaling (DVFS) is a power management technique that dynamically adjusts power and frequency based on the runtime behavior of applications to reduce power/energy consumption. Contemporary processors feature a variety of hardware power management mechanisms to adjust voltage and frequency. In this part, you will implement a frequency scaling policy which adapts the core s frequency to program behavior. The intuition behind dynamic frequency scaling is that if the core is stalling due to cache misses, lowering frequency can reduce dynamic power withot an effect on performance. XIOSim provides a modular interface for frequency scaling policies. You can find one simple example, sample.cpp, inside the XIOSim/ZCOMPS-dvfs/ directory. Currently, the scheduler is 5

really simple: every tick, it compares the dynamic IPC with a constant (0.6 in this case, out of a theoretical maximum of 2.0 on Atom). Higher IPC switches to the maximum frequency (3.2 GHz) and lower to the minimum (1.6 GHz). We already ran this simple policy on a test microbenchmark; results are shown in Figure 1. 2.0 Simple-DFS 1.6G 3.2G ideal 1.5 Normalized Power 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 Normalized Performance Figure 1: Normalized power and performance running at different frequencies. Power and performance results running at 1.6 GHz are the baseline for normallization. Running at 3.2 GHz doubles power consumption since frequency doubles, but we only get around 1.8 performance benefit. Using our Simple-DFS policy, the performance benefit is almost linear with the additional power consumption. The ideal case would get the best of both world: the performance of running at 3.2 GHz and the power of running at 1.6 GHz. Although it is ideal, we want to get as close to that target as possible. YOUR JOB: Design and implement your own scaling policy in sample.cpp to get closer to the ideal case. Table 1 shows some stats from XIOSim, which may (or may not) be useful for your policy. You need to add #include zesto-uncore.h to XIOSim/zesto-dvfs.cpp if you need LLC stats or add #include zesto-fetch.h and #include zesto-bpred.h if you 6

need branch predictor stats. Stats core->stat.commit insn core->sim cycle uncore->llc->stat.core lookups[0] uncore->llc->stat.core misses[0] core->fetch->bpred->num updates core->fetch->bpred->num hits Notes committed instructions simulated core cycles number of lookups in the LLC number of misses in the LLC number of predictions branch predictor makes number of correct branch prediction Table 1: Stats you may need for your scaling policy. You need to add uncore cfg.dvfs cfg.config : "sample" uncore cfg.dvfs cfg.interval : 1000000 to the A.cfg file replacements in your runscripts. dvfs cfg.config sets the name of the DVFS policy to use; dvfs interval sets how frequently we update the frequency (in cycles). Rebuild the simulator every time you change sample.cpp. Do a make under XIOSim followed by a make under XIOSim/pintool. 2 To quickly test your policy, you can use the step micro-benchmark in the XIOSim directory. Just change pintool/bencharks.cfg to point to../tests/step, and run pintool/run.sh as in Section 1 (d). For the testing run, you need to set dvfs cfg.interval to 20000 since the micro-benchmark is relatively short. After you test with the simple benchmark, you can run your DVFS policy with SPEC and a longer DVFS interval. Run the two SPEC benchmarks in the following three cases: A. fixed 1.6 GHz B. fixed 3.2 GHz C. dynamic frequency using your DFS policy Generate figures similar to Figure 1 for each benchmark and explain your findings. 2 Yeah, I know this two-step build thing is ugly. Fixing it is on my to-do, I promise. 7

6 Submission Instructions Please present your results/figures/findings for all the problems in a single PDF file. Send the PDF file along with your frequency scaling policy file (sample.cpp) to skanev@eecs.harvard.edu. References [1] Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In Computer Architecture (ISCA), 1996. [2] Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: power-performance modeling of mobile x86 cores. In Low power electronics and design (ISLPED), 2012. Updated April 3, 2015, Svilen Kanev 8