Precise and Accurate Processor Simulation



Similar documents
Enterprise Applications

Operating System Impact on SMT Architecture

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x Architecture

Intel Pentium 4 Processor on 90nm Technology

A Performance Counter Architecture for Computing Accurate CPI Components


"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

System Virtual Machines

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

x86 ISA Modifications to support Virtual Machines

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Virtualization. Pradipta De

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

WAR: Write After Read

Branch Prediction, Instruction-Window Size, and Cache Size: Performance Tradeoffs and Simulation Techniques

Understanding Hardware Transactional Memory

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures

Putting it all together: Intel Nehalem.

Energy-Efficient, High-Performance Heterogeneous Core Design

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

CS 147: Computer Systems Performance Analysis

Introduction to Virtual Machines

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Unified Architectural Support for Soft-Error Protection or Software Bug Detection

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

A Unified View of Virtual Machines

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

CMSC 611: Advanced Computer Architecture

Performance evaluation

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

COS 318: Operating Systems. Virtual Memory and Address Translation

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

POWER8 Performance Analysis

OC By Arsene Fansi T. POLIMI

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Chapter 3 Operating-System Structures

Thread Level Parallelism II: Multithreading

COS 318: Operating Systems. Virtual Machine Monitors

Architectures and Platforms

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Thread level parallelism

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Going Linux on Massive Multicore

State-Machine Replication

Computer Science 146/246 Homework #3

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Multi-/Many-core Modeling at Freescale

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Computer Organization and Components

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Programming for GCSE Topic H: Operating Systems

Performance Management for Cloudbased STC 2012

Chapter 2 System Structures

Interval Simulation: Raising the Level of Abstraction in Architectural Simulation

How To Virtualize A Storage Area Network (San) With Virtualization

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Introducing EEMBC Cloud and Big Data Server Benchmarks

Datacenter Operating Systems

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

IOMMU: A Detailed view

CS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines

Distributed Systems. Virtualization. Paul Krzyzanowski

ARM Caches: Giving you enough rope... to shoot yourself in the foot. Marc Zyngier KVM Forum 15

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Runtime Hardware Reconfiguration using Machine Learning

Transcription:

Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm

Performance Modeling Analytical models Queuing models Simulation Trace-driven Execution-driven Full system Why? Most widely used in academic research Perceived accuracy and precision

Precision, Accuracy, Flexibility Precision How closely simulator matches design Latency, bandwidth, resource occupancy, etc. Accuracy How closely simulation matches reality Requires precision Also requires replication of real-world conditions, inputs Flexibility? Enables exploration of broad design space

Uses for Simulation Academic Research Design Space Exploration Quantitative Tradeoff Analysis Functional validation Late design changes Accuracy??? Performance Validation Flexibility Precision High-level Design Microarchitectural Definition Design and Implementation Verification

Causes of Inaccuracy Many possible causes Software differences Hardware differences System effects Time dilation: interaction with physical world Here, we consider: Operating system code DMA traffic (in paper) Wrong-path effects

Validating Accuracy How do we validate? Against real hardware with perf. counters Different input since O/S now present Against HDL Same input as timer model, same error? Without full system simulation, cannot: Replicate runtime environment Cannot really validate accuracy Compensating errors mask inaccuracy Hence: build simulator that does not cheat

PharmSim Overview PharmSim -OOO Core -Gigaplane Ethernet Device simulation, etc. from SimOS-PPC PharmSim replaces functional simulators Full OOO core model, values in rename registers Based on SimpleMP [Rajwar] Adds VM, TLB, exceptions, interrupts, barriers, etc. SimOS-PPC -AIX 4.3.1 -Disk driver -E net driver Block Simple

PharmSim Pipeline Fetch Translate Decode Execute Mem Commit Substantially similar to IBM Power4 Some instructions cracked (1:2 expansion) Others (e.g. lmw) microcode stream Mem Stage Interface to 2-level cache model Sun Gigaplane XB snoopy MP coherence Caches contain values, must remain coherent No cheating! No flat memory model for reference/redirect

Operating System Effects Fairly well-understood for commercial: Must account for O/S references For SPEC? Widely accepted: Safe to ignore O/S paths Most popular tool (Simplescalar) Intercepts system calls Emulates on host, updates flat memory Returns magically with caches intact Is this really OK?

Operating System Effects References Modeled User-mode only User + Shared library User + Sh Lib + O/S User + Sh Lib + O/S + cache control ops Example Atom Simplescalar with static link H/W bus trace PharmSim

Operating System Effects 5.8x Dramatic error (5.8x in mcf, 2-3x commonplace) Note compensating errors (e.g. crafty, gzip, perl) IPC error > 100% (more detail at ISCA)

Wrong-path Execution Multiple effects on unarchitected state Pollute/prefetch I-cache, D-cache, TLB Pollute/train branch predictor (BHR, PHT, RAS) PharmSim: BHR is updated and repaired PHT is not updated speculatively RAS is updated, no repair No speculative TLB fill How can we filter wrong-path instructions? No cheating : don t know branch outcomes 25% - 40% instructions are wrong-path

Wrong-path Memory Stalls %time stalled for memory 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% exe-driven trace-driven crafty gap gcc gzip mcf parser perlbmk vortex specjbb specweb benchmark Minor effect: better or worse

Wrong-path RAS Accuracy RAS Prediction Accuracy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% exe-driven trace-driven crafty gap gcc gzip mcf parser perlbmk vortex specjbb specweb benchmark Prediction accuracy degrades up to 29% Could add fixup logic

Wrong-path IPC IPC 3 2.5 2 1.5 1 0.5 0 exe-driven trace-driven crafty gap gcc gzip mcf parser perlbmk vortex specjbb specweb benchmark Negligible effect (0.9%) RAS mispredictions overlapped

Summary PharmSim Simulator that does not cheat Can be used to validate assumptions, simplifications, abstractions Evaluated three effects on accuracy O/S: dramatic error, even for SPECINT DMA: not important for uniprocessors MP, bus-constrained results TBD Wrong path: unimportant

Conclusions Ignoring O/S effects fraught with danger Should always model O/S effects Trace-driven vs. execution-driven Traces with O/S much better Invest in Trace quality vs. Complexity of execution-driven simulation Precision without accuracy? Of questionable value Validation difficult due to compensating errors Hard to know if model is precise or accurate

Wrong-path Instructions %executed inst on wrong path 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% crafty gap gcc gzip mcf parser perlbmk vortex specjbb specweb benchmark Aggressive core model; 25%-40% wrong-path

DMA Traffic How do we support DMA? No flat memory image in simulator Lines may be in caches Invalidate Read Must use existing coherence Everything has to work correctly No subtle coherence bugs How much does this matter? Affects cache miss rates Introduces bus contention

DMA Traffic PharmSim incorporates accurate DMA engine: Issues bus invalidates, snoops Concurrent data transfer: No magic flat memory Bottom line: Unimportant for SPEC Unimportant for SPECWEB, SPECJBB Others in progress Contrived multiprogrammed workload 4.8% of all coherence traffic due to I/O, 1% IPC effect Results understated due to overbuilt MP bus MP workloads likely much more sensitive Additional evaluation in progress