Wiggins/Redstone: An On-line Program Specializer



Similar documents
VLIW Processors. VLIW Processors

MAQAO Performance Analysis and Optimization Tool

Compilers and Tools for Software Stack Optimisation

Sequential Performance Analysis with Callgrind and KCachegrind

Virtual Machine Learning: Thinking Like a Computer Architect

Full and Para Virtualization

General Introduction

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

HOTPATH VM. An Effective JIT Compiler for Resource-constrained Devices

Android Architecture. Alexandra Harrison & Jake Saxton

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Instruction Set Design

Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023

Operating System Overview. Otto J. Anshus

ELEC 377. Operating Systems. Week 1 Class 3

Chapter 3. Operating Systems

Performance Analysis and Optimization Tool

Replication on Virtual Machines

Sequential Performance Analysis with Callgrind and KCachegrind

Multiprogramming. IT 3123 Hardware and Software Concepts. Program Dispatching. Multiprogramming. Program Dispatching. Program Dispatching

Process Description and Control william stallings, maurizio pizzonia - sistemi operativi

Jonathan Worthington Scarborough Linux User Group

Java Virtual Machine: the key for accurated memory prefetching

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design


Chapter 6, The Operating System Machine Level

PERFORMANCE TOOLS DEVELOPMENTS

picojava TM : A Hardware Implementation of the Java Virtual Machine

Chapter 3 Operating-System Structures

Introduction to Virtual Machines

Chapter 3: Operating-System Structures. Common System Components

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

CS5460: Operating Systems

Virtual Machines. Virtual Machines

Operating System Impact on SMT Architecture

Embedded Software development Process and Tools: Lesson-4 Linking and Locating Software

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Mutual Exclusion using Monitors

Performance Measurement of Dynamically Compiled Java Executions

Fachbereich Informatik und Elektrotechnik SunSPOT. Ubiquitous Computing. Ubiquitous Computing, Helmut Dispert

Oracle Solaris Studio Code Analyzer

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Eugene Tsyrklevich. Ozone HIPS: Unbreakable Windows

Effective Java Programming. efficient software development

Multi-core Programming System Overview

Virtualization Technology. Zhiming Shen

Perf Tool: Performance Analysis Tool for Linux

An Easier Way for Cross-Platform Data Acquisition Application Development

COS 318: Operating Systems. Virtual Machine Monitors

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Levels of Programming Languages. Gerald Penn CSC 324

On Demand Loading of Code in MMUless Embedded System

1 The Java Virtual Machine

Chapter 2 System Structures

CSE 120 Principles of Operating Systems. Modules, Interfaces, Structure

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

x86 ISA Modifications to support Virtual Machines

11.1 inspectit inspectit

evm Virtualization Platform for Windows

Optimizing compilers. CS Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

SDN and Streamlining the Plumbing. Nick McKeown Stanford University

Introduction. What is an Operating System?

Embedded Systems. 6. Real-Time Operating Systems

for Malware Analysis Daniel Quist Lorie Liebrock New Mexico Tech Los Alamos National Laboratory

PROPOSAL: OCP COMMON LINUX SWITCH DISTRIBUTION. Rob Sherwood and Mansour Karam OCP November 2013

Memory Systems. Static Random Access Memory (SRAM) Cell

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Multi-GPU Load Balancing for Simulation and Rendering

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Example of Standard API

System Structures. Services Interface Structure

Can You Trust Your JVM Diagnostic Tools?

Computer Organization and Components

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

CS3600 SYSTEMS AND NETWORKS

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Gary Frost AMD Java Labs

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Building Applications Using Micro Focus COBOL

Infrastructure Matters: POWER8 vs. Xeon x86

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

Transcription:

Wiggins/Redstone: An On-line Program Specializer Dean Deaver Rick Gorton Norm Rubin {dean.deaver,rick.gorton,norm.rubin}@compaq.com Hot Chips 11 Wiggins/Redstone 1

W/R is a Software System That: u Makes arbitrary binary applications run faster without requiring any work from the programmer u Aggressively optimizes/specialize an application for a particular use on a particular machine u Moves optimization/compilation closer to the actual use of the program Hot Chips 11 Wiggins/Redstone 2

W/R is On-line u Executes dynamically while the program is running u Uses path profiles u Modifies in-memory images to take advantage of: Data values Temporal effects Hot Chips 11 Wiggins/Redstone 3

Motivation u Static code optimization (at compile time) Know little about the dynamic behavior of programs Know little about the actual machine Know nothing about the actual data u Feedback directed optimization gets some knowledge of dynamic behavior, but is tricky to use u Prior dynamic approaches not general purpose Hot Chips 11 Wiggins/Redstone 4

W/R Hotspot u Binary images u Paths u Specializes code in many ways u Opts are platform dependent u Starts with the code produced by an optimizing compiler u Java u Procedures (or parts) u Specializes code to remove virtual calls u Opts are platform independent u Starts with the code produced by a JIT Hot Chips 11 Wiggins/Redstone 5

W/R Approach u Input is an arbitrary binary without any special compilation switches or annotations u Load the profiler and optimizer into the application u Do the specialization at run-time u Use value profiling at key points u Optimizations performed are specific to the underlying micro-architecture u Optimizations are also data set specific Hot Chips 11 Wiggins/Redstone 6

W/R Benefits u At run time, automatically (without programmer direction) reorganize, optimize and specialize important dynamic code sequences u Exact knowledge of: Program behavior, phases Program invariants (glacial variables), data u Low overhead Hot Chips 11 Wiggins/Redstone 7

Relationship With Hardware u Hardware engines are starting to optimize code Out of order execution Branch and value prediction Trace processors u W/R uses hardware (performance counters) to direct the software to start building software instruction traces u Dynamic compilation may be required to exploit new hardware u Not either/or software/hardware technique Hot Chips 11 Wiggins/Redstone 8

The W/R System Architecture 1 The agent - A modified loader/launcher that starts the system 2 A low overhead, hardware based sampler 3 A trace builder that finds and instruments parts of a program 4 Optimizer/specializer (works on superblocks) 5 OS independent Windows NT Tru64 UNIX Hot Chips 11 Wiggins/Redstone 9

System Flow While the program is running { 1. Identify a hot instruction 2. Build a trace containing the instruction 3. Instrument the trace 4. Specialize the trace 5. Optimize the trace } Step 1 is hardware, 2-5 are software Hot Chips 11 Wiggins/Redstone 10

Agent u A special loader u Adds code to an image when started. This code contains the profiler and optimizer u The agent is shared over applications u The agent knows about the actual platform, so old programs can run on new platforms u Allows us to add new optimizations to old programs Hot Chips 11 Wiggins/Redstone 11

Sampler u We use a hardware PC sampler to find hot seed instructions The sampler is a source of frequent interrupts Look for frequent values of program counter at Interrupt time Code is based on DCPI u Approach works on out-of-order machines such as 21264 Hot Chips 11 Wiggins/Redstone 12

Trace Builder u Given a seed instruction Copy it and the remainder of the block to a side buffer Add instrumentation code, guards to insure correctness, branch back Patch the image to branch to the copy After the instrumentation code finds the most common successor extend the copy u Copied instructions form a superblock u Effectively a lazy instruction trace constructor Hot Chips 11 Wiggins/Redstone 13

Optimizer/specializer u Specializes "hot" traces using machine-specific information. Introduce guards as necessary u Exploits temporal info u Analyzes what to monitor u Performs architectural and micro-architectural optimizations (byte/word loads and stores on alpha) u Applications will continuously monitor themselves and perform self-improvements whenever necessary Hot Chips 11 Wiggins/Redstone 14

Advantages u The application carries no machine-specific information u Can update the agent to incorporate new optimization techniques as they become available u Programs compiled using generic or 21064 specific features run faster on 21164; 21164 specific programs run faster on 21264,... Hot Chips 11 Wiggins/Redstone 15

Some Data Points u Povray - a freely available rendering package u Image matches.pov 2 billion calls to power(x,y) If you perform three levels of inline on the frequent path you find that y = 8.0 Calls to power() are on the frequent path 95% of the time Hot Chips 11 Wiggins/Redstone 16

Povray Image Hot Chips 11 Wiggins/Redstone 17

How Many Traces? u Typically less than16 traces at a time u Traces contain several hundred instructions u Traces often account for 50-90 of the run time of an image u Traces are removed as the computation evolves Hot Chips 11 Wiggins/Redstone 18

PovRay shapes.pov Demo Hot Chips 11 Wiggins/Redstone 19

Percent of Time on Traces 100% 80% 60% 40% 20% 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 tim e Hot Chips 11 Wiggins/Redstone 20

Characteristics of Traces (shapes.pov) u Povray: 661069 static instructions u Traces 7 traces total 1586 instructions (0.239%) 819 unique instructions (0.123%) Hot Chips 11 Wiggins/Redstone 21

Characteristics of Traces u Inter-procedural Often 2-4 levels deep u Can include one loop But may include many unrolled loops u May be up to 2000 instructions long Often 300-500 instructions Long enough to insert pre-fetch instructions u Need not stop at a register transfer, return, or call site Hot Chips 11 Wiggins/Redstone 22

Conditional Branches (Cbrs) u 76 unique cbrs u 115 instances of a CBR show up on various traces u Trace probabilities vs. Aggregate probabilities Correlated branches Temporal effects u 5-10% of branches have multiple instances with reversed directions Hot Chips 11 Wiggins/Redstone 23

Cbrs Vs Static Branch Probs Branch Trace Probability Aggregate 0x12003cc48 3 1.00 0.13 8 0.00 0x12003cc74 3 0.00 0.00 0x12003cc9c 1 1.00 1.00 3 1.00 3 1.00 Hot Chips 11 Wiggins/Redstone 24

Temporal Effects u A single program using one data set can show phases, which may not be apparent in the source code u Different phases require different optimizations u E.G.. Compress (SPEC95) - For each data item - look it up in hash table Initially most items are not in table Later most items are in table Hot Chips 11 Wiggins/Redstone 25

Sunsethf.pov Trace construction (Time) Hot Chips 11 Wiggins/Redstone 26

Temporal Effects Hot Chips 11 Wiggins/Redstone 27

What Don t We Do? u W/R works on applications not system kernels u Does not modify OS components u Does not modify program memory layout u Does not work with device drivers Hot Chips 11 Wiggins/Redstone 28

Final Comments u Wiggins/Redstone is the software analog of a trace processor u Runs on stock hardware/stock OS u Optimizes/specializes binary images u Runs on-line u Captures temporal effects u One tool, in a more adaptive computing model? u Return to self-modifying code? Hot Chips 11 Wiggins/Redstone 29