Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations



Similar documents
CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Module: Software Instruction Scheduling Part I

2) Write in detail the issues in the design of code generator.

Matrix Multiplication

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

CA Compiler Construction

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Wiggins/Redstone: An On-line Program Specializer

Language Processing Systems

Instruction Set Architecture (ISA)

Virtual Machine Learning: Thinking Like a Computer Architect

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Programming Language Pragmatics

VLIW Processors. VLIW Processors

Programming Languages

Obfuscation: know your enemy

JavaScript as a compilation target Making it fast

HOTPATH VM. An Effective JIT Compiler for Resource-constrained Devices

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

How To Trace

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Parallel Computing. Benson Muite. benson.

IA-64 Application Developer s Architecture Guide

PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications

Big Data looks Tiny from the Stratosphere

Symbol Tables. Introduction

Applications of obfuscation to software and hardware systems

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Instruction Set Design

Introduction to Data-flow analysis

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Big Data Processing with Google s MapReduce. Alexandru Costan

Lecture 12: Software protection techniques. Software piracy protection Protection against reverse engineering of software

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 147: Computer Systems Performance Analysis

Jonathan Worthington Scarborough Linux User Group

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Optimizing matrix multiplication Amitabha Banerjee

Putting Checkpoints to Work in Thread Level Speculative Execution

CSC 408F/CSC2105F Lecture Notes

Optimizations. Optimization Safety. Optimization Safety. Control Flow Graphs. Code transformations to improve program

Test case design techniques I: Whitebox testing CISS

Recovering Business Rules from Legacy Source Code for System Modernization

MPI and Hybrid Programming Models. William Gropp

An Exception Monitoring System for Java

Functional Decomposition Top-Down Development

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Glossary of Object Oriented Terms

Parallel Algorithm for Dense Matrix Multiplication

Lecture 2 Introduction to Data Flow Analysis

Tachyon: a Meta-circular Optimizing JavaScript Virtual Machine

Software Engineering. Software Development Process Models. Lecturer: Giuseppe Santucci

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Optimizing Code for Accelerators: The Long Road to High Performance

A Pattern-Based Approach to. Automated Application Performance Analysis

Outline. Definition. Name spaces Name resolution Example: The Domain Name System Example: X.500, LDAP. Names, Identifiers and Addresses

n Introduction n Art of programming language design n Programming language spectrum n Why study programming languages? n Overview of compilation

evm Virtualization Platform for Windows

Big-data Analytics: Challenges and Opportunities

School of Computer Science

Chapter 13: Program Development and Programming Languages

Data Mining on Streams

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.

New Generation of Software Development

CS473 - Algorithms I

The Mjølner BETA system

Bug hunting. Vulnerability finding methods in Windows 32 environments compared. FX of Phenoelit

Duration Vendor Audience 5 Days Oracle Developers, Technical Consultants, Database Administrators and System Analysts

language 1 (source) compiler language 2 (target) Figure 1: Compiling a program

Texas Essential Knowledge and Skills Correlation to Video Game Design Foundations 2011 N Video Game Design

What is Multi Core Architecture?

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

A JIT Compiler for Android s Dalvik VM. Ben Cheng, Bill Buzbee May 2010

Oracle Database: Develop PL/SQL Program Units

Parallel and Distributed Computing Programming Assignment 1

Shared Memory Abstractions for Heterogeneous Multicore Processors

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Outline. What is Big data and where they come from? How we deal with Big data?

Transcription:

Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada IIT Madras Copyright c 2015 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from hosking@cs.purdue.edu. V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 2 / 27 Compiler structure Optimization Parser Semantic analysis (eg, type checking) Intermediate code generator Optimizer token stream syntax tree syntax tree Machine code generator low level IR (eg, canonical trees/tuples) low level IR (eg, canonical trees/tuples) machine code Potential optimizations: Source-language (AST): constant bounds in loops/arrays loop unrolling suppressing run-time checks enable later optimisations IR: local and global CSE elimination live variable analysis code hoisting enable later optimisations Code-generation (machine code): register allocation instruction scheduling peephole optimization Goal: produce fast code What is optimality? Problems are often hard Many are intractable or even undecideable Many are NP-complete Which optimizations should be used? Many optimizations overlap or interact V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 3 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 4 / 27

Optimization Machine-indepent transformations Definition: An optimization is a transformation that is expected to: improve the running time of a program or decrease its space requirements The point: improved code, not optimal code (forget optimum ) sometimes produces worse code range of speedup might be from 1.000001 to xxx applicable across broad range of machines remove redundant computations move evaluation to a less frequently executed place specialize some general-purpose code find useless code and remove it expose opportunities for other optimizations V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 5 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 6 / 27 Machine-depent transformations A classical distinction capitalize on machine-specific properties improve mapping from IR onto machine replace a costly operation with a cheaper one hide latency replace sequence of instructions with more powerful one (use exotic instructions) The distinction is not always clear: replace multiply with shifts and adds V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 7 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 8 / 27

Optimization Optimization Desirable properties of an optimizing compiler code at least as good as an assembler programmer stable, robust performance (predictability) architectural strengths fully exploited architectural weaknesses fully hidden broad, efficient support for language features instantaneous compiles Unfortunately, modern compilers often drop the ball Good compilers are crafted, not assembled consistent philosophy careful selection of transformations thorough application coordinate transformations and data structures attention to results (code, time, space) Compilers are engineered objects minimize running time of compiled code minimize compile time use reasonable compile-time space (serious problem) Thus, results are sometimes unexpected V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 9 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 10 / 27 Scope of optimization Optimization Local (single block) confined to straight-line code simplest to analyse time frame: 60s to present, particularly now Intraprocedural (global) consider the whole procedure What do we need to optimize an entire procedure? classical data-flow analysis, depence analysis time frame: 70s to present Interprocedural (whole program) analyse whole programs What do we need to optimize and entire program? less information is discernible time frame: late 70s to present, particularly now Three considerations arise in applying a transformation: safety profitability opportunity We need a clear understanding of these issues the literature often hides them every discussion should list them clearly V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 11 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 12 / 27

Safety Profitability Fundamental question Does the transformation change the results of executing the code? Fundamental question Is there a reasonable expectation that the transformation will be an improvement? yes don t do it! no it is safe yes do it! no don t do it Compile-time analysis Compile-time estimation may be safe in all cases (loop unrolling) always profitable analysis may be simple (DAGs and CSEs) heuristic rules may require complex reasoning (data-flow analysis) compute benefit (rare) V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 13 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 14 / 27 Opportunity Optimization Fundamental question Can we efficiently locate sites for applying the transformation? yes compilation time won t suffer no better be highly profitable Issues provides a framework for applying transformation systematically find all sites update safety information to reflect previous changes order of application (hard) Successful optimization requires test for safety profit is local improvement executions focus on loops: loop unrolling factoring loop invariants strength reduction want to minimize side-effects like code growth V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 15 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 16 / 27

Example: loop unrolling Example: loop unrolling Idea: reduce loop overhead by creating multiple successive copies of the loop s body and increasing the increment appropriately Safety: always safe Profitability: reduces overhead (instruction cache blowout) (subtle secondary effects) Opportunity: loops Matrix-matrix multiply do i 1, n, 1 do j 1, n, 1 c(i,j) 0 do k 1, n, 1 c(i,j) c(i,j) + a(i,k) b(k,j) 2n 3 flops, n 3 loop increments and branches each iteration does 2 loads and 2 flops Unrolling is easy to understand and perform This is the most overstudied example in the literature V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 17 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 18 / 27 Example: loop unrolling Matrix-matrix multiply (assume 4-word cache line) do i 1, n, 1 do j 1, n, 1 c(i,j) 0 do k 1, n, 4 c(i,j) c(i,j) + a(i,k) b(k,j) c(i,j) c(i,j) + a(i,k+1) b(k+1,j) c(i,j) c(i,j) + a(i,k+2) b(k+2,j) c(i,j) c(i,j) + a(i,k+3) b(k+3,j) 2n 3 flops, n3 4 loop increments and branches each iteration does 8 loads and 8 flops memory traffic is better c(i,j) is reused a(i,k) reference are from cache b(k,j) is problematic (put it in a register) Example: loop unrolling Matrix-matrix multiply (to improve traffic on b) do j 1, n, 1 do i 1, n, 4 c(i,j) 0 do k 1, n, 4 c(i,j) c(i,j) + a(i,k) b(k,j) + a(i,k+1) b(k+1,j) + a(i,k+2) b(k+2,j) + a(i,k+3) b(k+3,j) c(i+1,j) c(i+1,j) + a(i+1,k) b(k,j) + a(i+1,k+1) b(k+1,j) + a(i+1,k+2) b(k+2,j) + a(i+1,k+3) b(k+3,j) c(i+2,j) c(i+2,j) + a(i+2,k) b(k,j) + a(i+2,k+1) b(k+1,j) + a(i+2,k+2) b(k+2,j) + a(i+2,k+3) b(k+3,j) c(i+3,j) c(i+3,j) + a(i+3,k) b(k,j) + a(i+3,k+1) b(k+1,j) + a(i+3,k+2) b(k+2,j) + a(i+3,k+3) b(k+3,j) V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 19 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 20 / 27

Example: loop unrolling Loop transformations What happened? interchanged i and j loops unrolled i loop fused inner loops 2n 3 flops, n3 16 loop increments and branches first assignment does 8 loads and 8 flops 2 nd through 4 th do 4 loads and 8 flops memory traffic is better c(i,j) is reused a(i,k) references are from cache b(k,j) is reused (register) (register) It is not as easy as it looks: Safety : loop interchange? loop unrolling? loop fusion? Opportunity : find memory-bound loop nests Profitability : machine depent (mostly) Summary chance for large improvement answering the fundamentals is tough resulting code is ugly Matrix-matrix multiply is everyone s favorite example V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 21 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 22 / 27 Loop optimizations: factoring loop-invariants Example: factoring loop invariants Loop invariants: expressions constant within loop body Relevant variables: those used to compute and expression Opportunity: 1 identify variables defined in body of loop (LoopDef) 2 loop invariants have no relevant variables in LoopDef 3 assign each loop-invariant to temp. in loop header 4 use temporary in loop body Safety: loop-invariant expression may throw exception early Profitability: loop may execute 0 times loop-invariant may not be needed on every path through loop body // LoopDef = {i,j,k, A} // LoopDef = {j,k, A} // LoopDef = {k, A} A[i,j,k] = i j k; 3 million index operations 2 million multiplications V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 23 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 24 / 27

Example: factoring loop invariants (cont.) Strength reduction in loops Factoring the inner loop: // LoopDef = {i,j,k, A} // LoopDef = {j,k, A} t1 = &A[i][j]; t2 = i j ; // LoopDef = {k,a} t1[k] = t k; And the second loop: // LoopDef = {i,j,k, A} t3 = &A[i]; // LoopDef = {j,k, A} t1 = &t3[j]; t2 = i j ; // LoopDef = {k,a} t1[k] = t k; Loop induction variable: incremented on each iteration i 0,i 0 + 1,i 0 + 2,... Induction expression: ic 1 + c 2, where c 1, c 2 are loop invariant i 0 c 1 + c 2,(i 0 + 1)c 1 + c 2,(i 0 + 2)c 1 + c 2,... 1 replace ic 1 + c 2 by t in body of loop 2 insert t := i 0 c 1 + c 2 before loop 3 insert t := t + c 1 at of loop V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 25 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 26 / 27 Example: strength reduction in loops Example: strength reduction in loops From previous example: t3 = &A[i]; t4 = i; // i j0 = i t1 = &t3[j]; t2 = t4; // t4 = i j t5 = t2; // t2 k0 = t2 t1[k] = t5; // t5 = t2 k t5 = t5 + t2; t4 = t4 + i; After copy propagation and exposing indexing: t3 = A + (10000 i) - 10000; t4 = i; t1 = t3 + (100 j) - 100; t5 = t4; (t1 + k - 1) = t5; t5 = t5 + t4; t4 = t4 + i; V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 27 / 27 V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 28 / 27

Example: strength reduction in loops Applying strength reduction to exposed index expressions: t6 = A; t3 = t6; t4 = i; t7 = t3; t1 = t7; t5 = t4; t8 = t1; t8 = t5; t5 = t5 + t4; t8 = t8 + 1; t4 = t4 + i; t7 = t7 + 100; t6 = t6 + 10000; V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 29 / 27 Again, copy propagation further improves the code. Ordering optimization phases 1 semantic analysis and intermediate code generation: loop unrolling inline expansion 2 intermediate code generation: build basic blocks with their Def and Kill sets 3 build control flow graph: perform initial data flow analyses assume worst case for calls if no interproc. analysis 4 early data-flow optimizations: constant/copy propagation (may expose dead code, changing flow graph, so iterate) 5 CSE and live/dead variable analyses 6 translate basic blocks to target code: local optimizations (register allocation/assignment, code selection) 7 peephole optimization V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 30 / 27 Loop optimizations Loop unswitching Loop tiling Loop peeling Loop reversal Loop-invariant code motion Loop inversion Loop interchange Loop fusion Loop distribution V.Krishna Nandivada (IIT Madras) CS6013 - Jan 2015 31 / 27