How Compilers Work. by Walter Bright. Digital Mars



Similar documents
Compiler Construction

CA Compiler Construction

Buffer Overflows. Security 2011

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

Informatica e Sistemi in Tempo Reale

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Compiler Construction

Programming Language Pragmatics

64-Bit NASM Notes. Invoking 64-Bit NASM

Compiler Compiler Tutorial

Stack Overflows. Mitchell Adair

Overview of IA-32 assembly programming. Lars Ailo Bongo University of Tromsø

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Software Vulnerabilities

8. MACROS, Modules, and Mouse

Comp151. Definitions & Declarations

The previous chapter provided a definition of the semantics of a programming

C Compiler Targeting the Java Virtual Machine

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Software Fingerprinting for Automated Malicious Code Analysis

CSCI 3136 Principles of Programming Languages

Return-oriented programming without returns

Jonathan Worthington Scarborough Linux User Group

A Tiny Guide to Programming in 32-bit x86 Assembly Language

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Object Oriented Software Design II

Course Name: ADVANCE COURSE IN SOFTWARE DEVELOPMENT (Specialization:.Net Technologies)

For a 64-bit system. I - Presentation Of The Shellcode

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

CSC230 Getting Starting in C. Tyler Bletsch

An Incomplete C++ Primer. University of Wyoming MA 5310

The C Programming Language course syllabus associate level

03 - Lexical Analysis

Language Processing Systems

CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow? How to run R programs faster? Tomas Kalibera

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Off-by-One exploitation tutorial

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

TitanMist: Your First Step to Reversing Nirvana TitanMist. mist.reversinglabs.com

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Lecture 9. Semantic Analysis Scoping and Symbol Table

SQLITE C/C++ TUTORIAL

Chapter 6: Programming Languages

Compiler I: Syntax Analysis Human Thought

X86-64 Architecture Guide

1/20/2016 INTRODUCTION

Programming Languages

Lex et Yacc, exemples introductifs

GENERIC and GIMPLE: A New Tree Representation for Entire Functions

Programming Languages

1. General function and functionality of the malware

TODAY, FEW PROGRAMMERS USE ASSEMBLY LANGUAGE. Higher-level languages such

Introduction. Application Security. Reasons For Reverse Engineering. This lecture. Java Byte Code

TECHNOLOGY Computer Programming II Grade: 9-12 Standard 2: Technology and Society Interaction

Attacking x86 Windows Binaries by Jump Oriented Programming

Sources: On the Web: Slides will be available on:

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Assembly Language: Function Calls" Jennifer Rexford!

The programming language C. sws1 1

Title: Bugger The Debugger - Pre Interaction Debugger Code Execution

esrever gnireenigne tfosorcim seiranib

Lab Experience 17. Programming Language Translation

Fighting malware on your own

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Channel Access Client Programming. Andrew Johnson Computer Scientist, AES-SSG

6.170 Tutorial 3 - Ruby Basics

Glossary of Object Oriented Terms

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

High-speed image processing algorithms using MMX hardware

Debugging with TotalView

Computer Organization and Architecture

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

csce4313 Programming Languages Scanner (pass/fail)

CSCI E 98: Managed Environments for the Execution of Programs

Lumousoft Visual Programming Language and its IDE

Rakudo Perl 6 on the JVM. Jonathan Worthington

URI and UUID. Identifying things on the Web.

Test Driven Development in Assembler a little story about growing software from nothing

1 The Java Virtual Machine

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Guile Present. version 0.3.0, updated 21 September Andy Wingo

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

Character Translation Methods

Application-Specific Attacks: Leveraging the ActionScript Virtual Machine

REpsych. : psycholigical warfare in reverse engineering. def con 2015 // domas

Parallelization: Binary Tree Traversal

CS61: Systems Programing and Machine Organization

Lecture 27 C and Assembly

Transcription:

How Compilers Work by Walter Bright Digital Mars

Compilers I've Built D programming language C++ C Javascript Java A.B.E.L

Compiler Compilers Regex Lex Yacc Spirit Do only the easiest part Not very customizable Make compiler source less portable Not worth the bother

Source Code extern int printf(const char *,...); int main(int argc, char **argv) { // Print out each argument for (int i = 0; i < argc; i++) { printf("arg[%d] = '%s'\n",i,argv[i]); } return 0; }

Compiler Output! " "" #$%$$%$$ & $## '() '* +,* + + +$-./- 0$1 +/23 ++ + '43567 858

Compiler Passes. Lexing. Parsing. Semantic Analysis. Intermediate Code Generation. Optimization. Code Generation. Object File Generation

Lexing turns character stream into tokens eliminates whitespace eliminates comments distinguishes keywords from identifiers strings, numbers turn into single tokens input is dramatically simplified

Tokenized Result Token Name Optional Data extern int identifier printf lparen const char star identifier format comma ellipsis rparen semicolon int identifier main lparen int identifier argc comma char star star identifier argv rparen lbrace for lparen int identifier i equals number 0 semicolon identifier i lessthan identifier argc semicolon identifier i plusplus rparen lbrace identifier printf lparen string arg[%d] = '%s'\n comma identifier argv lbracket identifier i rbracket rparen semicolon rbrace return number 0 semicolon rbrace

249 characters of source becomes 54 'characters' after lexing

Parsing Grammar looks like: )'$, )'9":'*); <'*); <'*)= $, Data structure looks like: '()'$,$, > ":'*)?:; <'*)?); <'*)?*; $,?+)@; A; Note close correspondence

Semantic Analysis 1. Declaring symbols 2. Resolving symbols 3. Type determination 4. Type checking 5. Language rules checking 6. Overload resolution 7. Template expansion 8. Inlining functions

Results of Semantic Analysis Compiled program is an array of symbols. Symbols point to the parsed data structures. Parsed data structures are 'decorated' with types, attributes, storage classes, and other information needed to generate intermediate code.

Intermediate Code A 'pseudo machine' is targetted Often several front ends target this 'pseudo machine' Sharing a common optimizer and native code generator Most semantic information is stripped gcc is a prime example of this Digital Mars compilers do it too Interpreters tend to go no further An interpreter like the Java VM and.net execute intermediate code Optimization and native code gen passes are done at runtime using a JIT

Intermediate Code Generation Symbols resulting from semantic analysis are "walked" to generate intermediate code. The statements are converted into basic blocks connected by edges.

i = 0 i < argc true false printf(... ) i++ 0 return

Expressions to Trees i = 0 = i 0

i < argc < i argc

printf("arg[%d] = '%s'\n", i, argv[i]); call printf param * param + argv * i arg[%d] = '%s'\n i 4

i++ i ++

0 0

Optimization Rewrites the expressions and the basic blocks to an optimized form i = 0 argc > 0 true printf(...) i += 1 i < argc false false 0 return

Note the loop rotation and the replacement of i++ with i += 1

Reason for Loop Rotation one less jump can sometimes eliminate loop header, such as the loop: for (int i = 0; i < 10; i++) { body } is rewritten as: int i = 0; do { body } while (++i < 10);

Other Optimizations Constant propagation Copy propagation Common subexpressions Strength reduction Constant folding Loop induction variables

More Optimizations Very busy expressions Tail call elimination Dead assignment elimination Live variable analysis Code hoisting Data flow analysis

Register Assignment i EBX argc ESI argv EDI

Register Assignment Methods Registers reserved for variables Good when there are many registers By live analysis More advanced Best when there are few registers

Instruction Selection prolog push push push mov mov EBX ESI EDI ESI,0Ch[ESP] EDI,014h[ESP] i = 0 xor EBX,EBX argc > 0 test jle ESI,ESI L27

printf("arg[%d] = '%s'\n", i, argv[i]) push push push call add [EBX*4][EDI] EBX offset _DATA _printf ESP,0Ch i += 1 inc EBX i < argc cmp jl EBX,ESI L11 0 xor EAX,EAX epilog pop pop pop ret EDI ESI EBX

Putting It Together push EBX push ESI push EDI mov ESI,0Ch[ESP] mov EDI,014h[ESP] xor EBX,EBX test ESI,ESI jle L27 L11: push [EBX*4][EDI] push EBX push offset FLAT:_DATA call _printf add ESP,0Ch inc EBX cmp EBX,ESI jl L11 L27: xor EAX,EAX pop EDI pop ESI pop EBX ret

Instruction Scheduling Do register loads as soon as possible Do register reads as late as possible

push EBX push ESI push EDI mov ESI,0Ch[ESP] mov EDI,014h[ESP] xor EBX,EBX test ESI,ESI jle L27 L11: push [EBX*4][EDI] push EBX push offset FLAT:_DATA call _printf add ESP,0Ch inc EBX cmp EBX,ESI jl L11 L27: xor EAX,EAX pop EDI pop ESI pop EBX ret push EBX xor EBX,EBX push ESI mov ESI,0Ch[ESP] test ESI,ESI push EDI mov EDI,014h[ESP] jle L27 L11: push [EBX*4][EDI] push EBX push offset FLAT:_DATA call _printf inc EBX add ESP,0Ch cmp EBX,ESI jl L11 L27: pop EDI xor EAX,EAX pop ESI pop EBX ret

Object File Generation _TEXT segment dword use32 public 'CODE' ;size is 45 _TEXT ends _DATA segment dword use32 public 'DATA' ;size is 16 _DATA ends CONST segment dword use32 public 'CONST' ;size is 0 CONST ends _BSS segment dword use32 public 'BSS' ;size is 0 _BSS ends FLAT group includelib SNN.lib extrn acrtused_con extrn _printf public _main _TEXT segment assume CS:_TEXT _main: 53 push EBX 31 DB xor EBX,EBX 56 push ESI 8B 74 24 0C mov ESI,0Ch[ESP] 85 F6 test ESI,ESI 57 push EDI 8B 7C 24 14 mov EDI,014h[ESP] 7E 16 jle L27 L11: FF 34 9F push [EBX*4][EDI] 53 push EBX 68 00 00 00 00 push offset FLAT:_DATA E8 00 00 00 00 call near ptr _printf 43 inc EBX 83 C4 0C add ESP,0Ch 39 F3 cmp EBX,ESI 7C EA jl L11 L27: 5F pop EDI 31 C0 xor EAX,EAX 5E pop ESI 5B pop EBX C3 ret _TEXT ends _DATA segment D0 db 061h,072h,067h,05bh,025h,064h,05dh,020h db 03dh,020h,027h,025h,073h,027h,00ah,000h _DATA ends CONST segment CONST ends _BSS segment _BSS ends end

Object File Contents n Intel Object Module Format Header THEADR, COMENT Symbol Table EXTDEF, PUB386 Code/Data Sections SEG386, GRPDEF, LNAMES Section Contents LED386 Fixups FIX386 Footer MODEND

THEADR 8 74 65 73 74 2e 63 70 70.test.cpp COMENT 0 9d 37 6e 4f..7nO COMENT 0 a1 1 43 56...CV LNAMES 0 4 46 4c 41 54 5 5f 54 45 58 54 4 43 4f 44..FLAT._TEXT.COD 45 5 5f 44 41 54 41 4 44 41 54 41 5 43 4f 4e E._DATA.DATA.CON 53 54 4 5f 42 53 53 3 42 53 53 ST._BSS.BSS SEG386 a9 33 0 0 0 3 4 1.3... SEG386 a9 10 0 0 0 5 6 1... SEG386 a9 0 0 0 0 7 7 1... SEG386 a9 0 0 0 0 8 9 1... GRPDEF 2. COMENT 0 9f 53 4e 4e..SNN EXTDEF e 5f 5f 61 63 72 74 75 73 65 64 5f 63 6f 6e 0. acrtused_con. 7 5f 70 72 69 6e 74 66 0._printf. PUB386 0 1 5 5f 6d 61 69 6e 0 0 0 0 0..._main... LED386 1 0 0 0 0 c8 4 0 0 c7 45 fc 0 0 0 0...E... 8b 45 fc 3b 45 8 7d 1c 8b 4d fc 8b 55 c ff 34.E.;E.}..M..U..4 8a 51 68 0 0 0 0 e8 0 0 0 0 83 c4 c ff.qh... 45 fc eb dc 31 c0 c9 c3 E...1... FIX386 a4 23 16 1 2 e4 1e 14 1 2.#... LED386 2 0 0 0 0 61 72 67 5b 25 64 5d 20 3d 20 27...arg[%d] = ' 25 73 27 a 0 %s'.. MODEND 0.

Other Compiler Topics Exception handling Thread local storage Closures Virtual functions

More Topics RAII Position Independent Code Concurrency Runtime Library Symbolic debug information

Conclusion Compilers are complex, but can be broken down into well understood passes Best way to learn a language is to write a compiler for it Compilers are fun