Compiler Construction



Similar documents
Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Scanner. tokens scanner parser IR. source code. errors

03 - Lexical Analysis

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Programming Languages CIS 443

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Objects for lexical analysis

Introduction to Automata Theory. Reading: Chapter 1

Compilers Lexical Analysis

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Programming Project 1: Lexical Analyzer (Scanner)

Automata and Computability. Solutions to Exercises

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

The Halting Problem is Undecidable

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Evaluation of JFlex Scanner Generator Using Form Fields Validity Checking

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Compiler I: Syntax Analysis Human Thought

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Semantic Analysis: Types and Type Checking

Finite Automata. Reading: Chapter 2

CSCI 3136 Principles of Programming Languages

Regular Expressions and Automata using Haskell

Lecture 9. Semantic Analysis Scoping and Symbol Table

Automata and Formal Languages

Compiler Construction

Finite Automata. Reading: Chapter 2

Textual Modeling Languages

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Syntaktická analýza. Ján Šturc. Zima 208

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

csce4313 Programming Languages Scanner (pass/fail)

CA Compiler Construction

If-Then-Else Problem (a motivating example for LR grammars)

Programming Languages

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Reading 13 : Finite State Automata and Regular Expressions

Automata on Infinite Words and Trees

J a v a Quiz (Unit 3, Test 0 Practice)

CS5236 Advanced Automata Theory

Basics of Compiler Design

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Unit 6. Loop statements

Compiler Design July 2004

CS106A, Stanford Handout #38. Strings and Chars

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Pemrograman Dasar. Basic Elements Of Java

Acknowledgements. In the name of Allah, the Merciful, the Compassionate

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Computer Programming I

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

The programming language C. sws1 1

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Chapter 2: Elements of Java

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Number Representation

Genetic programming with regular expressions

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

C Compiler Targeting the Java Virtual Machine

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

ANTLR - Introduction. Overview of Lecture 4. ANTLR Introduction (1) ANTLR Introduction (2) Introduction

A TOOL FOR DATA STRUCTURE VISUALIZATION AND USER-DEFINED ALGORITHM ANIMATION

Introduction to Java

Positional Numbering System

Design Patterns in Parsing

java.util.scanner Here are some of the many features of Scanner objects. Some Features of java.util.scanner

What is a Loop? Pretest Loops in C++ Types of Loop Testing. Count-controlled loops. Loops can be...

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

Context free grammars and predictive parsing

Intel Assembler. Project administration. Non-standard project. Project administration: Repository

LEX/Flex Scanner Generator

Explain the relationship between a class and an object. Which is general and which is specific?

PL / SQL Basics. Chapter 3

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

3515ICT Theory of Computation Turing Machines

JavaScript: Control Statements I

Formal Grammars and Languages

Ecma/TC39/2013/NN. 4 th Draft ECMA-XXX. 1 st Edition / July The JSON Data Interchange Format. Reference number ECMA-123:2009

SOLUTION Trial Test Grammar & Parsing Deficiency Course for the Master in Software Technology Programme Utrecht University

Regular Languages and Finite State Machines

Regular Languages and Finite Automata

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

Symbol Tables. Introduction

Informatique Fondamentale IMA S8

So far we have considered only numeric processing, i.e. processing of numeric data represented

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Introduction to Python

Transcription:

Compiler Construction Regular expressions Scanning Görel Hedin Reviderad 2013 01 23.a 2013 Compiler Construction 2013 F02-1

Compiler overview source code lexical analysis tokens intermediate code generation intermediate code syntactic analysis attributed AST optimization AST semantic analysis analysis intermediate code machine code generation synthesis machine code Compiler Construction 2013 F02-2

Compiler phases source code scanner regular expressions tokens lexical analysis parser context free grammar parse tree (implicit) AST builder AST (explicit) syntactic analysis Compiler Construction 2013 F02-3

Typical tokens Reserved words (keywords) Identifiers example lexemes Literals Operators Separators Compiler Construction 2013 F02-4

Typical tokens Reserved words (keywords) Identifiers example lexemes if then else begin i alpha k10 Literals 123 3.1416 A text Operators + ++!= Separators ;, ( Compiler Construction 2013 F02-5

Non-tokens whitespace example lexemes comments Compiler Construction 2013 F02-6

Non-tokens whitespace comments example lexemes tab newline formfeed /* comment */ //eol comment Compiler Construction 2013 F02-7

Formal languages An alphabet, Σ, is a set of symbols. (Nonempty and finite) A string is a sequence of symbols. (Each string is finite) A formal language, L, is a set of strings. (Can be infinite) Compiler Construction 2013 F02-8

Formal languages An alphabet, Σ, is a set of symbols. (Nonempty and finite) A string is a sequence of symbols. (Each string is finite) A formal language, L, is a set of strings. (Can be infinite) We would like to have rules or algorithms for deciding if a certain string over the alphabet belongs to the language or not. Compiler Construction 2013 F02-8

Example: Languages over binary numbers Suppose we have the alphabet Σ = {0, 1} Example languages: The set of all possible combinations of zeros and ones: L 0 = {0, 1, 00, 01, 10, 11, 000,...} All binary numbers without unnecessary leading zeros: L 1 = {0, 1, 10, 11, 100, 101, 110, 111, 1000,...} All binary numbers with two digits: L 2 = {00, 01, 10, 11}... Compiler Construction 2013 F02-9

Example: Languages over UNICODE Here, the alphabet Σ is the set of UNICODE characters Example languages: All possible Java keywords: { class, import, public,...} All possible lexemes corresponding to Java tokens. All possible lexemes corresponding to Java whitespace. All binary numbers... Compiler Construction 2013 F02-10

Example: Languages over Java tokens Here, the alphabet Σ is the set of Java tokens Example languages: All syntactically correct Java programs All that are syntactically incorrect All that are compile-time correct All that terminate Compiler Construction 2013 F02-11

Example: Languages over Java tokens Here, the alphabet Σ is the set of Java tokens Example languages: All syntactically correct Java programs All that are syntactically incorrect All that are compile-time correct All that terminate (undecidable)... Compiler Construction 2013 F02-11

Defining languages using rules Increasingly powerful: Regular expressions Context-free grammars Attribute grammars Compiler Construction 2013 F02-12

Regular expressions (core notation) RE read is called a a symbol M N M or N alternative M N M followed by N concatenation ɛ the empty string epsilon M zero or more M repetition (Kleene star) (M) where a is a symbol in the alphabet (e.g., {0, 1} or UNICODE) and M and N are regular expressions Each regular expression defines a language over the alphabet (a set of strings that belong to the language). Priorities: M N P = M (N (P )) Compiler Construction 2013 F02-13

Example a b c means Compiler Construction 2013 F02-14

Example a b c means {a, b, bc, bcc, bccc,...} Compiler Construction 2013 F02-14

Regular expressions with extended notation RE read is called a a symbol M N M or N alternative M N M followed by N concatenation ɛ the empty string epsilon M zero or more M repetition (Kleene star) (M) means M + at least one... M M M? optional... ɛ M [a za Z] one of... a b... z A... Z [0 9] not... one character, but not anyone of those listed a+b the string... a \+ b Compiler Construction 2013 F02-15

Exercise Write a regular expression that defines the language of all decimal numbers, like 3.14 0.75 4711 0... But not numbers lacking an integer part. And not numbers with a decimal point but lacking a fractional part. So not numbers like 17..236. Leading and trailing zeros are allowed. So the following numbers are ok: 007 008.00 0.0 1.700 a) Use the extended notation. b) Then translate the expression to the core notation. c) Then write an expression that disallows leading zeros. Compiler Construction 2013 F02-16

Solution a) [0 9] + (. [0 9] + )? b) (0... 9) (0... 9) (ɛ (. ((0... 9) (0... 9) ))) c) 0 (. [0 9] + )? [1 9] [0 9] (. [0 9] + )? Compiler Construction 2013 F02-17

JavaCC notation Appel JavaCC ɛ M + M+ M? M? [a za Z] [ a - z, A - Z ] [ 0-9 ] a b c abc \n \n Compiler Construction 2013 F02-18

JavaCC example... /* Skip whitespace */ SKIP : { " " "\t" "\n" "\r" } /* Reserved words */ TOKEN [IGNORE_CASE]: { < IF: "if"> < THEN: "then"> } /* Identifiers */ TOKEN: { < ID: (["A"-"Z", "a"-"z"])+ > } Compiler Construction 2013 F02-19

... JavaCC example /* Literals */ TOKEN: { < INT: (["0"-"9"])+ > } /* Operators and separators */ TOKEN: { < ASSIGN: "=" > < MULT: "*" > < LT: "<" > < LE: "<=" >... } Compiler Construction 2013 F02-20

Ambiguous lexical definitions The scanned text can match tokens in more than one way Which tokens match <=? Compiler Construction 2013 F02-21

Ambiguous lexical definitions The scanned text can match tokens in more than one way Which tokens match <=? Which tokens match if? Compiler Construction 2013 F02-21

Extra rules for resolving the ambiguities Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the latter rule will be chosen. This way, the scanner will match the longest token possible. Compiler Construction 2013 F02-22

Extra rules for resolving the ambiguities Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the latter rule will be chosen. This way, the scanner will match the longest token possible. Rule priority If two rules can be used to match the same sequence of characters, the first one takes priority. Compiler Construction 2013 F02-22

Implementation of scanners Finite automata 1 i f 2 3 state IF 0 9 0 9 1 2 a transition start state INT final state Compiler Construction 2013 F02-23

More automata WHITESPACE ID Compiler Construction 2013 F02-24

More automata WHITESPACE 1 \t \n 2 3 4 ID a z a z 1 2 Compiler Construction 2013 F02-25

Automaton for the whole language Combine the automata for individual tokens merge the start states re-enumerate the states so they get unique numbers mark each final state with the token that is matched Compiler Construction 2013 F02-26

Example 1 Combine IF and INT Compiler Construction 2013 F02-27

Example 1 1 i f 2 3 0 9 IF 0 9 4 INT Compiler Construction 2013 F02-28

Example 2 Combine IF and ID Compiler Construction 2013 F02-29

Example 2 1 i f 2 3 a z0 9 IF a z 4 ID Compiler Construction 2013 F02-30

Is the automaton deterministic? Compiler Construction 2013 F02-31

Translating a RE to an NFA a MN M N M ɛ Compiler Construction 2013 F02-32

DFA vs. NFA Deterministic Finite Automaton (DFA) A finite automaton is deterministic if all outgoing edges from any given node have disjoint character sets there are no ɛ edges (the empty string) Can be implemented efficiently Compiler Construction 2013 F02-33

DFA vs. NFA Deterministic Finite Automaton (DFA) A finite automaton is deterministic if all outgoing edges from any given node have disjoint character sets there are no ɛ edges (the empty string) Can be implemented efficiently Non-deterministic Finite Automaton (NFA) An NFA may have two outgoing edges with overlapping character sets ɛ-edges DFA NFA Compiler Construction 2013 F02-33

Each NFA can be translated to a DFA Simulate the NFA keep track of a set of current NFA-states follow ɛ edges to extend the current set (take the closure) Construct the corresponding DFA Each such set of NFA states corresponds to one DFA state If any of the NFA states are final, the DFA state is also final, and is marked with the corresponding token. If there is more than one token to choose from, select the token that is defined first (rule priority) (Minimize the DFA for efficiency) Compiler Construction 2013 F02-34

Translate an NFA to a DFA NFA start 1 f 2 3 i a-z a-z 4 ID IF Compiler Construction 2013 F02-35

Translate an NFA to a DFA NFA 2 f 3 i IF start 1 a-z a-z 4 ID DFA 2,4 f 3,4 i a-eg-z IF start 1 a-z a-z a-hj-z 4 ID Compiler Construction 2013 F02-36

Construction of a Scanner Define all tokens as regular expressions Construct a finite automaton for each token Combine all these automata to a new automaton (an NFA in general) Translate to a corresponding DFA Minimize the DFA Implement the DFA Compiler Construction 2013 F02-37

Implementation alternatives for DFAs Table-driven Represent the automaton by a table Additional table to keep track of final states and token kinds A global variable keeps track of the current state Compiler Construction 2013 F02-38

Implementation alternatives for DFAs Table-driven Represent the automaton by a table Additional table to keep track of final states and token kinds A global variable keeps track of the current state Switch statements Each state is implemented as a switch statement Each case implements a state transition as a jump (to another switch statement). The current state is represented by the program counter Compiler Construction 2013 F02-38

Error handling Add a dead state (state 0), corresponding to erroneous input Add transitions to the dead state for all erroneous input Generate an Error token when the dead state is reached Compiler Construction 2013 F02-39

Table-driven implementation DFA for IF and ID 1 2,4 3,4 4.. a-e f g-h i j-z.. final kind true ERROR Compiler Construction 2013 F02-40

Table-driven implementation DFA for IF and ID.. a-e f g-h i j-z.. final kind 0 true ERROR 1 0 4 4 4 2,4 4 0 false 2,4 0 4 3,4 4 4 4 0 true ID 3,4 0 4 4 4 4 4 0 true IF 4 0 4 4 4 4 4 0 true ID Compiler Construction 2013 F02-41

Design Token int kind String string Parser Scanner Token next() Reader int get() Compiler Construction 2013 F02-42

Scanner (sketch) class Scanner { PushbackReader reader; boolean isfinal [ ]; int edges [ ] [ ]; int kind [ ]; Token next() { } } Compiler Construction 2013 F02-43

Scanner (sketch) class Scanner { //does not handle: PushbackReader reader; //longest match boolean isfinal [ ]; //eof int edges [ ] [ ]; //whitespace int kind [ ]; //token strings Token next() { int state = 1; //start state while(!isfinal[state]) { char ch = reader.read(); state = edges[state][ch]; } return new Token(kind[state]); } } Compiler Construction 2013 F02-44

Handling End-Of-File (EOF) and non-tokens Compiler Construction 2013 F02-45

Handling End-Of-File (EOF) and non-tokens EOF construct an explicit EOF token when the EOF character is read Non-tokens (Whitespace & Comments) Errors view as tokens of a special kind scan them as normal tokens, but don t create token objects for them loop in next() until a real token has been found construct an explicit ERROR token to be returned when no valid token can be found. Compiler Construction 2013 F02-45

Handling longest match When a token is matched (a final state reached), don t stop scanning. Keep track of the latest matched token in a variable. Continue scanning until we reach the dead state Restore the input stream. Use PushbackReader to be able to do this. Return the latest matched token (or return the ERROR token if there was no latest matched token) Compiler Construction 2013 F02-46

Scanner with longest match class Scanner {... Token next() { } } Compiler Construction 2013 F02-47

Scanner with longest match class Scanner {... Token next() { int state = 1, lastfinalstate = 0; int lastfinalstr = "": StringBuilder builder = new StringBuilder(); while(state!=0) { char ch = reader.read(); builder.append(ch); state = edges[state][ch]; if(isfinal[state]) { lastfinalstate = state; lastfinalstr = builder.tostring(); } } reader.unread(... ); //unused characters return new Token(kind[lastFinalState], lastfinalstr); } Compiler Construction 2013 F02-48

JavaCC as a scanner generator toy.jj sum.toy javacc ToyTokenManager.java ToyConstants.java...java tokens Compiler Construction 2013 F02-49

Other scanner generators tool author implementation generates lex Schmidt, Lesk, 1975 table-driven C code flex Paxon. 1987 switch statements C code jlex table-driven Java code jflex switch statements Java code Compiler Construction 2013 F02-50

Additional functionality Lexical actions extra (Java) code written in the specification for adapting the generated scanner e.g., create special token objects, count tokens, keep track of position in input text (line and column numbers), etc. Compiler Construction 2013 F02-51

Additional functionality Lexical actions extra (Java) code written in the specification for adapting the generated scanner e.g., create special token objects, count tokens, keep track of position in input text (line and column numbers), etc. Lexical states ( start states ) gives the possibility to switch automata during scanning makes it easier to scan certain tokens, e.g., multi-line comments makes it possible to scan certain combined languages, e.g., HTML Compiler Construction 2013 F02-51

Lexcial states, example Scanning of /*... */ Compiler Construction 2013 F02-52

Summary questions What is meant by an ambiguous lexical definition? Give some typical examples of this and how they may be resolved. Construct an NFA for a given lexical definition Construct a DFA for a given NFA What is the difference between a DFA and an NFA? Implement a DFA in Java. How is rule priority handled in the implementation? How is longest match handled? EOF? Whitespace? Compiler Construction 2013 F02-53

What s next? Tomorrow: Seminar 1 (Scanning) Next week: Parsing and Assignment 1 (Scanning) Compiler Construction 2013 F02-54