Compiler Construction Regular expressions Scanning Görel Hedin Reviderad 2013 01 23.a 2013 Compiler Construction 2013 F02-1
Compiler overview source code lexical analysis tokens intermediate code generation intermediate code syntactic analysis attributed AST optimization AST semantic analysis analysis intermediate code machine code generation synthesis machine code Compiler Construction 2013 F02-2
Compiler phases source code scanner regular expressions tokens lexical analysis parser context free grammar parse tree (implicit) AST builder AST (explicit) syntactic analysis Compiler Construction 2013 F02-3
Typical tokens Reserved words (keywords) Identifiers example lexemes Literals Operators Separators Compiler Construction 2013 F02-4
Typical tokens Reserved words (keywords) Identifiers example lexemes if then else begin i alpha k10 Literals 123 3.1416 A text Operators + ++!= Separators ;, ( Compiler Construction 2013 F02-5
Non-tokens whitespace example lexemes comments Compiler Construction 2013 F02-6
Non-tokens whitespace comments example lexemes tab newline formfeed /* comment */ //eol comment Compiler Construction 2013 F02-7
Formal languages An alphabet, Σ, is a set of symbols. (Nonempty and finite) A string is a sequence of symbols. (Each string is finite) A formal language, L, is a set of strings. (Can be infinite) Compiler Construction 2013 F02-8
Formal languages An alphabet, Σ, is a set of symbols. (Nonempty and finite) A string is a sequence of symbols. (Each string is finite) A formal language, L, is a set of strings. (Can be infinite) We would like to have rules or algorithms for deciding if a certain string over the alphabet belongs to the language or not. Compiler Construction 2013 F02-8
Example: Languages over binary numbers Suppose we have the alphabet Σ = {0, 1} Example languages: The set of all possible combinations of zeros and ones: L 0 = {0, 1, 00, 01, 10, 11, 000,...} All binary numbers without unnecessary leading zeros: L 1 = {0, 1, 10, 11, 100, 101, 110, 111, 1000,...} All binary numbers with two digits: L 2 = {00, 01, 10, 11}... Compiler Construction 2013 F02-9
Example: Languages over UNICODE Here, the alphabet Σ is the set of UNICODE characters Example languages: All possible Java keywords: { class, import, public,...} All possible lexemes corresponding to Java tokens. All possible lexemes corresponding to Java whitespace. All binary numbers... Compiler Construction 2013 F02-10
Example: Languages over Java tokens Here, the alphabet Σ is the set of Java tokens Example languages: All syntactically correct Java programs All that are syntactically incorrect All that are compile-time correct All that terminate Compiler Construction 2013 F02-11
Example: Languages over Java tokens Here, the alphabet Σ is the set of Java tokens Example languages: All syntactically correct Java programs All that are syntactically incorrect All that are compile-time correct All that terminate (undecidable)... Compiler Construction 2013 F02-11
Defining languages using rules Increasingly powerful: Regular expressions Context-free grammars Attribute grammars Compiler Construction 2013 F02-12
Regular expressions (core notation) RE read is called a a symbol M N M or N alternative M N M followed by N concatenation ɛ the empty string epsilon M zero or more M repetition (Kleene star) (M) where a is a symbol in the alphabet (e.g., {0, 1} or UNICODE) and M and N are regular expressions Each regular expression defines a language over the alphabet (a set of strings that belong to the language). Priorities: M N P = M (N (P )) Compiler Construction 2013 F02-13
Example a b c means Compiler Construction 2013 F02-14
Example a b c means {a, b, bc, bcc, bccc,...} Compiler Construction 2013 F02-14
Regular expressions with extended notation RE read is called a a symbol M N M or N alternative M N M followed by N concatenation ɛ the empty string epsilon M zero or more M repetition (Kleene star) (M) means M + at least one... M M M? optional... ɛ M [a za Z] one of... a b... z A... Z [0 9] not... one character, but not anyone of those listed a+b the string... a \+ b Compiler Construction 2013 F02-15
Exercise Write a regular expression that defines the language of all decimal numbers, like 3.14 0.75 4711 0... But not numbers lacking an integer part. And not numbers with a decimal point but lacking a fractional part. So not numbers like 17..236. Leading and trailing zeros are allowed. So the following numbers are ok: 007 008.00 0.0 1.700 a) Use the extended notation. b) Then translate the expression to the core notation. c) Then write an expression that disallows leading zeros. Compiler Construction 2013 F02-16
Solution a) [0 9] + (. [0 9] + )? b) (0... 9) (0... 9) (ɛ (. ((0... 9) (0... 9) ))) c) 0 (. [0 9] + )? [1 9] [0 9] (. [0 9] + )? Compiler Construction 2013 F02-17
JavaCC notation Appel JavaCC ɛ M + M+ M? M? [a za Z] [ a - z, A - Z ] [ 0-9 ] a b c abc \n \n Compiler Construction 2013 F02-18
JavaCC example... /* Skip whitespace */ SKIP : { " " "\t" "\n" "\r" } /* Reserved words */ TOKEN [IGNORE_CASE]: { < IF: "if"> < THEN: "then"> } /* Identifiers */ TOKEN: { < ID: (["A"-"Z", "a"-"z"])+ > } Compiler Construction 2013 F02-19
... JavaCC example /* Literals */ TOKEN: { < INT: (["0"-"9"])+ > } /* Operators and separators */ TOKEN: { < ASSIGN: "=" > < MULT: "*" > < LT: "<" > < LE: "<=" >... } Compiler Construction 2013 F02-20
Ambiguous lexical definitions The scanned text can match tokens in more than one way Which tokens match <=? Compiler Construction 2013 F02-21
Ambiguous lexical definitions The scanned text can match tokens in more than one way Which tokens match <=? Which tokens match if? Compiler Construction 2013 F02-21
Extra rules for resolving the ambiguities Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the latter rule will be chosen. This way, the scanner will match the longest token possible. Compiler Construction 2013 F02-22
Extra rules for resolving the ambiguities Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the latter rule will be chosen. This way, the scanner will match the longest token possible. Rule priority If two rules can be used to match the same sequence of characters, the first one takes priority. Compiler Construction 2013 F02-22
Implementation of scanners Finite automata 1 i f 2 3 state IF 0 9 0 9 1 2 a transition start state INT final state Compiler Construction 2013 F02-23
More automata WHITESPACE ID Compiler Construction 2013 F02-24
More automata WHITESPACE 1 \t \n 2 3 4 ID a z a z 1 2 Compiler Construction 2013 F02-25
Automaton for the whole language Combine the automata for individual tokens merge the start states re-enumerate the states so they get unique numbers mark each final state with the token that is matched Compiler Construction 2013 F02-26
Example 1 Combine IF and INT Compiler Construction 2013 F02-27
Example 1 1 i f 2 3 0 9 IF 0 9 4 INT Compiler Construction 2013 F02-28
Example 2 Combine IF and ID Compiler Construction 2013 F02-29
Example 2 1 i f 2 3 a z0 9 IF a z 4 ID Compiler Construction 2013 F02-30
Is the automaton deterministic? Compiler Construction 2013 F02-31
Translating a RE to an NFA a MN M N M ɛ Compiler Construction 2013 F02-32
DFA vs. NFA Deterministic Finite Automaton (DFA) A finite automaton is deterministic if all outgoing edges from any given node have disjoint character sets there are no ɛ edges (the empty string) Can be implemented efficiently Compiler Construction 2013 F02-33
DFA vs. NFA Deterministic Finite Automaton (DFA) A finite automaton is deterministic if all outgoing edges from any given node have disjoint character sets there are no ɛ edges (the empty string) Can be implemented efficiently Non-deterministic Finite Automaton (NFA) An NFA may have two outgoing edges with overlapping character sets ɛ-edges DFA NFA Compiler Construction 2013 F02-33
Each NFA can be translated to a DFA Simulate the NFA keep track of a set of current NFA-states follow ɛ edges to extend the current set (take the closure) Construct the corresponding DFA Each such set of NFA states corresponds to one DFA state If any of the NFA states are final, the DFA state is also final, and is marked with the corresponding token. If there is more than one token to choose from, select the token that is defined first (rule priority) (Minimize the DFA for efficiency) Compiler Construction 2013 F02-34
Translate an NFA to a DFA NFA start 1 f 2 3 i a-z a-z 4 ID IF Compiler Construction 2013 F02-35
Translate an NFA to a DFA NFA 2 f 3 i IF start 1 a-z a-z 4 ID DFA 2,4 f 3,4 i a-eg-z IF start 1 a-z a-z a-hj-z 4 ID Compiler Construction 2013 F02-36
Construction of a Scanner Define all tokens as regular expressions Construct a finite automaton for each token Combine all these automata to a new automaton (an NFA in general) Translate to a corresponding DFA Minimize the DFA Implement the DFA Compiler Construction 2013 F02-37
Implementation alternatives for DFAs Table-driven Represent the automaton by a table Additional table to keep track of final states and token kinds A global variable keeps track of the current state Compiler Construction 2013 F02-38
Implementation alternatives for DFAs Table-driven Represent the automaton by a table Additional table to keep track of final states and token kinds A global variable keeps track of the current state Switch statements Each state is implemented as a switch statement Each case implements a state transition as a jump (to another switch statement). The current state is represented by the program counter Compiler Construction 2013 F02-38
Error handling Add a dead state (state 0), corresponding to erroneous input Add transitions to the dead state for all erroneous input Generate an Error token when the dead state is reached Compiler Construction 2013 F02-39
Table-driven implementation DFA for IF and ID 1 2,4 3,4 4.. a-e f g-h i j-z.. final kind true ERROR Compiler Construction 2013 F02-40
Table-driven implementation DFA for IF and ID.. a-e f g-h i j-z.. final kind 0 true ERROR 1 0 4 4 4 2,4 4 0 false 2,4 0 4 3,4 4 4 4 0 true ID 3,4 0 4 4 4 4 4 0 true IF 4 0 4 4 4 4 4 0 true ID Compiler Construction 2013 F02-41
Design Token int kind String string Parser Scanner Token next() Reader int get() Compiler Construction 2013 F02-42
Scanner (sketch) class Scanner { PushbackReader reader; boolean isfinal [ ]; int edges [ ] [ ]; int kind [ ]; Token next() { } } Compiler Construction 2013 F02-43
Scanner (sketch) class Scanner { //does not handle: PushbackReader reader; //longest match boolean isfinal [ ]; //eof int edges [ ] [ ]; //whitespace int kind [ ]; //token strings Token next() { int state = 1; //start state while(!isfinal[state]) { char ch = reader.read(); state = edges[state][ch]; } return new Token(kind[state]); } } Compiler Construction 2013 F02-44
Handling End-Of-File (EOF) and non-tokens Compiler Construction 2013 F02-45
Handling End-Of-File (EOF) and non-tokens EOF construct an explicit EOF token when the EOF character is read Non-tokens (Whitespace & Comments) Errors view as tokens of a special kind scan them as normal tokens, but don t create token objects for them loop in next() until a real token has been found construct an explicit ERROR token to be returned when no valid token can be found. Compiler Construction 2013 F02-45
Handling longest match When a token is matched (a final state reached), don t stop scanning. Keep track of the latest matched token in a variable. Continue scanning until we reach the dead state Restore the input stream. Use PushbackReader to be able to do this. Return the latest matched token (or return the ERROR token if there was no latest matched token) Compiler Construction 2013 F02-46
Scanner with longest match class Scanner {... Token next() { } } Compiler Construction 2013 F02-47
Scanner with longest match class Scanner {... Token next() { int state = 1, lastfinalstate = 0; int lastfinalstr = "": StringBuilder builder = new StringBuilder(); while(state!=0) { char ch = reader.read(); builder.append(ch); state = edges[state][ch]; if(isfinal[state]) { lastfinalstate = state; lastfinalstr = builder.tostring(); } } reader.unread(... ); //unused characters return new Token(kind[lastFinalState], lastfinalstr); } Compiler Construction 2013 F02-48
JavaCC as a scanner generator toy.jj sum.toy javacc ToyTokenManager.java ToyConstants.java...java tokens Compiler Construction 2013 F02-49
Other scanner generators tool author implementation generates lex Schmidt, Lesk, 1975 table-driven C code flex Paxon. 1987 switch statements C code jlex table-driven Java code jflex switch statements Java code Compiler Construction 2013 F02-50
Additional functionality Lexical actions extra (Java) code written in the specification for adapting the generated scanner e.g., create special token objects, count tokens, keep track of position in input text (line and column numbers), etc. Compiler Construction 2013 F02-51
Additional functionality Lexical actions extra (Java) code written in the specification for adapting the generated scanner e.g., create special token objects, count tokens, keep track of position in input text (line and column numbers), etc. Lexical states ( start states ) gives the possibility to switch automata during scanning makes it easier to scan certain tokens, e.g., multi-line comments makes it possible to scan certain combined languages, e.g., HTML Compiler Construction 2013 F02-51
Lexcial states, example Scanning of /*... */ Compiler Construction 2013 F02-52
Summary questions What is meant by an ambiguous lexical definition? Give some typical examples of this and how they may be resolved. Construct an NFA for a given lexical definition Construct a DFA for a given NFA What is the difference between a DFA and an NFA? Implement a DFA in Java. How is rule priority handled in the implementation? How is longest match handled? EOF? Whitespace? Compiler Construction 2013 F02-53
What s next? Tomorrow: Seminar 1 (Scanning) Next week: Parsing and Assignment 1 (Scanning) Compiler Construction 2013 F02-54