University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

Transcription

1 University of Wales Swansea Department of Computer Science Compilers Course notes for module CS 218 Dr. Matt Poole 2002, edited by Mr. Christopher Whyley, 2nd Semester 2006/2007 www-compsci.swan.ac.uk/~cschris/compilers

2 1 Introduction 1.1 Compilation Definition. Compilation is a process that translates a program in one language (the source language) into an equivalent program in another language (the object or target language). Error messages Source program Compiler Target program An important part of any compiler is the detection and reporting of errors; this will be discussed in more detail later in the introduction. Commonly, the source language is a high-level programming language (i.e. a problem-oriented language), and the target language is a machine language or assembly language (i.e. a machine-oriented language). Thus compilation is a fundamental concept in the production of software: it is the link between the (abstract) world of application development and the low-level world of application execution on machines. Types of Translators. An assembler is also a type of translator: Assembly program Assembler Machine program An interpreter is closely related to a compiler, but takes both source program and input data. The translation and execution phases of the source program are one and the same. Source program Input data Interpreter Output data Although the above types of translator are the most well-known, we also need knowledge of compilation techniques to deal with the recognition and translation of many other types of languages including: 2

3 Command-line interface languages; Typesetting / word processing languages (e.g.tex); Natural languages; Hardware description languages; Page description languages (e.g. PostScript); Set-up or parameter files. Early Development of Compilers s. Early stored-program computers were programmed in machine language. Later, assembly languages were developed where machine instructions and memory locations were given symbolic forms s. Early high-level languages were developed, for example FORTRAN. Although more problem-oriented than assembly languages, the first versions of FORTRAN still had many machine-dependent features. Techniques and processes involved in compilation were not wellunderstood at this time, and compiler-writing was a huge task: e.g. the first FORTRAN compiler took 18 man years of effort to write. Chomsky s study of the structure of natural languages led to a classification of languages according to the complexity of their grammars. The context-free languages proved to be useful in describing the syntax of programming languages s onwards. The study of the parsing problem for context-free languages during the 1960 s and 1970 s has led to efficient algorithms for the recognition of context-free languages. These algorithms, and associated software tools, are central to compiler construction today. Similarly, the theory of finite state machines and regular expressions (which correspond to Chomsky s regular languages) have proven useful for describing the lexical structure of programming languages. From Algol 60, high-level languages have become more problem-oriented and machine independent, with features much removed from the machine languages into which they are compiled. The theory and tools available today make compiler construction a managable task, even for complex languages. For example, your compiler assignment will take only a few weeks (hopefully) and will only be about 1000 lines of code (although, admittedly, the source language is small). 1.2 The Context of a Compiler The complete process of compilation is illustrated as: 3

4 skeletal source program preprocessor source program compiler assembly program assembler relocatable m/c code link/load editor absolute m/c code Preprocessors Preprocessing performs (usually simple) operations on the source file(s) prior to compilation. Typical preprocessing operations include: (a) Expanding macros (shorthand notations for longer constructs). For example, in C, #define foo(x,y) (3*x+y*(2+x)) defines a macro foo, that when used in later in the program, is expanded by the preprocessor. For example, a = foo(a,b) becomes a = (3*a+b*(2+a)) (b) Inserting named files. For example, in C, #include "header.h" is replaced by the contents of the file header.h Linkers A linker combines object code (machine code that has not yet been linked) produced from compiling and assembling many source programs, as well as standard library functions and resources supplied by the operating system. This involves resolving references in each object file to external variables and procedures declared in other files. 4

5 1.2.3 Loaders Compilers, assemblers and linkers usually produce code whose memory references are made relative to an undetermined starting location that can be anywhere in memory (relocatable machine code). A loader calculates appropriate absolute addresses for these memory locations and amends the code to use these addresses. 1.3 The Phases of a Compiler The process of compilation is split up into six phases, each of which interacts with a symbol table manager and an error handler. This is called the analysis/synthesis model of compilation. There are many variants on this model, but the essential elements are the same. source program lexical analyser syntax analyser symbol-table manager semantic analyser intermediate code gen tor error handler code optimizer code generator target program Lexical Analysis A lexical analyser or scanner is a program that groups sequences of characters into lexemes, and outputs (to the syntax analyser) a sequence of tokens. Here: (a) Tokens are symbolic names for the entities that make up the text of the program; e.g. if for the keyword if, and id for any identifier. These make up the output of the lexical analyser. 5

6 (b) A pattern is a rule that specifies when a sequence of characters from the input constitutes a token; e.g the sequence i, f for the token if, and any sequence of alphanumerics starting with a letter for the token id. (c) A lexeme is a sequence of characters from the input that match a pattern (and hence constitute an instance of a token); for example if matches the pattern for if, and foo123bar matches the pattern for id. For example, the following code might result in the table given below. program foo(input,output);var x:integer;begin readln(x);writeln( value read =,x) end. Lexeme Token Pattern program program p, r, o, g, r, a, m newlines, spaces, tabs foo id (foo) letter followed by seq. of alphanumerics ( leftpar a left parenthesis input input i, n, p, u, t, comma a comma output output o, u, t, p, u, t ) rightpar a right parenthesis ; semicolon a semi-colon var var v, a, r x id (x) letter followed by seq. of alphanumerics : colon a colon integer integer i, n, t, e, g, e, r ; semicolon a semi-colon begin begin b, e, g, i, n newlines, spaces, tabs readln readln r, e, a, d, l, n ( leftpar a left parenthesis x id (x) letter followed by seq. of alphanumerics ) rightpar a right parenthesis ; semicolon a semi-colon writeln writeln w, r, i, t, e, l, n ( leftpar a left parenthesis value read = literal ( value read = ) seq. of chars enclosed in quotes, comma a comma x id (x) letter followed by seq. of alphanumerics ) rightpar a right parenthesis newlines, spaces, tabs end end e, n, d. fullstop a fullstop 6

7 It is the sequence of tokens in the middle column that are passed as output to the syntax analyser. This token sequence represents almost all the important information from the input program required by the syntax analyser. Whitespace (newlines, spaces and tabs), although often important in separating lexemes, is usually not returned as a token. Also, when outputting an id or literal token, the lexical analyser must also return the value of the matched lexeme (shown in parentheses) or else this information would be lost Symbol Table Management A symbol table is a data structure containing all the identifiers (i.e. names of variables, procedures etc.) of a source program together with all the attributes of each identifier. For variables, typical attributes include: its type, how much memory it occupies, its scope. For procedures and functions, typical attributes include: the number and type of each argument (if any), the method of passing each argument, and the type of value returned (if any). The purpose of the symbol table is to provide quick and uniform access to identifier attributes throughout the compilation process. Information is usually put into the symbol table during the lexical analysis and/or syntax analysis phases Syntax Analysis A syntax analyser or parser is a program that groups sequences of tokens from the lexical analysis phase into phrases each with an associated phrase type. A phrase is a logical unit with respect to the rules of the source language. For example, consider: a := x * y + z 7

8 After lexical analysis, this statement has the structure id 1 assign id 2 binop 1 id 3 binop 2 id 4 Now, a syntactic rule of Pascal is that there are objects called expressions for which the rules are (essentially): (1) Any constant or identifier is an expression. (2) If exp 1 and exp 2 are expressions then so is exp 1 binop exp 2. Taking all the identifiers to be variable names for simplicity, we have: By rule (1) exp 1 = id 2 and exp 2 = id 3 are both phrases with phrase type expression ; by rule (2) exp 3 = exp 1 binop 1 exp 2 is also a phrase with phrase type expression ; by rule (1) exp 4 = id 4 is a phase with type expression ; by rule (2), exp 5 = exp 3 binop 2 exp 4 is a phrase with phrase type expression. Of course, Pascal also has a rule that says: id assign exp is a phrase with phrase type assignment, and so the Pascal statement above is a phrase of type assignment. Parse Trees and Syntax Trees. The structure of a phrase is best thought of as a parse tree or a syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases. A syntax tree is a compacted form of parse tree in which the operators appear as the interior nodes. The construction of a parse tree is a basic activity in compiler-writing. A parse tree for the example Pascal statement is: assignment id 1 assign exp 5 exp 3 binop 2 exp 4 exp 1 binop 1 exp 2 id 4 id 2 id 3 8

9 and a syntax tree is: assign id 1 binop 2 binop 1 id 4 id 2 id 3 Comment. The distinction between lexical and syntactical analysis sometimes seems arbitrary. The main criterion is whether the analyser needs recursion or not: lexical analysers hardly ever use recursion; they are sometimes called linear analysers since they scan the input in a straight line (from from left to right). syntax analysers almost always use recursion; this is because phrase types are often defined in terms of themselves (cf. the phrase type expression above) Semantic Analysis A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree and a symbol table. Its purpose is to determine if the input has a well-defined meaning; in practice semantic analysers are mainly concerned with type checking and type coercion based on type rules. Typical type rules for expressions and assignments are: Expression Type Rules. Let exp be an expression. (a) If exp is a constant then exp is well-typed and its type is the type of the constant. (b) If exp is a variable then exp is well-typed and its type is the type of the variable. (c) If exp is an operator applied to further subexpressions such that: (i) the operator is applied to the correct number of subexpressions, (ii) each subexpression is well-typed and (iii) each subexpression is of an appropriate type, then exp is well-typed and its type is the result type of the operator. Assignment Type Rules. Let var be a variable of type T 1 and let exp be a well-typed expression of type T 2. If 9

10 (a) T 1 = T 2 and (b) T 1 is an assignable type then var assign exp is a well-typed assignment. For example, consider the following code fragment: intvar := intvar + realarray where intvar is stored in the symbol table as being an integer variable, and realarray as an array or reals. In Pascal this assignment is syntactically correct, but semantically incorrect since + is only defined on numbers, whereas its second argument is an array. The semantic analyser checks for such type errors using the parse tree, the symbol table and type rules Error Handling Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. On detecting an error the compiler must: report the error in a helpful way, correct the error if possible, and continue processing (if possible) after the error to look for further errors. Types of Error. Errors are either syntactic or semantic: Syntax errors are errors in the program text; they may be either lexical or grammatical: (a) A lexical error is a mistake in a lexeme, for examples, typing tehn instead of then, or missing off one of the quotes in a literal. (b) A grammatical error is a one that violates the (grammatical) rules of the language, for example if x = 7 y := 4 (missing then). Semantic errors are mistakes concerning the meaning of a program construct; they may be either type errors, logical errors or run-time errors: (a) Type errors occur when an operator is applied to an argument of the wrong type, or to the wrong number of arguments. 10

11 (b) Logical errors occur when a badly conceived program is executed, for example: while x = y do... when x and y initially have the same value and the body of loop need not change the value of either x or y. (c) Run-time errors are errors that can be detected only when the program is executed, for example: var x : real; readln(x); writeln(1/x) which would produce a run time error if the user input 0. Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way). If possible, the compiler should make the appropriate correction(s). Semantic errors are much harder and sometimes impossible for a computer to detect Intermediate Code Generation After the analysis phases of the compiler have been completed, a source program has been decomposed into a symbol table and a parse tree both of which may have been modified by the semantic analyser. From this information we begin the process of generating object code according to either of two approaches: (1) generate code for a specific machine, or (2) generate code for a general or abstract machine, then use further translators to turn the abstract code into code for specific machines. Approach (2) is more modular and efficient provided the abstract machine language is simple enough to: (a) produce and analyse (in the optimisation phase), and (b) easily translated into the required language(s). One of the most widely used intermediate languages is Three-Address Code (TAC). TAC Programs. A TAC program is a sequence of optionally labelled instructions. Some common TAC instructions include: (i) var 1 := var 2 binop var 3 (ii) var 1 := unop var 2 (iii) var 1 := num 11

12 (iv) goto label (v) if var 1 relop var 2 goto label There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will will use only the above for the following discussion. Syntax-Directed Code Generation. In essence, code is generated by recursively walking through a parse (or syntax) tree, and hence the process is referred to as syntax-directed code generation. For example, consider the code fragment: z := x * y + x and its syntax tree (with lexemes replacing tokens): := z + * x x y We use this tree to direct the compilation into TAC as follows. At the root of the tree we see an assignment whose right-hand side is an expression, and this expression is the sum of two quantities. Assume that we can produce TAC code that computes the value of the first and second summands and stores these values in temp1 and temp2 respectively. Then the appropriate TAC for the assignment statement is just z := temp1 + temp2 Next we consider how to compute the values oftemp1 andtemp2 in the same top-down recursive way. For temp1 we see that it is the product of two quantities. Assume that we can produce TAC code that computes the value of the first and second multiplicands and stores these values in temp3 and temp4 respectively. Then the appropriate TAC for the computing temp1 is temp1 := temp3 * temp4 Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus the TAC code 12

13 temp3 := x is sufficient. Next we come to temp4 and similar to temp3 the appropriate code is temp4 := y Finally, considering temp2, of course temp2 := x suffices. Each code fragment is output when we leave the corresponding node; this results in the final program: temp3 := x temp4 := y temp1 := temp3 * temp4 temp2 := x z := temp1 + temp2 Comment. Notice how a compound expression has been broken down and translated into a sequence of very simple instructions, and furthermore, the process of producing the TAC code was uniform and simple. Some redundancy has been brought into the TAC code but this can be removed (along with redundancy that is not due to the TAC-generation) in the optimisation phase Code Optimisation An optimiser attempts to improve the time and space requirements of a program. There are many ways in which code can be optimised, but most are expensive in terms of time and space to implement. Common optimisations include: removing redundant identifiers, removing unreachable sections of code, identifying common subexpressions, unfolding loops and 13

14 eliminating procedures. Note that here we are concerned with the general optimisation of abstract code. Example. Consider the TAC code: temp1 := x temp2 := temp1 if temp1 = temp2 goto 200 temp3 := temp1 * y goto temp3 := z 300 temp4 := temp2 + temp3 Removing redundant identifiers (just temp2) gives temp1 := x if temp1 = temp1 goto 200 temp3 := temp1 * y goto temp3 := z 300 temp4 := temp1 + temp3 Removing redundant code gives temp1 := x 200 temp3 := z 300 temp4 := temp1 + temp3 Notes. Attempting to find a best optimisation is expensive for the following reasons: A given optimisation technique may have to be applied repeatedly until no further optimisation can be obtained. (For example, removing one redundant identifier may introduce another.) A given optimisation technique may give rise to other forms of redundancy and thus sequences of optimisation techniques may have to be repeated. (For example, above we removed a redundant identifier and this gave rise to redundant code, but removing redundant code may lead to further redundant identifiers.) The order in which optimisations are applied may be significant. (How many ways are there of applying n optimisation techniques to a given piece of code?) 14

15 1.3.8 Code Generation The final phase of the compiler is to generate code for a specific machine. In this phase we consider: memory management, register assignment and machine-specific optimisation. The output from this phase is usually assembly language or relocatable machine code. Example. The TAC code above could typically result in the ARM assembly program shown below. Note that the example illustrates a mechanical translation of TAC into ARM; it is not intended to illustrate compact ARM programming!.x EQUD 0 four bytes for x.z EQUD 0 four bytes for z.temp EQUD 0 four bytes each for temp1, EQUD 0 temp3, and EQUD 0 temp4..prog MOV R12,#temp R12 = base address MOV R0,#x R0 = address of x LDR R1,[R0] R1 = value of x STR R1,[R12] store R1 at R12 MOV R0,#z R0 = address of z LDR R1,[R0] R1 = value of z STR R1,[R12,#4] store R1 at R12+4 LDR R1,[R12] R1 = value of temp1 LDR R2,[R12,#4] R2 = value of temp3 ADD R3,R1,R2 add temp1 to temp3 STR R3,[R12,#8] store R3 at R

16 2 Languages In this section we introduce the formal notion of a language, and the basic problem of recognising strings from a language. These are central concepts that we will use throughout the remainder of the course. Note.This section contains mainly theoretical definitions; the lectures will cover examples and diagrams illustrating the theory. 2.1 Basic Definitions An alphabet Σ is a finite non-empty set (of symbols). A string or word over an alphabet Σ is a finite concatenation (or juxtaposition) of symbols from Σ. The length of a string w (that is, the number of characters comprising it) is denoted w. The empty or null string is denoted ǫ. (That is, ǫ is the unique string satisfying ǫ = 0.) The set of all strings over Σ is denoted Σ. For each n 0 we define We define Σ n = {w Σ w = n}. Σ + = n 1Σ n. (Thus Σ = Σ + {ǫ}.) For a symbol or word x, x n denotes x concatenated with itself n times, with the convention that x 0 denotes ǫ. A language over Σ is a set L Σ. Two languages L 1 and L 2 over common alphabet Σ are equal if they are equal as sets. Thus L 1 = L 2 if, and only if, L 1 L 2 and L 2 L Decidability Given a language L over some alphabet Σ, a basic question is: For each possible word w Σ, can we effectively decide if w is a member of L or not? We call this the decision problem for L. Note the use of the word effectively : this implies the mechanism by which we decide on membership (or non-membership) must be a finitistic, deterministic and mechanical procedure 16

17 that can be carried out by some form of computing agent. Also note the decision problem asks if a given word is a member of L or not; that is, it is not sufficient to be only able to decide when words are members of L. More precisely then, a language L Σ is said to be decidable if there exists an algorithm such that for every w Σ (1) the algorithm terminates with output Yes when w L and (2) the algorithm terminates with output No when w L. If no such algorithm exists then L is said to be undecidable. Note. Decidability is based on the notion of an algorithm. In standard theoretical computer science this is taken to mean a Turing Machine; this is an abstract, but extremely low-level model of computation that is equivalent to a digital computer with an infinite memory. Thus it is sufficient in practice to use a more convenient model of computation such as Pascal programs provided that any decidability arguments we make assume an infinite memory. Example. Let Σ = {0, 1} be an alphabet. Let L be the (infinite) language L = {w Σ w = 0 n 1for some n}. Does the program below solve the decision problem for L? read( char ); if char = END_OF_STRING then print( "No" ) else /* char must be 0 or 1 */ while char = 0 do read( char ) od; /* char must be 1 or END_OF_STRING */ if char = 1 then print( "Yes" ) else print( "No" ) fi fi Answer: Basic Facts (1) Every finite language is decidable. (Hence every undecidable language is infinite.) 17

18 (2) Not every infinite language is undecidable. (3) Programming languages are (usually) infinite but (always) decidable. (Why?) 2.4 Applications to Compilation Languages may be classified by the means in which they are defined. Of interest to us are regular languages and context-free languages. Regular Languages. The significant aspects of regular languages are: they are defined by patterns called regular expressions; every regular language is decidable; the decision problem for any regular language is solved by a deterministic finite state automaton (DFA); and programming languages lexical patterns are specified using regular expressions, and lexical analysers are (essentially) DFAs. Regular languages and their relationship to lexical analysis are the subjects of the next section. Context-Free Languages. The significant aspects of context-free languages are: they are defined by rules called context-free grammars; every context-free language is decidable; the decision problem for any context-free language of interest to us is solved by a deterministic push-down automaton (DPDA); programming language syntax is specified using context-free grammars, and (most) parsers are (essentially) DPDAs. Context-free languages and their relationship to syntax analysis are the subjects of sections 4 and 5. 18

19 3 Lexical Analysis In this section we study some theoretical concepts with respect to the class of regular languages and apply these concepts to the practical problem of lexical analysis. Firstly, in Section 3.1, we define the notion of a regular expression and show how regular expressions determine regular languages. We then, in Section 3.2, introduce deterministic finite automata (DFAs), the class of algorithms that solve the decision problems for regular languages. We show how regular expressions and DFAs can be used to specify and implement lexical analysers in Section 3.3, and in Section 3.4 we take a brief look at Lex, a popular lexical analyser generator built upon the theory of regular expressions and DFAs. Note.This section contains mainly theoretical definitions; the lectures will cover examples and diagrams illustrating the theory. 3.1 Regular Expressions Recall from the Introduction that a lexical analyser uses pattern matching with respect to rules associated with the source language s tokens. For example, the token then is associated with the pattern t, h, e, n, and the token id might be associated with the pattern an alphabetic character followed by any number of alphanumeric characters. The notation of regular expressions is a mathematical formalism ideal for expressing patterns such as these, and thus ideal for expressing the lexical structure of programming languages Definition Regular expressions represent patterns of strings of symbols. A regular expression r matches a set of strings over an alphabet. This set is denoted L(r) and is called the language determined or generated by r. Let Σ be an alphabet. We define the set RE(Σ) of regular expressions over Σ, the strings they match and thus the languages they determine, as follows: RE(Σ) matches no strings. The language determined is L( ) =. ǫ RE(Σ) matches only the empty string. Therefore L(ǫ) = {ǫ}. If a Σ then a RE(Σ) matches the string a. Therefore L(a) = {a}. if r and s are in RE(Σ) and determine the languages L(r) and L(s) respectively, then r s RE(Σ) matches all strings matched either by r or by s. Therefore, L(r s) = L(r) L(s). 19

20 rs RE(Σ) matches any string that is the concatenation of two strings, the first matching r and the second matching s. Therefore, the language determined is L(rs) = L(r)L(s) = {uv Σ u L(r) and v L(s)}. (Given two sets S 1 and S 2 of strings, the notation S 1 S 2 denotes the set of all strings formed by appending members of S 1 with members of S 2.) r RE(Σ) matches all finite concatenations of strings which all match r. The language denoted is thus L(r ) = (L(r)) = (L(r)) i i N = {ǫ} L(r) L(r)L(r) Regular Languages Let L be a language over Σ. L is said to be a regular language if L = L(r) for some r RE(Σ) Notation We need to use parentheses to overide the convention concerning the precedence of the operators. The normal convention is: is higher than concatenation, which is higher than. Thus, for example, a bc is a (b(c )). We write r + for rr. We write r? for ǫ r. We write r n as an abbreviation for r...r (n times r), with r 0 denoting ǫ Lemma Writing r = s to mean L(r) = L(s) for two regular expressions r, s RE(Σ), the following identities hold for all r, s, t RE(Σ): r s = s r ( is commutative) (r s) t = r (s t) ( is associative) (rs)t = r(st) (concatenation is associative) r(s t) = rs rt (concatenation (r s)t = rt st distributes over ) 20

21 r = r = r = r = ǫ r? = r r = r (r s ) = (r s) ǫr = rǫ = r Regular definitions It is often useful to give names to complex regular expressions, and to use these names in place of the expressions they represent. Given an alphabet comprising all ASCII characters, letter = A B Z a b z digit = ident = letter(letter digit) are examples of regular definitions for letters, digits and identifiers The Decision Problem for Regular Languages For every regular expression r RE(Σ) there exists a string-processing machine M = M(r) such that for every w Σ, when input to M: (1) if w L(r) then M terminates with output Yes, and (2) if w L(r) then M terminates with output No. Thus, every regular language is decidable. The machines in question are Deterministic Finite State Automata. 3.2 Deterministic Finite State Automata In this section we define the notion of a DFA without reference to its application in lexical analysis. Here we are interested purely in solving the decision problem for regular languages; that is, defining machines that say yes or no given an inputted string, depending on its membership of a particular language. In Section 3.3 we use DFAs as the basis for lexical analysers: pattern matching algorithms that output sequences of tokens. 21

22 3.2.1 Definition A deterministic finite state automaton (or DFA) M is a 5-tuple where Q is a finite non-empty set of states, Σ is an alphabet, M = (Q, Σ, δ, q 0, F) δ : Q Σ Q is the transition or next-state function, q 0 Q is the initial state, and F Q is the set of accepting or final states. The idea behind a DFA M is that it is an abstract machine that defines a language L(M) Σ in the following way: The machine begins in its start state q 0 ; Given a string w Σ the machine reads the symbols of w one at a time from left to right; Each symbol causes the machine to make a transition from its current state to a new state; if the current state is q and the input symbol is a, then the new state is δ(q, a); The machine terminates when all the symbols of the string have been read; If, when the machine terminates, its state is a member of F, then the machine accepts w, else it rejects w. Note the name final state is not a good one since a DFA does not terminate as soon as a final state has been entered. The DFA only terminates when all the input has been read. We formalise this idea as follows: Definition. Let M = (Q, Σ, δ, q 0, F) be a DFA. We define ˆδ : Q Σ Q by ˆδ(q, ǫ) = q for each q Q and ˆδ(q, aw) = ˆδ(δ(q, a), w) for each q Q, a Σ and w Σ. We define the language of M by L(M) = {w Σ ˆδ(q 0, w) F }. 22

23 3.2.2 Transition Diagrams DFAs are best understood by depicting them as transition diagrams; these are directed graphs with nodes representing states and labelled arcs between states representing transitions. A transition diagram for a DFA is drawn as follows: (1) Draw a node labelled q for each state q Q. (2) For every q Q and every a Σ draw an arc labelled a from node q to node δ(q, a); (3) Draw an unlabelled arc from outside the DFA to the node representing the initial state q 0 ; (4) Indicate each final state by drawing a concentric circle around its node to form a double circle Examples Let M 1 = (Q, Σ, δ, q 0, F) where Q = {1, 2, 3, 4}, Σ = {a, b}, q 0 = 1, F = {4} and where δ is given by: δ(1, a) = 2 δ(1, b) = 3 δ(2, a) = 3 δ(2, b) = 4 δ(3, a) = 3 δ(3, b) = 3 δ(4, a) = 3 δ(4, b) = 4. From the transition diagram for M 1 it is clear that: L(M 1 ) = {w {a, b} ˆδ(1, w) F } = {w {a, b} ˆδ(1, w) = 4} = {ab, abb, abbb,..., ab n,...} = L(ab + ). Let M 2 be obtained from M 1 by adding states 1 and 2 to F. Then L(M 2 ) = L(ǫ ab ). Let M 3 be obtained from M 1 by changing F to {3}. Then L(M 3 ) = L((b aa abb a)(a b) ). Simplifications to transition diagrams. 23

24 It is often the case that a DFA has an error state, that is, a non-accepting state from which there are no transitions other than back to the error state. In such a case it is convenient to apply the convention that any apparently missing transitions are transitions to the error state. It is also common for there to be a large number of transitions between two given states in a DFA, which results in a cluttered transition diagram. For example, in an identifier recognition DFA, there may be 52 arcs labelled with each of the lower- and upper-case letters from the start state to a state representing that a single letter has been recognised. It is convenient in such cases to define a set comprising the labels of each of these arcs, for example, letter = {a, b, c,...,z, A, B, C,..., Z} Σ and to replace the arcs by a single arc labelled by the name of this set, e.g. letter. It is acceptable practice to use these conventions provided it is made clear that they are being operated Equivalence Theorem (1) For every r RE(Σ) there exists a DFA M with alphabet Σ such that L(M) = L(r). (2) For every DFA M with alphabet Σ there exists an r RE(Σ) such that L(r) = L(M). Proof. See J.E. Hopcroft and J. D. Ullman Introduction to Automata Theory, Languages, and Computation (Addison Wesley, 1979). Applications. The significance of the Equivalence Theorem is that its proof is constructive; there is an algorithm that, given a regular expression r, builds a DFA M such that L(M) = L(r). Thus, if we can write a fragment of a programming language syntax in terms of regular expressions, then by part (1) of the Theorem we can automatically construct a lexical analyser for that fragment. Part (2) of the Equivalence Theorem is a useful tool for showing that a language is regular, since if we cannot find a regular expression directly, part (2) states that it is sufficient to find a DFA that recognises the language. The standard algorithm for constructing a DFA from a given regular expression is not difficult, but would require that we also take a look at nondeterministic finite state automata (NFAs). NFAs are equivalent in power to DFAs but are slightly harder to understand (see the course text for details). Given a regular expression, the RE-DFA algorithm first constructs an NFA equivalent to the RE (by a method known as Thompson s Construction), and then transforms the NFA into an equivalent DFA (by a method known as the Subset Construction). 24

25 3.3 DFAs for Lexical Analysis Let s suppose we wish to construct a lexical analyser based on a DFA. We have seen that it is easy to construct a DFA that recognises lexemes for a given programming language token (e.g. for individual keywords, for identifiers, and for numbers). However, a lexical analyser has to deal with all of a programming language s lexical patterns, and has to repeatedly match sequences of characters against these patterns and output corresponding tokens. We illustrate how lexical analysers may be constructed using DFAs by means of an example An example DFA lexical analyser Consider first writing a DFA for recognising tokens for a (minimal!) language with identifiers and the symbols +, ; and := (we ll add keywords later). A transition diagram for a DFA that recognises these symbols is given by: letter digit in_ident other (re-read) letter whitespace other identifier (or keyword) error : in_assign start = done assign end_of_file end_of_input + plus other ; semi_colon error The lexical analyser code (see next section) consists of a procedure get next token which outputs the next token, which can be either identifier (for identifiers), plus (for +), semi colon (for ;), assign (for :=), error (if an error occurs, for example if an invalid character such as ( is read or if a : is not followed by a =) and end of input (for when the complete input file has been read). The (lexical analyser based on) the DFA begins in state start, and returns a token when it enters state done; the token returned depends on the final transition it takes to enter the done state, and is shown on the right hand side of the diagram. 25

26 For a state with an output arc labelled other, the intuition is that this transition is made on reading any character except those labelled on the state s other arcs; re-read denotes that the read character should not be consumed it is re-read as the first character when get next token is next called. Notice that adding keyword recognition to the above DFA would be tricky to do by hand and would lead to a complex DFA why? However, we can recognise keywords as identifiers using the above DFA, and when the accepting state for identifiers is entered the lexeme stored in the buffer can be checked against a table of keywords. If there is a match, then the appropriate keyword token is output, else identifier token is output. Further, when we have recognised an identifier we will also wish to output its string value as well as the identifier token, so that this can be used by the next phase of compilation Code for the example DFA lexical analyser Let s consider how code may be written based on the above DFA. Let s add the keywords if, then, else and fi to the language to make it slightly more realistic. Firstly, we define enumerated types for the sets of states and tokens: state = (start, in_identifier, in_assign, done); token = (k_if, k_then, k_else, k_fi, plus, identifier, assign, semi_colon, error, end_of_input); Next, we define some variables that are shared by the lexical analyser and the syntax analyser. The job of the procedure get next token is to set the value of current token to next token, and if this token is identifier it also sets the value of current identifier to the current lexeme. The value of the Boolean variable reread character determines whether the last character read during the previous execution of get next token should be re-read at the beginning of its next execution. The current character variable holds the value of the last character read by get next token. current_token : token; current_identifier : string[100]; reread_character : boolean; current_character : char; We also need the following auxiliary functions (their implementations are omitted here) with the obvious interpretations: function is_alpha(c : char) : boolean; 26

27 function is_digit(c : char) : boolean; function is_white_space(c : char) : boolean; Finally, we define two constant arrays: { Constants used to recognise keyword matches } NUM_KEYWORDS = 4; token_tab : array[1..num_keywords] of token = (k_if, k_then, k_else, k_fi); keyword_tab : array[1..num_keywords] of string = ( if, then, else, fi ); that store keyword tokens and keywords (with associated keywords and tokens stored at the same location each array) and a function that searches the keyword array for a string and returns the token associated with a matched keyword or the token identifier if not. Notice that the arrays and function are easily modified for any number of keywords appearing in our source language. function keyword_lookup(s : string) : token; {If s is a keyword, return this keyword s token; else return the identifier token} var index : integer; found : boolean; begin keyword_lookup := identifier; found := FALSE; index := 1; while (index <= NUM_KEYWORDS) and (not found) do begin if keyword_tab[index] = s then begin keyword_lookup := token_tab[index]; found := TRUE end; index := index + 1 end end; The get next token procedure is implemented as follows. Notice that within the main loop a case statement is used to deal with transitions from the current state. After the loop exits (when the done state is entered), if an identifier has been recognised, keyword lookup is used to check whether or not a keyword has been matched. 27

28 procedure get_next_token; {Sets the value of current_token by matching input characters. Also, sets the values current_identifier and reread_character if appropriate} var current_state : state; no_more_input : boolean; begin current_state := start; current_identifier := ; while not (current_state = done) do begin no_more_input := eof; {Check whether at end of file} if not (reread_character or no_more_input) then read(current_character); reread_character := FALSE; case current_state of start: if no_more_input then begin current_token := end_of_input; current_state := done end else if is_white_space(current_character) then current_state := start else if is_alpha(current_character) then begin current_identifier := current_identifier + current_character; current_state := in_identifier end else case current_character of ; : begin current_token := semi_colon; current_state := done end; + : begin current_token := plus; current_state := done end; : : current_state := in_assign else begin current_token := error; current_state := done; end end; {case} in_identifier: if (no_more_input or not(is_alpha(current_character) or is_digit(current_character))) then begin current_token := identifier; 28

29 current_state := done; reread_character := true end else current_identifier := current_identifier + current_character; in_assign: if no_more_input or (current_character <> = ) then begin current_token := error; current_state := done end else begin current_token := assign; current_state := done end end; {case} end; {while} if (current_token = identifier) then current_token := keyword_lookup(current_identifier); end; Test code (in the absence of a syntax analyser) might be the following. This just repeatedly calls get next token until the end of the input file has been reached, and prints out the value of the read token. {Request tokens from lexical analyser, outputting their values, until end_of_input} begin reread_character := false; repeat get_next_token; writeln( Current Token is, token_to_text(current_token)); if (current_token = identifier) then writeln( Identifier is, current_identifier); until (current_token = end_of_input) end. where function token_to_text(t : token) : string; converts token values to text. 29

30 3.4 Lex Lex is a widely available lexical analyser generator Overview Given a Lex source file comprising regular expressions for various tokens, Lex generates a lexical analyser (based on a DFA), written in C, that groups characters matching the expressions into lexemes, and can return their corresponding tokens. In essence, a Lex file comprises a number of lines typically of the form: pattern action where pattern is a regular expression and action is a piece of C code. When run on a Lex file, Lex produces a C file called lex.yy.c (a lexical analyser). When compiled, lex.yy.c takes a stream of characters as input and whenever a sequence of characters matches a given regular expression the corresponding action is executed. Characters not matching any regular expressions are simply copied to the output stream. Example. Consider the Lex fragment: a { printf( "read a \n" ); } b { printf( "read b \n" ); } After compiling (see below on how to do this) we obtain a binary executable which when executed on the input: sdfghjklaghjbfghjkbbdfghjk dfghjkaghjklaghjk produces sdfghjklread a ghjread b fghjkread b read b dfghjk dfghjkread a ghjklread a ghjk 30

31 Example. Consider the Lex program: %{ int abc_count, xyz_count; %} %% ab[cc] {abc_count++; } xyz {xyz_count++; } \n { ; }. { ; } %% main() { abc_count = xyz_count = 0; yylex(); printf( "%d occurrences of abc or abc\n", abc_count ); printf( "%d occurrences of xyz\n", xyz\_count ); } This file first declares two global variables for counting the number of occurrences of abc or abc and xyz. Next come the regular expressions for these lexemes and actions to increment the relevant counters. Finally, there is a main routine to initialise the counters and call yylex(). When executed on input: akhabfabcdbcaxyzxyzabchsdk dfhslkdxyzabcabcdkkjxyzkdf the lexical analyser produces: 4 occurrences of abc or abc 3 occurrences of xyz Some features of Lex illustrated by this example are: 31

32 (1) The notation for ; for example, [cc] matches either c or C. (2) The regular expression \n which matches a newline. (3) The regular expression. which matches any character except a newline. (4) The action { ; } which does nothing except to suppress printing Format of Lex Files The format of a Lex file is: definitions analyser specification auxiliary functions Lex Definitions. The (optional) definitions section comprises macros (see below) and global declarations of types, variables and functions to be used in the actions of the lexical analyser and the auxiliary functions (if present). All such global declaration code is written in C and surrounded by %{ and %}. Macros are abbreviations for regular expressions to be used in the analyser specification. For example, the token identifier could be defined by: IDENTIFIER [a-za-z][a-za-z0-9]* The shorthand character range construction [x-y] matches any of the characters between (and including) x and y. For example, [a-c] means the same as a b c, and [a-ca-c] means the same as a b c A B C. Definitions may use other definitions (enclosed in braces) as illustrated in: ALPHA ALPHANUM IDENTIFIER [a-za-z] [a-za-z0-9] {ALPHA}{ALPHANUM}* and: ALPHA [a-za-z] NUM [0-9] ALPHANUM ({ALPHA} {NUM}) IDENTIFIER {ALPHA}{ALPHANUM}* Notice the use of parentheses in the definition of ALPHANUM. What would happen without them? 32

33 Lex Analyser Specifications. These have the form: r 1 { action 1 } r 2 { action 2 } r n { action n } where r 1, r 2,..., r n are regular expressions (possibly involving macros enclosed in braces) and action 1, action 2,..., action n are sequences of C statements. Lex translates the specification into a function yylex() which, when when called, causes the following to happen: The current input character(s) are scanned to look for a match with the regular expressions. If there is no match, the current character is printed out, and the scanning process resumes with the next character. If the next m characters match r i then (a) the matching characters are assigned to string variable yytext, (b) the integer variable yyleng is assigned the value m, (c) the next m characters are skipped, and (d) action i is executed. If the last instruction of action i is return n; (where n is an integer expression) then the call to yylex() terminates and the value of n is returned as the function s value; otherwise yylex() resumes the scanning process. If end-of-file is read at any stage, then the call to yylex() terminates returning the value 0. If there is a match against two or more regular expressions, then the expression giving the longest lexeme is chosen; if all lexemes are of the same length then the first matching expression is chosen. Lex Auxiliary Functions. This optional section has the form: fun 1 fun 2... fun n where each fun i is a complete C function. We can also compile lex.yy.c with the lex library using the command: gcc lex.yy.c -ll 33

34 This has the effect of automatically including a standard main() function, equivalent to: main() { yylex(); return; } Thus in the absence of anyreturn statements in the analyser s actions, this one call toyylex() consumes all the input up to and including end-of-file Lexical Analyser Example The lex program below illustrates how a lexical analyser for a Pascal-type language is defined. Notice that the regular expression for identifiers is placed at the end of the list (why?). We assume that the syntax analyser requests tokens by repeatedly calling the function yylex(). The global variable yylval (of type integer in this example) is generally used to pass tokens attributes from the lexical analyser to the syntax analyser and is shared by both phases of the compiler. Here it is being used to pass integers values and identifiers symbol table positions to the syntax analyser. 34

35 %{ %} definitions (as integers) of IF, THEN, ELSE, ID, INTEGER,... delim [ \t\n] ws {delim}+ letter [A-Za-z] digit [0-9] id {letter}({letter} {digit})* integer [+\-]?{digit}+ %% {ws} { ; } if { return(if); } then { return(then); } else { return(else); } {integer} { yylval = atoi(yytext); return(integer); } {id} { yylval = InstallInTable(); return(id); } %% int InstallInTable() { put yytext in symbol table and return the position it has been inserted. } 35

36 4 Syntax Analysis In this section we will look at the second phase of compilation: syntax analysis, or parsing. Since parsing is a central concept in compilation and because (unlike lexical analysis) there are many approaches to parsing, this section makes up most of the remainder of the course. In Section 4.1 we discuss the class of context-free languages and its relationship to the syntactic structure of programming languages and the compilation process. Parsing algorithms for context-free languages fall into two main categories: top-down and bottom-up parsers (the names refer to the process of parse tree construction). Different types of top-down and bottom-up parsing algorithms will be discussed in Sections 4.2 and 4.3 respectively. 4.1 Context-Free Languages Regular languages are inadequate for specifying all but the simplest aspects of programming language syntax. To specify more-complex languages such as L = {w {a, b} w = a n b n for some n}, L = {w {(, )} w is a well-balanced string of parentheses} and the syntax of most programming languages, we use context-free languages. In this section we define context-free grammars and languages, and their use in describing the syntax of programming languages. This section is intended to provide a foundation for the following sections on parsing and parser construction. Note Like Section 3, this section contains mainly theoretical definitions; the lectures will cover examples and diagrams illustrating the theory Context-Free Grammars Definition. A context-free grammar is a tuple G = (T, N, S, P) where: T is a finite nonempty set of (terminal) symbols (tokens), N is a finite nonempty set of (nonterminal) symbols (denoting phrase types) disjoint from T, S N (the start symbol), and P is a set of (context-free) productions (denoting rules for phrase types) of the form A α where A N and α (T N). 36

37 Notation. In what follows we use: a, b, c,... for members of T, A, B, C,... for members of N,...,X, Y, Z for members of T N, u, v, w,... for members of T, and α, β, γ,... for members of (T N). Examples. (1) G 1 = (T, N, S, P) where T = {a, b}, N = {S} and P = {S ab, S asb}. (2) G 2 = (T, N, S, P) where T = {a, b}, N = {S, X} and P = {S X, S aa, S bb, S asa, S bsb, X a, X b}. Notation. It is customary to define a context free grammar by simply listing its productions and assuming: The terminals and nonterminals of the grammar are exactly those terminals appearing in the productions. (It is usually clear from the context whether a symbol is a terminal or nonterminal.) The start symbol is the nonterminal on the left-hand side of the first production. Right-hand sides separated by indicate alternatives. For example, G 2 above can be written as S X X aa bb asa bsb a b 37