6FDQQLQJ 2XWOLQH 2. Scanning The basics Ad-hoc scanning FSM based techniques A Lexical Analysis tool - Lex (a scanner generator) 5HFDOO &RPSLOHU 6WUXFWXUH 6RXUFH &RGH /H[LFDO $QDO\VLV6FDQQLQJ 6\QWD[ $QDO\VLV3DUVLQJ )URQW (QG 6HPDQWLF $QDO\VLV 0DFKLQH,QGHSHQGHQW 2SWLPL]DWLRQ %DFN (QG &RGH *HQHUDWLRQ 0DFKLQH 'HSHQGHQW 2SWLPL]DWLRQ 0DFKLQH &RGH 1
&RPSLOHU 6WUXFWXUH $QRWKHU 9LHZ,QSXW /DQJXDJH /H[LFDO $QDO\]HU /H[HPHV RU 7RNHQV 6\QWD[ $QDO\]HU 3KUDVH 6WUXFWXUH &RGH *HQHUDWRU 2XWSXW /DQJXDJH /H[LFDO $QDO\VLV :KDW LV LW" The input to a compiler/interpreter is a source program which is structured as a sequence/stream of characters or rather unstructured Processing individual characters is pretty tedious and highly inefficient As such, the first thing we have to do is add some basic structure to the source code 2
/H[LFDO $QDO\VLV :KDW LV LW" A Lexical Analyzer (a.k.a. scanner) converts a stream of characters into a stream of tokens i.e. they tokenize the input This is a many:1 transformation and thus later phases of compilation will only need to deal with comparatively few tokens. A token (a.k.a. lexeme or syntactic unit) is a fundamental component of a program /H[LFDO $QDO\VLV 7RNHQV Tokens are typically the bottom level entities in syntax diagrams Typical tokens include: identifiers (e.g. variable names, etc.) keywords operators literals (i.e. constant values) punctuation Consider a simple program and its tokens: 3
/H[LFDO $QDO\VLV 7RNHQV 352*5$0 WHVW FUOI 6RXUFH &RGH 9$5 [,17(*(5 FUOI %(*,1 FUOI [ [ FUOI (1' ^ WHVW ` 7RNHQV NH\ZRUG352*5$0 LGHQWWHVW NH\ZRUG9$5 LGHQW[ SXQFW NH\ZRUG,17(*(5 SXQFW NH\ZRUG%(*,1 LGHQW[ RSHUDWRU LGHQW[ RSHUDWRU OLWHUDO SXQFW NH\ZRUG(1' SXQFW 2WKHU 6FDQQHU )XQFWLRQV A scanner also removes white space from a program white space consists of spaces, tabs, carriage returns, comments, and the like stuff put into the source code solely for readability which does not affect the functional specification provided by the program Some scanners also enter symbols in the symbol table (more later) 4
$G KRF 6FDQQLQJ There are many applications outside of compiler construction that require simple scanning functions e.g. recognizing numeric values in financial and other applications These applications either implement their own recognition functions or rely on library routines or language based pattern matching to provide the needed functionality $G KRF 6FDQQLQJ Manual recognition of tokens involves a multitude of IF, WHILE, and SWITCH statements This approach is ugly, extremely tedious, highly error prone, and difficult to understand, maintain, and extend Using existing routines for doing pattern matching is a significant improvement 5
$G KRF 6FDQQLQJ In many cases (e.g. C language) these facilities are provided by library routines #include <string.h>: index, strlen, strcat, etc. In other cases (e.g. some variants of Pascal) they are incorporated into the language substring functions, sets, etc. or consider the language Perl!!! Both these reflect the prevalence and importance of such functionality $G KRF 6FDQQLQJ Anyone who has had to do a significant amount of such scanning/pattern matching knows how awkward it is e.g. consider data verification or the processing of command line arguments as other examples Scanning in a compiler/interpreter is typically far worse Even simple languages have complex lexemes 6
*UDPPDUV A generative grammar is a set of rules to generate valid phrases in a particular language Grammar G = {V, T, P, S}; V - finite set of nonterminals or variables, T -finite set of terminals or tokens, P - finite set of productions, S - is a nonterminal called start symbol Noam Chomsky defined classes of complexity of generative grammars The hierarchy of four classes, each of which properly contains the next is called the Chomsky hierarchy *UDPPDUV &KRPVN\ KLHUDUFK\ Type 0 Unrestricted Grammars Type 1 Context-Sensitive Grammars (CSGs) Type 2 Context-Free Grammars (CFGs) Type 3 Regular Grammars (RGs) 7
8QUHVWULFWHG *UDPPDUV This type of grammar is too complex for programming languages -- cannot construct efficient parsers for this type of grammar This grammar consists of productions of the form α β &RQWH[W6HQVLWLYH *UDPPDUV Most computer languages fall into this class of grammars The productions in this class are of the form α 1 Αα 2 α 1 βα 2 A becomes β in the context of α 1 and α 2 -- in general these grammars are still too complex for efficient computer analysis The context sensitivity of the programming languages is handled by other means so that context free grammars can be used for programming languages 8
&RQWH[W )UHH *UDPPDUV A production of a context free grammar (CFG) is of the form Α α, where Α is a variable and α is a string of symbols In CFGs, the derivations are on variables are independent of what surrounds them To generate phrases in the language, strings of terminals are derived by repeated expansion of non-terminals CFGs permit the construction of efficient syntax analyzers &RQWH[W )UHH *UDPPDUV Example: <S> a <A> b <A> <B> c <B> d Productions of the grammar Language generated by the above grammar is adcb 9
5HJXODU *UDPPDUV If all the productions of a CFG are of the form Α ωβ or Α ω, where Α, Β are non-terminals and ω is a string of terminals (possibly empty) Α Βω or Α ω, where Α, Β are non-terminals and ω is a string of terminals (possibly empty) Then the grammar is a RG -- first form is called Right linear and the second form is called Left linear RGs are too restrictive for most purposes Very efficient parsers can be built 5HJXODU *UDPPDUV The reason for the efficiency is that the language generation from RG can be performed without remembering our current position in the production that is currently being expanded Lack of memory makes RGs incapable of generating languages with arbitrarily nested structures In compilers, RGs will be used to describe words and CFGs will be used to describe phrases constructed from these words 10
5HJXODU ([SUHVVLRQV 5(V Regular expressions are a simplified form of grammar used to represent RGs ε (epsilon - empty set) is a regular expression that matches nothing symbol (terminal) s in the language is a RE that matches s if R is a RE, (R) * matches zero or more occurrences of the pattern R - known as the closure of R if R is a RE, (R) + matches one or more occurrences of the pattern R 5HJXODU ([SUHVVLRQV 5(V If R and S are RE, (R) (S) matches either the pattern R or the pattern S -- alternation If R and S are RE, (R)(S) matches the catenation of pattern R followed by pattern S Example <int> ::= (0 1 2 3 4 5 6 7 8 9) + <int_no_leading_zero> ::= (1 2 3 4 5 6 7 8 9) (0 1 2 3 4 5 6 7 8 9) * 11
%HWWHU 6FDQQLQJ 7HFKQLTXHV This has motivated the development of both techniques and tools for doing scanning The most common of these are based on what are known as finite state machines (FSMs) which recognize regular languages The key to being able to do this is the existence of certain restrictions placed on the format of programming languages E.g.; tokens are usually separated by delimiters )60EDVHG 6FDQQLQJ The most common techniques used for building scanners are based on finite state machines(or FSMs) FSMs can be easily used to recognize language constructs (tokens) which are described by regular languages 12
5HJXODU /DQJXDJHV 5HYLVWHG A regular language is one which is composed of regular expressions A regular expression consists of simple, atomic elements combined using only three operations catenation, alternation, and repetition 5HJXODU /DQJXDJHV 5HYLVWHG Catenation (a.k.a. concatenation or sequencing) is represented by physical adjacency e.g. the regular expression <letter> <digit> simply represents (depending on the definition of letter and digit) a sequence composed of a letter followed by a digit we would use the ::= (equivalence) operator to associated a definition with <letter> or <digit> 13
5HJXODU /DQJXDJHV 5HYLVWHG Alternation allows selection from a number of choices and is commonly represented by the operator E.g. <digit> ::= 0 1 2 3 4 5 6 7 8 9 Certain shorthand forms are also commonly used with alternation (especially ellipses) E.g. <alpha> ::= a b z A B Z 5HJXODU /DQJXDJHV 5HYLVWHG Finally, repetition permits the expression of constructs which are to be repeated some number of times There are two operators used for this purpose: superscript +, and superscript * E.g. <word> ::= <letter> + this implies 1 or more letters (* would imply 0 or more letters) 14
5HJXODU /DQJXDJHV 5HYLVWHG Finally, parenthesis ( ( and ) ) are used for grouping regular expressions Normally, the repetition operators have the highest precedence followed by alternation and then followed by catenation These 3 simple operations permit us to easily express the tokens that occur in existing programming languages 5HJXODU /DQJXDJHV 5HYLVWHG Consider the following regular expressions for a few common tokens and token types we might encounter 1RWH WKH XVH RI TXRWHV <assignop> ::= := <alphanum> ::= <alpha> <digit> <ident> ::= (<alpha> _ $ ) <alphanum> * <intconst> ::= <digit> + Not everything is this simple to specify 15
5HJXODU /DQJXDJHV 5HYLVWHG For this reason, there are a couple of other short cuts that make life easier These are notational conveniences only and can easily be represented using the basic constructs Logical Negation ( ^ or ~ ) commonly used with other constructs <comment> ::= { (~ } ) * } 5HJXODU /DQJXDJHV 5HYLVWHG ~a implies anything in U-{a} Negation can be done simply by enumerating everything in U-{a} e.g. if U={a b c d e} then we could write (~a) * or, alternatively, (b c d e) * Optional Constructs sometime it becomes tedious to list a number of similar options which could be more conveniently expressed by saying some constructs are optional 16
5HJXODU /DQJXDJHV 5HYLVWHG The most common notation for an optional construct is the use of braces E.g. <signedintconst> ::= [+ -] <intconst> The preceding example is equivalent to the following: <signedintconst> ::= <intconst> + <intconst> - <intconst> If we could specify the number of times a repetition could take place we could do it another way too 5HJXODU /DQJXDJHV 5HYLVWHG Consider: <signedintconst> ::= (+ -) 0..1 <intconst> The 0..1 is intended to imply that repetition can take place at most once (0 or 1 times) This illustrates yet another possible construct which, like the others, may be expressed using only catenation, alternation, and replication albeit more verbosely 17
5HJXODU /DQJXDJHV 5HYLVWHG Let s try something a bit more challenging: What does a real constant look like? It might have a sign for the mantissa The mantissa consists of some digits followed by a decimal point possibly followed by some more digits (the fractional part) There might be an exponent as well which could be signed 5HJXODU /DQJXDJHV 5HYLVWHG Let s do this in pieces... <realconst> ::= <mantissa> [ E <exponent>] Consider the exponent first - its just a signed integer constant: <exponent> ::= [+ -] <intconst> where <intconst> ::= <digit> + <digit> ::= 0 1 2 3 4 5 6 7 8 9 18
5HJXODU /DQJXDJHV 5HYLVWHG Now let s try the mantissa <mantissa> ::= [+ -] <intconst>. [ intconst] As with programming, divide and conquer works well to handle the complexity of regular expression specification Also, the use of the optional constructs greatly simplifies this specification As an exercise, try doing the real constant without [ and ] 5HJXODU /DQJXDJHV )60V A good way to start developing a scanner is to produce regular expressions for the tokens you wish to recognize The regular expressions themselves, however, are not the basis of the scanning process This requires a Finite State Machine (FSM) specification 19
)LQLWH 6WDWH 0DFKLQHV Fortunately, there is a direct 1:1 mapping between regular expressions and the FSMs that implement them An FSM is an abstract machine which can be in one of a finite number of states, which makes state transitions based on inputs, and which performs specific actions in specific states or on transitions between states Moore and Mealy machines from digital logic )LQLWH 6WDWH 0DFKLQHV FSMs are commonly represented graphically Nodes in the graph represent individual states and are assigned meaningful names Edges represent transitions between the states and are labeled with the input values which cause the state transitions An FSM-based scanner takes its input from the source code character stream 20
)LQLWH 6WDWH 0DFKLQHV The FSM-based scanner performs certain actions which include recognizing specific characters, accumulating the characters in a particular token, and returning completed tokens to form the output token stream We ll begin by just recognizing some simple tokens and worry about actually building the tokens later )LQLWH 6WDWH 0DFKLQHV <digit> ::= 0 1 9 <intconst> ::= <digit> + 0..9 0..9 intconst other 21
)LQLWH 6WDWH 0DFKLQHV <ident> ::= (<alpha> _ $ ) <alphanum> * alphanum alpha,_,$ ident other (TXLYDOHQFH RI 5(V DQG )60V For each regular expression (RE), there is an FSM that recognizes strings conforming to the regular expression Consider the three basic RE operations Catenation: a b start a b done 22
(TXLYDOHQFH RI 5(V DQG )60V Alternation: a b c start a b c done Repetition: a* a start a U-{a} done ε $ 6DPSOH 5HJXODU /DQJXDJH <comment> ::= { (~ } ) * } <letter> ::= a z A Z <digit> ::= 0 9 <ident> ::= <letter> (<letter> digit>) * <numconst> ::= <digit> + [. <digit> + ] <strconst> ::= (~ )* <assignop> ::= := :+= :-= :*= :/= <negop> ::= ~ ~< ~> ~= 23
$ 6DPSOH )60 IRU WKH ODQJXDJH Leading } { Comment other Finish letter digit Ident letter, digit other other Lit 1 digit other Lit 3 digit. other Lit 2 digit : Lit 4 other Assign? ~ = +-*/ Assign! = ;,. [ ] Neg ><= other %XLOGLQJ D 6FDQQHU How does a scanner interact with the parser? Consider the following: token Source Program Lexical Analyzer Syntax Analyzer Parse Tree get next token() 24
6FDQQHU $FWLRQV As the scanner changes from state to state, it must do something with the characters it scans in order to build the tokens to be returned to the parser calling it In some cases, it must append the character seen onto a developing token and consume it so the next input character is visible E.g. when scanning characters in an identifier 6FDQQHU $FWLRQV In other cases it must preserve the character and return a completed token E.g. MaxVal := -999; After scanning the : we know that we have found the end of the identifier MaxVal so we want to return that to the parser but we do not want to lose the : so we must preserve it Another possible action is to simply consume a character E.g. characters in comments 25
,PSOHPHQWLQJ WKH )60 A finite state machine may be easily implemented using a table driven technique Table driven techniques are highly methodical Comparatively easy to handle changes and/or extensions to the grammar Straightforward code that is not error-prone Easy to maintain the code,psohphqwlqj WKH )60 Regard the scanner as a device which takes a character stream as input and produces a token stream as output. At any given point in time... The device is in a specific state Based on the current state and the next input character, it will perform a specific action, and move into a new (possibly different) state 26
6FDQQHU $FWLRQV GHWDLO Typical actions include: C : Consume AC : Append and Consume PI : PL: Preserve and build ID token Preserve and build Literal token PK : Preserve and build Keyword token PP : Preserve and build Punctuation token CO : Consume and build Operator token CL : Consume and build Literal token What actions you need depends on the 6DPSOH )60 ZLWK DFWLRQV Leading letter AC digit AC AC : AC } C { C Lit 1 ~ AC Ident Lit 4 Assign? Neg Comment digit AC. AC letter, digit AC other AC other PP = CO +-*/ AC other PL ><= CO Lit 2 CL other PO Assign! other C digit AC other PI Lit 3 = CO digit AC other PL ;,. [ ] CP Finish 27
$ VFDQQHU PDLQOLQH STATIC GLOBAL ipchar; GLOBAL str, token, preserve str = state = Leading WHILE (state <> Finish) DO preserve = NO CALL action[state,ipchar] state = nextstate[state,ipchar] IF NOT preserve THEN ipchar = getchar() RETURN(token) $FWLRQ WDEOH Current State Input Character <alpha> <digit>. + : = { etc. 1. Leading AC AC CP AC CO AC CO C 2. Comment C C C 3. Ident AC AC PI PI PI PI PI PI 4. Lit 1 PL AC 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign! 10. Neg 11. Finish etc. 28
1H[W 6WDWH WDEOH Current State Input Character <alpha> <digit>. + : = { etc. 1. Leading 3 4 11 7 11 8 11 2 2. Comment 3. Ident 3 3 11 11 11 11 11 11 4. Lit 1 11 3 5 11 11 11 11 11 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign! 10. Neg 11. Finish etc. $GGLWLRQDO FRGH All we have to do now is add action routines append adds the current character onto a string representing the token being recognized consume vs. preserve is handled by the preserve flag 29
$ /H[LFDO $QDO\]HU *HQHUDWRU Building a scanner manually (even using the FSM technique) is tedious We know that the mapping from regular expressions to FSM is straightforward so why don t we automate the process? Then we just type in regular expressions and get back code to implement a scanner That is exactly what lex does +RZ lex ZRUNV Lex Source Program lex.l Lex Compiler lex.yy.c lex.yy.c C Compiler a.out input stream a.out sequence of tokens 30
OH[ 6SHFLILFDWLRQV lex programs are divided into three components declarations - variable defined, include files specified, etc %% translation rules pattern action (using REs) { C/C++ statements} %% auxiliary procedures -- support routines for the C/C++ statements above 6DPSOH lex SURJUDP %{ /* * this sample demonstrates (very) simple recognition: * a verb/not a verb. */ /* include s and define s should go in this section */ %} %% 31
6DPSOH lex SURJUDP [\t ]+ /* ignore white space */ ; is am are were was be being been do does did have had go { printf("%s: is a verb\n", yytext); } 6DPSOH lex SURJUDP [a-za-z]+ } { printf("%s: is not a verb\n", yytext);. \n { ECHO; /* normal default anyway */ } %% main() { yylex(); } 32