Context free grammars and predictive parsing Programming Language Concepts and Implementation Fall 2012, Lecture 6 Context free grammars Next week: LR parsing Describing programming language syntax Ambiguities and eliminating these The parser generator coco/r Overview Predictive parsing: Under the hood of coco/r 2
An example and a derivation = + * () Context free grammars Think of it as regular expressions + recursion Terminology: => + => + * => 2 + 3*4 Grammar 3.1 is an example of a grammar for straight-line programs. The start symbol is S (when the start symbol is not written explicitly it is conventional to assume that the left-hand nonterminal in the first production is the start symbol). The terminal symbols are id print num, + ( ) := ; - 1 non-terminal GRAMMAR 3.1: A syntax for straight-line programs. - 5 terminals (tokens): 1. S! S; +, S *, (, ), num 4. E! id 2. S! id := E 5. E! num - 4 productions (right hand sides) 3. S! print (L) 6. E! E + E - Terminals and nonterminals collectively are symbols 7. E! (S, E) 8. L! E 9. L! L, E and the nonterminals are S, E, and L. One sentence in the language of this grammar is Straight line programs (from book) S = S;S id := E print(l) E = id E + E (S,E) L = E L,E id := num; id := id + (id := num + num, id) where the source text (before lexical analysis) might have been a : = 7; b : = c + (d : = 5 + 6, d) The token-types (terminal symbols) are id, num, :=, and so on; the names (a,b,c,d) and numbers (7, 5, 6) are semantic values associated with some of the tokens. DERIVATIONS Another example To show that this sentence is in the language of the grammar, we can perform a derivation: Start with the start symbol, then repeatedly replace any nonterminal by one of its right-hand sides, as shown in Derivation 3.2. DERIVATION 3.2! S! S ; S! S ; id := E! id := E; id := E! id := num ; id := E! id := num ; id := E + E! id := num ; id := E + (S, E)! id := num ; id := id + (S, E)! id := num ; id := id + (id := E, E)! id := num ; id := id + (id := E + E, E)! id := num ; id := id + (id := E + E, id )! id := num ; id := id + (id := num + E, id)! id := num ; id := id + (id := num + num, id) 3 4
A context free grammar consists of - A finite set of nonterminals - A finite set of terminals - A finite set of productions - A choice of start symbol (a non-terminal) Official definition A production consists of - A nonterminal (called the left hand side) - A string of symbols (terminals or nonterminals) This is called Backus-Naur Form (BNF) 5 From MCIJ (note mixed notation) Example: Mini Java 6
SQL specification (in extended BNF)... <query specification> ::=!! SELECT [ <set quantifier> ] <select list> <table expression> <select list> ::=!! <asterisk>!! <select sublist> [ { <comma> <select sublist> }... ] <select sublist> ::= <derived column> <qualifier> <period> <asterisk> <derived column> ::= <value expression> [ <as clause> ] <as clause> ::= [ AS ] <column name> <table expression> ::=!! <from clause>!! [ <where clause> ]!! [ <group by clause> ]!! [ <having clause> ] http://savage.net.au/ SQL/sql-92.bnf <from clause> ::= FROM <table reference> [ { <comma> <table reference> }... ]... 7 Ambiguity
= + * () Ambiguity + 2 + 4 3 4 => + => + * => 2 + 3*4 2 3 => * => + * => 2 + 3*4 9 Encoding operator precedence Multiplication has higher precedence (binds stronger) than addition One nonterminal per precedence level Exercise: = + Term = Term * Term Term () - How many ways can you parse 2+3*4? - How about 2 + 3 + 4? 10
Ambiguity and associativity = - 5 2 3 2 Forcing left associativity 5 3 = - num 11 Exercise What ambiguities exist in the following grammar, and how do we get rid of them? = + * - / () 12
Exercise What ambiguities exist in the following grammar, and how do we get rid of them? = + * - / () * and / have higher precedence than -,+ All operators associate to the left, e.g., - 6-3-2 = (6-3)-2 6-(3-2) - 6/3*2 = (6/3)*2 6/(3*2) - 6-3+2 = (6-3)+2 6-(3+2) 13 Encoding operator precedence = + * - / () Use one non-terminal per precedence level Encoding associativity = + Term - Term Term Term = Term * num Term * () Term / num Term / () () or(better) = + - Term Term = Term * Term Term / Term () = + Term - Term Term Term = Term * Prim Term / Prim Prim Prim = () Exercise 14
Associativity of operators Most binary operators are left associative, e.g., +, -, *, / Few are right associative, e.g. = in C: x = y = 2 parsed as x = (y = 2) Forcing right associativity = ident = ident Some are non-associative, e.g., 1<2<3 is not legal Log = < =... 15 Consider the grammar Amguity: How to parse? Ambiguity: Dangling else Stmt = if then Stmt else Stmt if then Stmt id = if then if then id = else id = 16
Consider the grammar Amguity: How to parse Resolving the ambiguity Ambiguity: Dangling else Stmt = if then Stmt else Stmt if then Stmt id = if then if then id = else id = Stmt = Matched_Stmt Unmatched_Stmt Matched_Stmt = if then Matched_Stmt else Matched_Stmt id = Better to handle this using parser tricks. See later Unmatched_Stmt = if then Stmt if then Matched_Stmt else Unmatched_Stmt 17 The parser generator Coco/R
Extended BNF Example = Term { + Term - Term } Term = num { * num} Extra symbols - {α} means zero, one or many α - [α] means zero or one α - (α) is used for grouping EBNF is no more expressive than BNF, only more convenient 19 Using coco/r COMPILER essions... PRODUCTIONS /*-------------------------------------------------------------------*/ = Term { '+' Term '-' Term }. Term = number { '*' number }. essions =. Specification of start symbol END essions. 20
Using coco/r 21 Semantic actions in coco/r COMPILER essions public int res;... PRODUCTIONS /*-------------------------------------------------------------------*/ <out int n> (. int n1, n2;.) = Term<out n1> (. n = n1;.) { '+' Term<out n2> (. n = n+n2;.) '-' Term<out n2> (. n = n-n2;.) }. Term<out int n> = number (. n = Convert.ToInt32(t.val);.) { '*' number (. n = n*convert.toint32(t.val);.) }. essions (. int n;.) = <out n> (. res = n;.). END essions. 22
Method for parsing expressions In resulting Parser.cs void (out int n) {! int n1, n2;! Term(out n1);! n = n1;! while (la.kind == 3 la.kind == 4) {!! if (la.kind == 3) {!!! Get();!!! Term(out n2);!!! n = n+n2;!! } else {!!! Get();!!! Term(out n2);!!! n = n-n2;!! }! } } The generated parser Pass by reference, similar to ref If next token is + 23 Using coco/r with semantic actions 24
Suppose S is the start symbol of a grammar. To indicate that $ must come after a complete S- Predictive parsing phrase, we augment the grammar with a new start symbol S! and a new production S! " S$. In Grammar 3.8, E is the start symbol, so an augmented grammar is Grammar 3.10. Top-down parsing method aka LL-parsing GRAMMAR 3.10! S " E $! coco/r! generates LL parsers! T " T * F! E " E + T! T " T / F Produces! E " E # left-most T derivations! T " F! E " T! Example grammar 3.11 Guess a production based on the next token 3.2 PREDICTIVE PARSING Example parsing on board S = if E then S else S begin S L print E L = end ; S L E = num ident! F " id! F " num! F " (E) Some grammars are easy to parse using a simple algorithm known as recursive descent. In essence, each grammar production turns into one clause of a recursive function. We illustrate this by writing a recursive-descent parser for Grammar 3.11. GRAMMAR 3.11! S " if E then S else Rasmus S Ejlers Møgelberg! S " begin S L! S " print E!! L " end! L " ; S L!! E " num = num 25 Parser implementation A recursive-descent parser for this language has one function for each nonterminal and one clause for each production. final int IF=1, THEN=2, ELSE=3, BEGIN=4, END=5, PRINT=6, SEMI=7, NUM=8, EQ=9; int tok = gettoken(); void advance() {tok=gettoken();} void eat(int t) {if (tok==t) advance(); else error();} void S() {switch(tok) { case IF: eat(if); E(); eat(then); S(); eat(else); S(); break; case BEGIN: eat(begin); S(); L(); break; case PRINT: eat(print); E(); break; default: error(); }} void L() {switch(tok) { case END: eat(end); break; case SEMI: eat(semi); S(); L(); break; default: error(); 47 26
Parsing table S L E ---------------------------------------------------- if S->if E then S else S begin S->begin S L print S->print E end L->end ; L->;S L num E->num ident E->ident S = if E then S else S begin S L print E L = end ; S L E = num ident 27 Intended learning outcomes Construct grammars for programming languages Eliminate ambiguity by - Encoding operator precedence - Encoding operator associativity Use coco/r to create parsers and lexers 28