Context-Free Grammars

Similar documents
Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

C H A P T E R Regular Expressions regular expression

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Scanner. tokens scanner parser IR. source code. errors

Fundamentele Informatica II

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Syntaktická analýza. Ján Šturc. Zima 208

Reading 13 : Finite State Automata and Regular Expressions

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

Lecture 1. Basic Concepts of Set Theory, Functions and Relations

Automata and Computability. Solutions to Exercises

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Automata and Formal Languages

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Regular Expression Searching

Regular Languages and Finite Automata

A Little Set Theory (Never Hurt Anybody)

Lecture 2: Regular Languages [Fa 14]

THE LANGUAGE OF SETS AND SET NOTATION

Lecture 9. Semantic Analysis Scoping and Symbol Table

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Creating Basic Excel Formulas

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Cartesian Products and Relations

Rational Exponents. Squaring both sides of the equation yields. and to be consistent, we must have

Order of Operations More Essential Practice

THEORY of COMPUTATION

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Absolute Value Equations and Inequalities

Formal Grammars and Languages

A.2. Exponents and Radicals. Integer Exponents. What you should learn. Exponential Notation. Why you should learn it. Properties of Exponents

Compiler I: Syntax Analysis Human Thought

5.1 Radical Notation and Rational Exponents

Finite Automata and Regular Languages

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Semantic Analysis: Types and Type Checking

Compiler Construction

Ecma/TC39/2013/NN. 4 th Draft ECMA-XXX. 1 st Edition / July The JSON Data Interchange Format. Reference number ECMA-123:2009

Basic Concepts of Set Theory, Functions and Relations

Answer Key for California State Standards: Algebra I

Properties of Real Numbers

Computer Programming. Course Details An Introduction to Computational Tools. Prof. Mauro Gaspari:

3 Some Integer Functions

6.3 Conditional Probability and Independence

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

A Second Course in Mathematics Concepts for Elementary Teachers: Theory, Problems, and Solutions

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Graphing Parabolas With Microsoft Excel

Unified Language for Network Security Policy Implementation

03 - Lexical Analysis

Automata on Infinite Words and Trees

Review of Basic Algebraic Concepts

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

BEGINNING ALGEBRA ACKNOWLEDMENTS

Introduction to Automata Theory. Reading: Chapter 1

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Symbol Tables. Introduction

Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson

26 Integers: Multiplication, Division, and Order

Set Theory Basic Concepts and Definitions

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

MULTIPLICATION AND DIVISION OF REAL NUMBERS In this section we will complete the study of the four basic operations with real numbers.

[Refer Slide Time: 05:10]

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Context free grammars and predictive parsing

POLYNOMIAL FUNCTIONS

The Halting Problem is Undecidable

Vocabulary Words and Definitions for Algebra

Chapter 2: Elements of Java

Approximating Context-Free Grammars for Parsing and Verification

3515ICT Theory of Computation Turing Machines

Lecture 18 Regular Expressions

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

ECDL. European Computer Driving Licence. Spreadsheet Software BCS ITQ Level 2. Syllabus Version 5.0

How To Understand The Laws Of Algebra

Section IV.1: Recursive Algorithms and Recursion Trees

Applies to Version 6 Release 5 X12.6 Application Control Structure

Compiler Design July 2004

Summation Algebra. x i

Regular Expressions and Automata using Haskell

This is a square root. The number under the radical is 9. (An asterisk * means multiply.)

Basic Set Theory. 1. Motivation. Fido Sue. Fred Aristotle Bob. LX Semantics I September 11, 2008

ÖVNINGSUPPGIFTER I SAMMANHANGSFRIA SPRÅK. 15 april Master Edition

Welcome to Math Video Lessons. Stanley Ocken. Department of Mathematics The City College of New York Fall 2013

Master of Sciences in Informatics Engineering Programming Paradigms 2005/2006. Final Examination. January 24 th, 2006

Regular Languages and Finite State Machines

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

Finite Automata. Reading: Chapter 2

CS5236 Advanced Automata Theory

Moving from CS 61A Scheme to CS 61B Java

Lab Experience 17. Programming Language Translation

Transcription:

. L E C T U R E 3 Context-Free Grammars 1 Where are Context-Free Grammars (CFGs) Used? CFGs are a more powerful formalism than regular expressions. They are more powerful in the sense that whatever can be expressed using regular expressions can be expressed using grammars (short for context-free grammars here), but they can also express languages that do not have regular expressions. An example of such a language is the set of well-matched parenthesis. Grammars are used to express syntactic rules. These rules are used by the compiler to take a steam of tokens (the output from a scanner/lexical analyzer) and parse it for syntactic correctness, e.g. checking that each construct is well formed, all parentheses are matched, or all keywords are spelled correctly. This process is known as parsing. 2 Definitions A context-free grammar G is a 4-tuple <N,T,P,S> where N is a set of nonterminals, T is a set of terminals, P is a set of production rules of the form A α, A is an element of set N, i.e. A N, and α (N T)*, and S is a specific non-terminal called the start symbol. Sometimes, the set of terminals is also referred to as the alphabet. Recall that for a set of strings I, the notation I*, Kleene closure, refers to the set of all strings obtained by concatenation of zero or more elements taken from the set I in any order. For example, if I={a,b,A,B}, then the set I* is {ε, a, b, A, B, aa, bb, AA, BB, ab, ba, aa, Aa,aB, Ba, ba, Ab, bb, Bb,...} where ε is the empty string. Here are other definitions related to context-free grammars and languages: A derivation using the rule A α is the process of obtaining a new string from a string w by replacing an occurrence of A in w with α. A sentence is a string consisting of only terminal symbols. A valid sentence with respect to grammar G is a sentence that can be derived using the production rules of G starting from S and ending with a sentence. A leftmost (rightmost) derivation is Ramki Thurimella 9/21/2005 1

one in which at every step the leftmost (resp. rightmost) nonterminal is replaced. For a grammar G, the language generated by G (denoted L(G)) is the set of all valid sentences with respect to G. A grammar is said to be ambiguous if L(G) contains a string w such that w has two or more leftmost or rightmost derivations. Sometimes it is possible to rewrite a grammar so that the new grammar produces the same language as the original, but it is not ambiguous. When all grammars for a language are ambiguous, then that language is said to be inherently ambiguous. There are languages that are inherently ambiguous. This is a mathematically established fact the proof of which is beyond the scope of this course. A tree is a parse tree for a sentence w of L(G) if every node of the tree has a label from N T and it further satisfies the following conditions. The root of the tree is labeled S. All the leaves are labeled with terminal symbols which when concatenated from left to right will give w. All interior nodes are labeled with a nonterminal symbol. If an interior node is labeled A and its children from left to right are labeled X 1, X 2... X k where each X i is either a terminal or a nonterminal symbol, then G must contain production rule of the form A X 1 X 2...X k. 3 Examples 3.1 Equal Number of a s and b s Implicit in the question is the fact that T={a,b}. Here is a grammar for it: N = {S} S = S P = {S asb, S bsa, S SS, S ε } For brevity, we write the above grammar as S asb bsa SS ε, and leave out the details. Our convention is to assume that the left-side nonterminal of the first production rule is the start symbol. Uppercase letters stand for nonterminals. Numbers and lowercase letters constitute the terminals. 3.2 Set of Balanced Parenthesis Here T = {(, )}. The grammar is S ( S ) SS ε. Ramki Thurimella 9/21/2005 2

Figure 3-1 A parse tree for the string (())() Here is an example of a leftmost derivation: S SS (S)S ((S))S (())S (())(S) (())() 3.3 Ambiguity Consider the grammar for arithmetic expressions involving addition and multiplication operators: E E+E E E*E E ID It is easy to see that this grammar produces all arithmetic expressions consisting of + and *. Consider the sentence ID+ID*ID. This can be parsed in two different ways: Figure 3-2 Ambiguous way to parse ID+ID*ID In the figure above, the parse tree to the left gives the addition operator precedence over multiplication. In other words, an expression such as 3+5*9 is evaluated as (3+5)*9 with a result of 72. Whereas, the tree to the right, does what is considered the standard practice in programming languages, i.e. giving * precedence over +. The previous example would be evaluated as 3+(5*9) resulting in 48. The above grammar can be rewritten so that the new grammar produces the same language as the original at the same time every string in the language would have a unique parse tree. For example, the following grammar produces parse trees that give * precedence over +: Ramki Thurimella 9/21/2005 3

E E+T T T T*ID ID In general, this is not always possible. As mentioned before, some languages don t have unambiguous grammars. The ambiguity in the grammar for arithmetic expressions exists even if we had only one operator. Consider E E+E ID This grammar is ambiguous as well. The sentence ID+ID+ID has two parse trees as shown below: Figure 3-3 The Associativity Problem The parse tree to the left interprets ID+ID+ID as (ID+ID)+ID while the one to the right treats it as ID+(ID+ID). In mathematics, since addition is associative, both interpretations yield identical results. However, in programming languages, owing to the limited precision of floating point numbers, the addition operator is not strictly associative. To avoid confusion, many programming languages specify the operator evaluation order when two operators of equal precedence are encountered. Typically, + and * are left associative. That is, the interpretation used by the first tree is used. An example of an operator that is right associative is the exponentiation operator. The grammar above can also made unambiguous with left-to-right associativity as E E+ID ID As an exercise, try to come up with an unambiguous grammar for arithmetic expressions involving +, * and ^ (exponentiation) operators so that ^ has precedence over * which has precedence over +. Make the ^ operator right associative and the other two operators left associative. 3.4 A Language that is not Context-Free Consider the language {a n b n c m m, n 1}. A grammar for this language is S PQ P apb ab Q cq c Ramki Thurimella 9/21/2005 4

However, a slight variation of this language {a n b n c n n 1} can shown to be not context free! 3.5 Chomsky Hierarchy The language {a n b n c n n 1} can be captured by context-sensitive grammar (CSG). These grammars are a more powerful formalism than CFGs. There are languages that cannot be expressed using CSGs. For these, we use unrestricted grammars. The hierarchical relationship between various formalisms in language theory is shown as a Venn diagram. This is called the Chomsky hierarchy. Figure 3-4 Chomsky Hierarchy. As is clear from the figure above that if a language is regular, then it is also contextfree. The converse is not true. For example, {a n b n n 1} is context-free, but not regular. Similarly, if a language is context-free, then it is context-sensitive. Again, the inclusion is strict. As mentioned before {a n b n c n n 1} is not context-free, but context-sensitive. This behavior continues for another level to unrestricted grammars. The discussion involving the last two sets is beyond the scope of this course. It should be noted however that the fact that some languages do not have CFGs has an impact on programming-language translation. That is, certain syntactic constraints cannot be enforced during the parsing phase using context-free grammars. Some examples are Checking that a variable is declared before it is used, Ensuring that an array declared to have two dimensions is referenced with exactly two subscripts, Making sure that no identifier is declared twice within the same block, Verifying that the number and the order of parameters in a function call match those in the signature of the function, and Ramki Thurimella 9/21/2005 5

Type checking. These are typically enforced as a separate phase after parsing. This check is known as the static-semantic check; despite the name, it should be noted that the checks enforced in this phase are strictly syntactic and have nothing to do with semantics. 4 Equivalent Syntactic Formalisms There are two formalisms that are commonly used to express the syntax of programming languages: Extended BNF (EBNF) notation and syntax graphs. It should be noted that these formalisms only provide convenience and are not any more powerful than context-free grammars. In EBNF, the nonterminals are enclosed between angular brackets. The production rules use the symbol ::= instead of an arrow. Recursive production rules can be simplified by using Kleene closure. Square brackets are used indicate an optional element that can be used at most once. Zero or more uses of an element is expressed by enclosing that element within curly braces. Vertical bar, as shown in the examples before, gives the alternatives for a production. Here is an example grammar in EBNF for assignment statements: assignment ::= variable = expression expression ::= term { [+ -] term }* term ::= primary {[ /] primary }* primary ::= variable number ( expression ) variable ::= identifier identifier [ subscript list ] subscript list ::= expression {, expression }* In the grammar above, choices on the right side are enclosed between regular square brackets whereas the terminal symbol for square bracket that is used to enclose array indices is shown in bold. This grammar can be shown graphically using syntax charts as in the figure below: Ramki Thurimella 9/21/2005 6

Figure 3--5 Syntax graph corresponding to the example EBNF grammar Exercises 1. Give context-free grammars generating the following sets: (a) The set of palindromes (strings that read the same forward as backward) over {a,b}. (b) The complement of {a n b n n 0} 2. Show that CFGs are closed under concatenation, union, and Kleene closure. (A set S is closed under an operation, if the result of applying that operator also belongs to the set S. For example, the set of integers are closed under addition, but not under division.) Problems 1. Give context-free grammars generating the following sets: (a) {a i b j c k i j or j k}. (b) The set of all valid regular expressions over {0,1,,(,),*} (c) {w w contains more 1s than 0s} 2. Consider the following unambiguous grammar for arithmetic expressions: Ramki Thurimella 9/21/2005 7

E E + T T T T * F F F (E) ID Change it to an unambiguous grammar that produces all arithmetic expressions with no redundant parentheses. Note that a set of parentheses is redundant if its removal does not change the expression, e.g., the parentheses are redundant in ID+(ID*ID) but not in (ID+ID)*ID. 3. Is the following grammar ambiguous? S ab ba A a as baa B b bs abb 4. Show using mathematical induction that the strings in the language produced by the following grammar do not contain ba. S as Sb a b 5. Give a simple English description of the language generated by the following grammar: S asb by Ya Y by ay ε Answers to Exercises 1. (a) S asa bsb a b ε (b) The desired set consists of two subsets: One consisting of all strings that have ba as a substring, and the other {a m b n m n}. Producing them separately and taking a union, we get S S 1 S 2 S 1 X ba X X ax bx ε S 2 a S 2 b A B A aa a B bb b 2. We will show CFGs are closed under concatenation. Let G 1 and G 2 be two context-free grammars. Let S 1 and S 2 be the start of the production rules in G 1 and G 2 respectively. Now, we can create new grammar G 3 from G 1 and G 2 that produces L(G 1 )L(G 2 ). Introduce a new start symbol S 3 and the production rule S 3 S 1S 2 to G 3. Also, add all production rules of G 1 and G 2 to G 3. It is easy to see that G 3 is a context-free grammar that produces L(G 1 )L(G 2 ). Proofs for other operations are similar. Ramki Thurimella 9/21/2005 8