MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Similar documents
Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Compiler Construction

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Scanner. tokens scanner parser IR. source code. errors

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Regular Expressions and Automata using Haskell

Finite Automata. Reading: Chapter 2

03 - Lexical Analysis

Automata and Computability. Solutions to Exercises

Introduction to Automata Theory. Reading: Chapter 1

Automata and Formal Languages

The Halting Problem is Undecidable

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Compiler I: Syntax Analysis Human Thought

Regular Languages and Finite Automata

Bottom-Up Parsing. An Introductory Example

Reading 13 : Finite State Automata and Regular Expressions

C H A P T E R Regular Expressions regular expression

How To Compare A Markov Algorithm To A Turing Machine

Testing LTL Formula Translation into Büchi Automata

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Fast nondeterministic recognition of context-free languages using two queues

Overview of E0222: Automata and Computability

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Formal Grammars and Languages

6.080/6.089 GITCS Feb 12, Lecture 3

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

Finite Automata. Reading: Chapter 2

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Programming Languages CIS 443

Computational Models Lecture 8, Spring 2009

Design Patterns in Parsing

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

If-Then-Else Problem (a motivating example for LR grammars)

Context free grammars and predictive parsing

Automata on Infinite Words and Trees

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Finite Automata and Regular Languages

How To Understand The Theory Of Computer Science

3515ICT Theory of Computation Turing Machines

Properties of Stabilizing Computations

Introduction to Theory of Computation

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Models of Sequential Computation

Syntaktická analýza. Ján Šturc. Zima 208

CSCI 3136 Principles of Programming Languages

Lecture 9. Semantic Analysis Scoping and Symbol Table

Turing Machines, Part I

Turing Machines: An Introduction

Sources: On the Web: Slides will be available on:

Regular Languages and Finite State Machines

Semantic Analysis: Types and Type Checking

OBTAINING PRACTICAL VARIANTS OF LL (k) AND LR (k) FOR k >1. A Thesis. Submitted to the Faculty. Purdue University.

FIRST and FOLLOW sets a necessary preliminary to constructing the LL(1) parsing table

Outline of today s lecture

Notes on Complexity Theory Last updated: August, Lecture 1

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

Universality in the theory of algorithms and computer science

Semester Review. CSC 301, Fall 2015

CS5236 Advanced Automata Theory

[Refer Slide Time: 05:10]

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

Chapter One Introduction to Programming

Course Manual Automata & Complexity 2015

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

ω-automata Automata that accept (or reject) words of infinite length. Languages of infinite words appear:

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Basic Parsing Algorithms Chart Parsing

Programming Language Pragmatics

Lecture 2: Regular Languages [Fa 14]

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

Computer Architecture Syllabus of Qualifying Examination

MACM 101 Discrete Mathematics I

Chapter 7 Uncomputability

COMPUTER SCIENCE TRIPOS

Recovering Business Rules from Legacy Source Code for System Modernization

Section IV.1: Recursive Algorithms and Recursion Trees

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Chapter 2: Elements of Java

Symbol Tables. Introduction

SRM UNIVERSITY FACULTY OF ENGINEERING & TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF SOFTWARE ENGINEERING COURSE PLAN

Arithmetic Coding: Introduction

Finite Automata and Formal Languages

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

Web Data Extraction: 1 o Semestre 2007/2008

7.1 Our Current Model

Informatique Fondamentale IMA S8

Introduction to Turing Machines

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Properties of Real Numbers

Transcription:

MIT 6.035 Specifying Languages with Regular essions and Context-Free Grammars Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Language Definition Problem How to precisely define language Layered structure t of language definition i i Start with a set of letters in language Lexical structure - identifies words in language (each word is a sequence of letters) Syntactic structure - identifies sentences in language (each sentence is a sequence of words) Semantics - meaning of program (specifies what result should be for each input) Today s topic: lexical and syntactic structures

Specifying Formal Languages Huge Triumph of Computer Science Beautiful Theoretical Results Practical Techniques and Applications Two Dual Notions Generative approach (grammar or regular expression) Recognition approach (automaton) Lots of theorems about converting one approach automatically to another

Specifying Lexical Structure Using Regular essions Have some alphabet = set of letters Regular expressions are built from: ε - empty string Any yl letter erf from romap alphabet a r 1 r 2 regular expression r 1 followed by r 2 (sequence) r 1 r 2 either regular expression r 1 or r 2 (choice) r* - iterated sequence and choice ε r rr Parentheses to indicate grouping/precedence

Concept of Regular ession General Rules Generating a String Rewrite regular expression until have only a sequence of letters (string) left Example (0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* ) 1) r 1 r 2 r 1 1(0 1)*.(0 1)* 2) r 1 r 2 r 2 3) r* rr* 4) r* ε 1.(0 1)* 1.(0 1)(0 1)* 1.(0 1) 1.0

Nondeterminism in Generation Rewriting is similar to equational reasoning But different rule applications may yield different final results Example 1 Example 2 (0 1)*.(0 1)* (0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* (0 1)(0 1)*.(0 1)* ) ( ) 1(0 1)*.(0 1)* 0(0 1)*.(0 1)* 1.(0 1)* 0.(0 1)* 1.(0 1)(0 1)* 0.(0 1)(0 1)* 1.(0 1) 0.(0 1) 10 1.0 01 0.1

Concept of Language Generated by Regular essions Set of all strings generated by a regular expression is language of regular expression In general, language may be (countably) infinite String in language is often called a token

Examples of Languages and Regular = { 0, 1,. } essions (0 1)*.(0 1)* - Binary floating point numbers (00)* - even-length all-zero strings 1*(01*01*)* - strings with even number of zeros = { a,b,c, 0, 1, 2} (a b c)(a b c 0 1 2)* - alphanumeric identifiers (0 1 2)* - trinary numbers

Alphabet Alternate Abstraction Finite-State Automata Set of states t with initial iti and accept states t Transitions between states, labeled with letters (0 1)*.(0 1)*. 1 1 Start state 0 0 Accept state

Automaton Accepting String Conceptually, run string through automaton Have current state and current letter in string Start with start state and first letter in string At each step, match current letter against a transition whose label l is same as letter Continue until reach end of string or match fails If end in accept state, t automaton t accepts string Language of automaton is set of strings it accepts

Example Current state. 1 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state. 1 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state. 1 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state. 1 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state. 1 1 Start state 0 0 Accept state 11.0 Current letter

Example Current state. 1 1 Start state 0 0 Accept state 11.0 String is accepted! Current letter

Generative Versus Recognition Regular expressions give you a way to generate all strings in language Automata give you a way to recognize if a specific string is in language Philosophically hl h ll very different dff Theoretically equivalent (for regular expressions and automata) Standard approach Use regular expressions when define language Translated automatically into automata for implementation

From Regular essions to One start state Automata Construction by structural induction Given an arbitrary regular expression r Assume we can convert r to an automaton with One accept state Show how to o convert c all constructors c to odeliver an automaton with One start state One accept state

ε a Σ Basic Constructs ε a Start state Accept state

Sequence Start state Accept state r 1r2 r1 r 2

Sequence Old start state Old accept state Start state Accept state r 1r2 r1 r 2

Sequence Old start state Old accept state Start state Accept state ε r 1r2 r1 r 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 ε ε r 1 r 2 1 2 1 2

Sequence Old start state Old accept state Start state Accept state r 1 r 2 ε ε ε r 1 r 2 1 2 1 2

Choice Start state Accept state r 1 r 2 r 1 r2

Choice Old start state Old accept state Start state Accept state r 1 r 2 r 1 r2

Choice Old start state Old accept state Start state Accept state r 1 r 2 ε ε r 1 r2

Choice Old start state Old accept state Start state Accept state r 1 r 2 ε r 1 ε ε r 2 ε

Kleene Star Old start state Start state Old accept state Accept state r* r

Kleene Star Old start state Start state Old accept state Accept state r* r

Kleene Star Old start state Start state Old accept state Accept state r* ε r ε

Kleene Star Old start state Start state Old accept state Accept state ε ε r* r ε

Kleene Star Old start state Start state Old accept state Accept state ε ε r* r ε ε

DFA No ε transitions NFA vs. DFA At most one transition from each state for each letter OK b NFA neither restriction a NOT OK a a

Conversions Our regular expression to automata conversion produces an NFA Would like to have a DFA to make recognition algorithm ag simpler Can convert from NFA to DFA (but DFA may be exponentially larger than NFA)

NFA to DFA Construction DFA has a state for each subset of states in NFA DFA start state corresponds to set of states reachable by following ε transitions from NFA start t state DFA state is an accept state if an NFA accept state is in its set of NFA states To compute the transition for a given DFA state D and letter a Set S to empty set Find the set N of D s NFA states For all NFA states n in N Compute set of states N that the NFA may be in after matching a Set S to S union N If S is nonempty, there is a transition for a from D to the DFA state that has the set S of NFA states Otherwise, there is no transition for a from D

NFA to DFA Example for (a b)*.(a b)* ε 1 2 ε a 3 5 ε ε 7 ε 8. ε 9 10 ε a 11 13 ε ε 15 ε 16 ε 4 6 b ε ε ε 12 14 b ε ε. a 5,7,2,3,4,8. 13,15,10,11,12,16 a a a a 1,2,3,4,8 9,10,11,12,16 b b b. b 672348 6,7,2,3,4,8 14,15,10,11,12,1615 10 11 12 16 b a b

Lexical Structure in Languages Each language typically has several categories of words. In a typical programming language: Keywords (if, while) Arithmetic Operations (+, -, *, /) Integer numbers (1, 2, 45, 67) Floating point numbers (1.0,.2, 3.337) Identifiers (abc, i, j, ab345) Typically have a lexical category for each keyword and/or each category Each lexical category defined by regexp

Lexical Categories Example IfKeyword = if WhileKeyword = while Operator = + - * / Integer = [0-9] [0-9]* Float = [0-9]*. [0-9]* Identifier = [a-z]([a-z] [0-9])* Note that [0-9] = (0 1 2 3 4 5 6 7 8 9) [a-z] = (a b c y z) Will use lexical l categories in next level l

Programming Language Syntax Regular languages suboptimal for specifying programming language syntax Why? Constructs with nested syntax (a+(b-c))*(d-(x-(y-z))) (d (x (y if (x < y) if (y < z) a = 5 else a = 6 else a = 7 Regular languages lack state required to model nesting Canonical example: nested expressions No regular expression for language of parenthesized expressions

Solution Context-Free Grammar Set of terminals Op = + - * / { Op, Int, Open, Close } Int = [0-9] [0-9]* Each terminal defined Open = < by regular expression Close = > Set of nonterminals { Start, } Set of productions Start Single nonterminal on LHS Op Sequence of terminals and Int nonterminals on RHS Open Close

Production Game have a current string start with Start nonterminal loop until no more nonterminals choose a nonterminal in current string choose a production with nonterminal in LHS replace nonterminal with RHS of production substitute regular expressions with corresponding strings generated string is in language Note: different choices produce different strings

Sample Derivation Op = + - * / Int = [0-9] [0-9]* Open = < Close = > 1) Start 2) Op 3) Int 4) Open Close Start Op Open Close Op Open Op Close Op Open Int Op Close Op Open Int Op Close Op Int Open Int Op Int Close Op Int < 2-1 > + 1

Parse Tree Internal Nodes: Nonterminals Leaves: Terminals Edges: From Nonterminal of LHS of production To Nodes from RHS of production Captures derivation of string

Parse Tree for <2-1>+1 Start Open Close < > Int 2 Op - Int 1 Op + Int 1

Ambiguity in Grammar Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string Derivation and parse tree usually reflect semantics of the program Ambi guity in g rammar often reflects ambiguity in semantics of language (which is considered undesirable)

Ambiguity Example Two parse trees for 2-1+1 Tree corresponding Tree corresponding to <2-1>+1 to 2-<1+1> Start Start Op - Op + Int 1 Int 2 Op - Int Int Int Int 2 1 1 1 Op +

Eliminating Ambiguity Solution: hack the grammar Original Grammar Start Op Int Ep Open Close Hacked Grammar Start Op Int Int Ep Open Close Conceptually, makes all operators associate to left

Parse Trees for Hacked Grammar Only one parse tree for 2-1+1! Valid parse tree No longer valid parse tree Start Start Int 2 Op - Op + Int 1 Int 1 Int 2 Op - Int 1 Op + In t 1

Precedence Violations All operators associate to left Violates precedence of * over + 2-3*4 associates like <2-3>*4 Parse tree for 2-3*4 Start Op Int * 4 Op - Int 2 Int 3

Hacking Around Precedence Original Grammar Op = + - * / Int = [0-9] [0-9]* Open = < Close = > Start Op Int Int Open Close Hacked Grammar AddOp = + - MulOp = * / Int = [0-9] [0-9]* Open = < Close = > Start AddOp Term Term Term Term MulOp Num Term Num Num Int Num Open Close

Old parse tree for 2-3*4 Start Parse Tree Changes New parse tree for 2-3*4 Start AddOp Term - Op Int * 4 Term Term MulOp Num Op Int Num * - 3 Num Int Int Int 4 2 2 Int 3

General Idea Group Operators into Precedence Levels * and / are at top level, l bind strongest t + and - are at next level, bind next strongest Nonterminal for each Precedence Level Term is nonterminal for * and / is nonterminal for + anda - Can make operators left or right associative within each level Generalizes for arbitrary levels of precedence

Parser Converts program into a parse tree Can be written by hand Or produced automatically by parser generator Accepts a grammar as input Produces a parser as output Practical problem Parse tree for hacked grammar is complicated Would like to start with more intuitive parse tree

Solution Abstract versus Concrete Syntax Abstract t syntax corresponds to intuitive iti way of thinking of structure of program Omits details like superfluous keywords that are there to make the language unambiguous Abstract syntax may be ambiguous Concrete Syntax corresponds to full grammar used to parse the language Parsers are often written to produce abstract syntax trees.

Abstract Syntax Trees Start with intuitive but ambiguous grammar Hack grammar to make it unambiguous Concrete parse trees Less intuitive Convert concrete parse trees to abstract syntax trees Correspond to intuitive grammar for language Simpler for program to manipulate

Hacked Unambiguous Grammar AddOp = + - MulOp = * / Int = [0-9] [0-9]* Open = < Close = > Start t AddOp Term Term Term Term MulOp Num Term Num Num Int Num Open Close Example Intuitive but Ambiguous Grammar Op = * / + - Int = [0-9] [0-9]* Startt Ep Op Int

Concrete parse tree for <2-3>*4 Abstract syntax tree for <2-3>*4 Start Start Op * Term Num AddOp - Term Num Int 2 Int 3 Term MulOp * Num Int 4 Op - Int 2 Int 3 Uses intuitive grammar Int 4 Eliminates superfluous terminals Open Close

Abstract parse tree for <2-3>*4 Start Op * Op Int - Int 2 3 Int 4 Further simplified Start abstract syntax tree for <2-3>*4 Int 2 Op - Op * Int 3 Int 4

Lexical regular expressions and automata t Grammar ambiguities Summary Lexical and Syntactic Levels of Structure Syntactic grammars Hacked grammars Abstract syntax trees Generation versus Recognition Approaches Generation more convenient for specification Recognition required in implementation

Handling If Then Else Start Stat Stat if then Stat else Stat Stat if then Stat Stat...

Parse Trees Consider Statement if e 1 then if e 2 then s 1 else s 2

Stat Two Parse Trees if Stat e1 if then Stat else Stat Stat e2 s1 s2 if then Stat else Stat e1 s2 Which is correct? if then s1 e2

Alternative Readings Parse Tree Number 1 if e 1 if e 2 s 1 else s 2 Parse Tree Number 2 Grammar is ambiguous if e 1 if e 2 s 1 else s 2

Hack ed Gr ammar Goal Stat Stat WithElse Stat LastElse WithElse if then WithElse else WithElse WithElse <statements without if then or if then else> LastElse if then Stat LastElse if then WithElse else LastElse

Hacked Grammar Basic Idea: control carefully where an if without an else can occur Either at top level of statement Or as very last in a sequence of if then else if then... statements

Leftmost derivation Grammar Vocabulary Always expands leftmost t remaining i nonterminal Similarly for rightmost derivation Sentential form Partially or fully derived string from a step in valid derivation 0 + Op 0 + - 2

Grammar Defining a Language Generative approach All strings that grammar generates (How many are there for grammar in previous example?) Automaton Recognition approach All strings that automaton accepts Different flavors of grammars and automata In general, grammars and automata correspond

Finite Alphabet A Start state s 0 Regular Languages Automaton Characterization (S,A,F,s 0,s F ) Finite set of states S Transition function F : S A S Final states s F Lanuage is set of strings accepted by Automaton

Regular Languages Regular Grammar Characterization (T,NT,S,P) NTSP) Finite set of Terminals T Finite set of Nonterminals NT Start Nonterminal S (goal symbol, start symbol) Finite set of Productions P: NT T U NT U T NT Language is set of strings generated by grammar

Grammar and Automata Correspondence Grammar Regular Grammar Context-Free Grammar Context-Sensitive Grammar Automaton Finite-State Automaton Push-Down Automaton Turing Machine

Context-Free Grammars Grammar Characterization (T,NT,S,P) NTSP) Finite set of Terminals T Finite set of Nonterminals NT Start Nonterminal S (goal symbol, start symbol) Finite set of Productions P: NT (T NT)* RHS of production can have any sequence of RHS of production can have any sequence of terminals or nonterminals

DFA Plus a Stack Finite Input Alphabet A, Stack Alphabet V Start state s 0 Push-Down Automata (S,A,V, A V F,s 0,s F ) Finite set of states S Transition relation F : S (A U{ε}) V S V* Final states s F Each configuration consists of a state, a stack, and remaining input string

CFG Versus PDA CFGs and PDAs are of equivalent power Grammar Implementation t Mechanism: Translate CFG to PDA, then use PDA to parse input string Foundation for bottom-up parser generators

Context-Sensitive Grammars and Turing Machines Have Turing Machines Context-Sensitive Grammars Allow Productions to Use Context P: (T.NT)+ (T.NT)* Finite State Control Two-Way Tape Instead of A Stack

MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.