Lexical Analysis: Constructing a Scanner from Regular Expressions

Similar documents
Compiler Construction

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Finite Automata. Reading: Chapter 2

Scanner. tokens scanner parser IR. source code. errors

Regular Expressions and Automata using Haskell

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Finite Automata. Reading: Chapter 2

Automata and Computability. Solutions to Exercises

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

03 - Lexical Analysis

The Halting Problem is Undecidable

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Regular Languages and Finite State Machines

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

How To Compare A Markov Algorithm To A Turing Machine

3515ICT Theory of Computation Turing Machines

Chapter 7 Uncomputability

Compiler I: Syntax Analysis Human Thought

Turing Machines: An Introduction

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

T Reactive Systems: Introduction and Finite State Automata

Finite Automata and Regular Languages

ω-automata Automata that accept (or reject) words of infinite length. Languages of infinite words appear:

Programming Project 1: Lexical Analyzer (Scanner)

Automata and Formal Languages

Genetic programming with regular expressions

Converting Finite Automata to Regular Expressions

Automata on Infinite Words and Trees

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Learning Analysis by Reduction from Positive Data

Reading 13 : Finite State Automata and Regular Expressions

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Computational Models Lecture 8, Spring 2009

Web Data Extraction: 1 o Semestre 2007/2008

Composability of Infinite-State Activity Automata*

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Notes on Complexity Theory Last updated: August, Lecture 1

Lecture 9. Semantic Analysis Scoping and Symbol Table

Deterministic Finite Automata

Programming Languages CIS 443

ML-Flex Implementation Notes

Positional Numbering System

6.080/6.089 GITCS Feb 12, Lecture 3

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Operators are typically activated through the menu items. Choose th 't. Ig on. r ht N d. IS ac Iva es an operator that highlights A-tr 't' 1 b 1 d

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Turing Machines, Part I

Course Manual Automata & Complexity 2015

Omega Automata: Minimization and Learning 1

Theory of Computation Chapter 2: Turing Machines

CS 301 Course Information

24 Uses of Turing Machines

Class notes Program Analysis course given by Prof. Mooly Sagiv Computer Science Department, Tel Aviv University second lecture 8/3/2007

Regular Languages and Finite Automata

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

CS5236 Advanced Automata Theory

C H A P T E R Regular Expressions regular expression

1 Definition of a Turing machine

Algebraic Computation Models. Algebraic Computation Models

ASSIGNMENT ONE SOLUTIONS MATH 4805 / COMP 4805 / MATH 5605

Static Analysis. Find the Bug! : Analysis of Software Artifacts. Jonathan Aldrich. disable interrupts. ERROR: returning with interrupts disabled

Properties of Stabilizing Computations

Objects for lexical analysis

Lecture I FINITE AUTOMATA

Eventia Log Parsing Editor 1.0 Administration Guide

Programming Languages

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Sources: On the Web: Slides will be available on:

FINITE STATE AND TURING MACHINES

Compilers Lexical Analysis

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Modeling of Graph and Automaton in Database

Increasing Interaction and Support in the Formal Languages and Automata Theory Course

Computer Science Theory. From the course description:

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Introduction to Programming (in C++) Loops. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Example. Introduction to Programming (in C++) Loops. The while statement. Write the numbers 1 N. Assume the following specification:

Recovering Business Rules from Legacy Source Code for System Modernization

Testing LTL Formula Translation into Büchi Automata

Lecture 2: Universality

Output: struct treenode{ int data; struct treenode *left, *right; } struct treenode *tree_ptr;

CSCI 3136 Principles of Programming Languages

Data Structure with C

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Intrusion Detection via Static Analysis

Villanova University CSC 2400: Computer Systems I

Chapter 2: Elements of Java

Social Media Mining. Graph Essentials

Dynamic Programming. Lecture Overview Introduction

Basics of Compiler Design

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Transcription:

Lexical Analysis: Constructing a Scanner from Regular Expressions

Goal Show how to construct a DFA to recognize any RE Scanner simulates a DFA to generate tokens Last Lecture Convert RE to an nondeterministic finite automaton (NFA) Use Thompson s construction This Lecture Convert an NFA to a deterministic finite automaton (DFA) Use Subset construction

Convert NFA to DFA NFA is a 5-tuple (N, Σ, δ N, n 0, N A ) DFA is a 5-tuple (D, Σ, δ D, d 0, D A ) Want to create a DFA that simulates the NFA Non-trivial part is constructing D and δ D

NFA DFA: need to build a simulation of the NFA Two key functions Delta(q i, a) set of states reachable from states in q i by a Returns a set of states, for each n q i of δ i (n, a) ε-closure(q i ) set of states reachable from q i by ε moves Functions help create states of DFA by removing nondeterministic edges of the NFA.

Subset Construction Algorithm in English The algorithm: Start state q 0 derived from n 0 of the NFA Add q 0 to the Worklist Loop while Worklist not empty Remove a state q from worklist Compute t by Delta(q, α) for each α Σ, and take its ε-closure If t not in set Q add it to Q and Worklist Iterate until no more states are added Sounds more complex than it is

The Subset Construction Algorithm q 0 ε-closure(n 0 ) Q {q 0 } WorkList {q 0 } while ( WorkList is not empty ) remove q from WorkList for each α Σ t ε-closure(delta(q,α)) T[q,α] t if ( t Q ) then add t to Q and WorkList Let s think about why this works

NFA DFA with Subset Construction The algorithm: q 0 ε-closure(n 0 ) Q {q 0 } WorkList {q 0 } while ( WorkList ф ) remove q from WorkList for each α Σ t ε-closure(delta(q,α)) T[q,α] t if ( t Q ) then add t to Q and WorkList Let s think about why this works The algorithm halts: 1. Q contains no duplicates (test before adding) 2. 2 N is finite 3. while loop adds to Q, but does not remove from Q (monotone) the loop halts Q contains all the reachable NFA states It tries each character in each q. Q gives us D set of states of DFA T gives us δ D set of transitions of DFA

NFA DFA with Subset Construction Example of a fixed-point computation Monotone construction of some finite set Halts when it stops adding to the set These computations arise in many contexts We will see many more fixed-point computations

NFA DFA with Subset Construction a ( b c )* : a q 0 q 1 ε ε b ε q 4 q 5 ε q ε q 3 q ε 2 8 q 9 ε c q 6 q 7 ε Applying the subset construction: ε The algorithm: q 0 ε-closure(n 0 ) Q {q 0 } WorkList {q 0 } while ( WorkList ф ) remove q from WorkList for each α Σ t ε-closure(delta(q,α)) T[q,α] t if ( t Q ) then add t to Q and WorkList

NFA DFA with Subset Construction a ( b c )* : a q 0 q 1 ε ε b ε q 4 q 5 ε q ε q 3 q ε 2 8 q 9 ε c q 6 q 7 ε Applying the subset construction: ε Final states

NFA DFA with Subset Construction The DFA for a ( b c ) * Ends up smaller than the NFA All transitions are deterministic Use same code skeleton as before b b a s 0 s 1 c b s 2 c s 3 c

Where are we? Why are we doing this? RE NFA (Thompson s construction) Build an NFA for each term Combine them with ε-moves NFA DFA (subset construction) Build the simulation DFA Minimal DFA Hopcroft s algorithm DFA RE All pairs, all paths problem Union together paths from s 0 to a final state The Cycle of Constructions RE NFA DFA minimal DFA

Extra Slides

What we expect of the Scanner Report errors for lexicographically malformed inputs reject illegal characters, or meaningless character sequences E.g., # or floop in COOL Return an abstract representation of the code character sequences (e.g., if or loop ) turned into tokens. Resulting sequence of tokens will be used by the parser Makes the design of the parser a lot easier.

How to specify a scanner A scanner specification (e.g., for JLex), is list of (typically short) regular expressions. Each regular expressions has an action associated with it. Typically, an action is to return a token. On a given input string, the scanner will: find the longest prefix of the input string, that matches one of the regular expressions. will execute the action associated with the matching regular expression highest in the list. Scanner repeats this procedure for the remaining input. If no match can be found at some point, an error is reported.

Example of a Specification Consider the following scanner specification. 1. aaa { return T1 } 2. a*b { return T2 } 3. b { return S } Given the following input string into the scanner aaabbaaa the scanner as specified above would output T2 T2 T1 Note that the scanner will report an error for example on the string aa.

Special Return Tokens Sometimes one wants to extract information out of what prefix of the input was matched. Example: [a-za-z0-9]* { return STRING(yytext()) } Above RE matches every string that starts and ends with quotes, and has any number of alpha-numerical chars between them. Associated action returns a string token, which is the exact string that the RE matched. Note that yytext() will also include the quotes. Furthermore, note that this regular expression does not handle escaped characters.