Scanner. tokens scanner parser IR. source code. errors



Similar documents
Compiler Construction

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Reading 13 : Finite State Automata and Regular Expressions

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Automata and Computability. Solutions to Exercises

Regular Languages and Finite State Machines

Finite Automata. Reading: Chapter 2

ω-automata Automata that accept (or reject) words of infinite length. Languages of infinite words appear:

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Regular Expressions and Automata using Haskell

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

Automata on Infinite Words and Trees

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Introduction to Finite Automata

C H A P T E R Regular Expressions regular expression

03 - Lexical Analysis

SOLUTION Trial Test Grammar & Parsing Deficiency Course for the Master in Software Technology Programme Utrecht University

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Programming Languages CIS 443

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Finite Automata. Reading: Chapter 2

Basics of Compiler Design

Introduction to Automata Theory. Reading: Chapter 1

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Finite Automata and Regular Languages

Automata and Formal Languages

The Halting Problem is Undecidable

T Reactive Systems: Introduction and Finite State Automata

Compilers Lexical Analysis

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

Regular Languages and Finite Automata

Grammars with Regulated Rewriting

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

CS5236 Advanced Automata Theory

3515ICT Theory of Computation Turing Machines

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

A.2. Exponents and Radicals. Integer Exponents. What you should learn. Exponential Notation. Why you should learn it. Properties of Exponents

Compiler I: Syntax Analysis Human Thought

CHAPTER 7 GENERAL PROOF SYSTEMS

MACM 101 Discrete Mathematics I

Chapter 2: Elements of Java

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

If-Then-Else Problem (a motivating example for LR grammars)

Formal Grammars and Languages

ASSIGNMENT ONE SOLUTIONS MATH 4805 / COMP 4805 / MATH 5605

Lecture 18 Regular Expressions

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Modeling of Graph and Automaton in Database

Informatique Fondamentale IMA S8

CS 3719 (Theory of Computation and Algorithms) Lecture 4

Lights and Darks of the Star-Free Star

Lecture summary for Theory of Computation

Journal of Mathematics Volume 1, Number 1, Summer 2006 pp

6.080/6.089 GITCS Feb 12, Lecture 3

Turing Machines: An Introduction

Learning Automata and Grammars

Algorithmic Software Verification

12. Finite-State Machines

Approximating Context-Free Grammars for Parsing and Verification

Excel: Introduction to Formulas

A TOOL FOR DATA STRUCTURE VISUALIZATION AND USER-DEFINED ALGORITHM ANIMATION

Web Data Extraction: 1 o Semestre 2007/2008

Testing LTL Formula Translation into Büchi Automata

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

Joint Tactical Radio System Standard. JTRS Platform Adapter Interface Standard Version 1.3.3

Regular Languages and Finite Automata

Lecture I FINITE AUTOMATA

THEORY of COMPUTATION

Two Way F finite Automata and Three Ways to Solve Them

Syntaktická analýza. Ján Šturc. Zima 208

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Chapter 7 Uncomputability

FINITE STATE AND TURING MACHINES

Objects for lexical analysis

Boolean Algebra Part 1

9.2 Summation Notation

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Software Modeling and Verification

Deterministic Finite Automata

Omega Automata: Minimization and Learning 1

Eventia Log Parsing Editor 1.0 Administration Guide

Converting Finite Automata to Regular Expressions

Transcription:

Scanner source code tokens scanner parser IR errors maps characters into tokens the basic unit of syntax x = x + y; becomes <id, x> = <id, x> + <id, y> ; character string value for a token is a lexeme typical tokens: number, id, +, -, *, /, do, end eliminates white space (tabs, blanks, comments) a key issue is speed use specialized recognizer (as opposed to lex) 1

Specifying patterns A scanner must recognize the units of syntax Some parts are easy: white space <ws> ::= <ws> <ws> \t \t keywords and operators specified as literal patterns: do, end comments opening and closing delimiters: /* */ 2

Specifying patterns A scanner must recognize the units of syntax Other parts are much harder: identifiers alphabetic followed by k alphanumerics (, $, &,... ) numbers integers: 0 or digit from 1-9 followed by digits from 0-9 decimals: integer. digits from 0-9 reals: (integer or decimal) E (+ or -) digits from 0-9 complex: ( real, real ) We need a powerful notation to specify these patterns 3

Operations on languages Operation Definition union of L and M L M = {s s L or s M} written L M concatenation of L and M LM = {st s L and t M} written LM Kleene closure of L L = S i=0 L i written L positive closure of L L + = S i=1 L i written L + 4

Regular expressions Patterns are often specified as regular languages Notations used to describe a regular language (or a regular set) include both regular expressions and regular grammars Regular expressions (over an alphabet Σ): 1. is a RE denoting the set {} 2. if a Σ, then a is a RE denoting {a} 3. if r and s are REs, denoting L(r) and L(s), then: (r) is a RE denoting L(r) (r) (s) is a RE denoting L(r) S L(s) (r)(s) is a RE denoting L(r)L(s) (r) is a RE denoting L(r) If we adopt a precedence for operators, the extra parentheses can go away. We assume closure, then concatenation, then alternation as the order of precedence. 5

Examples identifier letter (a b c... z A B C... Z) digit (0 1 2 3 4 5 6 7 8 9) id letter ( letter digit ) numbers integer (+ ) (0 (1 2 3... 9) digit ) decimal integer. ( digit ) real ( integer decimal ) E (+ ) digit complex ( real, real ) Numbers can get much more complicated Most programming language tokens can be described with REs We can use REs to build scanners automatically 6

Algebraic properties of REs Axiom Description r s = s r is commutative r (s t) = (r s) t is associative (rs)t = r(st) concatenation is associative r(s t) = rs rt concatenation distributes over (s t)r = sr tr r = r is the identity for concatenation r = r r = (r ) relation between and r = r is idempotent 7

Examples Let Σ = {a,b} 1. a b denotes {a, b} 2. (a b)(a b) denotes {aa, ab, ba, bb} i.e., (a b)(a b) = aa ab ba bb 3. a denotes {,a,aa,aaa,...} 4. (a b) denotes the set of all strings of a s and b s (including ) i.e., (a b) = (a b ) 5. a a b denotes {a,b,ab,aab,aaab,aaaab,...} 8

Recognizers From a regular expression we can construct a deterministic finite automaton (DFA) Recognizer for identifier: letter digit letter other 0 1 2 3 digit other accept error identifier letter (a b c... z A B C... Z) digit (0 1 2 3 4 5 6 7 8 9) id letter ( letter digit ) 9

Code for the recognizer char next char(); state 0; /* code for state 0 */ done false; token value "" /* empty string */ while( not done ) { class char class[char]; state next state[class,state]; switch(state) { case 1: /* building an id */ token value token value + char; char next char(); break; case 2: /* accept state */ token type = identifier; done = true; break; case 3: /* error */ token type = error; done = true; break; } } return token type; 10

Tables for the recognizer Two tables control the recognizer char class: next state: a z A Z 0 9 other value letter letter digit other class 0 1 2 3 letter 1 1 digit 3 1 other 3 2 To change languages, we can just change tables 11

Automatic construction Scanner generators automatically construct code from RE-like descriptions construct a DFA use state minimization techniques emit code for the scanner (table driven or direct code ) A key issue in automation is an interface to the parser lex is a scanner generator supplied with UNIX emits C code for scanner provides macro definitions for each token (used in the parser) 12

Grammars for regular languages Can we place a restriction on the form of a grammar to ensure that it describes a regular language? Provable fact: For any RE r, a grammar g such that L(r) = L(g) Grammars that generate regular sets are called regular grammars: They have productions in one of 2 forms: 1. A aa 2. A a where A is any non-terminal and a is any terminal symbol These are also called type 3 grammars (Chomsky) 13

More regular languages Example: the set of strings containing an even number of zeros and an even number of ones 1 s 0 s 1 1 0 0 1 0 0 s 2 s 3 The RE is (00 11) ((01 10)(00 11) (01 10)(00 11) ) 1 14

More regular expressions What about the RE (a b) abb? a b a b b s 0 s 1 s 2 s 3 State s 0 has multiple transitions on a! nondeterministic finite automaton a b s 0 {s 0,s 1 } {s 0 } s 1 {s 2 } s 2 {s 3 } 15

Finite automata A non-deterministic finite automaton (NFA) consists of: 1. a set of states S = {s 0,...,s n } 2. a set of input symbols Σ (the alphabet) 3. a transition function move mapping state-symbol pairs to sets of states 4. a distinguished start state s 0 5. a set of distinguished accepting or final states F A Deterministic Finite Automaton (DFA) is a special case of an NFA: 1. no state has a -transition, and 2. for each state s and input symbol a, there is at most one edge labelled a leaving s A DFA accepts x iff. a unique path through the transition graph from s 0 to a final state such that the edges spell x. 16

DFAs and NFAs are equivalent 1. DFAs are clearly a subset of NFAs 2. Any NFA can be converted into a DFA, by simulating sets of simultaneous states: each DFA state corresponds to a set of NFA states possible exponential blowup 17

fs 0;s 3g NFA to DFA using the subset construction: example 1 a b a b b s 0 s 1 s 2 s 3 a b {s 0 } {s 0,s 1 } {s 0 } {s 0,s 1 } {s 0,s 1 } {s 0,s 2 } {s 0,s 2 } {s 0,s 1 } {s 0,s 3 } {s 0,s 3 } {s 0,s 1 } {s 0 } b fs 0g b a b b fs 0;s fs 0;s 2g 1g a a a 18

Constructing a DFA from a regular expression DFA minimized RE DFA NFA moves RE NFA w/ moves build NFA for each term connect them with moves NFA w/ moves to DFA construct the simulation the subset construction DFA minimized DFA merge compatible states DFA RE construct R k i j = Rk 1 ik (Rkk k 1 ) R k 1 k j S R k 1 i j 19

RE to NFA N() N(a) a N(A) A N(A B) N(B) B N(AB) N(A) A N(B) B N(A ) N(A) A 20

RE to NFA: example a 2 3 a b 1 6 4 5 b a 2 3 (a b) 0 1 6 7 4 5 b a b b 7 8 9 10 abb 21

NFA to DFA: the subset construction Input: Output: Method: NFA N A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N) Let s be a state in N and T be a set of states, and using the following operations: Operation -closure(s) -closure(t) move(t, a) Definition set of NFA states reachable from NFA state s on -transitions alone set of NFA states reachable from some NFA state s in T on -transitions alone set of NFA states to which there is a transition on input symbol a from some NFA state s in T add state T = -closure(s 0 ) unmarked to Dstates while unmarked state T in Dstates mark T for each input symbol a U = -closure(move(t, a)) if U Dstates then add U to Dstates unmarked Dtrans[T,a] = U endfor endwhile -closure(s 0 ) is the start state of D A state of D is final if it contains at least one final state in N 22

NFA to DFA using subset construction: example 2 a 2 3 0 1 6 7 a b b 8 9 10 4 5 b A = {0,1,2,4,7} D = {1,2,4,5,6,7,9} B = {1,2,3,4,6,7,8} E = {1,2,4,5,6,7,10} C = {1,2,4,5,6,7} a b A B C B B D C B C D B E E B C b C b a a b A a b b B D E a a 23

Limits of regular languages Not all languages are regular One cannot construct DFAs to recognize these languages: L = {p k q k } L = {wcw r w Σ } Note: neither of these is a regular expression! (DFAs cannot count!) But, this is a little subtle. One can construct DFAs for: alternating 0 s and 1 s ( 1)(01) ( 0) sets of pairs of 0 s and 1 s (01 10) + 24

Ramification - Internet Protocol How does your browser establish a connection with a web server? The client sends a SYN message to the server. In response, the server replies with a SYN-ACK. Finally the client sends an ACK back to the server. This is done through two DFAs in the client and server, respectively. 25

Ramification - Intrusion Detection Code FILE * f; f=fopen("demo", "r"); strcpy(...); //vulnerability if (!f) printf("fail to open\n"); else fgets(f, buf);... ------> ------> ------> Operating System SYS_OPEN SYS_WRITE SYS_READ A DFA will be exercised simultaneously with the program on the OS side to detect intrusion. 26