Lexical Analysis. Scanners, Regular expressions, and Automata. cs4713 1

Similar documents
Scanner. tokens scanner parser IR. source code. errors

Compiler Construction

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Compiler I: Syntax Analysis Human Thought

Programming Languages CIS 443

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

C H A P T E R Regular Expressions regular expression

Regular Expressions and Automata using Haskell

Compilers Lexical Analysis

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Automata and Computability. Solutions to Exercises

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Regular Languages and Finite State Machines

Programming Project 1: Lexical Analyzer (Scanner)

Finite Automata. Reading: Chapter 2

Reading 13 : Finite State Automata and Regular Expressions

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

03 - Lexical Analysis

Context free grammars and predictive parsing

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

Semantic Analysis: Types and Type Checking

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

CSCI 3136 Principles of Programming Languages

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Compiler Construction

Automata and Formal Languages

Sources: On the Web: Slides will be available on:

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

5HFDOO &RPSLOHU 6WUXFWXUH

Lecture 9. Semantic Analysis Scoping and Symbol Table

LEX/Flex Scanner Generator

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Chapter 2: Elements of Java

CA Compiler Construction

Finite Automata. Reading: Chapter 2

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

ML-Flex Implementation Notes

Programming Languages

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Basics of Compiler Design

Introduction to Programming (in C++) Loops. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Example. Introduction to Programming (in C++) Loops. The while statement. Write the numbers 1 N. Assume the following specification:

Semester Review. CSC 301, Fall 2015

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

Eventia Log Parsing Editor 1.0 Administration Guide

Moving from CS 61A Scheme to CS 61B Java

6.080/6.089 GITCS Feb 12, Lecture 3

csce4313 Programming Languages Scanner (pass/fail)

Stacks. Linear data structures

Software Engineering

Objects for lexical analysis

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Informatica e Sistemi in Tempo Reale

ASSIGNMENT ONE SOLUTIONS MATH 4805 / COMP 4805 / MATH 5605

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

SRM UNIVERSITY FACULTY OF ENGINEERING & TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF SOFTWARE ENGINEERING COURSE PLAN

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

Automata on Infinite Words and Trees

Lab Experience 17. Programming Language Translation

Introduction to Java

THEORY of COMPUTATION

Introduction to Python

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

Informatique Fondamentale IMA S8

Lecture 18 Regular Expressions

Composability of Infinite-State Activity Automata*

If-Then-Else Problem (a motivating example for LR grammars)

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Antlr ANother TutoRiaL

Syntaktická analýza. Ján Šturc. Zima 208

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

A Grammar for the C- Programming Language (Version S16) March 12, 2016

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Boolean Expressions, Conditions, Loops, and Enumerations. Precedence Rules (from highest to lowest priority)

Evaluation of JFlex Scanner Generator Using Form Fields Validity Checking

Visual Basic Programming. An Introduction

Translating to Java. Translation. Input. Many Level Translations. read, get, input, ask, request. Requirements Design Algorithm Java Machine Language

Finite Automata and Regular Languages

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00

SOLUTION Trial Test Grammar & Parsing Deficiency Course for the Master in Software Technology Programme Utrecht University

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Syntax Check of Embedded SQL in C++ with Proto

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Computer Programming Tutorial

CAs and Turing Machines. The Basis for Universal Computation

Regular Expressions. General Concepts About Regular Expressions

Transcription:

Lexical Analysis Scanners, Regular expressions, and Automata cs4713 1

Phases of compilation Compilers Read input program optimization translate into machine code front end mid end back end Lexical analysis parsing Semantic analysis Code generation Assembler Characters Words/strings Sentences/ statements Meaning translation Linker cs4713 2

Lexical analysis The first phase of compilation Also known as lexer, scanner Takes a stream of characters and returns tokens (words) Each token has a type and an optional value Called by the parser each time a new token is needed. if (a == b) c = a; IF LPARAN <ID a > EQ <ID b > RPARAN <ID c > ASSIGN <ID a > cs4713 3

Lexical analysis Typical tokens of programming languages Reserved words: class, int, char, bool, Identifiers: abc, def, mmm, mine, Constant numbers: 123, 123.45, 1.2E3 Operators and separators: (, ), <, <=, +, -, Goal recognize token classes, report error if a string does not match any class Each token class could be A single reserved word: CLASS, INT, CHAR, A single operator: LE, LT, ADD, A single separator: LPARAN, RPARAN, COMMA, The group of all identifiers: <ID a >, <ID b >, The group of all integer constant: <INTNUM 1>, The group of all floating point numbers <FLOAT 1.0> cs4713 4

Simple recognizers Recognizing keywords Only need to return token type c NextChar() if (c == f ) { c NextChar() if (c == e ) { c NextChar() if (c== e ) return <FEE> } } report syntax error f s0 s1 e e s2 s3 cs4713 5

Recognizing integers Token class recognizer Return <type,value> for each token c NextChar(); if (c = 0 ) then return <INT,0> else if (c >= 1 && c <= 9 ) { val = c 0 ; c NextChar() while (c >= 0 and c <= 9 ) { val= val* 10 + (c 0 ); c NextChar() } return <INT,val> } else report syntax error 1..9 s0 0 s2 s1 0..9 cs4713 6

Multi-token recognizers c NextChar() if (c == f ) { c NextChar() if (c == e ) { c NextChar() if (c == e ) return <FEE> else report error } else if (c == i ) { c NextChar() if (c == e ) return <FIE> else report error } } else if (c == w ) { c NextChar() if (c ==`h ) { c NextChar(); } else report error; } else report error s0 f w e e s2 s3 s1 i e s4 s5 h s6 i l e s7 s8 s9 s10 cs4713 7

Skipping white space c NextChar(); while (c== c== \n c== \r c== \t ) c NextChar(); if (c = 0 ) then return <INT,0> else if (c >= 1 && c <= 9 ) { val = c 0 ; c NextChar() while (c >= 0 and c <= 9 ) { val= val* 10 + (c 0 ); c NextChar() } return <INT,val> } else report syntax error 1..9 s0 0 s2 s1 0..9 cs4713 8

Recognizing operators c NextChar(); while (c== c== \n c== \r c== \t ) c NextChar(); if (c = 0 ) then return <INT,0> else if (c >= 1 && c <= 9 ) { val = c 0 ; c NextChar() while (c >= 0 and c <= 9 ) { val= val* 10 + (c 0 ); c NextChar() } return <INT,val> } else if (c == < ) return <LT> else if (c == * ) return <MULT> else else report syntax error 0..9 s2 1..9 s0 0 s1 < s3 * s4 cs4713 9

Reading ahead What if both <= and < are valid tokens? c NextChar(); else if (c == < ) { c NextChar(); if (c == = ) return <LE> else {PutBack(c); return <LT>; } } else else report syntax error static char putback=0; NextChar() { if (putback==0) return GetNextChar(); else { c = putback; putback=0; return c; } } Putback(char c) { if (putback==0) putback=c; else error; } 1..9 s0 0 s2 s1 s3 cs4713 10 < * s4 = s5 0..9

Recognizing identifiers Identifiers: names of variables <ID,val> May recognize keywords as identifiers, then use a hashtable to find token type of keywords c NextChar(); if (c >= a && c <= z c>= A && c<= Z c == _ ) { val = STR(c); c NextChar() while (c >= a && c <= z c >= A && c <= Z c >= 0 && c <= 9 c== _ ) { val = AppendString(val,c); c NextChar() } return <ID,val> } else a..z, _ A..Z s0 s2 a..z A..Z,_ 0..9 cs4713 11

Describing token types Each token class includes a set of strings CLASS = { class }; LE = { <= }; ADD = { + }; ID = {strings that start with a letter} INTNUM = {strings composed of only digits} FLOAT = { } Use formal language theory to describe sets of strings An alphabet is a finit set of all characters/symbols e.g. {a,b, z,0,1, 9}, {+, -, *,/, <, >, (, )} A string over is a sequence of characters drawn from e.g. abc begin end class if a then b Empty string: A formal language is a set of strings over { class } { <+ } {abc, def, }, { -3, -2,-1,0, 1, } The C programming language English cs4713 12

Operations on strings and languages Operations on strings Concatenation: abc + def = abcdef Can also be written as: s1s2 or s1 s2 Exponentiation: s i= sssssssss i Operations on languages Union: L1»L2= { x x œ L1 or x œ L2} Concatenation: L1L2 = { xy x œ L1 and x œ L2} i Exponentiation: L = { x i x œ L} Kleene closure: L * = { x i x œ L and i >= 0} cs4713 13

Regular expression Compact description of a subset of formal languages L(a): the formal language described by a Regular expressions over, the empty string is a r.e., L() = {} for each s œ, s is a r.e., L(s) = {s} if a and β are regular expressions then (a) is a r.e., L((a)) = L(a) aβ is a r.e., L(aβ) = L(a)L(β) a β is a r.e., L(a β ) = L(a)» L(β) ai is a r.e., L(a i) = L(a) i a* is a r.e., L(a*) = L(a)* cs4713 14

Regular expression example ={a,b} a b {a, b} (a b) (a b) {aa, ab, ba, bb} a* {, a, aa, aaa, aaaa, } aa* { a, aa, aaa, aaaa, } (a b)* all strings over {a,b} a (a b)* all strings over {a,b} that start with a a (a b)* b all strings start with and end with b cs4713 15

Describing token classes letter = A B C Z a b c z digit = 0 1 2 3 4 5 6 7 8 9 ID = letter (letter digit)* NAT = digit digit* FLOAT = digit*. NAT NAT. digit* EXP = NAT (e E) (+ - ) NAT INT = NAT - NAT What languages can be defined by regular expressions? alternatives ( ) and loops (*) each definition can refer to only previous definitions no recursion cs4713 16

Shorthand for regular expressions Character classes [abcd] = a b c d [a-z] = a b z [a-f0-3] = a b f 0 1 2 3 [^a-f] = -[a-f] Regular expression operations Concatenation: a β= a β = a β One or more instances: a + = aa* i i instances: a = aaaaa Zero or one instance: a? = a Precedence of operations * >> >> when in doubt, use parenthesis cs4713 17

What languages can be defined by regular expressions? letter = A B C Z a b c z digit = 0 1 2 3 4 5 6 7 8 9 ID = letter (letter digit)* NAT = digit digit* FLOAT = digit*. NAT NAT. digit* EXP = NAT (e E) (+ - ) NAT INT = NAT - NAT What languages can be defined by regular expressions? alternatives ( ) and loops (*) each definition can refer to only previous definitions no recursion cs4713 18

Writing regular expressions Given an alphabet ={0,1}, describe the set of all strings of alternating pairs of 0s and pairs of 1s The set of all strings that contain an even number of 0s or an even number of 1s Write a regular expression to describe Any sequence of tabs and blanks (white space) Comments in C programming language cs4713 19

Recognizing token classes from regular expressions Describe each token class in regular expressions For each token class (regular expression), build a recognizer Alternative operator ( ) conditionals Closure operator (*) loops To get the next token, try each token recognizer in turn, until a match is found if (IFmatch()) return IF; else if (THENmatch()) return THEN; else if (IDmatch()) return ID; cs4713 20

Building lexical analyzers Manual approach Write it yourself; control your own file IO and input buffering Recognize different types of tokens, group characters into identifiers, keywords, integers, floating points, etc. Automatic approach Use a tool to build a state-driven LA (lexical analyzer) Must manually define different token classes What is the tradeoff? Manually written code could run faster Automatic code is easier to build and modify cs4713 21

Finite Automata --- finite state machines Deterministic finite automata (DFA) A set of states S A start (initial) state s0 A set F of final (accepting) states Alphabet : a set of input symbols Transition function d : S x S Example: d (1, a) = 2 Non-deterministic finite automata (NFA) Transition function d: S x (» {}) 2^S Where represents the empty string Example: d (1, a) = {2,3}, d (2, ) = 4, Language accepted by FA All strings that correspond to a path from the start state s0 to a final state f œ F cs4713 22

Implementing DFA Char NextChar() state s0 while (char eof and state ERROR) state d (state, char) char NextChar() if (state œ F) then report acceptance else report failure S = {s0,s1,s2} = {0,1,2.3,4,5,6,7,8,9} d(s0,0) = s1 d(s0,1-9) = s2 d(s2,0-9) = s2 F = {s1,s2} 1..9 s0 0 s2 s1 0..9 cs4713 23

DFA examples b b start a b b 0 1 2 3 a a Accepted language: (a b)*abb a start 0 a b 1 4 a b Accepted language: a+ b+ cs4713 24

NFA examples a start a b b 0 1 2 3 b Accepted language: (a b)*abb start 0 a 1 2 b 3 4 a b Accepted language: a+ b+ cs4713 25

Automatically building scanners Regular Expressions/lexical patterns NFA NFA DFA DFA Lexical Analyzer DFA interpreter: Char NextChar() state s0 While (char eof and state ERROR) state d (state, char) char NextChar() if (state œ F) then report acceptance Else report failure Lexical patterns Scanner generator Input buffer scanner DFA interpreter DFA transition table cs4713 26

Converting RE to NFA Thompson s construction Takes a regexp r and returns NFA N(r) that accepts L(r) Recursive rules For each symbol c œ»{}, define NFA N(c) as c Alternation: if (r = r1 r2) build N(r) as N(r1) N(r2) Concatenation: if (r = r1r2) build N(r) as N(r1) N(r2) Repetition: if (r = r1*) build N(r) as N(r1) cs4713 27

RE to NFA examples a*b* start 8 0 a 2 1 3 6 4 b 5 7 9 (a b)* start 6 4 a 0 1 b 2 3 5 7 cs4713 28

Automatically building lexical analyzer Token Pattern Pattern Regular Expression Regular Expression a NFA or DFA NFA: DFA: start start a b b 0 1 2 3 b b b a b b 0 1 2 3 NFA/DFA Lexical Analyzer a a a cs4713 29

Lexical analysis generators Lexical analysis Specification Lex compiler Transition table declar ations Token classes Help functions N1 RE1 Nm REm %{ typedef enum { } Tokens; %} %% P1 {action_1} P2 {action_2} Pn {action_n} %% int main() { } Input buffer Finite automata simulator Transition table Lexical analyzer NFA or DFA cs4713 30

Using Lex to build scanners cconst '([^\']+ \\\')' sconst \"[^\"]*\" %pointer Lex program Lex.l Lex compiler lex.yy.c %{ /* put C declarations here*/ %} lex.yy.c C compiler a.out %% foo { return FOO; } bar { return BAR; } Input stream {cconst} { yylval=*yytext; return CCONST; } {sconst} { yylval=mk_string(yytext,yyleng); return SCONST; } [ \t\n\r]+ {}. { return ERROR; } a.out tokens cs4713 31

NFA-based lexical analysis Specifications P1 {action_1} P2 {action_2} Pn {action_n} s0 N(p1) N(p2) (1) Create a NFA N(pi) for each pattern (2) Combine all NFAs into a single composite NFA (3) Simulate the composite NFA: must find the longest string matched by a pattern continue making transitions until reaching termination N(pn) cs4713 32

Simulate NFA Movement through NFA on each input character Similar to DFA simulation, but must deal with multiple transitions from a set of states Idea: each DFA state correspond to a set of NFA states s is a single state -closure(t) = {s s is reachable from t through -transitions} T is a set of states -closure(t) = {s $ t œ T s.t. s œ -closure(t) } S = -closure(s0); a = nextchar(); while (a!= eof) S = -closure( move(s,a) ); a = nextchar(); If (S F!= «) return yes ; else return no cs4713 33

DFA-based lexical analyzers Convert composite NFA to DFA before simulation Match the longest string before terminiation Match the pattern specification with highest priority add -closure(s0) to Dstates unmarked while there is unmarked T in Dstates do mark T; for each symbol c in do begin U := -closure(move(t, c)); Dtrans[T, c] := U; if U is not in Dstates then add U to Dstates unmarked cs4713 34

Convert NFA to DFA example NFA: start a a b b 0 1 2 3 b Dstates = {-closure(s0)} = { {s0} }; Dtrans[{s0},a] = -closure(move({s0}, a)) = {s0,s1}; Dtrans[{s0},b] = -closure(move({s0}, b)) = {s0}; Dstates = {{s0} {s0,s1} }; Dtrans[{s0,s1},a] = -closure(move({s0,s1}, a)) = {s0,s1}; Dtrans[{s0,s1},b] = -closure(move({s0,s1}, b)) = {s0,s2}; Dstates = {{s0} {s0,s1} {s0,s2} }; Dtrans[{s0,s2},a] = -closure(move({s0,s2}, a)) = {s0,s1}; Dtrans[{s0,s2},b] = -closure(move({s0,s2}, b)) = {s0,s3}; Dstates = {{s0}, {s0,s1}, {s0,s2}, {s0,s3}}; Dtrans[{s0,s3},a] = -closure(move({s0,s3}, a)) = {s0,s1}; Dtrans[{s0,s3},b] = -closure(move({s0,s3}, b)) = {s0}; cs4713 35

Convert NFA to DFA example DFA: start b b a b b 0 0,1 0,2 0,3 a a a Dstates = {{s0}, {s0,s1}, {s0,s2}, {s0,s3}}; Dtrans[{s0},a] = {s0,s1}; Dtrans[{s0},b] = {s0}; Dtrans[{s0,s1},a] = {s0,s1}; Dtrans[{s0,s1},b] = {s0,s2}; Dtrans[{s0,s2},a] = {s0,s1}; Dtrans[{s0,s2},b] = {s0,s3}; Dtrans[{s0,s3},a] = {s0,s1}; Dtrans[{s0,s3},b] = {s0}; cs4713 36