5HFDOO &RPSLOHU 6WUXFWXUH

Similar documents

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Compiler Construction

03 - Lexical Analysis

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Programming Languages CIS 443

Compilers Lexical Analysis

Compiler I: Syntax Analysis Human Thought

Introduction to Automata Theory. Reading: Chapter 1

CSCI 3136 Principles of Programming Languages

Scanner. tokens scanner parser IR. source code. errors

Syntaktická analýza. Ján Šturc. Zima 208

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Programming Project 1: Lexical Analyzer (Scanner)

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Lecture 9. Semantic Analysis Scoping and Symbol Table

Moving from CS 61A Scheme to CS 61B Java

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Textual Modeling Languages

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Compiler Construction

Eventia Log Parsing Editor 1.0 Administration Guide

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

PL / SQL Basics. Chapter 3

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

csce4313 Programming Languages Scanner (pass/fail)

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Visual Basic Programming. An Introduction

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

Semantic Analysis: Types and Type Checking

Useful Number Systems

Stacks. Linear data structures

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

VHDL Test Bench Tutorial

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Computer Science 281 Binary and Hexadecimal Review

The programming language C. sws1 1

Chapter 3. Input and output. 3.1 The System class

Lecture 18 Regular Expressions

Unified Language for Network Security Policy Implementation

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Base Conversion written by Cathy Saxton

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

Lecture 5: Java Fundamentals III

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

1 Introduction. 2 Overview of the Tool. Program Visualization Tool for Educational Code Analysis

Source Code Translation

Design Patterns in Parsing

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

LEX/Flex Scanner Generator

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Applies to Version 6 Release 5 X12.6 Application Control Structure

C Compiler Targeting the Java Virtual Machine

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

Some Scanner Class Methods

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

Memory Systems. Static Random Access Memory (SRAM) Cell

Chapter 2: Elements of Java

Sources: On the Web: Slides will be available on:

Computational Mathematics with Python

CS 106 Introduction to Computer Science I

estatistik.core: COLLECTING RAW DATA FROM ERP SYSTEMS

XML Schema Definition Language (XSDL)

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

FTP client Selection and Programming

A Programming Language Where the Syntax and Semantics Are Mutable at Runtime

Informatique Fondamentale IMA S8

New York University Computer Science Department Courant Institute of Mathematical Sciences

CS106A, Stanford Handout #38. Strings and Chars

Automata and Computability. Solutions to Exercises

Informatica e Sistemi in Tempo Reale

Python Loops and String Manipulation

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

Barcode Labels Feature Focus Series. POSitive For Windows

So far we have considered only numeric processing, i.e. processing of numeric data represented

Lumousoft Visual Programming Language and its IDE

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

WA2099 Introduction to Java using RAD 8.0 EVALUATION ONLY. Student Labs. Web Age Solutions Inc.

CS321. Introduction to Numerical Methods

Programming Languages

CA Compiler Construction

CSE 1223: Introduction to Computer Programming in Java Chapter 2 Java Fundamentals

If-Then-Else Problem (a motivating example for LR grammars)

MATLAB Programming. Problem 1: Sequential

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

Chapter 13 - The Preprocessor

Concepts and terminology in the Simula Programming Language

Limitation of Liability

Computational Mathematics with Python

Transcription:

6FDQQLQJ 2XWOLQH 2. Scanning The basics Ad-hoc scanning FSM based techniques A Lexical Analysis tool - Lex (a scanner generator) 5HFDOO &RPSLOHU 6WUXFWXUH 6RXUFH &RGH /H[LFDO $QDO\VLV6FDQQLQJ 6\QWD[ $QDO\VLV3DUVLQJ )URQW (QG 6HPDQWLF $QDO\VLV 0DFKLQH,QGHSHQGHQW 2SWLPL]DWLRQ %DFN (QG &RGH *HQHUDWLRQ 0DFKLQH 'HSHQGHQW 2SWLPL]DWLRQ 0DFKLQH &RGH 1

&RPSLOHU 6WUXFWXUH $QRWKHU 9LHZ,QSXW /DQJXDJH /H[LFDO $QDO\]HU /H[HPHV RU 7RNHQV 6\QWD[ $QDO\]HU 3KUDVH 6WUXFWXUH &RGH *HQHUDWRU 2XWSXW /DQJXDJH /H[LFDO $QDO\VLV :KDW LV LW" The input to a compiler/interpreter is a source program which is structured as a sequence/stream of characters or rather unstructured Processing individual characters is pretty tedious and highly inefficient As such, the first thing we have to do is add some basic structure to the source code 2

/H[LFDO $QDO\VLV :KDW LV LW" A Lexical Analyzer (a.k.a. scanner) converts a stream of characters into a stream of tokens i.e. they tokenize the input This is a many:1 transformation and thus later phases of compilation will only need to deal with comparatively few tokens. A token (a.k.a. lexeme or syntactic unit) is a fundamental component of a program /H[LFDO $QDO\VLV 7RNHQV Tokens are typically the bottom level entities in syntax diagrams Typical tokens include: identifiers (e.g. variable names, etc.) keywords operators literals (i.e. constant values) punctuation Consider a simple program and its tokens: 3

/H[LFDO $QDO\VLV 7RNHQV 352*5$0 WHVW FUOI 6RXUFH &RGH 9$5 [,17(*(5 FUOI %(*,1 FUOI [ [ FUOI (1' ^ WHVW ` 7RNHQV NH\ZRUG352*5$0 LGHQWWHVW NH\ZRUG9$5 LGHQW[ SXQFW NH\ZRUG,17(*(5 SXQFW NH\ZRUG%(*,1 LGHQW[ RSHUDWRU LGHQW[ RSHUDWRU OLWHUDO SXQFW NH\ZRUG(1' SXQFW 2WKHU 6FDQQHU )XQFWLRQV A scanner also removes white space from a program white space consists of spaces, tabs, carriage returns, comments, and the like stuff put into the source code solely for readability which does not affect the functional specification provided by the program Some scanners also enter symbols in the symbol table (more later) 4

$G KRF 6FDQQLQJ There are many applications outside of compiler construction that require simple scanning functions e.g. recognizing numeric values in financial and other applications These applications either implement their own recognition functions or rely on library routines or language based pattern matching to provide the needed functionality $G KRF 6FDQQLQJ Manual recognition of tokens involves a multitude of IF, WHILE, and SWITCH statements This approach is ugly, extremely tedious, highly error prone, and difficult to understand, maintain, and extend Using existing routines for doing pattern matching is a significant improvement 5

$G KRF 6FDQQLQJ In many cases (e.g. C language) these facilities are provided by library routines #include <string.h>: index, strlen, strcat, etc. In other cases (e.g. some variants of Pascal) they are incorporated into the language substring functions, sets, etc. or consider the language Perl!!! Both these reflect the prevalence and importance of such functionality $G KRF 6FDQQLQJ Anyone who has had to do a significant amount of such scanning/pattern matching knows how awkward it is e.g. consider data verification or the processing of command line arguments as other examples Scanning in a compiler/interpreter is typically far worse Even simple languages have complex lexemes 6

*UDPPDUV A generative grammar is a set of rules to generate valid phrases in a particular language Grammar G = {V, T, P, S}; V - finite set of nonterminals or variables, T -finite set of terminals or tokens, P - finite set of productions, S - is a nonterminal called start symbol Noam Chomsky defined classes of complexity of generative grammars The hierarchy of four classes, each of which properly contains the next is called the Chomsky hierarchy *UDPPDUV &KRPVN\ KLHUDUFK\ Type 0 Unrestricted Grammars Type 1 Context-Sensitive Grammars (CSGs) Type 2 Context-Free Grammars (CFGs) Type 3 Regular Grammars (RGs) 7

8QUHVWULFWHG *UDPPDUV This type of grammar is too complex for programming languages -- cannot construct efficient parsers for this type of grammar This grammar consists of productions of the form α β &RQWH[W6HQVLWLYH *UDPPDUV Most computer languages fall into this class of grammars The productions in this class are of the form α 1 Αα 2 α 1 βα 2 A becomes β in the context of α 1 and α 2 -- in general these grammars are still too complex for efficient computer analysis The context sensitivity of the programming languages is handled by other means so that context free grammars can be used for programming languages 8

&RQWH[W )UHH *UDPPDUV A production of a context free grammar (CFG) is of the form Α α, where Α is a variable and α is a string of symbols In CFGs, the derivations are on variables are independent of what surrounds them To generate phrases in the language, strings of terminals are derived by repeated expansion of non-terminals CFGs permit the construction of efficient syntax analyzers &RQWH[W )UHH *UDPPDUV Example: <S> a <A> b <A> <B> c <B> d Productions of the grammar Language generated by the above grammar is adcb 9

5HJXODU *UDPPDUV If all the productions of a CFG are of the form Α ωβ or Α ω, where Α, Β are non-terminals and ω is a string of terminals (possibly empty) Α Βω or Α ω, where Α, Β are non-terminals and ω is a string of terminals (possibly empty) Then the grammar is a RG -- first form is called Right linear and the second form is called Left linear RGs are too restrictive for most purposes Very efficient parsers can be built 5HJXODU *UDPPDUV The reason for the efficiency is that the language generation from RG can be performed without remembering our current position in the production that is currently being expanded Lack of memory makes RGs incapable of generating languages with arbitrarily nested structures In compilers, RGs will be used to describe words and CFGs will be used to describe phrases constructed from these words 10

5HJXODU ([SUHVVLRQV 5(V Regular expressions are a simplified form of grammar used to represent RGs ε (epsilon - empty set) is a regular expression that matches nothing symbol (terminal) s in the language is a RE that matches s if R is a RE, (R) * matches zero or more occurrences of the pattern R - known as the closure of R if R is a RE, (R) + matches one or more occurrences of the pattern R 5HJXODU ([SUHVVLRQV 5(V If R and S are RE, (R) (S) matches either the pattern R or the pattern S -- alternation If R and S are RE, (R)(S) matches the catenation of pattern R followed by pattern S Example <int> ::= (0 1 2 3 4 5 6 7 8 9) + <int_no_leading_zero> ::= (1 2 3 4 5 6 7 8 9) (0 1 2 3 4 5 6 7 8 9) * 11

%HWWHU 6FDQQLQJ 7HFKQLTXHV This has motivated the development of both techniques and tools for doing scanning The most common of these are based on what are known as finite state machines (FSMs) which recognize regular languages The key to being able to do this is the existence of certain restrictions placed on the format of programming languages E.g.; tokens are usually separated by delimiters )60EDVHG 6FDQQLQJ The most common techniques used for building scanners are based on finite state machines(or FSMs) FSMs can be easily used to recognize language constructs (tokens) which are described by regular languages 12

5HJXODU /DQJXDJHV 5HYLVWHG A regular language is one which is composed of regular expressions A regular expression consists of simple, atomic elements combined using only three operations catenation, alternation, and repetition 5HJXODU /DQJXDJHV 5HYLVWHG Catenation (a.k.a. concatenation or sequencing) is represented by physical adjacency e.g. the regular expression <letter> <digit> simply represents (depending on the definition of letter and digit) a sequence composed of a letter followed by a digit we would use the ::= (equivalence) operator to associated a definition with <letter> or <digit> 13

5HJXODU /DQJXDJHV 5HYLVWHG Alternation allows selection from a number of choices and is commonly represented by the operator E.g. <digit> ::= 0 1 2 3 4 5 6 7 8 9 Certain shorthand forms are also commonly used with alternation (especially ellipses) E.g. <alpha> ::= a b z A B Z 5HJXODU /DQJXDJHV 5HYLVWHG Finally, repetition permits the expression of constructs which are to be repeated some number of times There are two operators used for this purpose: superscript +, and superscript * E.g. <word> ::= <letter> + this implies 1 or more letters (* would imply 0 or more letters) 14

5HJXODU /DQJXDJHV 5HYLVWHG Finally, parenthesis ( ( and ) ) are used for grouping regular expressions Normally, the repetition operators have the highest precedence followed by alternation and then followed by catenation These 3 simple operations permit us to easily express the tokens that occur in existing programming languages 5HJXODU /DQJXDJHV 5HYLVWHG Consider the following regular expressions for a few common tokens and token types we might encounter 1RWH WKH XVH RI TXRWHV <assignop> ::= := <alphanum> ::= <alpha> <digit> <ident> ::= (<alpha> _ $ ) <alphanum> * <intconst> ::= <digit> + Not everything is this simple to specify 15

5HJXODU /DQJXDJHV 5HYLVWHG For this reason, there are a couple of other short cuts that make life easier These are notational conveniences only and can easily be represented using the basic constructs Logical Negation ( ^ or ~ ) commonly used with other constructs <comment> ::= { (~ } ) * } 5HJXODU /DQJXDJHV 5HYLVWHG ~a implies anything in U-{a} Negation can be done simply by enumerating everything in U-{a} e.g. if U={a b c d e} then we could write (~a) * or, alternatively, (b c d e) * Optional Constructs sometime it becomes tedious to list a number of similar options which could be more conveniently expressed by saying some constructs are optional 16

5HJXODU /DQJXDJHV 5HYLVWHG The most common notation for an optional construct is the use of braces E.g. <signedintconst> ::= [+ -] <intconst> The preceding example is equivalent to the following: <signedintconst> ::= <intconst> + <intconst> - <intconst> If we could specify the number of times a repetition could take place we could do it another way too 5HJXODU /DQJXDJHV 5HYLVWHG Consider: <signedintconst> ::= (+ -) 0..1 <intconst> The 0..1 is intended to imply that repetition can take place at most once (0 or 1 times) This illustrates yet another possible construct which, like the others, may be expressed using only catenation, alternation, and replication albeit more verbosely 17

5HJXODU /DQJXDJHV 5HYLVWHG Let s try something a bit more challenging: What does a real constant look like? It might have a sign for the mantissa The mantissa consists of some digits followed by a decimal point possibly followed by some more digits (the fractional part) There might be an exponent as well which could be signed 5HJXODU /DQJXDJHV 5HYLVWHG Let s do this in pieces... <realconst> ::= <mantissa> [ E <exponent>] Consider the exponent first - its just a signed integer constant: <exponent> ::= [+ -] <intconst> where <intconst> ::= <digit> + <digit> ::= 0 1 2 3 4 5 6 7 8 9 18

5HJXODU /DQJXDJHV 5HYLVWHG Now let s try the mantissa <mantissa> ::= [+ -] <intconst>. [ intconst] As with programming, divide and conquer works well to handle the complexity of regular expression specification Also, the use of the optional constructs greatly simplifies this specification As an exercise, try doing the real constant without [ and ] 5HJXODU /DQJXDJHV )60V A good way to start developing a scanner is to produce regular expressions for the tokens you wish to recognize The regular expressions themselves, however, are not the basis of the scanning process This requires a Finite State Machine (FSM) specification 19

)LQLWH 6WDWH 0DFKLQHV Fortunately, there is a direct 1:1 mapping between regular expressions and the FSMs that implement them An FSM is an abstract machine which can be in one of a finite number of states, which makes state transitions based on inputs, and which performs specific actions in specific states or on transitions between states Moore and Mealy machines from digital logic )LQLWH 6WDWH 0DFKLQHV FSMs are commonly represented graphically Nodes in the graph represent individual states and are assigned meaningful names Edges represent transitions between the states and are labeled with the input values which cause the state transitions An FSM-based scanner takes its input from the source code character stream 20

)LQLWH 6WDWH 0DFKLQHV The FSM-based scanner performs certain actions which include recognizing specific characters, accumulating the characters in a particular token, and returning completed tokens to form the output token stream We ll begin by just recognizing some simple tokens and worry about actually building the tokens later )LQLWH 6WDWH 0DFKLQHV <digit> ::= 0 1 9 <intconst> ::= <digit> + 0..9 0..9 intconst other 21

)LQLWH 6WDWH 0DFKLQHV <ident> ::= (<alpha> _ $ ) <alphanum> * alphanum alpha,_,$ ident other (TXLYDOHQFH RI 5(V DQG )60V For each regular expression (RE), there is an FSM that recognizes strings conforming to the regular expression Consider the three basic RE operations Catenation: a b start a b done 22

(TXLYDOHQFH RI 5(V DQG )60V Alternation: a b c start a b c done Repetition: a* a start a U-{a} done ε $ 6DPSOH 5HJXODU /DQJXDJH <comment> ::= { (~ } ) * } <letter> ::= a z A Z <digit> ::= 0 9 <ident> ::= <letter> (<letter> digit>) * <numconst> ::= <digit> + [. <digit> + ] <strconst> ::= (~ )* <assignop> ::= := :+= :-= :*= :/= <negop> ::= ~ ~< ~> ~= 23

$ 6DPSOH )60 IRU WKH ODQJXDJH Leading } { Comment other Finish letter digit Ident letter, digit other other Lit 1 digit other Lit 3 digit. other Lit 2 digit : Lit 4 other Assign? ~ = +-*/ Assign! = ;,. [ ] Neg ><= other %XLOGLQJ D 6FDQQHU How does a scanner interact with the parser? Consider the following: token Source Program Lexical Analyzer Syntax Analyzer Parse Tree get next token() 24

6FDQQHU $FWLRQV As the scanner changes from state to state, it must do something with the characters it scans in order to build the tokens to be returned to the parser calling it In some cases, it must append the character seen onto a developing token and consume it so the next input character is visible E.g. when scanning characters in an identifier 6FDQQHU $FWLRQV In other cases it must preserve the character and return a completed token E.g. MaxVal := -999; After scanning the : we know that we have found the end of the identifier MaxVal so we want to return that to the parser but we do not want to lose the : so we must preserve it Another possible action is to simply consume a character E.g. characters in comments 25

,PSOHPHQWLQJ WKH )60 A finite state machine may be easily implemented using a table driven technique Table driven techniques are highly methodical Comparatively easy to handle changes and/or extensions to the grammar Straightforward code that is not error-prone Easy to maintain the code,psohphqwlqj WKH )60 Regard the scanner as a device which takes a character stream as input and produces a token stream as output. At any given point in time... The device is in a specific state Based on the current state and the next input character, it will perform a specific action, and move into a new (possibly different) state 26

6FDQQHU $FWLRQV GHWDLO Typical actions include: C : Consume AC : Append and Consume PI : PL: Preserve and build ID token Preserve and build Literal token PK : Preserve and build Keyword token PP : Preserve and build Punctuation token CO : Consume and build Operator token CL : Consume and build Literal token What actions you need depends on the 6DPSOH )60 ZLWK DFWLRQV Leading letter AC digit AC AC : AC } C { C Lit 1 ~ AC Ident Lit 4 Assign? Neg Comment digit AC. AC letter, digit AC other AC other PP = CO +-*/ AC other PL ><= CO Lit 2 CL other PO Assign! other C digit AC other PI Lit 3 = CO digit AC other PL ;,. [ ] CP Finish 27

$ VFDQQHU PDLQOLQH STATIC GLOBAL ipchar; GLOBAL str, token, preserve str = state = Leading WHILE (state <> Finish) DO preserve = NO CALL action[state,ipchar] state = nextstate[state,ipchar] IF NOT preserve THEN ipchar = getchar() RETURN(token) $FWLRQ WDEOH Current State Input Character <alpha> <digit>. + : = { etc. 1. Leading AC AC CP AC CO AC CO C 2. Comment C C C 3. Ident AC AC PI PI PI PI PI PI 4. Lit 1 PL AC 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign! 10. Neg 11. Finish etc. 28

1H[W 6WDWH WDEOH Current State Input Character <alpha> <digit>. + : = { etc. 1. Leading 3 4 11 7 11 8 11 2 2. Comment 3. Ident 3 3 11 11 11 11 11 11 4. Lit 1 11 3 5 11 11 11 11 11 5. Lit 2 6. Lit 3 7. Lit 4 etc. 8. Assign? 9. Assign! 10. Neg 11. Finish etc. $GGLWLRQDO FRGH All we have to do now is add action routines append adds the current character onto a string representing the token being recognized consume vs. preserve is handled by the preserve flag 29

$ /H[LFDO $QDO\]HU *HQHUDWRU Building a scanner manually (even using the FSM technique) is tedious We know that the mapping from regular expressions to FSM is straightforward so why don t we automate the process? Then we just type in regular expressions and get back code to implement a scanner That is exactly what lex does +RZ lex ZRUNV Lex Source Program lex.l Lex Compiler lex.yy.c lex.yy.c C Compiler a.out input stream a.out sequence of tokens 30

OH[ 6SHFLILFDWLRQV lex programs are divided into three components declarations - variable defined, include files specified, etc %% translation rules pattern action (using REs) { C/C++ statements} %% auxiliary procedures -- support routines for the C/C++ statements above 6DPSOH lex SURJUDP %{ /* * this sample demonstrates (very) simple recognition: * a verb/not a verb. */ /* include s and define s should go in this section */ %} %% 31

6DPSOH lex SURJUDP [\t ]+ /* ignore white space */ ; is am are were was be being been do does did have had go { printf("%s: is a verb\n", yytext); } 6DPSOH lex SURJUDP [a-za-z]+ } { printf("%s: is not a verb\n", yytext);. \n { ECHO; /* normal default anyway */ } %% main() { yylex(); } 32