Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Similar documents

Compiler Construction

03 - Lexical Analysis

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

CSCI 3136 Principles of Programming Languages

Programming Languages CIS 443

The programming language C. sws1 1

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Lecture 9. Semantic Analysis Scoping and Symbol Table

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Compiler Construction

Semantic Analysis: Types and Type Checking

Scanner. tokens scanner parser IR. source code. errors

Objects for lexical analysis

Compiler I: Syntax Analysis Human Thought

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Introduction to Python

Pemrograman Dasar. Basic Elements Of Java

Compilers Lexical Analysis

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

Programming Languages

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Number Representation

Ed. v1.0 PROGRAMMING LANGUAGES WORKING PAPER DRAFT PROGRAMMING LANGUAGES. Ed. v1.0

Programming Project 1: Lexical Analyzer (Scanner)

Project 2: Bejeweled

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

If-Then-Else Problem (a motivating example for LR grammars)

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

CA Compiler Construction

Programmierpraktikum

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Design Patterns in Parsing

Informatica e Sistemi in Tempo Reale

CS 106 Introduction to Computer Science I

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

How to Write a Simple Makefile

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Chapter 1 Java Program Design and Development

CS106A, Stanford Handout #38. Strings and Chars

The C Programming Language course syllabus associate level

C Compiler Targeting the Java Virtual Machine

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Comp151. Definitions & Declarations

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

6.170 Tutorial 3 - Ruby Basics

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Symbol Tables. Introduction

Regular Expressions and Automata using Haskell

2) Write in detail the issues in the design of code generator.

Chapter 13 - The Preprocessor

6.080/6.089 GITCS Feb 12, Lecture 3

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

High-Level Programming Languages. Nell Dale & John Lewis (adaptation by Michael Goldwasser)

C Programming. for Embedded Microcontrollers. Warwick A. Smith. Postbus 11. Elektor International Media BV. 6114ZG Susteren The Netherlands

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Secret Communication through Web Pages Using Special Space Codes in HTML Files

Lecture 5: Java Fundamentals III

Organization of Programming Languages CS320/520N. Lecture 05. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Computational Mathematics with Python

Computer Programming I

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

PL / SQL Basics. Chapter 3

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Programming languages C

This section describes how LabVIEW stores data in memory for controls, indicators, wires, and other objects.

Variables, Constants, and Data Types

Levels of Programming Languages. Gerald Penn CSC 324

Topics. Parts of a Java Program. Topics (2) CS 146. Introduction To Computers And Java Chapter Objectives To understand:

Chapter 5 Names, Bindings, Type Checking, and Scopes

Sources: On the Web: Slides will be available on:

Language Processing Systems

So far we have considered only numeric processing, i.e. processing of numeric data represented

Basic Java Constructs and Data Types Nuts and Bolts. Looking into Specific Differences and Enhancements in Java compared to C

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Textual Modeling Languages

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

ELEG3924 Microprocessor Ch.7 Programming In C

Hypercosm. Studio.

Chapter 2: Elements of Java

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

Finite Automata. Reading: Chapter 2

Computational Mathematics with Python

Introduction to Automata Theory. Reading: Chapter 1

PROGRAMMING IN C PROGRAMMING IN C CONTENT AT A GLANCE

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

UIL Computer Science for Dummies by Jake Warren and works from Mr. Fleming

Yacc: Yet Another Compiler-Compiler

Introduction to Java

Transcription:

Lexical Analysis and Scanning Honors Compilers Feb 5 th 2001 Robert Dewar

The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc)

A series of tokens The Output Punctuation ( ) ;, [ ] Operators + - ** := Keywords begin end if Identifiers Square_Root String literals hello this is a string Character literals x Numeric literals 123 4_5.23e+2 16#ac#

Free form vs Fixed form Free form languages White space does not matter Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important Fixed format languages Layout is critical Fortran, label in cols 1-61 COBOL, area A B Lexical analyzer must worry about layout

Punctuation Typically individual special characters Such as + - Lexical analyzer does not know : from : Sometimes double characters E.g. (* treated as a kind of bracket Returned just as identity of token And perhaps location For error message and debugging purposes

Operators Like punctuation No real difference for lexical analyzer Typically single or double special chars Operators + - Operations := Returned just as identity of token And perhaps location

Keywords Reserved identifiers E.g. BEGIN END in Pascal, if in C Maybe distinguished from identifiers E.g. mode vs mode in Algol-68 Returned just as token identity With possible location information Unreserved keywords (e.g. PL/1) Handled as identifiers (parser distinguishes)

Rules differ Identifiers Length, allowed characters, separators Need to build table So that junk1 is recognized as junk1 Typical structure: hash table Lexical analyzer returns token type And key to table entry Table entry includes location information

More on Identifier Tables Most common structure is hash table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters No hash code is perfect! Avoid any arbitrary limits

String Literals Text must be stored Actual characters are important Not like identifiers Character set issues Table needed Lexical analyzer returns key to table May or may not be worth hashing

Character Literals Similar issues to string literals Lexical Analyzer returns Token type Identity of character Note, cannot assume character set of host machine, may be different

Numeric Literals Also need a table Typically record value E.g. 123 = 0123 = 01_23 (Ada) But cannot use int for values Because may have different characteristics Float stuff much more complex Denormals,, correct rounding Very delicate stuff

Handling Comments Comments have no effect on program Can therefore be eliminated by scanner But may need to be retrieved by tools Error detection issues E.g. unclosed comments Scanner does not return comments

Case Equivalence Some languages have case equivalence Pascal, Ada Some do not C, Java Lexical analyzer ignores case if needed This_Routine = THIS_RouTine Error analysis may need exact casing

Issues to Address Speed Lexical analysis can take a lot of time Minimize processing per character I/O is also an issue (read large blocks) We compile frequently Compilation time is important Especially during development

General Approach Define set of token codes An enumeration type A series of integer definitions These are just codes (no semantics) Some codes associated with data E.g. key for identifier table May be useful to build tree node For identifiers, literals etc

Interface to Lexical Analyzer Convert entire file to a file of tokens Lexical analyzer is separate phase Parser calls lexical analyzer Get next token This approach avoids extra I/O Parser builds tree as we go along

Implementation of Scanner Given the input text Generate the required tokens Or provide token by token on demand Before we describe implementations We take this short break To describe relevant formalisms

Relevant Formalisms Type 3 (Regular) Grammars Regular Expressions Finite State Machines

Regular Grammars Regular grammars Non-terminals (arbitrary names) Terminals (characters) Two forms of rules Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal One non-terminal is the start symbol Regular (type 3) grammars cannot count No concept of matching nested parens

Regular Grammars Regular grammars E.g. grammar of reals with no exponent REAL ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::= 0 REAL1 (repeat for 1.. 9) REAL1 ::=. INTEGER INTEGER ::= 0 INTEGER (repeat for 1.. 9) INTEGER ::= 0 (repeat for 1.. 9) Start symbol is REAL

Regular Expressions Regular expressions (RE) defined by Any terminal character is an RE Alternation RE RE Concatenation RE1 RE2 Repetition RE* (zero or more RE s) Language of RE s s = type 3 grammars Regular expressions are more convenient

Specifying RE s s in Unix Tools Single characters a b c d \x Alternation [bcd[ bcd] ] [b-z] ab cd Match any character. Match sequence of characters x* y+ Concatenation abc[d-q] Optional [0-9]+(.[0 9]+(.[0-9]*)? 9]*)?

Finite State Machines Languages and Automata A language is a set of strings An automaton is a machine That determines if a given string is in the language or not. FSM s are automata that recognize regular languages (regular expressions)

Definitions of FSM A set of labeled states Directed arcs labeled with character A state may be marked as terminal Transition from state S1 to S2 If and only if arc from S1 to S2 Labeled with next character (which is eaten) Recognized if ends up in terminal state One state is distinguished start state

Building FSM from Grammar One state for each non-terminal A rule of the form Nont1 ::= terminal Generates transition from S1 to final state A rule of the form Nont1 ::= terminal Nont2 Generates transition from S1 to S2

Building FSM s from RE s Every RE corresponds to a grammar For all regular expressions A natural translation to FSM exists We will not give details of algorithm here

Non-Deterministic FSM A non-deterministic FSM Has at least one state With two arcs to two separate states Labeled with the same character Which way to go? Implementation requires backtracking Nasty

Deterministic FSM For all states S For all characters C There is either ONE or NO arcs From state S Labeled with character C Much easier to implement No backtracking

Dealing with ND FSM Construction naturally leads to ND FSM For example, consider FSM for [0-9]+ [0-9]+ 9]+\.[0-9]+ (integer or real) We will naturally get a start state With two sets of 0-90 9 branches And thus non-deterministic

Converting to Deterministic There is an algorithm for converting From any ND FSM To an equivalent deterministic FSM Algorithm is in the text book Example (given in terms of RE s) [0-9]+ [0-9]+ 9]+\.[0-9]+ [0-9]+( 9]+(\.[0-9]+)?

Implementing the Scanner Three methods Completely informal, just write code Define tokens using regular expressions Convert RE s s to ND finite state machine Convert ND FSM to deterministic FSM Program the FSM Use an automated program To achieve above three steps

Ad Hoc Code (forget FSM s) Write normal hand code A procedure called Scan Normal coding techniques Basically scan over white space and comments till non-blank character found. Base subsequent processing on character E.g. colon may be : or := / may be operator or start of comment Return token found Write aggressive efficient code

Using FSM Formalisms Start with regular grammar or RE Typically found in the language standard For example, for Ada: Chapter 2. Lexical Elements Digit ::= 0 1 2 3 4 5 6 7 8 9 decimal-literal literal ::= integer [.integer][exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer E - integer

Using FSM formalisms, cont Given RE s s or grammar Convert to finite state machine Convert ND FSM to deterministic FSM Write a program to recognize Using the deterministic FSM

Implementing FSM (Method 1) Each state is code of the form: <<state1>> case Next_Character is when a => goto state3; when b => goto state1; when others => End_of_token_processing; end case; <<state2>>

Implementing FSM (Method 2) There is a variable called State loop case State is when state1 =><<state1>> case Next_Character is when a => State := state3; when b => State := state1; when others => End_token_processing; end case; when state2 end case; end loop;

Implementing FSM (Method 3) T : array (State, Character) of State; while More_Input loop Curstate := T (Curstate( Curstate,, Next_Char); if Curstate = Error_State then end loop;

Automatic FSM Generation Our example, FLEX See home page for manual in HTML FLEX is given A set of regular expressions Actions associated with each RE It builds a scanner Which matches RE s s and executes actions

Flex General Format Input to Flex is a set of rules: Regexp Regexp actions (C statements) actions (C statements) Flex scans the longest matching Regexp And executes the corresponding actions

An Example of a Flex scanner DIGIT [0-9] ID [a-z][a z][a-z0-9]* %% {DIGIT}+ { printf ( an integer %s (%d)\n, yytext, atoi (yytext)); } {DIGIT}+. {DIGIT}* {DIGIT}* { printf ( a a float %s (%g)\n, yytext, atof (yytext)); if then begin end procedure function { printf ( a a keyword: %s\n, yytext));

Flex Example (continued) {ID} printf ( an identifier %s\n, yytext); + - * / { printf ( an operator %s\n, yytext); } --.*\n n /* eat Ada style comment */ [ \t\n]+ /* eat white space */. printf ( unrecognized character ); %%

Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc( argc, argv) int argc; char **argv argv; { yyin = fopen (argv[1], r ); yylex(); }

Running flex flex is a program that is executed The input is as we have given The output is a running C program For Ada fans Look at aflex (www.adapower.com) For C++ fans flex can run in C++ mode Generates appropriate classes

Choice Between Methods? Hand written scanners Typically much faster execution And pretty easy to write And a easier for good error recovery Flex approach Simple to Use Easy to modify token language

The GNAT Scanner Hand written (scn.adb/scn.ads( scn.adb/scn.ads) Basically a call does Super quick scan past blanks/comments etc Big case statement Process based on first character Call special routines Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Strings (stringt.ads( stringt.ads) Integers (uintp.ads( uintp.ads) Reals (ureal.ads)

More on the GNAT Scanner Entire source read into memory Single contiguous block Source location is index into this block Different index range for each source file See sinput.adb/ads for source mgmt See scans.ads for definitions of tokens

More on GNAT Scanner Read scn.adb code Very easy reading, e.g.

ASSIGNMENT TWO Write a flex or aflex program Recognize tokens of Algol-68s program Print out tokens in style of flex example Extra credit Build hash table for identifiers Output hash table key

Preprocessors Some languages allow preprocessing This is a separate step Input is source Output is expanded source Can either be done as separate phase Or embedded into the lexical analyzer Often done as separate phase Need to keep track of source locations

Nasty Glitches Separation of tokens Not all languages have clear rules FORTRAN has optional spaces DO10I=1.6 identifier operator literal DO10I = 1.6 DO10I=1,6 Keyword stmt loopvar operator literal punc literal DO 10 I = 1, 6 Modern languages avoid this kind of thing!