Compilers Lexical Analysis

Similar documents

Compiler Construction

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

03 - Lexical Analysis

Programming Languages CIS 443

Programming Project 1: Lexical Analyzer (Scanner)

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

CSCI 3136 Principles of Programming Languages

Scanner. tokens scanner parser IR. source code. errors

Compiler I: Syntax Analysis Human Thought

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Informatica e Sistemi in Tempo Reale

csce4313 Programming Languages Scanner (pass/fail)

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Introduction to Automata Theory. Reading: Chapter 1

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Symbol Tables. Introduction

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

6.170 Tutorial 3 - Ruby Basics

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Automata and Computability. Solutions to Exercises

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Moving from CS 61A Scheme to CS 61B Java

Chapter 2: Elements of Java

Evaluation of JFlex Scanner Generator Using Form Fields Validity Checking

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

LEX/Flex Scanner Generator

Chapter One Introduction to Programming

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

PL / SQL Basics. Chapter 3

Introduction to Turing Machines

Python Lists and Loops

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

Sources: On the Web: Slides will be available on:

Member Functions of the istream Class

Compiler Construction

Regular Expressions Overview Suppose you needed to find a specific IPv4 address in a bunch of files? This is easy to do; you just specify the IP

Regular Expressions. General Concepts About Regular Expressions

Command Scripts Running scripts: include and commands

Semantic Analysis: Types and Type Checking

CA Compiler Construction

Theory of Compilation

Regular Expressions and Automata using Haskell

C Compiler Targeting the Java Virtual Machine

Chapter 3. Input and output. 3.1 The System class

grep, awk and sed three VERY useful command-line utilities Matt Probert, Uni of York grep = global regular expression print

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Textual Modeling Languages

Pemrograman Dasar. Basic Elements Of Java

Unix Shell Scripts. Contents. 1 Introduction. Norman Matloff. July 30, Introduction 1. 2 Invoking Shell Scripts 2

Reading 13 : Finite State Automata and Regular Expressions

Hands-on Exercise 1: VBA Coding Basics

Ed. v1.0 PROGRAMMING LANGUAGES WORKING PAPER DRAFT PROGRAMMING LANGUAGES. Ed. v1.0

Selection Statements

Programming Languages

Python Loops and String Manipulation

Antlr ANother TutoRiaL

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

The Center for Teaching, Learning, & Technology

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Exercise 4 Learning Python language fundamentals

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

PROGRAMMING IN C PROGRAMMING IN C CONTENT AT A GLANCE

Base Conversion written by Cathy Saxton

CS 241 Data Organization Coding Standards

Introduction to Python

FTP client Selection and Programming

Lecture 9. Semantic Analysis Scoping and Symbol Table

Keywords are identifiers having predefined meanings in C programming language. The list of keywords used in standard C are : unsigned void

C++ Language Tutorial

ANTLR - Introduction. Overview of Lecture 4. ANTLR Introduction (1) ANTLR Introduction (2) Introduction

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

How to Write a Simple Makefile

MATLAB Programming. Problem 1: Sequential

C H A P T E R Regular Expressions regular expression

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

Topics. Parts of a Java Program. Topics (2) CS 146. Introduction To Computers And Java Chapter Objectives To understand:

Chapter 2 Basics of Scanning and Conventional Programming in Java

5.1 Radical Notation and Rational Exponents

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts

PL/SQL Overview. Basic Structure and Syntax of PL/SQL

Finite Automata. Reading: Chapter 2

3.GETTING STARTED WITH ORACLE8i

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Objects for lexical analysis

Computer Science 281 Binary and Hexadecimal Review

Computer Programming C++ Classes and Objects 15 th Lecture

Lex et Yacc, exemples introductifs

Transcription:

Compilers Lexical Analysis SITE : http://www.info.univ-tours.fr/ mirian/ TLC - Mírian Halfeld-Ferrari p. 1/3

The Role of the Lexical Analyzer The first phase of a compiler. Lexical analysis : process of taking an input string of characters (such as the source code of a computer program) and producing a sequence of symbols called lexical tokens, or just tokens, which may be handled more easily by a parser. The lexical analyzer reads the source text and, thus, it may perform certain secondary tasks: Eliminate comments and white spaces in the form of blanks, tab and newline characters. Correlate errors messages from the compiler with the source program (eg, keep track of the number of lines),... The interaction with the parser is usually done by making the lexical analyzer be a sub-routine of the parser. TLC - Mírian Halfeld-Ferrari p. 2/3

Interaction of lexical analyzer with parser lexical analyzer token get next token parser symbol table TLC - Mírian Halfeld-Ferrari p. 3/3

Tokens, Patterns, Lexemes Token: A token is a group of characters having collective meaning: typically a word or punctuation mark, separated by a lexical analyzer and passed to a parser. A lexeme is an actual character sequence forming a specific instance of a token, such as num. Pattern: A rule that describes the set of strings associated to a token. Expressed as a regular expression and describing how a particular token can be formed. For example, [A-Za-z][A-Za-z_0-9]* The pattern matches each string in the set. A lexeme is a sequence of characters in the source text that is matched by the pattern for a token. TLC - Mírian Halfeld-Ferrari p. 4/3

Example Token Sample Lexemes (Informal) Description of Pattern const const const if if if relation <, <=, =, <>, >, => < <= = <> > => id pi, count, D2 (letter.(letter digit) ) num 3.1426, 0.6, 6.22 any numeric constant literal core dumped any character between and except Note: In the Pascal statement const pi = 3.1416 the substring pi is a lexeme for the token identifier TLC - Mírian Halfeld-Ferrari p. 5/3

When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme. Example: pattern num matches 0 and 1. It is essential for the code generator to know what string was actually matched. The lexical analyzer collects information about tokens into their associated attributes. In practice: A token has usually only a single attribute - a pointer to the symbol-table entry in which the information about the token is kept such as: the lexeme, the line number on which it was first seen, etc. Example Fortran statement: E = M * C ** 2 Tokens and associated attributes: < id, pointer to symbol-table for E > < assignop > < id, pointer to symbol-table for M > < multiop >... TLC - Mírian Halfeld-Ferrari p. 6/3

Lexical Errors Few errors are discernible at the lexical level alone. Lexical analyzer has a very localized view of the source text. It cannot tell whether a string fi is a misspelling of a keyword if or an identifier. The lexical analyzer can detect characters that are not in the alphabet or strings that have no pattern. In general, when an error is found, the lexical analyzer stops (but other actions are also possible). TLC - Mírian Halfeld-Ferrari p. 7/3

Stages of a lexical analyzer Scanner Based on a finite state machine. If it lands on an accepting state, it takes note of the type and position of the acceptance, and continues. Sometimes it lands on a "dead state," which is a non-accepting state. When the lexical analyzer lands on the dead state, it is done. The last accepting state is the one that represent the type and length of the longest valid lexeme. The "extra" non valid character should be "returned" to the input buffer. TLC - Mírian Halfeld-Ferrari p. 8/3

Stages of a lexical analyzer Evaluator Goes over the characters of the lexeme to produce a value. The lexeme s type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. TLC - Mírian Halfeld-Ferrari p. 9/3

Example In the source code of a computer program the string net_worth_future = (assets - liabilities); might be converted (with whitespace suppressed) into the lexical token stream: NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON TLC - Mírian Halfeld-Ferrari p. 10/3

General idea of input buffering We use a buffer divided into two N-character halves. We read N input characters into each half of the buffer with one system read command rather than invoking a read command for each input character. If fewer than N characters remain in the input, then a special character eof is read. Two pointers to the input buffer are maintained. The string of characters between the two pointers is the current lexeme. Initially both pointers point to the first character of the lexeme to be found. One called the forward pointer scans ahead until a match for a pattern is found. Once the next lexeme is determined, the forward point is set to the first character of it. After the lexeme is processed, both pointers are set to the character immediately past the lexeme. With this schema, comments and white spaces can be treated as patterns that yield no token. TLC - Mírian Halfeld-Ferrari p. 11/3

The importance of the lexical analysis Lexical analysis makes writing a parser much easier. Instead of having to build up names such as "net_worth_future" from their individual characters, the parser can start with tokens and concern itself only with syntactical matters. This leads to efficiency of programming, if not efficiency of execution. However, since the lexical analyzer is the subsystem that must examine every single character of the input, it can be a compute-intensive step whose performance is critical, such as when used in a compiler TLC - Mírian Halfeld-Ferrari p. 12/3

Regular expressions to the lexical analysis There is an almost perfect match between regular expressions to the lexical analysis problem with two exceptions: 1. There are many different kinds of lexemes that need to be recognized. The lexer treats these differently so a simple accept reject answer is not sufficient. There should be a different kind of accept answer for each different kind of lexeme. 2. A DFA reads a string from beginning to end then accepts or rejects. 3. A lexer must find the end of the lexeme in the input stream. Then the next time it is called it must find the next lexeme in the string. TLC - Mírian Halfeld-Ferrari p. 13/3

Lexical Analyzer Specification To specify a lexical analyzer we need a state machine, sometimes called Transition Diagram (TD), which is similar to a FSA Transition Diagrams depict the actions that take place when the lexer is called by the parser to get the next token Differences between TD and FSA FSA accepts or rejects a string. TD reads characters until finding a token, returns the read token and prepare the input buffer for the next call. In a TD, there is no out-transition from accepting states (for some authors). Transition labeled other (or not labeled) should be taken on any character except those labeling transitions out of a given state. States can be marked with a : This indicates states on which a input retraction must take place. To consider different kinds of lexeme, we usually build separate DFAs (or TD) corresponding to the regular expressions for each kind of lexeme then merge them into a single combined DFA (or TD). TLC - Mírian Halfeld-Ferrari p. 14/3

Example [0-9] 0 [0-9] 1 FSA [0-9] [0-9] other 0 1 2 TD Figure 1: FSA and a TD for integers TLC - Mírian Halfeld-Ferrari p. 15/3

Recognizing keywords Keywords: same pattern as identifiers but do not correspond to the token identifier. Two solutions are possible: 1. We can consider keywords as identifiers and search in a table to know whether the lexeme is an identifier or a keyword. 2. We can consider the regular expressions of all keywords and integrate the TD obtained from them into the global TD. TLC - Mírian Halfeld-Ferrari p. 16/3

Keywords as identifiers This technique is almost essential if the lexer is coded by hand. Without it, the number of states in a lexer for a typical programming language is several hundred (using the trick, fewer of a hundred will probably suffice). The technique for separating keywords from identifiers consists in initializing appropriately the symbol table in which information about identifiers is saved. For instance, we enter the strings if, then and else into the symbol table before any characters in the input are seen. When a string is recognized by the TD: 1. The symbol table is examined 2. If the lexeme is found there marked as a keyword then the string is a keyword else the string is an identifier TLC - Mírian Halfeld-Ferrari p. 17/3

Example Symbol Table do keyword end keyword Keywords for keyword while keyword... Cont Identifier Identifiers TLC - Mírian Halfeld-Ferrari p. 18/3

Regular expressions for all keywords Keywords can be prefixes of identifiers (ex: do and done). The lexer that results from this technique is much more complex but they are necessary when we use lexical analyzer generators from some specification. The specification is made by regular expressions. letter other 1 2 letter,digit IDENTIFIER other otherdigitlet digit 0 3 digit d other 5 otherdigitlet other INTEGER 4 otherdigitlet 7 DO DONE 10 other o n e other 6 8 9 otherdigitlet TLC - Mírian Halfeld-Ferrari p. 19/3

Example: another transition diagram TD for unsigned or negative signed integers, addition operation + and increment operator + + + other 2 ADD other 4 ADD 1 + + 3 5 INCR + other digit 0 6 other 7 INTEGER digit 8 digit TLC - Mírian Halfeld-Ferrari p. 20/3

Example Transition Table Input state + D token retraction 0 1 8 6 1 3 2 2 2 ADD 1 3 5 4 4 4 ADD 2 5 INCR 0 6 7 7 6 7 INTERGER 1 8 error error TLC - Mírian Halfeld-Ferrari p. 21/3

Implementation of Lexical Analyzer Different ways of creating a lexical analyzer: To use an automatic generator of lexical analyzers (as LEX or FLEX). Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools accept regular expressions which describe the tokens allowed in the input stream. Input: Specification Regular expressions representing the patterns Actions to take according to the detected token Each regular expression is associated with a phrase in a programming language which will evaluate the lexemes that match the regular expression. The tool then constructs a state table for the appropriate finite state machine and creates program code which contains the table, the evaluation phrases, and a routine which uses them appropriately TLC - Mírian Halfeld-Ferrari p. 22/3

Lexical analyzer: use a generator or construction by hand? Lexical analyzer generator Advantages: easier and faster development. Disadvantages: the lexical analyzer is not very efficient and its maintenance can be complicated. To write the lexical analyzer by using a high level language Advantages: More efficient and compact Disadvantages: Done by hand To write the lexical analyzer by using a low level language Advantages: Very efficient and compact Disadvantages: Development is complicate TLC - Mírian Halfeld-Ferrari p. 23/3

Remember Regular expressions and the finite state machines are not capable of handling recursive patterns, such as n opening parentheses, followed by a statement, followed by n closing parentheses. They are not capable of keeping count, and verifying that n is the same on both sides. TLC - Mírian Halfeld-Ferrari p. 24/3

Priority of tokens Longest lexeme: DO and DOT (DOT is taken) > and >= (>= is taken) First-listed matching pattern The following regular expressions appear in the lexical specification: w.h.i.l.e: keyword while letter.(letter digit): identifier In the input we read while The lexer considers it as a keyword If we change the order of the specification then we will never detect a keyword while TLC - Mírian Halfeld-Ferrari p. 25/3

The use of Lex or Flex The general form of the input expected by Flex is { definitions } %% { rules } %% { user subroutines } The second part MUST be present. TLC - Mírian Halfeld-Ferrari p. 26/3

Definitions and Rules Definitions Declarations of ordinary C variables and constants Flex definitions Rules The form of rules are: regularexpression action The actions are C/C++ code. TLC - Mírian Halfeld-Ferrari p. 27/3

Flex actions Actions are C source fragments. If it is compound, or takes more than one line, enclose with braces ({ }). [a-z]+ printf("found word\n"); [A-Z][a-z]* { printf("found capitalized word:\n"); printf(" %s \n",yytext); } TLC - Mírian Halfeld-Ferrari p. 28/3

Regular-expression like notation a represents a single character \a or a represents a when a is a character used in the notation (to avoid ambiguity) a b represents a or b a? represents zero or one occurrence of a a represents zero or more occurrence of a a+ represents one or more occurrence of a a{m, n} represents between m and n occurrences of a [a z] represents a character set [ a z] represents the complement of the first character set TLC - Mírian Halfeld-Ferrari p. 29/3

Regular-expression like notation {name} represents the regular expression defined by name a represents a at the start of a line a$ represents a at the end of a line ab\xy represents ab when followed by xy. any character except newline TLC - Mírian Halfeld-Ferrari p. 30/3

Examples of regular expressions in flex a zero or more a s. zero or more of any character except newline.+ one or more characters [a z] a lowercase letter [a za Z] any alphabetic letter [ a za Z] any non-alphabetic character a.b a followed by any character followed by b rs tu rs or tu EN D$ the characters END followed by an end-of-line. a(b c)d abd or acd TLC - Mírian Halfeld-Ferrari p. 31/3

Definition Format name definition Examples name is just a word beginning with a letter (or an underscore) followed by zero or more letters, underscore, or dash. definition goes from the first non-whitespace character to the end of line. You can refer to it via {name}, which will expand to (definition) DIGIT [0-9] {DIGIT}*\.{DIGIT}+ ou ([0-9])*\.([0-9])+ TLC - Mírian Halfeld-Ferrari p. 32/3

Example %option lex-compat letter [A-Za-z] digit [0-9] identifier {letter}({letter} {digit})* %% {identifier} {printf("identifier %s on line %d recognised \n", yy %% TLC - Mírian Halfeld-Ferrari p. 33/3

Example %option lex-compat digit [0-9] letter [A-Za-z] intconst [+\-]?{digit}+ realconst [+\-]?{digit}+\.{digit}+(e[+\-]?{digit}+)? identifier {letter}({letter} {digit})* whitespace [ \t\n] stringch [ˆ ] string {stringch}+ otherch [ˆ0-9a-zA-Z+\- \t\n] othersym {otherch}+ TLC - Mírian Halfeld-Ferrari p. 34/3

Example %% main printf("keyword main - program recognised \n"); \{ printf("begin recognised\n"); for printf("keyword for recognised\n"); do printf("keyword do recognised\n"); while printf("keyword while recognised\n"); switch printf("keyword switch recognised\n"); case printf("keyword case recognised \n"); if printf("keyword if recognised \n"); else printf("keyword else recognised \n"); {intconst} printf("integer %s on line %d \n", yytext,yylineno); {realconst} printf("real %s on line %d \n", yytext,yylineno); {string} printf("string %s on line %d \n", yytext,yylineno); {identifier} printf("identifier %s on line %d recognised \n", yyt {whitespace} ; /* no action */ {othersym} ; /*no action */ %% TLC - Mírian Halfeld-Ferrari p. 35/3

Homework Implement examples in the book! TLC - Mírian Halfeld-Ferrari p. 36/3