Why separate lexical from syntactic analysis? Lexical Analysis / Scanning. Complications. Lexemes, tokens, and patterns

Similar documents
Compiler Construction

03 - Lexical Analysis

Scanner. tokens scanner parser IR. source code. errors

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Programming Languages CIS 443

Regular Expressions and Automata using Haskell

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Finite Automata. Reading: Chapter 2

Compilers Lexical Analysis

Automata and Computability. Solutions to Exercises

CSCI 3136 Principles of Programming Languages

Introduction to Automata Theory. Reading: Chapter 1

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

The Halting Problem is Undecidable

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Finite Automata. Reading: Chapter 2

Reading 13 : Finite State Automata and Regular Expressions

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Compiler I: Syntax Analysis Human Thought

Semantic Analysis: Types and Type Checking

C H A P T E R Regular Expressions regular expression

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Today s Agenda. Automata and Logic. Quiz 4 Temporal Logic. Introduction Buchi Automata Linear Time Logic Summary

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Lecture 9. Semantic Analysis Scoping and Symbol Table

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

5HFDOO &RPSLOHU 6WUXFWXUH

Chapter 2: Elements of Java

Objects for lexical analysis

Programming Project 1: Lexical Analyzer (Scanner)

Automata and Formal Languages

Pemrograman Dasar. Basic Elements Of Java

Informatique Fondamentale IMA S8

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Turing Machines: An Introduction

Testing LTL Formula Translation into Büchi Automata

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

6.080/6.089 GITCS Feb 12, Lecture 3

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Regular Languages and Finite Automata

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

Compiler Construction

If-Then-Else Problem (a motivating example for LR grammars)

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Finite Automata and Regular Languages

T Reactive Systems: Introduction and Finite State Automata

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

3515ICT Theory of Computation Turing Machines

Lab Experience 17. Programming Language Translation

A TOOL FOR DATA STRUCTURE VISUALIZATION AND USER-DEFINED ALGORITHM ANIMATION

SRM UNIVERSITY FACULTY OF ENGINEERING & TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF SOFTWARE ENGINEERING COURSE PLAN

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

CS 301 Course Information

language 1 (source) compiler language 2 (target) Figure 1: Compiling a program

Moving from CS 61A Scheme to CS 61B Java

Automata on Infinite Words and Trees

NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Philadelphia University Faculty of Information Technology Department of Computer Science First Semester, 2007/2008.

Regular Languages and Finite State Machines

Unit 6. Loop statements

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Sources: On the Web: Slides will be available on:

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Finite Automata and Formal Languages

Number Representation

Context free grammars and predictive parsing

J a v a Quiz (Unit 3, Test 0 Practice)

Implementation of Recursively Enumerable Languages using Universal Turing Machine in JFLAP

(IALC, Chapters 8 and 9) Introduction to Turing s life, Turing machines, universal machines, unsolvable problems.

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

CSE 1223: Introduction to Computer Programming in Java Chapter 2 Java Fundamentals

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Translating to Java. Translation. Input. Many Level Translations. read, get, input, ask, request. Requirements Design Algorithm Java Machine Language

Composability of Infinite-State Activity Automata*

CHAPTER 7 GENERAL PROOF SYSTEMS

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Symbol Tables. Introduction

CS 106 Introduction to Computer Science I

Chapter One Introduction to Programming

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Programming Languages

C Compiler Targeting the Java Virtual Machine

CA Compiler Construction

Programming Language Pragmatics

Chapter 7: Functional Programming Languages

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

University of Wales Swansea. Department of Computer Science. Compilers. Course notes for module CS 218

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

6.170 Tutorial 3 - Ruby Basics

Introduction to Java

Transcription:

Lexical Analysis / Scanning Why separate lexical from syntactic analysis? Purpose: turn character stream (input program) into token stream parser turns token stream into syntax tree Token: group of characters forming basic, atomic chunk of syntax; a word Whitespace: characters between tokens that are ignored Separation of concerns / good design scanner: handle grouping chars into tokens ignore whitespace handle I/O, machine dependencies parser: handle grouping tokens into syntax trees Restricted nature of scanning allows faster implementation scanning is time-consuming in many compilers Craig Chambers 17 CSE 41 Craig Chambers 18 CSE 41 Complications Lexemes, tokens, and patterns Most languages today are free-form layout doesn t matter whitespace separates tokens Alternatives: Fortran: line-oriented, whitespace doesn t separate do 1 i = 1.1.. a loop.. 1 continue Haskell: can use identation & layout to imply grouping Lexeme: group of characters that form a token Token: class of lexemes that match a pattern token may have attributes, if more than one lexeme in token Pattern: typically defined using a regular expression REs are simplest language class that s powerful enough Most languages separate scanning and parsing Alternative: C/C++/Java: type vs. identifier parser wants scanner to distinguish names that are types from names that are variables but scanner doesn t know how things declared -- that s done during semantic analysis a.k.a. typechecking! Craig Chambers 19 CSE 41 Craig Chambers 2 CSE 41

Languages and language specifications Classes of languages Alphabet: a finite set of characters/symbols String: a finite, possibly empty sequence of characters in alphabet Language: a (possibly empty or infinite) set of strings Grammar: a finite specification of a set of strings Regular languages can be specified by regular expressions/grammars, finite-state automata (FSAs) Context-free languages can be specified by context-free grammars, push-down automata (PDAs) Turing-computable languages can be specified by general grammars, Turing machines Language automaton: a finite machine for accepting a set of strings and rejecting all others all languages Turing-computable languages context-free languages A language can be specified by many different grammars and automata regular languages A grammar or automaton specifies only one language Craig Chambers 21 CSE 41 Craig Chambers 22 CSE 41 Syntax of regular expressions Notational conveniences Defined inductively base cases: the empty string (ε or ) a symbol from the alphabet (e.g. x) inductive cases: Notes: sequence of two RE s:e 1 E 2 either of two RE s:e 1 E 2 Kleene closure (zero or more occurrences) of a RE:E * E + means 1 or more occurrences ofe E k meansk occurrences ofe [E] means or 1 occurrence ofe (optionale) {E} meanse * not(x) means any character in the alphabet but x not(e) means any string of characters in the alphabet but those strings matchinge E 1 -E 2 means any string matchinge 1 except those matchinge 2 can use parentheses for grouping precedence: * highest, sequence, lowest whitespace insignificant No additional expressive power through these conveniences Craig Chambers 23 CSE 41 Craig Chambers 24 CSE 41

Naming regular expressions Using regular expressions to specify tokens Can assign names to regular expressions Can use the name of a RE in the definition of another RE Identifiers ident ::= letter (letter digit) * Examples: letter ::= a b... z digit ::= 1... 9 alphanum ::= letter digit Integer constants integer ::= digit + sign ::= + - signed_int ::= [sign] integer Grammar-like notation for named RE s: a regular grammar Can reduce named RE s to plain RE by macro expansion no recursive definitions allowed, unlike full context-free grammars Real number constants real ::= signed_int [fraction] [exponent] fraction ::=. digit + exponent ::= (E e) signed_int Craig Chambers 25 CSE 41 Craig Chambers 26 CSE 41 More token specifications Meta-rules String and character constants string ::= " char * " character ::= ' char ' Can define a rule that a legal program is a sequence of tokens and whitespace program ::= (token whitespace)* token ::= ident integer real string... char escape ::= not(" ' \) escape ::= \(" ' \ n r t v b a) But this doesn t say how to uniquely break up an input program into tokens -- it s highly ambiguous! Whitespace whitespace ::= <space> <tab> <newline> comment comment ::= /* not(*/) */ E.g. what tokens to make out ofhi2bob? one identifier,hi2bob? three tokens,hi2bob? six tokens, each one character long? The grammar states that it s legal, but not how tokens should be carved up from it Apply extra rules to say how to break up string into sequence of tokens longest match wins yield tokens, drop whitespace Craig Chambers 27 CSE 41 Craig Chambers 28 CSE 41

RE specification of initial MiniJava lexical structure Building scanners from RE patterns Program ::= (Token Whitespace) * Convert RE specification into finite state automaton (FSA) Token ::= ID Integer ReservedWord Operator Delimiter ID ::= Letter (Letter Digit) * Letter ::= a... z A... Z Digit ::=... 9 Integer ::= Digit + ReservedWord::= class public static extends void int boolean if else while return true false this new String main System.out.println Operator ::= + - * / < <= >= > ==!= &&! Delimiter ::= ;., = ( ) { } [ ] Convert FSA into scanner implementation by hand into collection of procedures mechanically into table-driven scanner Whitespace ::= <space> <tab> <newline> Craig Chambers 29 CSE 41 Craig Chambers 3 CSE 41 Finite state automata Determinism An FSA has: a set of states one marked the initial state some marked final states a set of transitions from state to state each transition labelled with a symbol from the alphabet or ε not(*,/) / * * / FSA can be deterministic or nondeterministic Deterministic: always know which way to go at most 1 arc leaving a state with particular symbol no ε arcs Nondeterministic: may need to explore multiple paths, only choose right one later Example: not(*) * 1 Operate by reading symbols and taking transitions, beginning with the start state if no transition with a matching label is found, reject When done with input, accept if in final state, reject otherwise 1 1 Craig Chambers 31 CSE 41 Craig Chambers 32 CSE 41

NFAs vs. DFAs A solution A problem: RE s (e.g. specifications) map to NFA s easily Can write code from DFA easily Cool algorithm to translate any NFA into equivalent DFA! proves that NFAs aren t more expressive than DFAs How to bridge the gap? Can it be bridged? Plan: 1) Convert RE into NFA [they re equivalent] 2) Convert NFA into DFA 3) Convert DFA into code Can be done by hand, or fully automatically Craig Chambers 33 CSE 41 Craig Chambers 34 CSE 41 RE NFA NFA DFA Define by cases ε x Problem: NFA can choose among alternative paths, while DFA must have only one path Solution: subset construction of DFA each state in DFA represents set of states in NFA, all that the NFA might be in during its traversal E 1 E 2 E 1 E 2 E * Craig Chambers 35 CSE 41 Craig Chambers 36 CSE 41

Subset construction algorithm DFA code Given NFA with states and transitions label all NFA states uniquely Create start state of DFA label it with the set of NFA states that can be reached by ε transitions (i.e. without consuming any input) Process the start state To process a DFA state S with label {s 1,..,s N }: For each symbol x in the alphabet: compute the set T of NFA states reached from any of the NFA states s 1,..,s N by an x transition followed by any number of ε transitions if T not empty: if a DFA state has T as a label, add a transition labeled x from S to T otherwise create a new DFA state labeled T, add a transition labeled x from S to T, and process T Option 1: implement scanner by hand using procedures one procedure for each token each procedure reads characters choices implemented using if & switch statements Pros straightforward to write by hand fast Cons a fair amount of tedious work may have subtle differences from language specification A DFA state is final iff at least one of the NFA states in its label is final Craig Chambers 37 CSE 41 Craig Chambers 38 CSE 41 DFA code (cont.) Option 2: use tool to generate table-driven scanner rows: states of DFA columns: input characters entries: action go to new state accept token, go to start state error Pros convenient for automatic generation exactly matches specification, if tool-generated Cons magic table lookups may be slower than direct code but switch statements get compiled into table lookups, so... can translate table lookups into switch statements, if beneficial Craig Chambers 39 CSE 41