Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Similar documents
COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

03 - Lexical Analysis

Compiler I: Syntax Analysis Human Thought

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Compiler Construction

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Outline of today s lecture

Compilers Lexical Analysis

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Design Patterns in Parsing

Scanner. tokens scanner parser IR. source code. errors

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

5HFDOO &RPSLOHU 6WUXFWXUH

If-Then-Else Problem (a motivating example for LR grammars)

Syntaktická analýza. Ján Šturc. Zima 208

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

Symbol Tables. Introduction

Context free grammars and predictive parsing

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

CSCI 3136 Principles of Programming Languages

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

FIRST and FOLLOW sets a necessary preliminary to constructing the LL(1) parsing table

Programming Project 1: Lexical Analyzer (Scanner)

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Learning Translation Rules from Bilingual English Filipino Corpus

Bottom-Up Parsing. An Introductory Example

Master of Sciences in Informatics Engineering Programming Paradigms 2005/2006. Final Examination. January 24 th, 2006

Basic Parsing Algorithms Chart Parsing

Regular Expressions and Automata using Haskell

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Classification/Decision Trees (II)

Language Processing Systems

Programming Languages

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Programming Languages CIS 443

A TOOL FOR DATA STRUCTURE VISUALIZATION AND USER-DEFINED ALGORITHM ANIMATION

Compiler Design July 2004

Lecture 9. Semantic Analysis Scoping and Symbol Table

Honors Class (Foundations of) Informatics. Tom Verhoeff. Department of Mathematics & Computer Science Software Engineering & Technology

Dynamic Cognitive Modeling IV

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Semester Review. CSC 301, Fall 2015

Reading 13 : Finite State Automata and Regular Expressions

The previous chapter provided a definition of the semantics of a programming

A Chart Parsing implementation in Answer Set Programming

Automata and Computability. Solutions to Exercises

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Cardinality. The set of all finite strings over the alphabet of lowercase letters is countable. The set of real numbers R is an uncountable set.

Introduction to Automata Theory. Reading: Chapter 1

Yacc: Yet Another Compiler-Compiler

Textual Modeling Languages

OBTAINING PRACTICAL VARIANTS OF LL (k) AND LR (k) FOR k >1. A Thesis. Submitted to the Faculty. Purdue University.

How To Understand A Sentence In A Syntactic Analysis

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Factoring Surface Syntactic Structures

Acknowledgements. In the name of Allah, the Merciful, the Compassionate

3515ICT Theory of Computation Turing Machines

Eventia Log Parsing Editor 1.0 Administration Guide

A chart generator for the Dutch Alpino grammar

Algorithms and Data Structures

ANTLR - Introduction. Overview of Lecture 4. ANTLR Introduction (1) ANTLR Introduction (2) Introduction

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Natural Language Database Interface for the Community Based Monitoring System *

Constraints in Phrase Structure Grammar

Compiler Construction

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

Compiler Construction

Web Data Extraction: 1 o Semestre 2007/2008

2) Write in detail the issues in the design of code generator.

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

Turing Machines, Part I

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

Patterns of Input Processing Software

CA Compiler Construction

Module: Software Instruction Scheduling Part I

Translating while Parsing

Chapter One Introduction to Programming

CSCE-608 Database Systems COURSE PROJECT #2

1/20/2016 INTRODUCTION

Chapter 4: Computer Codes

Peking: Profiling Syntactic Tree Parsing Techniques for Semantic Graph Parsing

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

24 Uses of Turing Machines

Eindhoven University of Technology

Computer Programming. Course Details An Introduction to Computational Tools. Prof. Mauro Gaspari:

Managing Users and Identity Stores

Transcription:

Topics Chapter 4 Lexical and Syntax Analysis Introduction Lexical Analysis Syntax Analysis Recursive -Descent Parsing Bottom-Up parsing 2 Language Implementation Compilation There are three possible approaches to translating human readable code to machine code 1. Compilation 2. Interpretation 3. Hybrid 3 4 Introduction The syntax analysis portion of a language processor nearly always consists of two parts: A low-level part called a lexical analyzer Based on a regular grammar. Output: set of tokens. A high-level part called a syntax analyzer Based on a context-free grammar or BNF Output: parse tree. Issues in Lexical and Syntax Analysis Reasons for separating both analysis: 1) Simpler design. Separation allows the simplification of one or the other. Example: A parser with comments or white spaces is more complex 2) Compiler efficiency is improved. Optimization of lexical analysis because a large amount of time is spent reading the source program and partitioning it into tokens. 3) Compiler portability is enhanced. Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer. 5 6 1

Lexical Analyzer Examples of Tokens First phase of a compiler. It is also called scanner. Main task: read the input characters and produce as output a sequence of tokens. Process: Input: program as a single string of characters. Collects characters into logical groupings and assigns internal codes to the groupings according to their structure. Groupings: lexemes Internal codes: tokens 7 Example of an assignment Token IDENT result = value / 100; Lexeme result ASSIGNMENT_OP = IDENT value DIVISION_OP / INT_LIT 100 SEMICOLON ; 8 Interaction between lexical and syntax analyzers source program lexical analyzer token get next token symbol table syntax analyzer parse tree Secondary tasks: Lexical Analysis Stripping out from the source program comments and white spaces in the form of blank, tab, and new line characters. Correlating error messages from the compiler with the source program. Inserting lexemes for user-defined names into the symbol table. 9 10 Building a Lexical Analyzer State Transition Diagram Three different approaches: Write a formal description of the tokens and use a software tool that constructs table-driven lexical analyzers given such a description (e,g, lex) Design a state diagram that describes the tokens and write a program that implements the state diagram Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram Directed graph Nodes are labeled with state names. Arcs are labeled with the input characters that cause the transitions An arc may also include actions the lexical analyzer must perform when the transition is taken. A state diagrams represent a finite automaton which recognizes regular languages (expressions). 11 12 2

State Diagram Design State Diagram: Example A naive state diagram would have a transition from every state on every character in the source language - such a diagram would be very large. In many cases, transitions can be combined to simplify the state diagram When recognizing an identifier, all uppercase and lowercase letters are equivalent Use a character class that includes all letters When recognizing an integer literal, all digits are equivalent - use a digit class 13 14 Syntax Analyzer Parser The syntax analyzer or parser must determine the structure of the sequence of tokens provided to it by the scanner. Check the input program to determine whether is syntactically correct. Produce either a complete parse tree of at least trace the structure of the complete parse tree. Error: produce a diagnostic message and recover (gets back to a normal state and continue the analysis of the input program: find as many errors as possible in one pass). Two categories of parsers Top-down: produce the parse tree, beginning at the root down to the leaves. Bottom-up: produce the parse tree, beginning at the leaves upward to the root. 15 16 Conventions Top-Down Parser Terminal symbols: lowercase letters at the beginning of the alphabet (a,b, ) Nonterminal symbols: uppercase letters at the beginning of the alphabet (A,B, ) Terminals or nonterminals: uppercase letters at the end of the alphabet (W,X,Y,Z) Strings of terminals: lowercase letters at the end of the alphabet (w,x,y,z) Mixed strings (terminals and/or nonterminals): lowercase Greek letters (α,β,δ,γ) Build the parse tree in preorder. Begins with the root. Each node is visited before its branches are followed. Branches from a particular node are followed in left-to-right order. Uses leftmost derivation. 17 18 3

Next sentential form? Example Given a sentential form xaα, A is the leftmost nonterminal that could be expanded to get the next sentential form in a leftmost derivation. Current sentential form: xaα A-rules: A? bb, A? cbb, and A? a Next sentential form?: xbbα or xcbbα or xaα This is known as the parsing decision problem for top-down parsers. 19 <S> ::= <NP> <VP> <S> ::= <aux> <NP> <VP> <S> ::= <VP> <NP> ::= <Proper_Noun> <NP> ::= <Det> <Nom> <Nom> ::= <Noun> <Nom> ::= <Noun> <Nom> <VP> ::= <Verb> <VP> ::= <Verb> <NP> <Det> ::= that this a <Noun> ::= book flight meal <Verb> ::= book prefer <Aux> ::= does <Prep> ::= from to on <Proper_Noun> ::= Houston A miniature English grammar 20 Suppose the parser is able to build all possible partial trees at the same time. The algorithm begins assuming that the input can be derived by the designated start symbol S. The next step is to find the tops of all trees which can start with S, by looking for all the grammars rules with S on the LHS. There are 3 rules that expand S, so the second ply, or level, has 3 partial trees. These constituents of these 3 new trees are expanded in the same way we just expanded S; and so on. At each ply we use the RHS of the rules to provide new sets of expectations for the parser, which are then used to recursively generate the rest of the tree. Trees are grown downward until they eventually reach the terminals at the bottom of the tree. 21 22 Top-Down Parser At this point, trees whose leaves fail to match all the words in the input can be rejected, leaving behind those trees that represent successful parses. In this example, only the fifth parse tree (the one which has expanded the rule <VP> ::= <Verb><NP>) will eventually match the input sentence. Parsers look only one token ahead in the input Given a sentential form, xaα, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: Recursive descent - a coded implementation LL parsers - table driven implementation 23 24 4

Bottom-Up Parser Bottom-up parsing is the earliest know parsing algorithm. Build the parse tree beginning at the leaves and progressing towards the root. Order corresponds to the reverse of a rightmost derivation. The parse is successful if the parser succeeds in building a tree rooted in the start symbol that covers all of the input. The parse beginning by looking to each word and building 3 partial trees with the category of each terminal. The word book is ambiguous (it can be a noun or a verb). Thus the parser must consider two possible sets of trees. Each of the trees in the second ply is then expanded, and so on. 25 26 Top-Down vs. Bottom-Up In general, the parser extends one ply to the next by looking for places in the parse-inprogress where the right-hand-side of some rule might fit. In the fifth ply, the interpretation of book as a noun has been pruned because this parse cannot be continued: there is no rule in the grammar with RHS <Nominal><NP> The final ply is the correct parse tree. The most common bottom-up parsing algorithms are in the LR family Advantage (top-down) Never waste time exploring trees that cannot result in the root symbol (S), since it begins by generating just those trees. Never explores subtrees that cannot find a place in some S-rooted tree Bottom-up: left branch is completely wasted effort because it is based on interpreting book as Noun at the beginning of the sentence despite the fact no such tree can lead to an S given this grammar. 27 28 Top-Down vs. Bottom-Up Disadvantage (top-down) Spend considerable effort on S trees that are not consistent with the input. Firsts four of six trees in the third ply have left branches that cannot match the word book. This weakness arises from the fact that they can generate trees before ever examining the input. 29 5