COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing



Similar documents
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Scanning and parsing. Topics. Announcements Pick a partner by Monday Makeup lecture will be on Monday August 29th at 3pm

Programming Languages CIS 443

If-Then-Else Problem (a motivating example for LR grammars)

Compiler Construction

Theory of Compilation

CS143 Handout 08 Summer 2008 July 02, 2007 Bottom-Up Parsing

Introduction to the 1st Obligatory Exercise

Flex/Bison Tutorial. Aaron Myles Landwehr CAPSL 2/17/2012

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. CSC467 Compilers and Interpreters Fall Semester, 2005

Programming Assignment II Due Date: See online CISC 672 schedule Individual Assignment

Compilers Lexical Analysis

Bottom-Up Parsing. An Introductory Example

Programming Project 1: Lexical Analyzer (Scanner)

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

CUP User's Manual. Scott E. Hudson Graphics Visualization and Usability Center Georgia Institute of Technology. Table of Contents.

Scanner. tokens scanner parser IR. source code. errors

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Chapter 2: Elements of Java

03 - Lexical Analysis

Compiler I: Syntax Analysis Human Thought

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Introduction to Java Applications Pearson Education, Inc. All rights reserved.

Syntaktická analýza. Ján Šturc. Zima 208

Moving from CS 61A Scheme to CS 61B Java

Yacc: Yet Another Compiler-Compiler

Some Scanner Class Methods

C Compiler Targeting the Java Virtual Machine

Design Patterns in Parsing

Antlr ANother TutoRiaL

Lecture 5: Java Fundamentals III

JFlex User s Manual. The Fast Lexical Analyser Generator. Contents. Copyright c by Gerwin Klein, Steve Rowe, and Régis Décamps.

Evaluation of JFlex Scanner Generator Using Form Fields Validity Checking

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4

A TOOL FOR DATA STRUCTURE VISUALIZATION AND USER-DEFINED ALGORITHM ANIMATION

Bottom-Up Syntax Analysis LR - metódy

Lecture 9. Semantic Analysis Scoping and Symbol Table

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

csce4313 Programming Languages Scanner (pass/fail)

FIRST and FOLLOW sets a necessary preliminary to constructing the LL(1) parsing table

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Chapter 3. Input and output. 3.1 The System class

Compiler Construction

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Introduction to Lex. General Description Input file Output file How matching is done Regular expressions Local names Using Lex

Pushdown automata. Informatics 2A: Lecture 9. Alex Simpson. 3 October, School of Informatics University of Edinburgh als@inf.ed.ac.

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Semantic Analysis: Types and Type Checking

New York University Computer Science Department Courant Institute of Mathematical Sciences

Introduction to Python

A Lex Tutorial. Victor Eijkhout. July Introduction. 2 Structure of a lex file

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Context free grammars and predictive parsing

Compiler Design July 2004

Introduction to Java

Master of Sciences in Informatics Engineering Programming Paradigms 2005/2006. Final Examination. January 24 th, 2006

CP Lab 2: Writing programs for simple arithmetic problems

6.1. Example: A Tip Calculator 6-1

CSCI 3136 Principles of Programming Languages

The previous chapter provided a definition of the semantics of a programming

Hypercosm. Studio.

Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)

Implementing Programming Languages. Aarne Ranta

Chulalongkorn University International School of Engineering Department of Computer Engineering Computer Programming Lab.

Lex et Yacc, exemples introductifs

Exercise 4 Learning Python language fundamentals

Example of a Java program

FTP client Selection and Programming

7. Building Compilers with Coco/R. 7.1 Overview 7.2 Scanner Specification 7.3 Parser Specification 7.4 Error Handling 7.5 LL(1) Conflicts 7.

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

ANTLR - Introduction. Overview of Lecture 4. ANTLR Introduction (1) ANTLR Introduction (2) Introduction

Basic Programming and PC Skills: Basic Programming and PC Skills:

Informatica e Sistemi in Tempo Reale

Stacks. Linear data structures

Scanner. It takes input and splits it into a sequence of tokens. A token is a group of characters which form some unit.

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Algorithms and Data Structures

Pemrograman Dasar. Basic Elements Of Java

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Textual Modeling Languages

Chapter One Introduction to Programming

Hands-on Exercise 1: VBA Coding Basics

CS 241 Data Organization Coding Standards

Approximating Context-Free Grammars for Parsing and Verification

Semester Review. CSC 301, Fall 2015

Computer Science 281 Binary and Hexadecimal Review

So far we have considered only numeric processing, i.e. processing of numeric data represented

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

CS170 Lab 11 Abstract Data Types & Objects

WA2099 Introduction to Java using RAD 8.0 EVALUATION ONLY. Student Labs. Web Age Solutions Inc.

Translating to Java. Translation. Input. Many Level Translations. read, get, input, ask, request. Requirements Design Algorithm Java Machine Language

Introduction to Python

HW3: Programming with stacks

J a v a Quiz (Unit 3, Test 0 Practice)

Transcription:

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing The scanner (or lexical analyzer) of a compiler processes the source program, recognizing tokens and returning them one at a time. The parser (or syntactic analyzer) of a compiler organizes the stream of tokens (from the scanner) into parse trees according to the grammar for the source languages. Separating the scanner and parser into different stages of the compiler: makes implementing the parser easier increases portability of the compiler permits parallel development eases tool support 1 Scanning Scanning is essentially pattern matching - each token is described by some pattern for its lexemes. Examples: the pattern for an identifier(token id) is a letter followed by a sequence of letters, digits and underscores. Lexemes include: i3, foo, foo 7,... the pattern for a floating point numeric literal (token num) is an optional sign, followed by a sequence of digits, followed by a decimal point, followed by a sequence of digits. Lexemes include: -7.3, 3.44,... 1.1 Regular Expressions regular expressions are a notation for defining patterns. notation name meaning example matches [...] character class any of a set [a-za-z] any letter * Kleene * 0 or more occurrences a* 0 or more a s Kleene closure + Kleene + 1 or more occurrences a+ 1 or more a s positive closure? optional 0 or 1 occurrences a? a or ǫ alternative a b a or b (...) grouping a(b c) a followed by b or c Table 1: Regular expression notation Because has lowest precedence, ab c (the last example above with the parentheses removed) matches ab or c, and not a followed by b or c. A regular definition is a named regular expression. Such names can be used in place of the regular expression in following definitions. Examples: letter = [a-za-z] digit = [0-9] id = {letter}({letter} {digit} _)* num = (+ -)?{digit}+.{digit}+ 1

The name of a regular definition is put in { } on the R.H.S. of another regular definition. This is to distinguish to use of the named definition, for example, {digit}, from the simple pattern given by the letters in the name, for example, d followed by i followed by g followed by i followed by t. The language specified by a regular expression is the set of all strings that match the pattern given by the regular expression. For example, the language specified by the pattern for id given above includes strings such as foo, i3, foo 7 and so on. A regular language is any language that can be specified by a regular expression. Every regular languages is context free (can be specified by a BNF definition), but not vice-versa. For example: {a n b n n 0} is a context free language, because it is defined by the following grammar: <S> a<s>b ǫ but is not a regular language. Hence, regular expressions are strictly less powerful than context free grammars (BNF definitions). However, they can be used to generate more efficient pattern matchers. To generate a pattern matcher from regular expressions, tools such as lex and JFLex first convert regular expressions to state transition diagrams, and then generate pattern matching code from the diagrams. 1.2 Using JFLex JFlex is a tool for generating Java implementations of scanners from regular expression definitions of tokens. To configure your account for using JFlex (and CUP), first download the file cup.jar from the course web page (see the links for Homework 3). Remember which directory this file is stored in. Next, use a text editor such as pico to create/edit a file called.profile in your home directory, add the following content to your.profile, and save it: CLASSPATH=~/cup.jar:. export CLASSPATH PATH=.:$PATH export PATH If you saved cup.jar somewhere other than your home directory, the first line should include the appropriate directory. For example, if you saved cup.jar to your Desktop directory, the first line should be: CLASSPATH=~/Desktop/cup.jar:. Your PATH and CLASSPATH environment variables are now set correctly for all of the software we will use in this course. A JFlex specification is put in a file with a.lex extension, i.e. example.lex. Assuming that example.lex contains a valid JFlex specification, the command: java JFlex.Main example.lex generates the scanner code and places it in file Yylex.java. (Any error messages about inability to find main or JFlex.Main indicate that your CLASSPATH environment variable is not set correctly.) The scanner code is in class Yylex. The constructor for class Yylex takes a stream reader (use System.in for standard input) as the input stream to read the source program from. The other significant method in class Yylex is next_token(), which returns the next token. For use with CUP(a parser generator), next_token() should return instances of class Symbol from java_cup.runtime.symbol (which requires that CUP is installed and your CLASSPATH is set correctly). Class Symbol has two significant fields (data members): int sym - the token recognized (represented as an integer) 2

Object value - can contain instances of any Java class, and is often used to hold lexemes as instances of classes such as String, Double and Integer. CUP generates parsers from attribute grammars, and this value field will become an intrinsic attribute of the token in the attribute grammar. The one-argument constructor of class Symbol takes the token (of type int) as its parameter, while the two-argument constructor for class Symbol has parameters for the values of both of these fields. The token values should be defined as integer constants in class sym. CUP generates this class automatically, or it can be constructed by hand if CUP is not being used. We will construct a JFlex scanner and a CUP parser for the following simple calculator language. This language defines lists of arithmetic expressions, and the output from our parser will be the value of each expression. <exprl> <exprl> <expr> ; ǫ <expr> num <expr> addop <expr> <expr> mulop <expr> (<expr>) Format of a JFlex specification file: user code %% JFlex directives %% regular expression rules and actions For an example, see file example.lex on the course web page. The contents of the user code section are copied directly into the generated scanner code. For example, imports of any needed Java classes would go here. The JFlex directives section is used to customize the behavior of JFlex. Some common directives include: %cup // to specify CUP compatibility // the following three lines make JFlex recognize EOF properly %eofval{ return new Symbol(sym.EOF); %eofval} Notes on JFlex directives: the value of sym.eof must be 0 (this only matters if you construct class sym by hand) every directive must start at the beginning of a line the line breaks in the macro defined with %eofval above are critical Any regular definitions also go in this section. These use exactly the syntax of regular definitions presented in class, except that double quotes (" ") are used instead of single quotes ( ) to distinguish metasymbols from literal characters. The regular expression rules and actions give a sequence of regular expression patterns for tokens and actions to be executed when the associated actions are matched. Notes: each action is Java code, and Java keywordreturn can be used to specify the value returned by method next_token() white space (spaces, tabs,...) in the source program is ignored by giving a pattern for white space with an empty action. Since the action contains no return, method next_token() ignores white space and keeps scanning the source program until something else is found. the patterns are tried in order, and the first matching pattern is used. 3

a period(.) is a pattern that matches any character. This is useful for reporting lexical errors. However, this also means that the character period must be put in double quotes, i.e. ".", in a pattern if you mean the character. and not the metasymbol. in each action, the string of characters that matched the associated pattern(the lexeme) can be accessed using function yytext() 2 Bottom-Up Parsers The parser (or syntactic analyzer) of a compiler organizes the stream of tokens (from the scanner) into parse trees according to the grammar for the source languages. All grammars used in designing parsers must be unambiguous. Choices: remove ambiguity by hand (by modifying the grammar) use disambiguating rules to specify which choices to make when constructing the parse tree. Parsing algorithms that can parse any unambiguous grammar are O(n 3 ), where n is the length of the string being parsed. Commercial compilers use O(n) (linear) parsing algorithms that can handle specific classes of grammars, called LL(1) and LR(1) grammars. These grammars are sufficient to describe the syntax of the vast majority of programming language features. Parsers are classified as: top-down parsers, which build parse trees from the root to the leaves. These parsers are designed using LL(1) grammars. bottom-up parsers, which build parse trees from the leaves to the root. These parsers are designed using LR(1) grammars. Bottom-up parsers build parse trees from the leaves to the root. The idea is to use the grammar rules backward - the parser matches against the R.H.S. of grammar rules and when some rule is matched, it is used to construct a subtree with the R.H.S of the rule as the children and the L.H.S. as the root. A bottom-up parser uses a stack to keep track of the parse. At each parsing step, there are two possible actions: shift - push the next token from the input onto the stack reduce - if the sequence of grammar symbols on top of the stack matches the R.H.S. of a grammar rule, pop them all and replace them with the L.H.S. of that rule. The parser is controlled by a table generated from the grammar. The tables are too large and complex to construct by hand, and so are generated using a tool such as yacc or CUP. Bottom-up parsers are also called LR parsers, because they parse LR grammars. L - Left to right scan of the input R - generates a Rightmost derivation (in reverse) Every LL(1) grammar is an LR grammar, but not vice versa: LL(1) grammars LR grammars For example, many left recursive grammars are LR. Hence, bottom-up parsers have more parsing power than top-down parsers. The parsers generated by yacc and CUP are LALR(1) parsers, where LALR(1) stand for Look-Ahead LR(1). These parsers are less powerful than LR parsers, i.e.: LL(1) grammars LALR(1) grammars LR grammars but the loss of parsing power is small and the parse tables for LALR(1) parsers are much smaller than those for LR parsers. If the grammar used as input to yacc or CUP is ambiguous or otherwise not LALR(1), these tools will report parsing conflicts as error messages. The possible conflicts are: 4

shift-reduce conflicts - at some step, the parser can t tell whether to shift a token onto the stack or do a reduction. reduce-reduce conflicts - at some step, the parser can t tell which of two reductions to perform. This can occur when two rules in the grammar have identical R.H.S. s, but different L.H.S. s. 2.1 Using CUP CUP generates an LALR(1) parser (in Java) from an attribute grammar specification: this approach works well with synthesized attributes each time a rule is used in a reduction, the attribute of the L.H.S. of the rule is computed (via userspecified Java code) from the attributes of the R.H.S. bottom-up parsing ensures that the attributes of the R.H.S. are available when needed. JFlex scanners can easily be made compatible with CUP parsers. A CUP specification is placed in a file with extension.cup, for example, example.cup. To generate parser code: java java_cup.main example.cup This generates: parser.java which contains the parser code sym.java which contains definitions of all of the tokens used as integer constants. These must match the tokens used in the scanner. To compile and run the parser: 1. ensure that the scanner code generated using JFlex (for example, Yylex.java) is in the same directory as parser.java and sym.java 2. compile by executing: javac parser.java 3. run the parser by executing: java parser < input where input is the file containing the source program. Or, to enter the source program directly from the keyboard, just enter: java parser and then type the source program and press Ctrl-D when finished. Format of a CUP specification file: imports directives declarations of tokens and nonterminals grammar rules and actions For an example, see file example.cup on the course web page or the class handout. The imports are copied directly to the beginning of the generated code in parser.java. For example, 5

import java_cup.runtime.*; is usually placed here to provide access to the needed library classes. The directives supply user code to be placed in the indicated class in the generated code. For example: parser code {: code :} causes the code in the special brackets {: :} to be placed in class parser. In the declarations of tokens and nonterminals section, all grammar symbols are declared as follows: the name used for a token must match the name assigned to the sym field of class Symbol when that token is recognized by the scanner. if a token has an attribute (placed in the value field of class Symbol by the scanner), the type of that token must be declared and must match the attribute value assigned by the scanner. if a nonterminal has an attribute, the type of that attribute must also be given. The majority of nonterminals will have attributes. by convention, token names are in all caps and nonterminal names are all lowercase. Note that nonterminal names are not put in angle brackets (< >). This section also includes any specifications of associativity and precedence of tokens. These specifications are actually disambiguating rules and permit parsing of many ambiguous grammars when used appropriately. The syntax of these specifications is: precedence <spec> terminal {, terminal}; where: <spec> left right nonassoc The <spec> specifies associativity. If a token is declared as nonassoc, then it can not occur multiple times in one expression. For example, 1 < 2 < 3 is illegal in many programming languages. Precedence rules: all tokens listed in the same declaration have the same precedence. if there are multiple precedence declarations, later declarations have higher precedence. all tokens not listed in a precedence declaration have lowest precedence For example: precedence left ADDOP precedence left MULOP declares that MULOP has higher precedence than ADDOP, and that both are left associative. In many cases, these precedence and associativity declarations can be used to remove shift-reduce and reduce-reduce conflicts from the grammar. The grammar rules and actions section is an attribute grammar for the language being parsed: the L.H.S. of the first rule is the start symbol the symbol ::= is used in place of nonterminal names are not in < > (the declarations specify which names are nonterminals and which are tokens) each rule ends in ; 6

each symbol on the R.H.S. of a rule can be labeled using : and a name, i.e. expr:l ADDOP:op expr:r, where l, op and r are labels for tokens, the label is used to refer to the intrinsic attribute assigned by the scanner (stored in the value field of the Symbol) for nonterminals, the attribute value is computed using actions associated with each rule the attribute of the L.H.S. of the rule is always RESULT The actions associated with each rule: are in special brackets {: :} contain Java code that can refer to attribute labels and RESULT typically compute the value of RESULT based on the attribute values of the grammar symbols on the R.H.S. of the rule. Hence, the attributes are synthesized attributes. are executed each time the rule is used in a reduction can easily be used to build parse trees or do other computation The commands needed to compile and run a CUP parser with a JFlex scanner are as follows, assuming: file example.lex contains the JFlex specification file example.cup contains the CUP specification file input contains the source program JFlex and CUP are installed and the CLASSPATH is set correctly command comment java JFlex.Main example.lex creates Yylex.java java java_cup.main example.cup creates sym.java and parser.java javac parser.java compiles everything java parser < input runs the parser Table 2: Commands needed to compile and run a CUP parser with a JFlex scanner. To add tokens, you must declare them in the.cup file and also add actions to the.lex file to recognize and return them. Also, you must run CUP (to generate constant declarations for the new tokens in sym.java) before recompiling anything. 7