csce4313 Programming Languages Scanner (pass/fail) John C. Lusth Revision Date: January 18, 2005 This is your first pass/fail assignment. You may develop your code using any procedural language, but you must ensure it runs correctly on turing.csce.uark.edu before submission. Your task is to write a scanner for Java. A scanner picks through the source code of a program, identifying the individual tokens. To scan a file, it is necessary to understand lexical analysis. Lexical Analysis Lexical analysis is the process of breaking up a language stream into its tokens (words) and categorizing those tokens as to their types (parts of speech). These token - type pairs are known as lexemes. Formally, a lexical analyzer maps a stream of characters onto a stream of lexemes. At a minimum, a lexeme is an object that holds at least two pieces of information, the type of the token and, in some cases, the actual token itself. As can be seen further on, a lexeme representing a semicolon would generally only use the type field, since the actual word ; can easily be reconstructed from the type information. The basic definition of a lexeme class might look like: class lexeme implements types String type; String word; Note: for programming in a non-object oriented language, substitute the approriate data structure for the above class. With respect to the lexemes found in a programming language, it may be efficient to store some of the words as other than strings. For example, numbers may best be stored as native numbers rather than as strings. To reflect this idea, we add a field corresponding to each of the primitive types in the language we are to implement: class lexeme extends types String type; String string; int integer; double real; Since memory is relatively cheap, we will not worry about the fact that for any given lexeme, two of the three value fields will remain empty. As an example of lexical analysis with respect to computer languages, consider some simple expressions in Java: 1
int a = 2; double zeta; if (a!= 0) a = 64.0 / a; zeta = log(a*a,2); System.out.println("the value of zeta is " + zeta); The parts of speech in this language are keywords, variables, operators, numbers and punctuation (in order of appearance). Now consider a program for taking this source code and producing the lexical stream. The function that reads a portion of the input and produces the next lexeme has historically been called lex. How does the lex decide what type a particular token is? Through its morphology! A plus sign, for example, has a distinctive shape, both as a character and as a bit pattern representing that character. As a more complicated example, a variable may have the following shape: VARIABLE: a letter followed by any number of letters or digits An integer has more restrictive shape: INTEGER : a digit followed by any number of digits Often, in a language, a keyword has the same general shape as a variable, yet its specific shape is different. For example, the keyword if has the following exact shape: IF : an i followed by an f Putting these ideas together, it becomes rather simple to write the lex function: lexeme lex() char ch; skipwhitespace(); ch = Input. read(); if (Input. failed) return new lexeme(endofinput); switch(ch) // single character tokens case ( : return new lexeme(oparen); case ) : return new lexeme(cparen); case, : return new lexeme(comma); case + : return new lexeme(plus); //what about ++ and +=? case * : return new lexemetimes); case - : return new lexeme(minus); case / : return new lexeme(divides); case < : return new lexeme(lessthan); case > : 2
return new lexeme(greaterthan); case = : return new lexeme(assign); case ; : return new lexeme(semicolon); //add any other cases here default: // multi-character tokens (only numbers, // variables/keywords, and strings) if (Character. isdigit(ch)) Input. pushback(ch); return lexnumber(); else if (Character. isletter(ch)) Input. pushback(ch); return lexvariable(); //and keywords! else if (ch == \" ) Input. pushback(ch); return lexstring(); else return new Lexeme(UNKNOWN, ch); In the (partial) implementation of lex given above, the function lexvariable is charged with looking for the specific shapes of the keywords within the general shape of variables. The implementations of lexnumber, lexvariable, lexstring, and skipwhitespace are left to the reader. The tokens ENDofINPUT, PLUS, MINUS, etc. are String constants, defined in the class/module types, and are used to fill out the type component of a lexeme. Note that when input is exhausted, lex repeatedly returns the ENDofINPUT lexeme. When given the input... int alpha = 10; while (alpha > 0) System.out.println("alpha is " + alpha); alpha = alpha - 1;...repeated calls to lex might produce the following stream of lexemes: TYPE_INT ASSIGN INTEGER 10 SEMICOLON WHILE OPEN_PAREN GREATER_THAN 3
INTEGER 0 CLOSE_PAREN OPEN_BRACE VARIABLE System DOT VARIABLE out DOT VARIABLE println OPEN_PAREN STRING "alpha is " PLUS CLOSE_PAREN SEMICOLON ASSIGN MINUS INTEGER 1 SEMICOLON CLOSE_BRACE ENDofINPUT A program which repeatedly calls lex and displays the resulting lexemes is called a scanner: void scanner(string filename) lexeme token; lexer i = new lexer(filename); token = i.lex(); while (token. type!= ENDofINPUT) System.out.println(token); token = i.lex(); Lexical analysis is the first phase of implementing a programming language. The next phase, covered in detail in the next pass/fail assignment, ensures that the order of the lexemes makes logical sense. Specifics Specifically, you are to write a scanner for Java. You should implement the following classes/modules: types lexeme lexer scanner Your implementation must be based upon one-character-at-a-time input (e.g. you may not use Java s string tokenizer if you are writing in Java). That is, your lex method reads a single character and then analyzes what to do with it as above. If you are the least bit confused about this requirement, see me. One should be able to run the scanner on a file of Java source code by typing in the command... scanner f...where f is the name of the Java file to be scanned. 4
Passing the assignment To pass, you must, at a minimum, correctly scan the file test.java. You do not need to be able to lex real numbers, but you will need to be able to skip both over /* */ comments and // comments. Coding Standards For this assignment, there are three possible scores: a strong pass worth 100pass will be awarded if the program is successful but the following coding standards and the coding standards in the syllabus are not followed. You will recieve no credit if your program fails to compile or you do not supply a makefile. Compiling Style Compile at the highest level of warnings (-Wall for gcc, g++) Supply code that compiles with no warnings or errors (C, C++ programmers) main returns int (C, C++ programmers) no implicit typing of functions or variables (C, C++ programmers) Each object should have it s own module/class (C, C++ programmers) The public interface for that object is placed in the header file (C, C++ programmers) Typedef all structures (C++ programmers) Use namespaces and modern include files In general, methods should be not contain very much code - if they do, see if there is some way to reduce the complexity with helper functions - in this particular program, the lex method will be longish because of the switch statement There should be no repetitious code Use symbolic constants (no numeric contants in code or declarations,) Programs should not contain explicit limits Portability Do not supply code that is tied to a specific operating system Do not call system Do not include conio.h or any other OS specific header file Miscellaneous Do not fflush(stdin) - it doesn t do anything useful Sign your work (all files) Comment appropriately (not too much, not too little, but just right) Indent no more than four spaces Don t change the length of a tab in your editor Do not read unlimited length strings into fixed length buffers! 5
Submitting the assignment To submit your assignment, delete all object or class files from your working directory, leaving only source code, a makefile, a readme file, and any test cases you may have. Then, while in your working directory on turing.csce.uark.edu, type the command submit csce4313 lusth scanner The submit program will bundle up all the files in your current directory and ship them to me. This includes subdirectories as well since all the files in any subdirectories will also be shipped to me. You will receive a weak pass if you supply extraneous files (such as.o files or.class files), so be careful. Make sure you that I will be able to call your scanner with the command named scanner. The command scanner must take a filename as a command line argument. You must supply a makefile that will compile your scanner when responding to the command make. If you do not supply a makefile, you will receive a failing grade for this assignment. You may submit as many times as you want before the deadline; new submissions replace old submissions. You must sign your work and give credit where credit is due! You must submit from turing.csce.uark.edu! 6