CS 246 Winter 2016 Assignment 3 Instructors: Peter Buhr and Rob Schluntz Due Date: Monday, March 7, 2016 at 22:00

CS 246 Winter 2016 Assignment 3 Instructors: Peter Buhr and Rob Schluntz Due Date: Monday, March 7, 2016 at 22:00 March 1, 2016 This assignment examines intermediate-level C++ and classes. Use it to become familiar with these facilities, and ensure you use the specified concepts in your assignment solution, i.e., writing a C-style solution for questions is unacceptable, and will receive little or no marks. (You may freely use the code from these example programs.) 1. Given the C++ program in Figure 1, compile the program with and without preprocessor variable DYN defined. $ g++ -DDYN new.cc $ g++ new.cc Compare the two versions of the program with respect to performance by doing the following for each version: Run the program and time the execution using the time command: $ /usr/bin/time -f "%Uu %Ss %E"./a.out 3.21u 0.02s 0:03.32 (Output from time differs depending on the shell, so use the system time command.) Compare the user time (3.21u) only, which is the CPU time consumed solely by the execution of user code (versus system and real time). Use the program command-line argument (if necessary) to adjust the number of times the experiment is performed to get user times approximately in the range 0.1 to 100 seconds. (Timing results below 0.1 seconds are inaccurate.) Use the same command-line value for all experiments. Run both the experiments again after recompiling the programs with compiler optimization turned on (i.e., compiler flag -O2). $ g++ -O2 -DDYN new.cc $ g++ -O2 new.cc Include 4 timing results to validate the experiments. Explain the relative differences in the timing results with respect to stack and dynamic allocation. State the performance difference when compiler optimization is used. Explain the use of 0 instead of NULL to initialize a pointer. (Hint: change the 0 to NULL, comment out the #include, and compile the program.) Explain why the call to delete with an address of 0 does not produce an error. 2. Write a C++ program to verify a string of bytes is a valid Unicode Transformation Format 8-bit character (UTF- 8). UTF-8 allows any universal character to be represented while maintaining full backwards-compatibility with ASCII encoding, which is achieved by using a variable-length encoding. The following table provides a summary of the Unicode value ranges in hexadecimal, and how they are represented in binary for UTF-8. Unicode ranges UTF-8 binary encoding 000000-00007F 0xxxxxxx 000080-0007FF 110xxxxx 10xxxxxx 000800-00FFFF 1110xxxx 10xxxxxx 10xxxxxx 010000-10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 1

CS 246 - Assignment 3 2 #include <cstdlib> // atoi void alloc( unsigned int size, unsigned int times ) { for ( unsigned int i = 0; i < times; i += 1 ) { #ifdef DYN volatile int * arr = new int[size]; arr[0] = 5; delete [ ] arr; #else volatile int arr[size]; // ignore volatile, prevents elimination of declaration & assignment #endif } // for } // alloc arr[0] = 5; int main( int argc, char * argv[ ] ) { int times = 100000000; switch ( argc ) { case 2: times = atoi( argv[1] ); } // switch alloc( 10, times ); volatile int * arr = 0; delete arr; } // main // ignore volatile, prevents elimination of declaration & deallocation Figure 1: Stack versus Dynamic Allocation For example, the symbol is represented by Unicode value 0xA3 (binary 1010 0011). Since falls within the range of 0x80 to 0x7FF, it is encoded by the UTF-8 bit string 110xxxxx 10xxxxxx. To fit the character into the eleven bits of the UTF-8 encoding, it is padded on the left with zeroes to 00010100011. The UTF-8 encoding becomes 11000010 10100011, where the x s are replaced with the 11-bit binary encoding giving the UTF-8 character encoding 0xC2A3 for symbol. Note, UTF-8 is a minimal encoding; e.g., it is incorrect to represent the value 0 by any encoding other than the first UTF-8 binary encoding. Use unformatted I/O to read the Unicode bytes and the data structure in Figure 2 to decode the bytes. The shell interface to the utf8 program is as follows: utf8 [ filename ] (Square brackets indicate optional command line parameters, and do not appear on the actual command line.) If no input file name is specified, input comes from standard input. Output is sent to standard output. Issue appropriate runtime error messages for incorrect usage or if a file cannot be opened. The input file contains an unknown number of packed UTF-8 characters, meaning there is no newline separation, but a newline can appear as a UTF-8 character. Structure the program in two translation units. One translation unit contains routine read: wchar t read( istream & infile, character & ch ); which reads in sufficient bytes from infile to accumulate a valid UTF-8 character in utf8char ch.data with the length of the UTF-8 character set in ch.length. It also returns the Unicode value of the UTF-8 character, e.g., for the UTF-8 character, 0xC2A3, the value 0xA3 is returned. Routine read does not print. If read finds an error in the format of the UTF-8 character, it raises a UTF8err exception containing an appropriate message, e.g.: throw UTF8err( "length" ); indicating a problem with the encoded length of a UTF-8 character. The other translation unit contains the main program, which handle the command-line arguments and calls read until end-of-file is raised, and does all necessary printing of valid UTF-8 characters or errors. Print the bytes of

CS 246 - Assignment 3 3 struct UTF8err { // exception const char * msg; UTF8err( const char * msg ) : msg( msg ) {} struct character { union UTF8 { unsigned char ch; unsigned char dt : 7; unsigned char ck : 1; } t1; unsigned char dt : 5; unsigned char ck : 3; } t2; } t3; } t4; } dt; // character // types for 1st utf-8 byte // check // check // type for extra utf-8 bytes } data[4]; // bytes in UTF-8 character unsigned int length; // number of bytes in UTF-8 character Figure 2: UTF8 Data Structure the UTF-8 character in hexadecimal. Hint: to print a character in hexadecimal use the following cast: char ch = 0xff; cout << hex << (unsigned int)(unsigned char)ch << endl; For example, given the input file: $ od -t x1 infile 0000000 23 d7 90 d7 c2 c2 a3 b0 e0 e3 e9 80 80 e0 93 90 0000020 ff f0 90 89 f0 f0 90 89 80 01 the program prints: 0x23 : valid value 0x23 0xd790 : valid value 0x5d0 0xd7c2 : invalid padding 0xc2a3 : valid value 0xa3 0xb0 : invalid length 0xe0e3 : invalid padding 0xe98080 : valid value 0x9000 0xe09390 : invalid range 0xff : invalid length 0xf09089f0 : invalid padding 0xf0908980 : valid value 0x10240 0x01 : valid value 0x1 3. Write a C++ class named string that contains a sequence of UTF-8 characters. Since the name string is already used for C++ string, it is important to prevent name clashes between the new UTF-8 strings and std::string. To prevent conflicts, place the UTF-8 string in its own namespace, called utf8. The interface for the UTF-8 string is:

CS 246 - Assignment 3 4 struct string { string(); string(const string &); string(const char * ); ~string(); string & operator=(const string &); void push back( character ch ); void reserve( unsigned int n ); character * chars; unsigned int length; unsigned int capacity; // copy assignment operator // add one UTF-8 character to the end of the string // if n > capacity, resize string to have enough space // for n UTF-8 characters // dynamically allocated array of UTF-8 characters // # of UTF-8 characters currently in the chars array // maximum # of UTF-8 characters that chars can store // IMPLEMENT INPUT, OUTPUT, ADDITION Implement the appropriate constructors and destructor, and member routines push back and reserve for the string type. Furthermore, overload the input, output, assignment, and addition operators for the UTF-8 string type. The following example illustrates how a UTF-8 string is used. using utf8::string; string s1; // create an empty string (length is zero, chars is NULL) string s2( "foobar" ); // create a UTF-8 string initialized with the character string foobar string s3( s2 ); // initialize UTF-8 string s3 with a copy of the UTF-8 string in s2 string s4( "\xc2\xa3" ); // initialize with UTF-8 pound symbol cin >> s1 >> s4; // read in whitespace-delimited UTF-8 strings from stdin cout << s1 << " " << s2 << " " << s3 << " " << s4 << endl; // print UTF-8 strings to stdout s1 = s1 + s4; s2 = s2 + "baz"; // concatenate UTF-8 strings s1 and s4 // concatenate UTF-8 string s2 and character string baz Implementation notes The declaration of the string type can be found in utf8string.h. For your submission you should add all routine and member definitions to utf8string.cc. You are not allowed to use the C++ string type to solve this question. However, you may include the header cstring and use the functions declared therein. In particular, you may find memcpy useful. For memory allocation, you must follow this allocation scheme: every default constructed string begins with a capacity of 0. The first time data is stored in a default constructed string, it is given a capacity of 5 and space is allocated accordingly. If the string was not allocated with the default constructor, you may choose a different, reasonable initial capacity. If at any point this capacity proves to be not enough, you must double the capacity (for example, capacities can go from 5 to 10 to 20 to 40...). Note that there is no realloc in C++, so doubling the size of an array necessitates allocating a new array and copying items over. Your program must not leak memory. Becoming familar with cin.peek() and the isspace function located in the <cctype> library may aid you in solving this question. In particular, note that cin.peek() does not by default skip leading whitespace. Also note that cin.peek() returns an int. The provided driver (q3.cc) can be compiled with your solution to test (and then debug) your code. Please keep in mind that the purpose of the test harness is to provide a convenient means of verifying that code you are asked to write is working correctly. Therefore, although some effort has been expended to make the harness reasonably robust, we do not guarantee that it is perfect, as that is not the point. The test harness should function correctly if you use it as intended; it may fail horribly if you abuse it. But the point of your testing is to verify your code, rather than the harness.

CS 246 - Assignment 3 5 As a hint, some of the operations are easier to implement than others. In addition, some of the operations are useful as helper functions for implementing the more difficult operations. Submission Guidelines Please follow these guidelines carefully. Review the Assignment Guidelines and C++ Coding Guidelines before starting each assignment. Each text file, i.e., *. * txt file, must be ASCII text and not exceed 500 lines in length, where a line is a maximum of 120 characters. Name your submitted files as follows: 1. new.txt contains the information required by question 1, p. 1. 2. utf8char.h,utf8char.{cc,c,cpp},q2.{cc,c,cpp} code for question 2, p. 1. The program must be divided into separate compilation units with file names given above. Program documentation must be present in your submitted code. Output for this question is checked via a marking program, so it must match exactly with the given program. 3. q2utf8.testtxt test documentation for question 2, p. 1, which includes the input and output of your tests. Write a brief description for each test explaining what aspects of the program it is testing and how you decided if the program passed the test. 4. utf8char.h,utf8char.{cc,c,cpp},utf8string.h,utf8string.{cc,c,cpp} code for question 3, p. 3. The program must be divided into separate compilation units with file names given above. Program documentation must be present in your submitted code. Output for this question is checked via a marking program, so it must match exactly with the given program. Use the following Makefile to compile the programs for questions 2, p. 1 and 3, p. 3 (do not submit this file): CXX = g++-4.9 # compiler CXXFLAGS = -g -Wall -Werror -std=c++11 -MMD # compiler flags MAKEFILE NAME = ${firstword ${MAKEFILE LIST}} # makefile name OBJECTS2 = utf8char.o q2.o EXEC2 = utf8ch OBJECTS3 = utf8char.o utf8string.o q3.o EXEC3 = utf8str OBJECTS = ${OBJECTS2} ${OBJECTS3} EXECS = ${EXEC2} ${EXEC3} DEPENDS = ${OBJECTS:.o=.d} # object files forming executable # executable name # object files forming executable # executable name # substitute.o with.d.phony : all clean all : ${EXECS} ${EXEC2} : ${OBJECTS2} ${CXX} $^ -o $@ ${EXEC3} : ${OBJECTS3} ${CXX} $^ -o $@ ${OBJECTS} : ${MAKEFILE NAME} -include ${DEPENDS} # link step # link step # OPTIONAL : changes to this file => recompile # include *.d files containing program dependences clean : rm -f ${DEPENDS} ${OBJECTS} ${EXECS} # remove files that can be regenerated Put this Makefile in the directory with your programs, name your source files appropriately, and then execute shell command make utf8ch or make utf8str in the directory to compile a program (make without an argument compiles all the programs). This Makefile is used by Marmoset to build programs, so make sure your programs compiles with it. Do not make any changes to the Makefile. Follow these guidelines. Your grade depends on it!