CS 525 -- Assignment Two

Due Monday, May 4th@ 5pm in CS Homework Station

 


Homework Guidelines

Submission Instructions: Please hand in a hard copy of your solutions to all problems, with a suitable cover page with your name, the date, the course, and the number of the assignment. Do each problem on a separate sheet clearly labelled at the top (e.g., "Problem 3"), and staple all pages together in the upper-left-hand corner (do NOT submit solutions separately or update your solutions after handing in your assignment). For any problem statement that asks multiple questions, answer each in a separate paragraph, labelled (a), (b), etc. Submit "pencil and paper" problems into the CS 525 mail slot (not the shelf) in the CS homework station; submit code or files produced by software using Gsubmit as specified on the course home page. A small penalty will be taken for any homework that does not follow these guidelines.

Late Policy: Both parts of the homework (pencil and paper, and electronic submission) are due at 5 pm on Monday; late homeworks will be penalized at the rate of 25% for each weekday late (each day ending at 5pm, weekends not included).

Policy on Academic Honesty: Please see the end of this homework for comments on the University policy for academic honesty that I adhere to.


 

Problem 1

Convert the following RE to an NFA as discussed in class:

(a|b)* aa(a|b)*

You may take shortcuts as discussed in class, but you must show all work for full credit

Problem 2

Convert the NFA from the last problem to a DFA using the Subset Construction. You may simplify the NFA by removing silly epsilon transitions before applying the construction. Show all work. Is this DFA minimal? Explain why or why not (be precise!), and if it is not minimal, construct a minimal automaton.

 

Problem 3

This problem has two parts. (3.1) Prove that the language { w | w is a string of bits with exactly twice as many 1's as 0's } is not regular, using the Pumping Lemma. (3.2) Suppose someone gives you the following proof that the language { w | w is a string of bits with an odd number of 1's } is NOT regular: "Using the Pumping Lemma, choose the three strings xyz such that y contains an odd number of 1's (e.g., it consists of the first occurence of a 1 in the string); then you can pump the string y twice, to get a string with an even number of 1's; hence the language is not regular." Clearly this proof does not work, as the language is obviously regular. Where does it go wrong? Be as precise as possible.

Problem 4

In this problem you will construct a lexical analyzer for your term project (by yourself--we will form groups for the last stages of the project), by extending a bare-bones lexer that I supply.

What To Do First

Look at the sections on Lex in the Lex & Yacc book.If you read this material carefully, you will see that this assignment is not at hard as it looks, as much of what you need to do is spelled out explicitly in the text.

Details of the Lexer

You will write a lexical analyzer for a subset of C. For now, you will use the following tokens:

Keywords: main, int, float, if, else, while, for, break, struct, return, continue

Operators: +, ++, -, --, %, *, /, =, &&, ||, !, <, <=, >, >=, ==, !=

Punctuation: , (comma), ; (semicolon), (, ), {, }, [, ]

plus identifiers and integer constants (no floating point constants!). These will suffice for the subset of C we will be translating. Your numbers should be unsigned, as we will be adding signs using the unary + and - in the grammar for expressions.

For now, you can assign any token numbers you want to these, starting at 1 (0 is reserved for a special use by lex). You will need to modify the sample lexer given below to recognize these tokens, and return the token number for each one. You will need to write your main() routine in the analyzer in such a way that you loop through the input file, calling yylex() until the end of file, printing out the token numbers and the lexemes (this is done in the sample lexer provided below).

For the purposes of this assignment, you will also need to write a routine to print out the contents of the symbol table. Do not take the easy way out and simply print the identifiers out as they are found, and before you insert them; I want to see the whole symbol table at the end of the run. When you incorporate the lexer into the compiler, of course you will not need this looping main() routine or the printing routines. We will discuss this when you get your next project.

Symbol Table

You must implement your symbol table as a hash table that stores structs (or classes) using the lexeme as the key; you must be able to add new identifiers to the table, using a function insert(key) that returns a pointer to the new entry, or a pointer to an existing entry if the key is already in the table (you should not insert duplicates!). For now the struct or class for each entry can have only a key value; you may add other fields later. The details of the hash table implementation are up to you, but, of course, you should make intelligent choices based on your deep knowledge of such things from CS 112.

I am having you do the symbol table routines now, because there will be plenty to do in the next assignment when you build the parser; however, you will be calling your symbol table routines in the final project from the parser, not the lexer. Thus, you should use appropriate modularity when designing these routines so that they can be moved when necessary.

Error Reporting

You should keep track of line numbers in the input file, and when you find an illegal character in the input, you should print the line number and the line itself, with some indication of the error, e.g.,

    Error: Illegal character on line 13:

    while ( A &$ B ) {
               ^
    Terminating compilation!

You can read about how to do this on pp.246-7 in the Lex & Yacc book; note however that yyerror() is a yacc function, and you should not use it (yet). Just put your error reporting code in the actions following the default pattern that catches any characters not belonging to a correct token (as done in the sample code).

Don't worry about error recovery: you can simply halt after printing the error message.

Lex Bugs

There are bugs in Lex, and some of these are discussed in the Lex and Yacc book. There are a few others not mentioned. The weirdest is that your .l file must have a blank line at the end; if you look at the sample file you will see that after the final "}" of the main() function I have put a blank line. If you remove this you will get some bizarre errors. Another annoying thing is that when Lex creates the file "lex.yy.c" of course this is a C program, and it defines various constants and etc. of its own, and if you HAPPEN to use the same name, this will cause a problem. For example, if you define a constant "BEGIN" corresponding to the "begin" keyword in Pascal, since Lex defines a constant "BEGIN" itself, you will get an error. In general, watch out for using constants that have familiar names; use bizarre names and you should be ok.

In general, keep in mind that you are inserting code into the middle of some already-existing code when you look at any error messages that might crop up. You need to look at the lex.yy.c program to see what the error is and then try to figure out what you did wrong in the .l file. Try deleting code to isolate the error, moving things around, and so on.

Sample Lexical Analyzer and Test Files

A bare-bones lexical analyzer (without symbol table and without error reporting) based on the one in the book is provided with test files in the following directory:

www.cs.bu.edu/~snyder/cs525/proj1/

You can start with this and add functionality to construct this project. The header comment describes how to run the program.

What to Hand In

For problems, 1 -- 3, you must write up solutions as described at the top of this document. For problem 4, you must provide hardcopy and electronic copy. Specifically: create your lexer in a file called lexer.l, then run your it on the three test files found in the directory given above, saving the result of the correct execution on all test files in a script file called session1.txt. You must simultaneous Gsubmit these two files into a directory called P1 (upper case 'P' followed by a digit '1') in your gsubmit directory. Instructions for using Gsubmit are on the class web site.


Policy on Academic Conduct

Whenever you submit a piece of academic work and sign your name to it, you are verifying that this work is the result of your own intellectual efforts; it is Plagiarism to submit work solely under your own name in which:

In this course, I will not allow you to submit any joint work, although in other courses and in other contexts you may be permitted to do this. In general, you must always provide proper attribution of authorship by naming all persons, books, or resources that provided intellectual content towards the final result. In some cases, you will need to describe the extend of the contribution, particularly when literally copying the words or artistic artifacts of another. To fail to provide proper attribution is plagiarism. Plagiarism demeans the seriousness of what we do in class, and does not allow you as a student to obtain a fair grade for the results you worked hard for.

I will discuss the issue of plagiarism with the grader, and use, when appropriate, automated tools to check submitted programs and files for copying. I don't like to be involved in this, but it is necessary in order to provide a fair environment for your work. In cases of suspected plagiarism, I will discuss the matter with the student(s) involved and, when warrented, submit the case to the Academic Conduct Committee (of which I was chairman for three years). Unfortunately, students in my classes violate these policies several times a year on average; it is my hope that by being completely clear and straight-forward about my policy that I can reduce this number. Please take this issue serious and talk to me if you have any concerns or questions.