This homework consists of five problems: the first three exercise your understanding of symbol tables, and the last is a problem that will help you understand both the implementation and behavior of hash tables. You MUST make sure that you do the following
Submit each of problems 1-3 in a file called Answers.05.txt. Each of the problems is worth 10 points.
For this problem, you are going to solve an interesting problem that arises when we search web documents: it is often the case that you are reading a paper or a webpage, and you would like to find other documents that are similar to this one (e.g., you are doing research about hoplite tactics during the Battle of Marathon, and the author has an interesting take on the physical evidence revealed by archeology, and you would like to find other articles that approach the problem from the same angle; or perhaps you are an English teacher reading a student paper that sounds very familiar, and you would like to see if this paper has been plagiarized from the web; and so on). Of course you can Google keywords, but one way to estimate the similarity of documents would be to build a concordance of each document (i.e., list of all words in the document perhaps with other information such as a list of occurrence or simply the number of occurrences of each word) and then check how big the intersection of the concordances are (i..e., how many words appear in both documents). In this program, we will perform a test like this to estimate how similar a number of documents are, by building a concordance of all the words in all documents, and recording for each word how many occurrences there were in each document. We will output a sorted list of all the words in the concordance, ordered by the total number of occurrences of each word.
The client program TextAnalyzer will read from the standard input a series of lines of text: each line will be assumed to be part of a document, with a line beginning with "$next$" separating each document, and a line beginning with $end$ marking the end of the input (see the sample interaction below). There will be a maximum of 10 documents which can be processed in one session, however, you may input any number up to 10. The input for each document will be stripped of punctuation (,.!:;"'?()) and converted to all lower-case, and then "tokenized" into an array of separate words (which were separated by white space in the original line). These are the words inserted into the concordance with the number of the document that it came from.
The outline of your client code should be something like this:
Scanner in = new Scanner(System.in); int docNumber = 0; // print a welcome message and explain what to do String line = in.nextLine(); while ( the line does not start with the terminator word ) { if ( the line starts with the separator word ) ++docNumber; else { remove punctuation from line and convert to all lower case String[] lineArray = line.split("\\s+"); insert each member of the array (the words that were separated by white space) into the concordance with their document number } line = in.nextLine(); }
// remove all the words from the concordance and sort them by total number of occurrences, and then alphabetically, and print it out
You will build a class Concordance which uses a hash table to store linked-list buckets of nodes which contain the following information:
String word;
int [] count;
the second being an array that records how many times the word occurred in document 0, 1, 2, etc. Every time you insert a word into the concordance, you will need to hash the word using a suitable hash function for strings, and then search the bucket for the word. If it appears already, then update the count array with the information that you found another occurrence of the word for this document; if the word does not appear, then you need to add it to the bucket, recording at count of 1 for this document.
The size of your hash table should be some relatively large prime number (Google "list of primes" to find one).
Since it is not very useful to count simple words such as "and", "or", "but" and so on, you should build a "blacklist" of words that never get inserted into the concordance---they should simply be ignored when you see them. You should include the following words in your blacklist:
and, or, but, however, he, she, it, thus, when, not, now, is, the, of, this, a, that, been
Here is a sample interaction with the program. Red is the user input. We will test your code by hand, and then give it some relatively large documents by redirecting the standard input.
Welcome to the Text Intersection Program!
You may load up to 10 documents by typing each to the standard input, separating each by
line starting with the word "$next$"; insert a line starting with "$end$" to terminate the list
and output the results.
This is the first line of the first document, and
this is the second line of that document.
$next$
THis is the first line of the second one, and
this is the second,
no, the third line.
$next$
And this is, finally, the third document; which
is "supposed" to be the last one.
Yeah, I guess it will be the last one.
$end$
Concordance:
line [2][2][0]
document [2][0][1]
first [2][1][0]
one [0][1][2]
second [1][2][0]
be [0][0][2]
last [0][0][2]
third [0][1][1]
finally [0][0][1]
guess [0][0][1]
i [0][0][1]
no [0][1][0]
supposed [0][0][1]
to [0][0][1]
which [0][0][1]
will [0][0][1]
yeah [0][0][1]
Note that we outputted the concordance in sorted order, sorted by greatest total frequency first, then alphabetically to break ties within a given frequency. We'd like for you to do this too: one method is to use heapsort: dump the entries of the concordance into a heap-based priority queue after you've built it up, and then print entries as they're deleted. Think carefully about what you'd like your Keys to be. Hint: you may need to implement Comparable.