Final Project FAQ

Part I
Part II
Part III
Part IV
Part V

If you don’t see your question here, post it on Piazza or come to office hours! See the links in the navigation bar for both of those options.

Part I

When splitting the text in clean_text, what separator character should we pass in to the split method?

None. Because we want split to split on all whitespace characters (spaces, tabs, and newlines), you should not pass in a separator – not even a space. Rather, you should use empty parentheses when calling split, which will cause it to split on all whitespace characters.
My clean_text function splits the text into a list of words, and then it uses a loop to clean each word in the list. However, when I look at the return value, the individual words in the list have not been changed. What am I doing wrong?

Don’t forget that you can only change the internals of a list if you assign something to one of the positions in the list. For example, consider the following code fragment:
```
my_words = ['hello', 'world']

for w in my_words:
    w = w.upper()     # changes w, but *not* my_words!

print(my_words)
```
If you run this code fragment, you’ll see that the contents of my_words are unchanged. That’s because the assignment inside the loop changes w, but it doesn’t change the contents of my_words. In order to change the contents of my_words, we would need to use an index-based loop.

An easier approach would be to clean the text BEFORE splitting it!
Is my clean_text function good enough?

The clean_text function should at least remove the specified punctuation symbols and make every letter lowercase. Also remember that clean_text must return a list of words. This means you should split up the cleaned string inside of clean_text. If you are unsure of how to do this, check out the word_frequencies function in the example code online.

These are the minimum requirements. If you have time, you are welcome to take additional steps to further clean the text.
How do I update the words and word_lengths dictionaries?

You should start by reading the pseudocode we’ve given you for add_string.

Note that the for loop in the pseudocode goes through each word in a list called word_list that contains all of the words in the original string. You should complete the body of that loop so that, for each value of the variable w, it updates the frequency of w in the self.words dictionary. What are the keys for that dictionary? How can you correctly update that dictionary in light of the current word w? You may want to review the example code from PS 8 for a reminder of how to update a dictionary.

When you update word_lengths, what are the keys in the word_lengths dictionary? How can you transform a word into a key in this dictionary? Once you answer these questions, you can add the code needed to update word_lengths.
How do I read from a file?

In lecture, we presented two different ways to read the contents of a file. You can consult the lecture notes from a couple of weeks ago, or check out the example code online. In the problem set, we recommend reading in the entire file into a single string and then adding that string to your model.
How can I test add_file?

In Spyder, open up a new file. It doesn’t matter what you call it but you must save it in your project folder. Add a few sentences to the text file and save it. Suppose you called the file foo. Try adding the file to a TextModel object.
```
model = TextModel("Test")  # you can call the model anything you want 
model.add_file("foo")      # we want to add the file `foo` to the model
```
(Note: You should replace "foo" with the full name of the file that you saved in Spyder. If Spyder gave the file a .py extension, you should include that .py in the name of the file. If Spyder gave the file a .txt extension, you should include that .txt in the name of the file.)

Now try printing the model and the dictionaries that it contains. Do the right words and frequencies appear in the model? If everything looks good, then your add_file function should be fine. If not, it may be an issue in add_file or in any methods you use inside of the function. You can use debugging print statements to narrow down the cause of the issue.

Part II

How do I read from a file?

In lecture, we presented two different ways to read the contents of a file. You can consult the lecture notes from a couple weeks ago, or check out the example code online. In the problem set, we recommend reading in the entire file into a single string and then adding that string to your model.
Why are my save_model and read_model functions are not working properly?

Go through the test case we give you in the assignment one step at a time. After you save a model, open one of the dictionary files using a text editor (such as the editor in Spyder). Are the correct dictionaries inside of the files? If so, the issue is likely inside of your read_model function. Remember that after you read the dictionaries from the appropriate file, you must store them somewhere in the TextModel object. For example, to store the word-frequency dictionary, you must do an assignment that looks something like this:
```
self.words = ...
```
where you replace ... with the correct expression.

Part III

How do we update sentence_lengths in our add_string method?

Let’s try to break this problem up into smaller parts. The first thing you can do is split your string into a list of words, but without removing any punctuation. If you were to go through every word in this list, what would it mean if you found a word that ended with a punctuation mark? How could you use this fact to count the number of words in each sentence? You will need to use some type of cumulative computation, and you should be careful to reset your count as needed.
When splitting the text in order to determine the sentence lengths, what separator character should we pass in to the split method?

None. Because we want split to split on all whitespace characters (spaces, tabs, and newlines), you should not pass in a separator – not even a space. Rather, you should use empty parentheses when calling split, which will cause it to split on all whitespace characters.
What criteria should we use for determining when a sentence ends?

You should use the same sentence-ending characters that we used in PS 8: a period ('.'), question mark ('?') or exclamation point ('!').

In addition, the last sentence in a string or file should be considered a sentence, even if the final word does not end with a sentence-ending character.

Part IV

The numbers that my compare_dictionaries function produces seem too negative. What is an acceptable range?

There is no specific range of numbers required other than the fact that similarities should be less than or equal to 0. If you are getting positive numbers, then you should try debugging your similarity score function.
How do I know if my methods are working?

One thing to try is the test function that we give you near the end of Part IV. Copy and paste this function into the bottom of your finalproject.py file – outside of the TextModel class – and try calling test() from the Shell. Compare the scores that you get for source1 and source2 with the ones that we get. Not all of the scores should be the same, but some of them should be – in particular, the first scores (the ones based on the words dictionaries) and the second scores (the ones based on the word_lengths dictionaries) should be the same as ours.

It’s also possible that your classify method may conclude that mystery is more likely to have come from source2 (rather than source1, as our method concludes). That may be fine as well. Just make sure that your classify method is making the correct conclusion for the two lists of five scores that your code produces.
When I test my code using the test() function that you provide or when I submit my code on Gradescope, I seem to be getting an incorrect score for just the word_lengths dictionary. Why would that be the case?

In the instructions for the similarity_scores method, there is a note labeled Important. Review that note, and make sure that your similarity_scores method is following its guidelines. If you aren’t following the guidelines in that note, it’s possible in certain cases for some but not all of the scores to be incorrect.

Part V

I get some unexpected results when I compare texts. Is that a problem?

Not necessarily. Depending on the texts that you use to build your models, it’s possible that the classifications may not correct. For example, your method may claim a GQ article is from Cosmopolitan, or vice versa. This can happen, and it does not necessarily indicate that your code is wrong. As long as you get reasonable results for our test() function from Part IV (see above), you should be fine.

Make sure to document your results in your reflections, including any unexpected results that you encounter. Try to think of reasons for the unexpected results. Does our approach to modeling a body of text miss some important features that differentiate the sources that you used?

Last updated on April 28, 2025.