150 points; pair-optional
Parts I and II are due by 11:59 p.m. on Thursday, December 4, 2025
The full project is due by 11:59 p.m. on Tuesday, December 9, 2025
Important: This assignment cannot be replaced by the final exam, so it is essential that you complete it.
For the remainder of the semester, you will be working on a final course project. The project is a bit larger in scope than a problem set and is thus worth 150 points. We encourage you to read through the final project description in its entirety before starting.
You may work on the project individually or as a pair. Since this is a large project, we especially encourage you to work in pairs, but please review the collaboration policies of the course before doing so.
If you have questions while working on the final project, please
come to office hours, post them on Piazza, or email
cs111-staff@cs.bu.edu.
Make sure to submit your work on Gradescope, following the procedures found below.
The project will give you an opportunity to use Python to model, analyze, and score the similarity of text samples. In essence, you will develop a statistical tool that will allow you to identify a particular author or style of writing!
Background
Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoo’s Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. Details on that story and the analysis that was performed appear here.
The comparative analysis of Rowling’s works used surprisingly simple techniques to model the author’s style. Possible “style signatures” could include:
Other similar measures of style are also possible. Perhaps the most common application of this kind of feature-based text classification is spam detection and filtering (which you can read more about here, if you’re interested).
Your task
To allow you to compare and classify texts, you will create a statistical model of a body of text using several different techniques. At a minimum, you should integrate five features:
word frequencies
word-length frequencies
stem frequencies
frequencies of different sentence lengths
one other feature of your choice. You might choose one of the features used in the analysis that revealed Rowling’s authorship of The Cuckoo’s Calling. Or you could choose something else entirely (e.g., something based on punctuation) – anything that you can compute/count using Python!
Important: Like the other features in the model, your extra feature must involve maintaining frequencies – i.e., counting how often different things occur.
In addition, the feature should not be too closely tied to a specific type of text. Rather, it must be generic enough to apply to any reasonably long English paragraph.
Computing word lengths from words should be fairly easy, but computing stems, sentence lengths, and your additional feature are an additional part of the challenge, and offer plenty of room for algorithmic creativity!
In general, the project description below will offer fewer guidelines than the problem sets did. If the guidelines do not explicitly mandate or forbid certain behavior, you should feel free to be creative!
Create a subfolder called project within your
cs111 folder, and put all of the files for this assignment in that
folder.
For the first part of the project, you will create an initial version
of a TextModel class, which will serve as a blueprint for objects
that model a body of text (i.e., a collection of one or more text documents).
To get started, open a new file in Spyder and name it
finalproject.py. In this file, declare a class named
TextModel. Add appropriate comments to the top of the file.
Write a constructor __init__(self, model_name) that
constructs a new TextModel object by accepting a string
model_name as a parameter and initializing the following three
attributes:
name – a string that is a label for this text model, such as
'JKRowling' or 'Shakespeare'. This will be used in the
filenames for saving and retrieving the model. Use the
model_name that is passed in as a parameter.
words – a dictionary that records the number of times each word
appears in the text.
word_lengths – a dictionary that records the number of times
each word length appears.
Each of the dictionaries should be initialized to the empty
dictionary ({}). In Part III, you will add dictionaries
for the other three features as well.
For example:
>>> model = TextModel('J.K. Rowling') >>> model.name result: 'J.K. Rowling' >>> model.words result: {} >>> model.word_lengths result: {}
Write a method __repr__(self) that returns a string that
includes the name of the model as well as the sizes of the
dictionaries for each feature of the text.
For example:
>>> model = TextModel('J.K. Rowling') >>> model result: text model name: J.K. Rowling number of words: 0 number of word lengths: 0 >>> model.words = {'love': 25, 'spell': 275, 'potter': 700} >>> model result: text model name: J.K. Rowling number of words: 3 number of word lengths: 0
Here again, information about the other feature dictionaries will eventually be included, but for now only the first two dictionaries are needed.
Notes:
Remember that the __repr__ method should create a single string
and return it. You should not use the print function in this
method.
Since the returned string should have multiple lines, you will
need to add in the newline character ('\n'). Below is a
starting point for this method that already adds in some newline
characters. You will have to expand upon this to finish the
method:
def __repr__(self): """Return a string representation of the TextModel.""" s = 'text model name: ' + self.name + '\n' s += ' number of words: ' + str(len(self.words)) + '\n' return s
In the final version of your class, the returned string should not include the contents of the dictionaries because they will become very large. However, it may be a good idea to include their contents in the early stages of your code to facilitate small-scale testing.
Write a helper function named clean_text(txt) that takes a
string of text txt as a parameter and returns a list containing
the words in txt after it has been “cleaned”.
This function will be used when you need to process each word in
a text individually, without having to worry about punctuation
or special characters.
Notes:
Because this is a regular function and not a method, you
should define it outside of your TextModel class–e.g.,
before the class header. And when you call clean_text, you
should just use the function name; you will not need to
prepend a called object. The reason for implementing
clean_text as a function rather than a method is that it
doesn’t need to access the internals of a TextModel object.
Your clean_text must at least do the following:
Remove the following punctuation symbols:
.),)?)")')!);):)(see below for our recommended appproach for doing this)
convert all of the letters to lowercase (which you can do
using the string method lower).
split the text into a list of words that is then returned.
For example:
>>> clean_text('How are you? Fine, thanks. How about you?') result: ['how', 'are', 'you', 'fine', 'thanks', 'how', 'about', 'you']
You are also welcome to take additional steps as you see fit.
You may find it helpful to use the string method replace. To
remind yourself of how it works, try the following in the console:
>>> s = 'Mr. Smith programs.' >>> s = s.replace('.', '') >>> s result: 'Mr Smith programs'
Note that you can avoid the need for multiple calls to replace
by using a loop to remove one punctuation symbol at a time.
For example:
for symbol in """.,?"'!;:""": # use replace to remove symbol from your text
Note that we use triple quotes to surround the string of punctuation symbols so that we can include a single-quote character and a double-quote character within the string.
When splitting the text into a list of words, use the split
method without any inputs, as we did in the Markov model problem
in PS 8.
Write a method add_string(self, s) that adds a string of text
s to the model by augmenting the feature dictionaries defined in the
constructor. It should not explicitly return a value.
For example:
>>> model = TextModel('test') >>> model.add_string('How are you? Fine, thanks. How about you?') >>> model.words result: {'how': 2, 'are': 1, 'you': 2, 'fine': 1, 'thanks': 1, 'about': 1} >>> model.word_lengths result: {3: 5, 4: 1, 6: 1, 5: 1}
(Note: It’s okay if the contents of the dictionaries are printed in a different order, since items in a dictionary do not have a position.)
Here is some pseudocode to get you started:
def add_string(self, s): """Analyzes the string txt and adds its pieces to all of the dictionaries in this text model. """ # Add code to clean the text and split it into a list of words. # *Hint:* Call one of the functions you have already written! word_list = ... for w in word_list: # Update self.words to reflect w # either add a new key-value pair for w # or update the existing key-value pair. # Add code to update self.word_lengths
For now, you should complete the pseudocode that we’ve given you,
and then add code to update the word_lengths dictionary. Later,
you will extend the method to update the other dictionaries as
well.
Write a method add_file(self, filename) that adds all of the
text in the file identified by filename to the model. It should
not explicitly return a value.
Important: When you open the file for reading, you should specify two additional arguments as follows:
f = open(filename, 'r', encoding='utf8', errors='ignore')
These encoding and errors arguments should allow Python to handle special characters (e.g., “smart quotes”) that may be present in your text files.
Hints:
You may find it helpful to consult the lecture notes on
file-reading from earlier in the semester, or check out the
example code online. Rather than reading the file
line-by-line, it makes sense to use the read() method to
read in the entire file into a single string, and then add
that string to your model.
Take advantage of add_string()!
At this point, you are ready to run some additional tests on your methods. For example, you could try entering the following commands from the console, but remember that the contents of your dictionaries may be printed in a different order.
>>> model = TextModel('A. Poor Righter') >>> model.add_string("The partiers love the pizza party.") >>> print(model) text model name: A. Poor Righter number of words: 5 number of word lengths: 4 >>> model.words result: {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2} >>> model.word_lengths result: {8: 1, 3: 2, 4: 1, 5: 2}
Important
If you add test code to your Python files, please put it in one or more separate test functions, which you can then call to do the testing. Having test functions is not required. However, you should not have any test code in the global scope (i.e., outside of a function).
You should continue testing your code frequently from this point forward to make sure everything is working correctly at each step. Otherwise, if you reach the end and realize there are errors, it can be very difficult to determine the causes of those errors in such a large program!
Creating a text model can require a lot of computational power and
time. Therefore, once we have created a model, we want to be able to
save it for later use. The easiest way to do this is to write each of the
feature dictionaries to a different file so that we can read them back in
at a later time. In this part of the project, you will add methods to
your TextModel class that allow you to save and retrieve a
text model in this way.
To get you started, here is a function that defines a small dictionary and saves it to a file:
def sample_file_write(filename): """A function that demonstrates how to write a Python dictionary to an easily-readable file. """ d = {'test': 1, 'foo': 42} # Create a sample dictionary. f = open(filename, 'w') # Open file for writing. f.write(str(d)) # Writes the dictionary to the file. f.close() # Close the file.
Notice that:
The file is opened for writing by using a second
parameter of 'w' in the open function call.
We write to the file by using the file handle object’s write
method, passing in a string representation of the dictionary d that
is created using the built-in str() function.
Below is a function that reads in a string representing a dictionary
from a file, and converts this string (which is a string
that looks like a dictionary) to an actual dictionary object. The
conversion is performed using a combination of two built-in
functions: dict, the constructor for dictionary objects;
and eval, which evaluates a string as if it were an expression.
def sample_file_read(filename): """A function that demonstrates how to read a Python dictionary from a file. """ f = open(filename, 'r') # Open for reading. d_str = f.read() # Read in a string that represents a dict. f.close() d = dict(eval(d_str)) # Convert the string to a dictionary. print("Inside the newly-read dictionary, d, we have:") print(d)
Try saving these functions in a Python file (or even in
finalproject.py), and then use the following calls from the IPython
console:
>>> filename = 'testfile.txt' >>> sample_file_write(filename) >>> sample_file_read(filename) Inside the newly-read dictionary, d, we have: {'test': 1, 'foo': 42}
There should also now be a file named testfile.txt in your project
folder that contains this dictionary.
Now that you know how to write dictionaries to files and read dictionaries
from files, add the following two methods to the TextModel class:
Write a method save_model(self) that saves the TextModel
object self by writing its various feature dictionaries to
files. There will be one file written for each feature
dictionary. For now, you just need to handle the words and
word_lengths dictionaries.
In order to identify which model and dictionary is stored in a given
file, you should use the name attribute concatenated with
the name of the feature dictionary. For example, if self.name
is 'JKR' (for J. K. Rowling), then you should use the
filenames:
'JKR_words''JKR_word_lengths'In general, the filenames are self.name + '_' +
name_of_dictionary. Taking this approach will ensure that you
don’t overwrite one model’s dictionary files when you go to save
another model.
It may help to use the code for sample_file_write as a starting
point for this method, but don’t forget that you should create a
separate file for each dictionary.
Write a method read_model(self) that reads the stored
dictionaries for the called TextModel object from their files and
assigns them to the attributes of the called TextModel.
This is the complementary method to save_model, and you should
assume that the necessary files have filenames that follow the
naming scheme used in save_model.
Remember that you can use the dict and eval functions to convert
a string that represents a dictionary to an actual dictionary object.
It may help to use the code for sample_file_read as a starting
point for this method, but note that you should read a separate
file for each dictionary.
Examples:
You can test your save_model and read_model methods as follows:
# Create a model for a simple text, and save the resulting model. >>> model = TextModel('A. Poor Righter') >>> model.add_string("The partiers love the pizza party.") >>> model.save_model() # Create a new TextModel object with the same name as the original one, # and assign it to a new variable. >>> model2 = TextModel('A. Poor Righter') # Read the dictionaries that were saved for the original model, # and use them as the dictionaries of model2. >>> model2.read_model() >>> print(model2) text model name: A. Poor Righter number of words: 5 number of word lengths: 4 >>> model2.words results: {'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2} >>> model2.word_lengths results: {8: 1, 3: 2, 4: 1, 5: 2}
Coming soon!
Last updated on November 21, 2025.