Boston University

CAS CS 660 - Introduction To Database Systems

PA3-extra: NoSQL Databases and Twitter Analysis

This is ONLY for CS 660 students

Due on: Tuesday, Dec 12, 2017 at 11:59PM.

1. Introduction

In this project you will learn how to get tweets from the Twitter Website in real-time (streaming mode), how to store them in a MongoDB database and retrieve them in a Python code using PyMongo, in addition to playing with the data within the Mongo shell itself.


For this extra project which is geared only for grad students in CS660, we expect students to be able to install all the necessary packages on their own and be able to search and research for ways to do things. For some of the tasks we have provided suggestions on how to perform them but you could use any other methods to get the task done as long as you are using PyMongo within Python except for the last extra point task which you might want to use other methods and languages. We tried to keep it fun and engaging and we wish you a great rest of semester ahead.


For each part you should write related Python code either using PyMongo API or pure Python code or using other 3rd party libraries. You need to gsubmit your entire code in a zip file in the format of firstname_lastname_CS660.zip by Tuesday, December 12, 2017 at 11:59PM.


Part 1)  


For this part of the project, you use the Twitter data mining script (pymongo_tweepy.py) given to you and modify it such that it mines tweets with the keywords #deeplearning, #computervision,  #datascience, and #bigdata.  Your streamer, similar to the original file, should stream on track (search for keywords) (while in Part 2 you stream based on location).

Heres what a single tweet would look like when stored in MongoDB:

Use the command > db.twitter_search.find().limit(1)


In order to find the number of tweets in your database, you could use the following command:

> db.twitter_search.find().count()


For the purpose of the project please retrieve ~1000 tweets using the given instruction in https://github.com/monajalal/mongo_tweets . You would need to do a git pull to get the latest version of the code if you already have git cloned the repository. For further instruction on how to get the repo and get started with Twitter API please check Lab11_extra in case you didnt attend the lab on December 1st, 2017.

Lab11_extra: https://docs.google.com/document/d/1rCAgy7V1q8u4E33XwW0-3E0d9xU07WN6prpfbnoATRA/edit?usp=sharing


For this project, you would need to refer to Tweet Object definitions here https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object



A. Find the number of tweets that have data somewhere in the tweets text (case insensitive search using regex)


B. From all the data related objects, how many of them are geo_enabled?


C. For all the data related tweets, use the TextBlob Python library to detect if the Tweets sentiment is Positive”, “Neutral, or Negative. You are free to use other sensible methods and libraries to do so.(Hint: To get better results you could clean your Tweets text of unwanted characters/emoji/etc--not obligatory and we wouldnt deduct point based on accuracy, whatsoever).

Your final results should look like something like below: