Boston University

CAS CS 660 - Introduction To Database Systems

PA3-extra: NoSQL Databases and Twitter Analysis

This is ONLY for CS 660 students

Due on: Tuesday, Dec 12, 2017 at 11:59PM.

1. Introduction

In this project you will learn how to get tweets from the Twitter Website in real-time (streaming mode), how to store them in a MongoDB database and retrieve them in a Python code using PyMongo, in addition to playing with the data within the Mongo shell itself.


For this extra project which is geared only for grad students in CS660, we expect students to be able to install all the necessary packages on their own and be able to search and research for ways to do things. For some of the tasks we have provided suggestions on how to perform them but you could use any other methods to get the task done as long as you are using PyMongo within Python except for the last extra point task which you might want to use other methods and languages. We tried to keep it fun and engaging and we wish you a great rest of semester ahead.


For each part you should write related Python code either using PyMongo API or pure Python code or using other 3rd party libraries. You need to gsubmit your entire code in a zip file in the format of firstname_lastname_CS660.zip by Tuesday, December 12, 2017 at 11:59PM.


Part 1)  


For this part of the project, you use the Twitter data mining script (pymongo_tweepy.py) given to you and modify it such that it mines tweets with the keywords #deeplearning, #computervision,  #datascience, and #bigdata.  Your streamer, similar to the original file, should stream on track (search for keywords) (while in Part 2 you stream based on location).

Heres what a single tweet would look like when stored in MongoDB:

Use the command > db.twitter_search.find().limit(1)


In order to find the number of tweets in your database, you could use the following command:

> db.twitter_search.find().count()


For the purpose of the project please retrieve ~1000 tweets using the given instruction in https://github.com/monajalal/mongo_tweets . You would need to do a git pull to get the latest version of the code if you already have git cloned the repository. For further instruction on how to get the repo and get started with Twitter API please check Lab11_extra in case you didnt attend the lab on December 1st, 2017.

Lab11_extra: https://docs.google.com/document/d/1rCAgy7V1q8u4E33XwW0-3E0d9xU07WN6prpfbnoATRA/edit?usp=sharing


For this project, you would need to refer to Tweet Object definitions here https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object



A. Find the number of tweets that have data somewhere in the tweets text (case insensitive search using regex)


B. From all the data related objects, how many of them are geo_enabled?


C. For all the data related tweets, use the TextBlob Python library to detect if the Tweets sentiment is Positive”, “Neutral, or Negative. You are free to use other sensible methods and libraries to do so.(Hint: To get better results you could clean your Tweets text of unwanted characters/emoji/etc--not obligatory and we wouldnt deduct point based on accuracy, whatsoever).

Your final results should look like something like below:



Part 2)

A. Create a new script that mines Tweets from Twitter using the Tweepy API so that instead of mining Tweets based on keywords, it mines tweets based on location. Basically, change the stream.filter so that for the location field it takes the United States bounding box  (visit http://boundingbox.klokantech.com for finding this info) longitude, latitude. Additionally, modify the script so that from all the real-time tweets that it streams, it only saves those into the MongoDB using PyMongo that their coordinatesfield is not None. Also, in the mining code given to you, modify the twitterdb to usa_db and twitter_search to usa_tweets_collection. Leave it running until you mine ~10000 tweets by checking db.usa_tweets_collection.find().count()


B. (mostly Python coding) Do some searching in Google to find how to extract emojis from a text. We suggest you to use from emoji import UNICODE_EMOJI for this purpose but feel free to use any library that can help you. Find the tweets that have at least one emoji in them and use defaultdict(or dictionary) to save the count of emoji per state and state per emoji.

1. What are the top 15 emojis used in the entire tweets?

1. You should report something like this: [('', 61), ('🎄', 39), ('💙', 23), ('😍', 22), ('🔥', 19), ('🎅', 18), ('👍', 13),('😁', 12), ('', 12), ('😂', 12), ('🎁', 11), ('💕', 11), ('', 11),('🎶', 10), ('💖', 9)]

2. What are the top 5 states for the emoji 🎄?

3. What are the top 5 emojis for MA?

4. What are the top 5 states that use emojis?


C. Use MongoDB queries within PyMongo API to answer the following:

1. What are the top 5 states that have tweets?

2. In the state of California, what are the top 5 cities that tweet?


D. We have given you the file json_to_csv.py which converts the database saved in MongoDB to csv format. You use that to create a map of all the tweets using Folium--you are free to use other methods-- Python library. Your eventual map should look like something below (and of course brownie points if it looks better). Some of the methods that you would possibly find useful from Folium are: Map and CircleMaker. Eventually you have to use the save method to save your map in map.htmlfile. We expect the students to read more about the necessary methods from Folium documentation which has plenty of examples here https://media.readthedocs.org/pdf/folium/latest/folium.pdf (this task will not really require more than 10 lines of code).



Extra credit (10 out of 100)

For part 2) B,  create the map of USA with top 2 emojis per state (you could use any language you want).


Resources:

https://www.thesisscientist.com/docs/Dr.JakeFord/f900e601-2cc7-4c34-a45e-a9d363e43026.pdf