CAS CS 591 K1, Tools and Techniques for Data Mining and Applications

Schedule

Tue/Thur 3:30-5:00 pm in CAS 314

Office hours: Tue 12:00-1:30pm (George, MCS 283), Thur 5:00-6:30 pm (George, MCS 283)

Mon 3:30-5pm (Katherine, Undergrad Lab), Wed 12:30-2pm (Katherine, Undergrad Lab)

Instructor and TF

Instructor: George Kollios www.cs.bu.edu/~gkollios

Teaching Fellow: Katherine Zhao cs-people.bu.edu/kzhao

Course Outline

The course emphasizes practical skills in working with data, while introducing students to a wide range of techniques that are commonly used in the analysis of data, such as clustering, classification, regression, and network analysis. The goal of the class is to provide to students a hands-on understanding of classical data analysis techniques and to develop proficiency in applying these techniques in a modern programming language (Python).

Lectures will present the fundamentals of each technique; focus is not on the theoretical underpinnings of the methods, but rather on helping students understand the practical settings in which these methods are useful. Class discussion will study use cases and will go over relevant Python libraries that will enable the students to perform hands-on experiments with their data.

Note this class is different from CS 565 (Data Mining): while CS 565 focuses on the fundamental algorithmic problems around a set of data-mining problems and emphasizes on the analysis of the algorithms for certain data analysis tasks, this class will focus on how these algorithms woork in practice.

Target audience

This course is targeted towards graduate or advanced undergraduate students who need to be proficient on working and analyzing large datasets for their research or aim to find a job that will require data-analysis skills.

Prerequisites

Students taking this class must have some prior familiarity with programming, at the level of CS 105, 108, or 111, or equivalent. CS 112 is also helpful.

Suggested Textbooks

Python for Data Analysis

Programming Collective Intelligence

Workload

There will be weekly or bi-weekly programming assignment. In these assignments students will be given datasets which they will analyze using the tools and techniques presented during that particular week. The weekly programming assignments will be very targeted and their goal will be to practice the material taught during the week.

In addition, there will be a final project. The project will be done in groups of two. For the project the students will use a dataset of their choice and will have to extract some knowledge or conclusions from the analysis of the dataset. The analysis will be done using a subset of the methods we described in class.

The project will have three essential components: 1) a data collection piece (which may involve crawling or calls to an API, combining data from different sources etc), 2) a data analysis piece (which will involve applying different techniques we described in class for the analysis) and 3) a conclusion component (where the results of the data analysis will be drawn). The students will submit a 5-page report explaining clearly all the three components of their project. Finally a poster presentation will be required where the students will prepare to present their effort and results in front of their poster.

As an example, a student may choose to collect data from Twitter related to a specific topic (e.g., Ebola virus) and then measure the intensity of posts about a topic in different areas of the world etc. Other examples of projects may include (but are not limited to): analysis of MBTA data, analysis of NYC Taxi data, analysis of movie and sports data, crawling of YouTube (or other social media data) and analysis of social behavior like trolling, bullying etc.

The project is due by the end of the exam week. The project presentations will be given in the form of a final poster explaining components 1, 2 and 3 of the project.

Students are expected to work individually on homeworks and on the final project. There will be no final exam.

Grading scheme:
Homeworks: 50%
Project: 50%

Tentative schedule

Week 1 :
Introduction. Analyzing data from files using python: I/O, parsing and simple computations on data in files.

Week 2:
Basic analytics and data summarization of tabular data. (e.g., group-by summaries of tabular data). (python package: pandas) Basic visualization tools for data and data summaries (python package: Matplotlib)

Week 3:
Similarity/Distance functions (applications to recommendation systems) (python package: scikit-learn, scipy)

Week 4:
Discovering groups of similar items (k-NN, locality sensitive hashing) (python package: scikit-learn)

Week 5:
Partition-based clustering (k-means, k-median) (python package: scikit-learn)

Week 6:
Hierarchical clustering (python package: scikit-learn)

Week 7:
Dimensionality Reduction: SVD (python package: numpy)

Week 8:
Regression (Linear Logistic) (python package: scikit-learn)

Week 9:
Crawling: building your own mini-crawler (python package: beautifulsoup)

Week 10:
Social network analysis: computing network statistics (python package: networkX)

Week 11:
Finding communities in networks and network visualization (python package: networkx, matplotlib)

Weeks 12 and 13:
Tentative: Large scale data analytics. (MapReduce, Spark, GraphX)

The project is due by the end of the exam week.

Collaborations/Academic Honesty

All course participants must adhere to the CAS Academic Conduct Code . All instances of adacemic dishonesty will be reported to the academic conduct committee.