\documentclass[12pt,twoside]{article}
\usepackage{graphicx}
\usepackage{fullpage}
\usepackage{hyperref}
\begin{document}
\begin{center}
\large{\textbf{CAS CS 112 -- Spring 2012, Programming Assignment 6}} \\
\large{\textbf{Due at 10:00 pm on Thursday, May 3}} \\
\end{center}
In this assignment you have to investigate the structure of a real world graph using
some of the techniques that we discussed in class. The graph that you will use is
a co-authorship graph from the DBLP database that contains information about articles
published in a large number of computer science journals and conferences:
{\tt http://www.informatik.uni-trier.de/\textasciitilde ley/db/}.
In a co-authorship graph, each vertex in the graph represents an author and an edge
is present between two authors if they have published at least one paper together.
Your task is to use simple techniques to analyze this graph. You will try to find
the ``importance'' of an author based on a number of different measures including
number of co-authors, neighborhood characteristics, and its ``centrality'' in this graph:
{\tt http://en.wikipedia.org/wiki/Eigenvector\_centrality\#eigenvector\_centrality}.
\vspace{0.5cm}
\noindent
{\bf Building the Graph}
(20 Points) You will be provided with a file ({\bf graph.txt}) which contains the graph data
using the format from the text described on p. 431. Each vertex in the graph is represented as an integer
number between 0 and 226412, some of which refer to authors of single-authored papers (no incident edges).
There are roughly 700 thousand edges in the graph.
The edges are undirected. Therefore, if an edge (a, b) is specified, so is (b,a).
Since the graph is very sparse, use an adjacency list representation to store it.
Your first task is to read in the {\bf graph.txt} file line by line and add the corresponding edge
to the graph representation. (Hint: do NOT start with this huge graph. Do everything on
much smaller graphs from the booksite, like mediumG.txt.)
%% GK: using HashMap you need to increase the heapsize of JVM to 500MB because it is not the mots space efficient implementation.
%%{\em Hint:} In Java, a convenient way to represent a graph is to use the following
%data stricture: \\
%
%{\tt HashMap>}\\
%
%where each vertex is represented as an Integer and is mapped to the set of its neighbors.
%
\vspace{0.5cm}
\noindent
{\bf Ranking Vertices by Degree}
(20 Points)
The simplest way to determine the importance of a vertex is by its degree.
For example, in a social interaction network the person who knows the most people may
be important (but of course that is not always the case). Using our graph representation,
it is very easy to determine the degree of each vertex (just get the size of its adjacency list).
First, output a list of vertices sorted by decreasing degree, where each line contains the name of
the vertex and its degree. You should notice that most vertices have very low degree, but there are
still a considerable number of hub vertices with high degree. For this and subsequent questions,
break any ties (if they occur) by printing lower-numbered vertices among ties first.
\vspace{0.5cm}
\noindent
{\bf Ranking Vertices by Closeness Centrality}
(20 Points) A second measure that signifies the ``importance'' of a vertex in a graph
is the {\bf closeness centrality} of this vertex. The idea behind this measure is that if
a vertex is close to the center of the graph, it should be easy to reach any
other vertex in the graph. Given a graph $G=(V, E)$ that is connected (that is, there is at least
one path between every pair of vertices), the closeness centrality $CLC$ of a vertex $v$ is defined as follows:
\begin{center}
$ CLC(v) = {{\sum_{u \in V'} d(v,u)} \over {|V|-1 }}$
\end{center}
where $V' = V-\{v\}$ and $d(v,u)$ is the shortest distance between vertices $v$ and $u$ in the graph (the size of the
shortest path). Thus, the $CLC(v)$ is the average distance of vertex $v$ to every other vertex in the
graph.
To calculate $CLC(v)$ of a vertex $v$ you can run a breadth-first search from $v$ and keep a
counter of the number of vertices with distance 1, 2, 3, and so on. After all vertices of the graph
have been visited, you can use these counters to compute $CLC(v)$.
Compute the closeness centrality of the top-100 vertices ranked by degree
and output those as a sorted list (sorted by increasing closeness centrality) as well.
\vspace{0.5cm}
\noindent
{\bf Ranking Vertices by Clustering Coefficient}
(20 Points) Another way to determine the importance of a vertex is by considering how
dense is the immediate neighborhood of this vertex. One way to measure this is by
considering what fraction of its neighbors are connected.
This metric is known as the {\bf clustering coefficient}.
Formally, for undirected graph $G = (V;E)$, define the neighbor set of $v_i \in V$,
denoted by $N_{v_i}$ , to be the set of vertices connected to $v_i$:
$$N_{v_i} = \{ v_j | (v_i, v_j) \in E\}.$$
Then the clustering coefficient of $v_i$, denoted by $CC(v_i)$,
is the fraction of vertex pairs in $N_{v_i}$ that share an edge:
$$CC(v_i) = {{|\{(v_j, v_k)\}|} \over {{|N_{v_i}|} \choose 2}}: v_j, v_k \in N_{v_i}, (v_j, v_k) \in E.$$
Compute the clustering coefficient of the top-100 vertices ranked by degree,
and again output a sorted list (sorted by decreasing clustering coefficient).
\end{document}