Software and Datasets
National Science Foundation
Award Number: III-1320542
PIs: George Kollios and Evimaria Terzi
Network data appear in diverse applications such as social networks, biological networks, and mobile adhoc networks. In many cases, uncertainty or imprecise information becomes a critical impediment to understanding and effectively utilizing the information contained in such graphs. Uncertainty can be due either to the data-collection process or due to the machine-learning techniques utilized to preprocess the data. Such uncertainty is usually encoded in the edges; e.g., a node in a social network influences another node with some probability. In protein-protein interaction networks, the edges, representing interactions among proteins, are results of noisy and error-prone measurements.
How can we measure the closeness of two proteins in a protein-protein interaction network? How can we identify communities of users in an uncertain network? Which nodes are the most important nodes? How can we find the most important edges in an uncertain network and how can we decrease their uncertainty? How do we handle network evolution over time?
Most of the above questions have been extensively studied in the context of deterministic graphs. A straightforward approach to answering these questions on probabilistic graphs is to heuristically cast the probability of every edge into a weight and apply existing methodologies on this weighted graph. This approach is problematic; not only there is no meaningful way to perform such a casting, but also there is no principled way to additionally encode normal weights on the edges. Motivated by the possible- world semantics of probabilistic databases and probabilistic graphs, we treat every probabilistic graph G as a generative model for deterministic graphs. Each such deterministic graph is a possible world of G and is associated with a probability to be generated. The flexibility of possible-world semantics comes with high computational cost; after all, the expectation of any measure is taken over all (exponentially many) possible worlds.
The goal of this project is to adopt the possible-world semantics to probabilistic graphs and within this framework develop algorithms for well-established data-mining problems. Motivated by real-life applications, we focus on specific data-analysis tasks. However, our goal is to make our framework general enough so that it can be used for a wider set of tasks and applications.
Intellectual merit: In the presence of uncertainty, even basic notions, e.g., the shortest path between two nodes in the graph, lose their intuitive interpretation. Existing solutions to graph-analysis tasks prove insufficient to handle uncertain graphs. In this project, we lay the foundations of a principled framework that will facilitate the development of models and algorithms for data-analysis tasks in uncertain graphs. In order to achieve the above goals we will extend traditional data-analysis methods and address several challenging computational and scalability problems.
Broader impact: We expect our research to have impact on several application domains including online social networks, mobile adhoc networks and biological networks. For example, it can help internet-based and social-media related companies to analyze their data and improve their targeting advertisement policies and practices. This will create opportunities for making these companies viable and help the economy of internet-based business. In medicine, biology and bio-chemistry, networks play a very important role and many of these networks can be better modeled as uncertain networks. Our project can help analyze these networks and lead to new biological insights and development of new drugs and treatments. We plan to disseminate our results as follows: (1) We will develop publicly available prototypes. (2) We will include the results of our work in our classes lectures. We will also engage in our research graduate and undergraduate students, including women and minorities. (3) We will communicate our results to scientists of computer science and other fields and our industry collaborators through publications and demos.
Key Words: uncertain graphs; biological networks; social networks; uncertainty; data mining
Selected Recent Papers!
- Haohan Zhu, Xianrui Meng, and George Kollios.
NED: An Inter-Graph Node Metric Based On Edit Distance. arXiv 2016.
- Giovanni Comarela, Mark Crovella, and Evimaria Terzi.
Detecting Unusually-Routed ASes: Methods and Applications ACM IMC 2016.
- Xianrui Meng, Seny Kamara, Kobbi Nissim, and George Kollios.
GRECS: Graph Encryption for Approximate Shortest Distance Queries. ACM CCS 2015.
- Natali Ruchansky, Mark Crovella, and Evimaria Terzi.
Matrix Completion with Queries. ACM SIGKDD 2015.
- Charalampos Mavroforakis, Richard Garcia-Lebron, Ioannis Koutis, and Evimaria Terzi.
Spanning Edge Centrality: Large-scale Computation and Applications. WWW 2015.
A Python Library for efficient computation of spanning edge centralities:
Any opinions, findings, and conclusions or recommendations expressed
here are those of the author(s) and do not necessarily reflect the
views of the National Science Foundation.