Software and Datasets Related to My Research

1995 WWW Client Datasets

These datasets are the basis for many published studies. They are available from the Internet Traffic Archives. The format of the traces and collection process are documented in our tech report BUCS-TR-1995-010.

This material is based upon work supported by the National Science Foundation under Grant no. CCR-9501822.

1998 WWW Client Datasets

In 1998 we captured a new set of client logs, using a method different from the 1995 set. The format of the trace data and collection process are documented in our tech report BUCS-TR-1999-011, and the trace itself is here.

This material is based upon work supported by the National Science Foundation under Grant no. CCR-9501822.

tcpeval

Tcpeval constructs critical path analyses of TCP transactions. It was developed by Paul Barford during his PhD thesis research. Its algorithms are described in the paper Critical Path Analysis of TCP Transactions in Proceedings of the 2000 ACM SIGCOMM Conference, Stockholm. Sweden, September 2000.

The source code is available in a compressed tarfile here. Included in the tarfile is a HOWTO with installation instructions.

If you download this code, please send an email to Paul Barford (pb at cs.wisc.edu) let him know you are using it, and whether you find it useful.

This material is based upon work supported by the National Science Foundation under Grant no. CCR-9706685.

Surge

Surge, which generates Web requests intended to mimic measured statistical properties is availble here.

The paper describing Surge's rationale and design is Generating Representative Web Workloads for Network and Server Performance Evaluation in Proceedings of Performance ’98 / ACM SIGMETRICS ’98.

However, the default models and parameter settings used in this version of Surge are based on analyses of the 1998 dataset, documented in Changes in Web Client Access Patterns: Characteristics and Caching Implications in World Wide Web, Special Issue on Characterization and Performance Evaluation, Vol. 2, pp. 15-28, 1999.

This is the HTTP/1.1 compliant version of the code (HTTP/1.0 is still supported in this release). There is a detailed HOW-TO included which should get you going.

If you download this code, please send an email to Paul Barford (pb at cs.wisc.edu) who developed it and will put your name on the SURGE interest mailing list so that you will be notified about future updates. We'd also be interested in what you will be using the code for - if you could give him a brief overview I would appreciate it.

While the HOW-TO suggests using the MIT pthreads, if you are using a 2.2 Linux kernel, we recommend you compile using kernel threads (make sure your thread limit is set high enough!). To do that make the following mods:

In the Makefile, replace PGCC by CC. Added the pthread library to Surgeclient.c rule (ie. -lpthread and -D_REENTRANT).
Remove pthread_init() from Surgeclient.c
Added struct sockaddr_in). (For Linux this should be
In Surgeclient.c you may need to change sys/times.h to sys/time.h (no plural)

This material is based upon work supported by the National Science Foundation under Grant nos. CCR-9501822 and CCR-9706685.

BPROBE

BPROBE is a tool for measuring bottleneck bandwidth of an Internet path, using the packet-pair technique. It was developed by Bob Carter during his PhD research. Source code for BPROBE is available here and the paper describing the design of BPROBE is here.

This material is based upon work supported by the National Science Foundation under Grant no. CCR-9501822.

Aest: A Tool For Estimating the Heavy Tail Index from Scaling Properties

This tool provides an estimation of the tail index alpha for empirical heavy-tailed distributions, such as have been encountered in telecommunication systems. It uses a method (called the ‘‘scaling estimator’’) based on the scaling properties of sums of heavy-tailed random variables. The software is available here, and the paper describing aest is available here.

Traffic Matrices

In the paper Mining Anomalies Using Traffic Feature Distributions we used data from two networks: GEANT and Abilene. THe GEANT data was provided to us under NDA so we can't distribute it, but the Abilene data is freely distributable. It can be downloaded here as a Matlab file with associated metadata and instructions. Note: this data consists of byte counts per unit time (not the entropy measures used in the paper).

To implement subspace based anomaly detection as used in that paper, probably the best source is the PhD School short course I gave in May 2015. It provides examples of using these techniques in practice on the Abilene data.

This material is based upon work supported by the National Science Foundation under Grant no. CCR-0325701.

Latency Matrices (Virtual Landmarks)

Virtual Landmarks uses Lipschitz embedding of network nodes based on distances to landmark, along with dimensionality reduction via PCA. The method is described in this paper. The datasets used in that paper are here.

This material is based upon work supported by the National Science Foundation under Grant no. ANI-0322990.

Constraint-Based Geolocation

Constraint-Based Geolocation (CBG) uses measured round-trip-time delays to estimate geographic position. The technique is described in this paper. The code for CBG is here as a collection of routines in R.

This material is based upon work supported by the National Science Foundation under Grant no. ANI-0322990.

Multidimensional Scaling in the Poincare Disk

Multidimensional Scaling in the Poincare Disk is a method of embedding a set of points equipped with interpoint distances (or dissimilarities) into the Poincare model of hyperbolic space, in a way that seeks to minimize the difference between the input distances, and the distances as measured in the embedding. Matlab code for MDS-PD is here and the method is described in this paper.

This material is based upon work supported by the National Science Foundation under Grant no. CNS-1018266.

Hyperbolic Greedy Embedding for Dynamic Graphs

A greedy embedding of a graph is an assignment of coordinates to the graph's vertices in some metric space such that greedy routing always succeeds. Robert Kleinberg showed that any graph can be embedded greedily in two dimensional hyperbolic space (Infocom 2007), but that algorithm may require all nodes to be re-embedded if the graph grows. In Hyperbolic Embedding and Routing for Dynamic Graphs we presented an algorithm for greedy embedding in which no node needs to change coordinates as the graph grows.

This archive contains two sets of code: (1) Matlab code for both Kleinberg's embedding as well as our own embedding and (2) C code with calls to MPFR functions which implements our embedding and routing in variable precision. Variable precision is needed for large graphs whose vertices can wind up being close (in a Euclidean sense) to the Poincare disk boundary. This archive contains all code developed for our papers Hyperbolic Embedding and Routing for Dynamic Graphs, Low-Stretch Greedy Embedding Heuristics, and On the Choice of a Spanning Tree for Greedy Embedding of Network Graphs.

This material is based upon work supported by the National Science Foundation under Grant no. CNS-1018266.

AliasCluster

AliasCluster is a method for interface disambiguation – identifying which IP addresses are associated with the same router. It is described in AliasCluster: A Lightweight Approach to Interface Disambiguation, Global Internet Symposium, 2013.

Matlab code for AliasCluster is here. The main results in the paper are obtained by running loadIndividualComponents.m and then script2010GlobalInternet.m.

This material is based upon work supported by the National Science Foundation under Grants no. CNS-0905565, CNS-1018266, CNS-1012910, and CNS-1117039, and supported by the Army Research Office under grant W911NF-11-1-0227.

Estimating Intrinsic Dimension via Clustering

Estimating the intrinsic dimension of a data set from pairwise distances is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Our method, described in Estimating Intrinsic Dimension via Clustering IEEE Statistical Signal Processing Workshop 2012, uses clustering to estimate dimension, and shows greater accuracy and better scalability than prior techniques.

Code for the method has been published in the ider package for R, available from CRAN: ider: Intrinsic Dimension Estimation with R.

Routing State Distance

Routing State Distance is a new metric and analysis toolset for understanding the structure of paths in graphs. It is described in a series of papers: Inferring Visibility: Who’s (Not) Talking to Whom? (SIGCOMM 2012), Routing State Distance: A Path-based Metric for Network Analysis. (IMC 2012) (Winner of a 2013 IETF/IRTF Applied Networking Research Prize), and Studying Interdomain Routing over Long Timescales. (IMC 2013).

Code and data related to RSD and TRSD are here.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Matrix Completion with Queries

Code for the Order-and-Extend algorithm is here. It is in matlab. Unfortunately there is no accompanying documentation on how to run this.

This research was supported in part by NSF grants CNS-1018266, CNS-1012910, IIS-1421759, IIS-1218437, CAREER-1253393, IIS-1320542, and IIP- 1430145.

Mining Low Dimensional Network Data

In May 2015 I gave a PhD School short course covering theory and practice of data mining techniques based on exploiting low dimensionality in network data. It covers:

Low Dimensionality in Network Data,
Anomaly Detection, and
Matrix Completion.

It is in the form of a Jupyter notebook, using Python.

Alternatively, you can interact with the notebook directly via the button on the left.

Explanations for Matrix Factorization Recommender Systems

Code for generating gradient-based explanataions for matrix factorization based recommender systems, as described in Exploring Explanations for Matrix Factorization Recommender Systems in the FATREC Workshop on Responsible Recommendation 2017, is here.

Alternatively, you can interact with the notebook directly via the button on the left.

Generating Antidote Data to Improve Fairness

Code for generating antidote data to improve the societal impact of matrix factorization based recommender systems, as described in Fighting Fire with Fire: Using Antidote Data to Improve Polarization and Fairness of Recommender Systems in WSDM 2019, is here.

How YouTube Leads Privacy-Seeking Users away from Reliable Information

This repository contains the following material related to the above paper:

The data collection framework (all code for collecting your own dataset)
dataset collected in 2018-2019 (including manual classification of channels)
analysis code to reproduce all figures in the paper

The repository is here.

Licensing

All code on this page is licensed under a Creative Commons License.