From Statistical Learning Theory to Sampling Algorithms

**Random sampling** is a natural
technique to speed up the execution of algorithms on very large
datasets. The results obtained by analyzing only a random sample of the
dataset are an *approximation* of the exact solution. When only
a single value must be computed, the trade-off between the size of the
sample and the accuracy of the approximation can be studied through
probabilistic bounds (e.g., the Chernoff-Hoeffding bounds) for the
deviation of the quantity of interest in the sample from its exact
value in the dataset. In many classical data mining problems, *the
number of quantities of interest can be extremely large* (e.g.,
betweenness centrality requires to compute one value for each node in a
graph). In these cases, uniform (i.e., simultaneous) bounds to the
deviations of all quantities are needed. *Classical techniques like
the Union bound are insufficient* because excessively loose due to
their worst-case assumptions that do not hold in many data mining
problems.

**Rademacher Averages and the
Vapnik-Chervonenkis dimension**
have been developed to overcome this issue: they obtain much stricter
uniform deviation bounds by taking into account the nature of the
problem and properties of the dataset and of the sampling process. They
have been used with success in the analysis of sampling algorithms for
data and graph analysis problems on very large datasets.

In this tutorial, we *survey the use of
Rademacher Averages and VC-dimension for developing sampling-based
algorithms* for graph analysis and pattern mining. We start from
their *theoretical foundations at the core of machine learning theory*, then
show a *generic recipe for formulating data mining problems* in a way that
allows using these concepts in the analysis of efficient randomized
algorithms for those problems. Finally, we *show examples of the
application of this recipe to graph problems* (connectivity, shortest
paths, betweenness centrality) *and pattern mining*.

Our goal is to show how these techniques can be a valuable addition to the toolkit of the data mining researcher, and to encourage further research in the area.

Slides presented at ECML PKDD (Updated: September 11)

Slides presented at KDD (Updated: August 11)

Matteo Riondato is a research scientist in the Labs group at Two Sigma. Previously he was a postdoc at Brown University and at Stanford University. He received his Ph.D. from Brown in May 2014, where he was advised by Eli Upfal, with a dissertation on sampling-based randomized algorithms for data analytics, which received the Best Student Poster Award at SIAM SDM 2014. He presented a nectar talk about modern sampling algorithms at ECML PKDD 2014. His research focuses on exploiting theoretical results for practical algorithms in pattern and graph mining.

Eli Upfal is a professor of computer science at Brown University, where he was also the department chair from 2002 to 2007. Prior to joining Brown in 1998, he was a researcher and project manager at the IBM Almaden Research Center in California, and a professor of Applied Mathematics and Computer Science at the Weizmann Institute of Science in Israel. Upfal’s research focuses on the design and analysis of algorithms. In particular he is interested in randomized algorithms, probabilistic analysis of algorithms, and computational statistics, with applications ranging from combinatorial and stochastic optimization to routing and communication networks, computational biology, and computational finance. Upfal is a fellow of the IEEE and the ACM. He received the IBM Outstanding Innovation Award, and the IBM Research Division Award. His work at Brown has been funded in part by the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA), the Office of Naval Research (ONR), and the National Institute of Health (NIH). He is co-author of a popular textbook “Probability and Computing: Randomized Algorithms and Probabilistic Analysis” (with M. Mitzenmacher, Cambridge University Press 2005).

A BibTeX bibliography of the main publications about VC-dimension and Rademacher averages: vcrade.bib.

This tutorial is part of Project BIGDATA at Brown CS.

This work is supported in part by NSF grant IIS-1247581 and NIH grant R01-CA180776. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.