Random sampling is a natural technique to speed up the execution of algorithms on very large datasets. The results obtained by analyzing only a random sample of the dataset are an approximation of the exact solution. When only a single value must be computed, the trade-off between the size of the sample and the accuracy of the approximation can be studied through probabilistic bounds (e.g., the Chernoff-Hoeffding bounds) for the deviation of the quantity of interest in the sample from its exact value in the dataset. In many classical data mining problems, the number of quantities of interest can be extremely large (e.g., betweenness centrality requires to compute one value for each node in a graph). In these cases, uniform (i.e., simultaneous) bounds to the deviations of all quantities are needed. Classical techniques like the Union bound are insufficient because excessively loose due to their worst-case assumptions that do not hold in many data mining problems.
Rademacher Averages and the Vapnik-Chervonenkis dimension have been developed to overcome this issue: they obtain much stricter uniform deviation bounds by taking into account the nature of the problem and properties of the dataset and of the sampling process. They have been used with success in the analysis of sampling algorithms for data and graph analysis problems on very large datasets.
In this tutorial, we survey the use of Rademacher Averages and VC-dimension for developing sampling-based algorithms for graph analysis and pattern mining. We start from their theoretical foundations at the core of machine learning theory, then show a generic recipe for formulating data mining problems in a way that allows using these concepts in the analysis of efficient randomized algorithms for those problems. Finally, we show examples of the application of this recipe to graph problems (connectivity, shortest paths, betweenness centrality) and pattern mining.
Our goal is to show how these techniques can be a valuable addition to the toolkit of the data mining researcher, and to encourage further research in the area.
Slides presented at ECML PKDD (Updated: September 11)
Slides presented at KDD (Updated: August 11)
Matteo Riondato is a research scientist in the Labs group at Two Sigma. Previously he was a postdoc at Brown University and at Stanford University. He received his Ph.D. from Brown in May 2014, where he was advised by Eli Upfal, with a dissertation on sampling-based randomized algorithms for data analytics, which received the Best Student Poster Award at SIAM SDM 2014. He presented a nectar talk about modern sampling algorithms at ECML PKDD 2014. His research focuses on exploiting theoretical results for practical algorithms in pattern and graph mining.
Eli Upfal is a professor of computer science at Brown University, where he was also the department chair from 2002 to 2007. Prior to joining Brown in 1998, he was a researcher and project manager at the IBM Almaden Research Center in California, and a professor of Applied Mathematics and Computer Science at the Weizmann Institute of Science in Israel. Upfal’s research focuses on the design and analysis of algorithms. In particular he is interested in randomized algorithms, probabilistic analysis of algorithms, and computational statistics, with applications ranging from combinatorial and stochastic optimization to routing and communication networks, computational biology, and computational finance. Upfal is a fellow of the IEEE and the ACM. He received the IBM Outstanding Innovation Award, and the IBM Research Division Award. His work at Brown has been funded in part by the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA), the Office of Naval Research (ONR), and the National Institute of Health (NIH). He is co-author of a popular textbook “Probability and Computing: Randomized Algorithms and Probabilistic Analysis” (with M. Mitzenmacher, Cambridge University Press 2005).
A BibTeX bibliography of the main publications about VC-dimension and Rademacher averages: vcrade.bib.
This tutorial is part of Project BIGDATA at Brown CS.
This work is supported in part by NSF grant IIS-1247581 and NIH grant R01-CA180776. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.