The goal of this project is to design and test mathematically well-founded algorithmic and statistical techniques for analyzing large scale, heterogeneous and noisy data. The proposed research is transformative in its emphasis on rigorous analytical evaluation of algorithms' performance and statistical measures of output uncertainty, in contrast to the primarily heuristic approaches currently used in data mining and machine learning. Any progress in that direction will have significant contribution to the reliability and scientific impact of massive data analysis. This project is motivated by the challenges in analyzing molecular biology data. Molecular biology provides an excellent source of data for testing advanced data analysis techniques: specifically, DNA/RNA sequence data repositories are growing at a super-exponential rate. The data is typically large and noisy, and in some cases includes both genotype and phenotype features that permits experimental validation of the analysis. However, the methods and techniques developed in this project will be broadly applicable to other scientific communities that process massive multi-variant data sets.
The major technical goals of the project include: (1) Design efficient algorithms that provide guarantees on the output when the data comes from independent random samples from an unknown distribution. (2) Develop techniques for estimating the minimum number of samples required to test hypothesis of varying complexity in large datasets, building on techniques in computational statistics. (3) Design algorithms to analyze data on graphs that represent interactions between samples or features in the dataset. These data may be static (e.g. mutations on interacting genes represented by a protein interaction network) or dynamic (e.g. information dissemination on a social network).
This project will advocate a responsible approach to data analysis, based on well-founded mathematical and statistical concepts. The capacity building activities of the project include: (1) Creation and dissemination of algorithms and software that implement rigorous computational and statistical approaches to big data analysis. (2) Educational initiatives at the graduate and undergraduate level to build a bigger workforce of data scientists with the appropriate foundational skills both to apply analytical tools to existing datasets and to develop new approaches to future datasets. The proposed work will be tested on extensive cancer genome data, contributing to health IT, one of the National Priority Domain Areas.
Sung-Ho (Justin) Oh
The project is supported by NSF grant IIS-1247581 and NIH grant R01-CA180776. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.