III-Core: Discovering and Exploring Patterns in Subspaces

IIS0812464 (September 1, 2008 ~ August 31, 2011)
High-throughput experimental methods have revolutionized scientific inquiry. In contrast to the hypothesis-driven scientific method, data-driven science seeks to discover and explore hypotheses supported by the huge volume of data generated in high-throughput experiments. Such datasets are large and high-dimensional: they consist of a multitude of samples and many measured attributes for each sample. A typical hypothesis corresponds to a subspace of this dataset: a subset of samples that share similar values on a subset of attributes.
The goal of this project is to develop a series of new data mining methods that can effectively discover these subspaces, the embedded patterns among the values, and the relationships between patterns. The underlying problems are highly combinatorial and efficient algorithms are required to enable users to mine and explore subspace patterns in large and complex datasets. The proposed methods combine the advantages of efficient matrix decomposition, effective sampling techniques, and advanced graph algorithms. Solutions to these research problems will be integrated into an interactive and visual interface to explore subspace patterns mined from experimental data. While the proposed methods are applicable across a wide range of domains, the focus of project is the analysis of gene regulatory networks and the analysis of protein structure, in collaboration respectively with geneticists and pharmacologists.
Personnel
Principal Investigators:
- Wei Wang (PI)
- Leonard McMillan (co-PI)
- Jan Prins (co-PI)
Students:
- Ning Jin
- Yi Liu
- Feng Pan
- Abhishek Sarkar
- Calvin Young
- Qi Zhang
- Xiang Zhang
- Zhaojun Zhang
- Tree-based Genome-wide Association Mapping
- Inferring Genome-wide Mosaic Structure
- FastANOVA: an Efficient Algorithm for Genome-Wide Association Study
- Genotype Sequence Segmentation
- Split-order distance for clustering and classification hierarchies, by Qi Zhang, Eric Yi Liu, Abhishek Sarkar, and Wei Wang. Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM), pp. 517-534, 2009.
- COE: a general approach for efficient genome-wide two-locus epistasis test in disease association study, by Xiang Zhang, Feng Pan, Yuying Xie, Fei Zou, and Wei Wang. Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 253-269, 2009.
- TreeQA: quantitative genome wide association mapping using local perfect phylogeny trees, by Feng Pan, Leonard McMillan, Fernando Pardo-Manuel de Villena, David Threadgill and Wei Wang. Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 415-426, 2009.
- Inferring genome-wide mosaic structure, by Qi Zhang, Wei Wang, Leonard McMillan, Fernando Pardo-Manuel de Villena, and David Threadgill. Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 150-161, 2009.
- FastChi: an efficient algorithm for analyzing gene-gene interactions, by Xiang Zhang, Fei Zou, and Wei Wang. Proceedings of the 14th Pacific Symposium on Biocomputing (PSB), pp. 528-539, 2009.
- Quantitative association analysis using tree hierarchies, by Feng Pan, Lynda Yang, Leonard McMillan, Fernando Pardo-Manuel de Villena, David Threadgill and Wei Wang.
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), pp. 971-976, 2008. - Functional neighbors: relationships between non-homologous protein families inferred using family-specific fingerprints, by Deepak Bandyopadhyay, Luke Huan, Jinze Liu, Jan Prins, Jack Snoeyink, Wei Wang, and Alexander Tropsha. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2008.
- REDUS: finding reducible subspaces in high dimensional data, by Xiang Zhang, Feng Pan, and Wei Wang. Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), pp. 961-970, 2008.
- Mining non-redundant high order correlations in binary data, by Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel. Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), pp. 1178-1188, 2008.
- FastANOVA: an efficient algorithm for genome-wide association study, by Xiang Zhang, Fei Zou, and Wei Wang. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 821-829, 2008. (Best Research Paper)
- CRD: a general framework for fast co-clustering on large datasets utilizing sample-based matrix decomposition, by Feng Pan, Xiang Zhang, and Wei Wang. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 173-184, 2008.
- CARE: finding local linear correlations in high dimensional data, by Xiang Zhang, Feng Pan, and Wei Wang. Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), pp. 130-139, 2008. (Best Student Paper)
- Poclustering: lossless clustering of dissimilarity data, by Jinze Liu, Qi Zhang, Wei Wang, Leonard McMillan, and Jan Prins. Proceedings of the 7th SIAM Conference on Data Mining (SDM), 2007.
- Clustering pair-wise dissimilarity data into partially ordered sets, by Jinze Liu, Qi Zhang, Wei Wang, Leonard McMillan, and Jan Prins. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 637-642, 2006.