CAREER: Mining Salient Localized Patterns in Complex Data

NSF IIS 0448392 (March 15, 2005 ~ February 28, 2010)

One of the greatest challenges in modern data analysis is to find significant and non-obvious patterns within immense and complex data sets. The detection of such salient patterns is an indispensable tool for comprehending the trends and meaning of data. Such tools are required by scientists, economists, marketing analysts, and all other data analysts. This project is developing new methods and tools for identifying the salient patterns within complex data sets and has the following objectives: design robust and scalable algorithms for mining the most salient patterns; evaluate the significance of mined patterns in the context of complex and noisy data; and integrate and correlate heterogeneous data sets based on corresponding patterns. The project is focussed on problems related to bioinformatics with four driving applications: Integrative Genetics of Cancer Susceptibility, HIV Salivary Gland Disease (SGD) Pathogenesis, Discovering Family Specific Residue Packing Patterns of Proteins, and Integrative Functional Annotation of Proteins. All of these applications produce massive quantities of data thereby providing an excellent testbed for the salient pattern mining algorithms being developed in this project.

The intellectual merits include a new class of data analysis tools for analyzing the huge data sets generated by modern quantitative genetics technologies. These tools will assist biologists in their study of functional proteomics, aid in their understanding of disease progression, and assist in the search for effective treatments. In order to be useful, the data mining techniques must also be accurate, computationally efficient, and operate autonomously. If successful, this project will make significant contributions to bioinformatics and computational biology. Results from this research will be disseminated through publications and the software will be made publicly available through a web portal.

The broader impacts of this research include interdisciplinary collaboration and training, immediate applications to fields other than life sciences, a multitude of educational impacts, and outreach to underrepresented groups in the sciences. The pattern mining methods will be applied to analyze the administrative paperwork of child welfare cases from the North Carolina Department of Health and Human Services (NC-DHHS) in an effort to improve services and achieve better outcomes for children in the welfare system. Long-term interdisciplinary collaborations with scientists have been established and will be strengthened during the course of this project. Educational impacts include new curriculum developments for computer science and bioinformatics, support of multidisciplinary educational experiences, and services to the research community.


Principal Investigators:

to appear…



  1. The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for QTL discovery and systems genetics, by Adam Roberts, Fernando Pardo-Manuel de Villena, Wei Wang, Leonard McMillan, and David Threadgill, Mammalian Genome, Aug 3, 2007.
  2. Structure-based function inference using protein family-specific fingerprints, by Deepak Bandyopadhyay, Jun Huan, Jinze Liu, Jan Prins, Jack Snoeyink, Wei Wang, and Alexander Tropsha, Protein Science, v.15, 2006, p. 1537
  3. Benchmarking the effectiveness of sequential pattern mining methods, by Hye-Chung Kum, J. H. Chang, and Wei Wang, Data and Knowledge Engineering, v.60, 2007, p. 30.
  4. Sequential pattern mining in multi-databases via multiple alignment, by Hye-Chung Kum, Joong-Hyuk Chang, and Wei Wang, Data Mining and Knowledge Discovery (DMKD), v.12, 2006, p. 151
  5. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows, by Adam Roberts, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, and David Threadgill, Proceedings of the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), 2007.
  6. FastANOVA: an efficient algorithm for genome-wide association study, by Xiang Zhang, Fei Zou, and Wei Wang. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’08).
  7. TreeQA: Quantitative Genome Wide Association Mapping Using Local Perfect Phylogeny Trees, by Feng Pan, Leonard McMillan, Fernando Pardo-Manuel de Villena, David Threadgill and Wei Wang. Proceedings of the the 14th Pacific Symposium on Biocomputing (PSB’ 09) .
  8. Inferring Genome-Wide Mosaic Structure, by Qi Zhang, Wei Wang, Leonard McMillan, Fernando Pardo-Manuel de Villena, and David Threadgill. Proceedings of the the 14th Pacific Symposium on Biocomputing (PSB’ 09) .
  9. Genotype Sequence Segmentation: Handling Constraints and Noise, by Qi Zhang, Wei Wang, Leonard McMillan, Jan Prins, Fernando Pardo-Manuel de Villena, and David Threadgill, Proceedings of 8th Workshop on Algorithms in Bioinformatics (WABI’08), 2008.