Visualizing and Exploring High-dimensional Data


NSF IIS 0534580 (September 1, 2006 ~ August 31, 2010)

The aim of this project is to develop new methods for interactively exploring relationships within large high-dimensional data sets, such as those typical of high-throughput scientific experiments. The resulting tools will provide an aid to scientists prior to applying traditional offline data-analysis techniques such as clustering, segmentation, and classification. Scientists will be able to explore hypotheses and incorporate their own knowledge to drive traditional unsupervised data-mining algorithms in sensible and more promising directions. The visualization tools will assist scientist in many disciplines, including biologists in studying gene function, medical doctors in comprehending disease susceptibility, chemists in developing candidate drugs, and high-energy physics in analyzing the data generated by particle accelerators.

A key component of the novel approach is the ability to interactively explore parameter spaces and combine attributes of high-dimensional data points. The visualization tool will provide two alternate views of the data sets: a dissimilarity-matrix view that offers insights into the size, compactness, separation, and relative proximity of clusters, and a point-cloud view that provides a 3-D projection of the high-dimensional source data that best preserve the distance between points. This dual-view approach excels in communicating the flow and migrations of points from one cluster to another as parameters are tuned. It also allows the user to probe and interact with the data, including such tasks as hand clustering the data, and examining particular points. The resulting visualization tools will support dynamic cluster formation and migration as the contributions of various data set features are interactively modified. The project provides an excellent interdisciplinary education and research environment, and the collaborative nature of the project also enhances the potential for results dissemination.

Project Personnel

Principal Investigators:

Collaborators:

  • David Threadgill (Genetics, NC State)
  • Fernando Pardo Manuel de Villena (Genetics, UNC)

Students:

  • Shriram Alapathy
  • Jeremy R Wang
  • Catherine Welsh
  • Xiang Zhang (Microsoft Ph.D. Fellowship winner)

Alumni:

  • Jinze Liu (Ph.D. 2006, Post Doc. 2007, Assistant Professor University of Kentucky)
  • Kyle Moore (MS 2007)
  • Mengsheng Zhang (MS 2008)
  • Feng Pan (Ph.D. 2009)
  • Lynda Yang (B.S. 2008, NSF Fellowship, currently at UIUC)
  • Tynia Yang (M.S. 2006)
  • Adam Roberts (B.S. 2007, NSF Fellowship currently at UC-Berkeley)
  • Qi Zhang (Ph.D. 2009)

Projects:

Publications

  1. Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows, by Adam Roberts, Leonard McMillan, Wei Wang, Joel Parker, Ivan Rusyn, and David Threadgill, Proceedings of the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), 2007.
  2. Sample selection for maximal diversity, by Feng Pan, Adam Roberts, Leonard McMillan, Fernando Pardo Manuel de Villena, David Threadgill, and Wei Wang, 2007 IEEE International Conference on Data Mining (ICDM’07).
  3. The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for QTL discovery and systems genetics, by Adam Roberts, Fernando Pardo-Manuel de Villena, Wei Wang, and Leonard McMillan, and David W. Threadgill, in Mammalian Genome, vol. , (2007)
  4. CRD: fast co-clustering on large datasets utilizing sample-based matrix decomposition, by Feng Pan, Xiang Zhang, and Wei Wang, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2008, p. 173.
  5. Poclustering: lossless clustering of dissimilarity data, by Jinze Liu, Qi Zhang, Wei Wang, Leonard McMillan, and Jan Prins, Proceedings of 2007 SIAM International Conference on Data Mining (SDM2007), 2007.
  6. Mining approximate order preserving clusters in the presence of noise, by Mengsheng Zhang, Wei Wang, and Jinze Liu, Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), 2008, p. 160
  7. Accelerating Profile Queries in Elevation Maps, by Pan Feng, Wei Wang, and Leonard McMillan, International Conference on Data Engineering (ICDE 2007), 2007.
  8. CARE: Finding Local Linear Correlations in High Dimensional Data, by Xiang Zhang, Feng Pan, and Wei Wang, Proceedings of 2008 International Conference on Data Engineering (ICDE’08).
  9. Split-order distance for clustering and classification hierarchies, by Zhang, Q., Liu, E. Y., Sarkar, A., and Wang, W., Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM), 2009, p. 517.
  10. REDUS: finding reducible subspaces in high dimensional data, by Xiang Zhang, Feng Pan, and Wei Wang. Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08).