Projects

HiDimViewer

HiDimViewer is a visualization tool we are developing for high-dimensional datasets. It is designed to be used as an interactive data exploration tool to aid scientists in selecting and observing clusters in high-dimensional data. More...

NPUTE

NPUTE is an efficient data structure we have developed for finding pair-wise haplotype similarity. Its simplicity can lead to benefits in speed and exhaustive searches over multiple parameters. More...

Genetic Diversity of Mus musculus Laboratory Strains

Mouse genetic resources include inbred strains, recombinant inbred lines, chromosome substitution strains, heterogeneous stocks and the Collaborative Cross (CC). These resources were generated through various breeding designs that potentially produce different genetic architectures including the level of diversity represented, the spatial distribution of the variation and the allele frequencies within the resource. By combining sequencing data for sixteen inbred strains and the recorded history of related strains, the architecture of genetic variation in mouse resources was determined. The most commonly used resources harbor only a fraction of Mus musculus genetic diversity, which is not uniformly distributed resulting in many blind spots. Only resources that include wild-derived inbred strains from subspecies other than M. m. domesticus have no blind spots and uniform distribution of the variation. Unlike other resources that are primarily suited for gene discovery, the CC is the only resource that can support genome-wide network analysis, which is the foundation of systems genetics. More...

snpBrowser

snpBrowser is an application designed to analyze and visualize the immense SNP datasets that are currently available. It provides modes for analyzing genetic diversity, marker segregation, strain selection, and QTL mapping. More...

Full-Genome SNP Compatibility

Recent studies suggest that the mammalian genomes can be subdivided in segments within which there is limited haplotype diversity. Understanding the distribution and structure of these blocks will help to unravel many biological problems including the identification of genes associated with complex diseases, finding the ancestral origins of a given population, and localizing regions of historical recombination and homoplasy. We are developing methods for partitioning a genome into blocks for which there are no apparent recombinations. Thus providing parsimonious sets of compatible genome intervals based on the four-gamete test. We have developed theory and methods for dividing a genome into compatible intervals and also developed the notion of an interval set that achieves an interval lower-bound, yet maximizes interval overlap. More...

Tree-based Genome-wide Association Mapping

The goal of genome wide association (GWA) mapping in modern genetics is to identify genes or narrow regions in the genome that contribute to genetically complex phenotypes such as morphology or disease. Among the existing methods, tree-based association mapping methods show obvious advantages over single marker-based and haplotype-based methods because they incorporate information about the evolutionary history of the genome into the analysis. However, existing tree-based methods are designed primarily for binary phenotypes derived from case/control studies or fail to scale genome-wide. In this project, we developed TreeQA, a quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogenies constructed in genomic regions exhibiting no evidence of historical recombination. By efficient algorithm design and implementation, TreeQA can efficiently conduct quantitative genom-wide association analysis and is more effective than the previous methods. More...

Strain Sequence Identity Interval Viewer

Strain Sequence Identity (SSI) Interval Viewer is a web application that allows the user to choose a subset of mice strains from the list. When the Update button is clicked, the image at the bottom of the page will be updated to display all sequence identity intervals that include all strains that were selected. If no strains were selected, then all sequence identity intervals will be displayed. More...

Collaborative Cross Simulator

The Collaborative Cross Simulator will provide both data and visual simulations for the collaborative cross experiment. The simulator creates synthetic founder mice and breeds them in the CC Funnel scheme to produce an synthetic inbred line of mice. Using computer simulated cross-over and breeding events, the G1, G2, and G2:Fx mice are created. The simulator will provide a powerful tool for the community by allowing them to generate synthetic lines and populations. Using these synthetic mice, researchers can compare actual mouse data against statistically neutral and random data. More...

Inferring Genome-wide Mosaic Structure

In this project, we study the Minimum Mosaic Problem: given a set of genome sequences from individuals within a population, compute a mosaic structure containing the minimum number of breakpoints. This mosaic structure provides a good estimation of the minimum number of recombination events (and their location) required to generate the existing haplotypes in the population. We solve this problem by finding the shortest path in a directed graph. Our algorithm’s efficiency permits genome-wide analysis. More...

Statistical Significance of Clustering

Clustering methods provide a powerful tool for the exploratory analysis of High Dimension, Low Sample Size (HDLSS) datasets such as gene expression microarray data. A fundamental statistical issue of clustering is which clusters are really there, as opposed to being artifacts of the natural sampling variation. In this project, SigClust is proposed as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p-values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model.

Flexible Large Margin Classifiers

Classification has become increasingly important as a means for facilitating information extraction. Recent invention of the Support Vector Machine (SVM) in machine learning has provided a new learning system using the margin concept. As a product of optimization techniques and flexible functional estimation such as learning in the Reproducing Kernel Hilbert Spaces, the SVM and other large-margin classifiers have been providing good classification accuracy for complex data, especially high dimensional genomic data. In this research direction, we propose several new large-margin classifiers which are expected to yield highly competitive classification accuracy, class probability estimation, as well as variable selection.

Computational Models for Biological Signaling Networks

Understanding how biological systems work requires knowledge of their component parts, how those parts are connected together, and how then those connected parts work together as part of a dynamic evolving system. With the advent of large-scale sequencing projects, a catalog of molecular parts has started to be compiled. While the molecular wiring between these parts has been under investigation by biologists for decades, only recently have comprehensive studies of connectivity and dynamics been able to be performed. While modeling and analysis of these systems is still in its infancy, systems approaches have significant potential in helping us to understand both biological function as well as dysfunction. Our primary research is in the area of computational systems biology, with particular interest in the study of biological signaling networks; trying to understand their structure, evolution and dynamics. In collaboration with wet lab experimentalists, we develop and apply computational models, including probabilistic graphical and multivariate methods along with more traditional engineering approaches such as system identification and control theory. Most relevant to the systems genetics work of this program, we are actively pursuing several computational studies focused on the reconstruction of gene-regulatory and protein interaction networks, network comparison methodologies as well as network analysis methods for use in the identification of genes most relevant to the presentation of a given phenotype. We have recently published our method for tree/network comparison and will be using it in phylogeny-based association studies as well as gene/protein network comparison and substructure identification. We are also completing work describing a novel approach for identification and prioritization of potentially relevant genes identified from eQTL or other studies. Furthermore, in collaboration with Daniel Pomp and Fernando Pardo-Manuel de Villena, we have initiated systems-level studies of metabolism in mice derived from the Collaborative Cross, with particular emphasis on linking genetic variation with metabolic function. In this work, we are developing mechanistic models of metabolism that can be used to help link genetic variation to mechanistic explanations for changes in biological function. More...

Genotype Sequence Segmentation

Recombination plays an important role in shaping the genetic variations present in current-day populations. We consider populations evolved from a small number of founders, where each individual genomic sequence is composed of segments from the founders.In this project, we study the problem of segmenting the genotype sequences into the minimum number of segments attributable to the founder sequences. The minimum segmentation can be used for inferring the relationship among sequences to identify the genetic basis of traits, which is important for disease association studies. We propose two dynamic programming algorithms which can solve the minimum segmentation problem in polynomial time. Our algorithms incorporate biological constraints to greatly reduce the computation, and guarantee that only minimum segmentation solutions with comparable numbers of segments on both haplotypes of the genotype sequence are computed. Our algorithms can also work on noisy data including genotyping errors, point mutations, gene conversions, and missing values. More...

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

In this project, we studied the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. More...