Program Overview

Advances have been made over the last decade in our understanding of how genes influence phenotypes and contribute to common disease. It has become increasingly clear that the underlying mechanisms have a complex basis where observed clinical outcomes resulting from a diverse range of causes interconnected through networks of genetic, biological and environmental interactions. Most of the advances have been achieved in our ability to measure phenotypes, to understand gene function though single-gene mutagenesis and to identify individual genetic variants associated with common diseases. However, the experimental basis for understanding how traits with complex polygenic and gene-environment etiologies are interconnected has so far lagged behind largely because of insufficient focus on integrated and focused efforts to design and deploy user-friendly computational approaches for data integration, analysis and interpretation. The historic one-gene-at-a-time approach is insufficient when faced with a mammalian genome, which contains more than 30,000 genes and nearly limitless amounts of combinatorial variation. A far more efficient approach is that of factorial experimentation, a process by which many or all components of a complex system are altered simultaneously through genetic randomization. This approach allows analyses of all genes concurrently in order to illuminate underlying biology well enough to synthetically reassemble biological knowledge, which is essential for making accurate biological predictions - the cornerstone of personalized medicine for common diseases. Without this capability, it is impossible to predict and/or alter the probability of developing common but complex diseases, such as cancer, diabetes, infections, and metabolic syndrome.

A clear picture of biological complexity is available only through the development and efficient deployment of innovative computational and statistical tools. Because of the tremendous technological advances enabled by the genome projects, we have entered an era where progress in understanding complex biological systems is limited only by our creativity and the development of new approaches to integrate, analyze and ultimately interpret high dimensional data. To overcome current deficiencies, we have assembled a stellar cast of scientists from the UNC Computational Genetics Research Group (; CompGen Group) to pioneer a new paradigm in experimental research. Our revolutionary approach reverses the normal paradigm of integrative programs. Rather than restricting computational sciences to support roles that provide assistance to biological projects, we are proposing a ‘reverse’ program where the drivers in knowledge advancement emanate from innovative computational projects that are supported by a biological core that both provides high-content data and provides biological validations to iteratively improve computation accuracy developed by each project. This new paradigm, applied to the nascent field of systems genetics, will elucidate the biological interactions and engineering principles of complex biological systems using genetic control of energy balance as the exemplar. Energy balance is a complex biological process that drives a variety of diseases and disease-outcomes, from diabetes to cancer. This innovative ‘reverse’ program is focused on developing the new computational methods that will be required to support the future development of predictive biology and personalized medicine.

Structure of the program

Figure 1. Relationship between projects and cores.

Within this “reverse” program project application, we propose to develop innovative new tools, integrated into a common biological platform to support analysis of complex biological data through the continued development of the integrated field of systems genetics. Systems genetics is a non-reductionist field based on the integration of large-scale, systems biology experiments combined with the genetic diversity of populations. In this program, we have brought together computer scientists, statisticians, engineers, mathematicians, geneticists and biologists to leverage their individual expertise into a synergistic program that has much greater capability than the sum of their individual projects (Figure 1).

The program is organized around four computation intensive research projects focusing on large-scale data mining, statistical learning, network models, and data visualization and interaction with the goal of pioneering an integrated computational platform to elucidate the mechanics of complex biological systems. Three cores, administration, data generation and validation, and high performance computing, statistics and software engineering, will support the research projects. As will be evident from the following project and core descriptions, this program has a highly integrated structure.

Analytical approaches that integrate across multiple high-throughput experiments are required to dissect the genetic and environmental interactions contributing to diseases. These next-generation integrative tools include new data-mining methods that address heterogeneous data types (Project 1), new statistical methods that make predictions based on the analysis of diverse data collections (Project 2), and the derivation of analytical models for biological systems, pathways, and networks (Project 3). It is also necessary to close the loop by providing biologists with intuitive visualization interfaces for interpreting data, and outcomes of analyses (Project 4). Finally, these tools require large amounts of high quality data (high-density DNA genotypes, global transcriptional (RNA) profiling across multiple tissues, analysis of intermediary metabolites in blood, and detailed life-course clinical phenotypes) and experimental validation.

Figure 2. Overview of data resources. From top to bottom the figure depicts the origin of the mice used to derive the lines that generated the four populations, the number of generations involved in each cross, a graphical overview of the genotypes and the size of the existing populations and finally the phenotypes measured. Note that reshuffling of genotypes in the experimental populations is related to the number of generations involved in breeding. Arrows represent breeding and circles in the arrows represent number of generations. Red arrows represent selective breeding and the selection criteria are shown in red and italics. Green arrows represent full-sib matings used to inbreed strains. Finally, black arrows are standard crosses. Phenotypes are shown in the lower boxes, and colored lines from each population indicate whether the phenotype has already been measured in that cross. Open boxes represent physiological phenotypes and pink boxes represent molecular phenotypes.

The mouse populations of Core B (Figure 2) cumulatively represent an extraordinary biological resource in terms of range and diversity of biomolecular measurements, and provide us with a glimpse into the future of medical informatics, as well as a unique opportunity to develop the required computational and analytical methods ahead of the oncoming data flood. These resources include controlled environmental exposures, extensive life-course clinical phenotyping, gene expression profiling across multiple tissues, and diverse bimolecular measurements (gene expression, metabolites, etc.) in the context of a reference genetic population whose complexity is comparable to humans. We propose to exploit this resource to develop the necessary integrative computational and analytical methods that will be required in the next decade. Equally important to the successful implementation of our objectives is a world-class high-performance computational infrastructure and web services for distributing outcomes that will be provided by Core C. Our methods and principles will be applicable across a wider range of biomedical domains. They provide users a new paradigm for learning from multiple data types, discovering and exploring complex networks; enable the interactive examination of these relationships; and aid in the generation and evaluation of hypotheses.