Full-Genome SNP Compatibility
Genome-Wide Compatibility Region Viewer
Introduction
The local block-structure of haplotypes within a population sheds light on many important biological questions. Haplotype blocks are central to quantifying and localizing recombinations (both recent and historical), they are widely used to identify maximally informative marker sets [38], and are essential building blocks for constructing genetic maps. Haplotype-block structure also underlies many methods of genomewide association study, provides fundamental biochemical evidence for genetic selection, and offers a tool for ascertaining the ancestral origins of a population.
The task of decomposing a genome into meaningful blocks, however, has proven to be ill-defined, inconsistent, and often ambiguous. In part, the problem resides in the ad hoc definition of what constitutes a haplotype block. Haplotype blocks are often defined to serve a specific purpose. Examples include the minimum number of tagging SNPs sufficient to capture the majority of haplotypes, intervals of SNPs surrounding core SNPs that exceed a given threshold of Linkage Disequilibrium (LD), and maximal regions whose haplotype diversity falls below a threshold. Partitioning haplotypes into blocks supporting perfect phylogenies, and, the related, selection of blocks lacking evidence for recombination have been used in support of genotype phasing and for constructing Ancestral Recombination Graphs (ARGs).
We propose an unambiguous definition for haplotype blocks and efficient methods for computing them. Where ambiguity is unavoidable, we have uncovered properties that are common to all solutions. Our haplotype block definition directly supports, and has been used for, association mapping, construction of genetic maps, and determining
the ancestral origins within local genomic regions.
We assume the availability of haplotype data, which is problematic for human genotypes. However, dense SNP data sets that are homozygous at every allele are readily available for many inbred mammal and plant models commonly used for association mapping. It is unnecessary to phase such data sets, however, it is still important to identify haplotype blocks for exploring the local diversity structures, and ancestral origins. Like
others, our haplotype blocks are chosen for their lack of historical recombination evidence.
We define SNP compatibility in terms of the Four-Gamete Test (FGT). The FGT is of particular interest because of its close relation to perfect phylogeny. Specifically, a necessary and sufficient condition for a perfect
phylogeny is that all pairs of SNPs satisfy the FGT. We partition the genome into a set of potentially overlapping, maximal compatible intervals, each of which admits a perfect phylogeny, and whose union covers the full data set. We address the question of what is the fewest number of such intervals required, and we also identify suspect SNPs whose removal reduces the overall complexity of the haplotype-block structure (perhaps indicating genotyping errors, homoplasy, or gene conversions).
Our contribution is an analysis of the problem of dividing a genome into compatible intervals and its computational complexity. We provide an achievable lower-bound on the number of such intervals. While in general there are numerous ways of dividing a genome into a minimum number of compatible intervals (a fact overlooked by others), we also identify non-overlapping core subintervals that are common to all valid solutions. We also define a specific interval set that achieves the interval lower-bound, yet maximizes the block overlap, thus minimizing the number of perfect phylogeny trees, while providing the richest possible set of contributing SNPs to each tree.

Shown above is a Compatibility Matrix for Chromosome Y - 1420 SNPs (To Enlarge: Right Click > Save Link As…)
A movie illustrating genome-scale compatibility.
Try our demo online.
Research Sponsor
NSF IIS 0534580: “Visualizing and Exploring High-dimensional Data”
NIH GM 076468: “The Center for Genome Dynamics at Jackson Laboratory:An NIGMS National Center of Systems Biology”