Modern high-throughput genotyping techniques produce numerous missing calls that confound subsequent analyses, such as disease association studies. Common remedies or this problem include removing affected markers and/or samples or, otherwise, inferring the missing data. On small marker sets imputation is frequently based on a vote of the K-nearest-neighbor KNN) haplotypes, but this technique is neither practical nor ustifiable for large datasets.

We have developed a data structure called mismatch accumulator array (MAA) that supports efficient KNN queries over arbitrarily sized, sliding haplotype windows, and employ it for genotype imputation.

The performance of our method enables exhaustive exploration over all window sizes and known sites in large (150K, 8.3M) SNP panels. The graph below shows how’s NPUTE performance on 150 K SNPs with 46 Strains (65 ųs per imputation, ~7.5 minutes for the entire dataset).

More information about the underlying methods used in our viewer can be found in [Roberts2007a][NPUTE Presentation at ISMB 2007].

The NPUTE Python source code is available packaged in a .zip archive here, with usage instructions here.

Research Sponsor

NSF IIS 0448392: “CAREER: Mining Salient Localized Patterns in Complex Data”
NSF IIS 0534580: “Visualizing and Exploring High-dimensional Data”
NIH U01 CA105417: “Integrative Genetics of Cancer Susceptibility”
EPA STAR RD832720: “Environmental Bioinformatics Research Center to Support Computational Toxicology Applications”

2 comments on “NPUTE
  1. vsp_123 says:


    How can a haplotype format be obtained from a regular merlin format genotype data (below), so I can use NPUTE to impute missing data?

    Ind ID, Family ID, Father ID, Mother ID, Sex, Disease status, marker1, marker 2, …


  2. adarob says:


    You simply need to remove all columns before the markers and make sure that the file is formatted as a comma-separated (csv) file. Once the values have been imputed, you can add the ID columns back.