Modern high-throughput genotyping techniques produce numerous missing calls that confound subsequent analyses, such as disease association studies. Common remedies or this problem include removing affected markers and/or samples or, otherwise, inferring the missing data. On small marker sets imputation is frequently based on a vote of the K-nearest-neighbor KNN) haplotypes, but this technique is neither practical nor ustifiable for large datasets.
We have developed a data structure called mismatch accumulator array (MAA) that supports efficient KNN queries over arbitrarily sized, sliding haplotype windows, and employ it for genotype imputation.
The performance of our method enables exhaustive exploration over all window sizes and known sites in large (150K, 8.3M) SNP panels. The graph below shows how’s NPUTE performance on 150 K SNPs with 46 Strains (65 ųs per imputation, ~7.5 minutes for the entire dataset).
The NPUTE Python source code is available packaged in a .zip archive here, with usage instructions here.
NSF IIS 0448392: “CAREER: Mining Salient Localized Patterns in Complex Data”
NSF IIS 0534580: “Visualizing and Exploring High-dimensional Data”
NIH U01 CA105417: “Integrative Genetics of Cancer Susceptibility”
EPA STAR RD832720: “Environmental Bioinformatics Research Center to Support Computational Toxicology Applications”