Abstract
With the emergence of population-scale whole-genome sequencing (WGS), rare variants can be captured precisely. Studying rare variants explains part of the heritability of complex traits that is overlooked by conventional genome-wide association studies (GWASs). However, the extent to which imputed data can approximate or improve upon the power of WGS data in rare variant association studies remains unclear. Using the UK Biobank WGS data (n = 150,119) as the ground truth, we first evaluated the consistency of rare variants in the single-nucleotide polymorphism (SNP) array data imputed using TOPMed or HRC+UK10K reference panel. Imputation quality (average R2) of the TOPMed-imputed data reached 0.6 even for extremely rare variants with minor allele count ≤ 5. TOPMed-imputed data were closer to WGS data across three ethnic groups, with average Cramer's V > 0.75. Furthermore, association tests were performed on 45 traits. At the same sample size (n = 150,119), neither imputed dataset outperformed WGS data, but the results of the TOPMed-imputed data were more consistent with those of WGS data. When the sample size was increased to 488,377, the number of significant rare variants identified from the TOPMed-imputed data increased by 27.71% for quantitative traits and by approximately 10-fold for binary traits. Finally, we meta-analyzed the association results of SNP array and WGS for lung cancer and epithelial ovarian cancer, respectively. Compared to WGS-based results, more significant variants and genes were identified. Our findings highlight that incorporating rare variants imputed using large-scale sequencing populations can boost the power of rare variant association studies when WGS has limited sample sizes.</p>