Abstract
Electronic health records have been increasingly adopted as useful resources for genomic research. However, case-control labeling of clinical data from electronic health records is challenging and most studies utilize phenotype codes to define case/control labels, resulting in suboptimal downstream analyses. Here we describe the liability threshold phenotypic integration, a method combining genetic relatedness with phenotypic data, including binary and continuous traits such as diagnosis codes, family disease history, laboratory measurements and biomarkers, to derive new continuous phenotypes for target diseases. The model utilizes an automatic trait selection algorithm that increases performance in disease risk prediction and provides insights into nontarget traits associated with the target disease. Our simulations and applications to the eMERGE network and the UK Biobank data demonstrate consistent performance gains in disease risk prediction and genome-wide association study power compared to conventional phenotype codes, models that solely incorporate family history and the phenotype imputation method SoftImpute, with similar false-positive rate control.</p>