Abstract
Introduction: Transformer-based models have shown strong potential for clinical prediction using electronic health record data, yet their performance can vary depending on modelling decisions and data characteristics.</p>
Methods: In this study, we trained a BEHRT model on hospital-based UK Biobank data and evaluated its performance across four clinical prediction tasks, including next-visit diagnosis and longer-term diagnosis prediction up to five years. We exhaustively assessed the impact of model size, medical terminology (CALIBER vs ICD-10), and data split strategies.</p>
Results: The large model consistently outperformed the smaller one in long-term prediction tasks (AUROC = 0.874 vs 0.858 at 5 years), while differences were marginal in 6-months prediction tasks. Performance was also sensitive to the vocabulary size, with CALIBER model yielding higher average precision scores (Average Precision Score = 0.773 vs 0.678 using ICD-10).</p>
Discussion: Our results show that transformer models can achieve high predictive performance across diverse clinical scenarios, but outcomes vary considerably depending on modelling choices, particularly in long-term prediction tasks.</p>