JMIR Med Inform. 2026 May 19;14:e81130. doi: 10.2196/81130.
ABSTRACT
BACKGROUND: Dyslipidemia is a multifactorial and complex condition that warrants investigation through advanced analytical approaches such as machine learning (ML). Few previous ML studies predicting dyslipidemia have been validated across multiple international populations.
OBJECTIVE: This study aimed to develop an ML model to predict the 5-year incidence of dyslipidemia using routinely collected health examination data. To ensure generalizability, the model was externally validated in populations from South Korea, Japan, and the United Kingdom. Furthermore, the clinical relevance of the model-derived risk was evaluated by examining its association with atherosclerotic outcomes, including acute myocardial infarction and cerebral infarction.
METHODS: This study was conducted using 3 independent, large-scale, population-based cohorts. The discovery cohort from South Korea (n=471,650) was used for model training and internal validation, while 2 validation cohorts from Japan (validation A; n=7,255,685) and the United Kingdom (validation B; n=408,725) were used for external validation. We evaluated various ML-based models using 23 features extracted from regular health screening data to predict the new onset of dyslipidemia within 5 years. Shapley Additive Explanations values were calculated to assess feature importance. To ensure the robustness of the proposed ML model, we evaluated the risk of atherothrombotic events (acute myocardial infarction or cerebral infarction) based on the model probability (tertiles; T1, T2, and T3) using a Cox proportional hazards model.
RESULTS: In the discovery cohort, soft-voting ensemble learning with Light Gradient Boosting Machine and categorical boosting exhibited performance metrics of area under the receiver operating characteristic curve (AUROC) of 0.783, precision of 37.9%, and area under the precision-recall curve of 0.469. The model showed moderate discriminatory performance in the external validation cohorts (cohort A: AUROC 0.744; precision 27.2%; and cohort B: AUROC 0.687; precision 5.07%). Shapley Additive Explanations value analysis identified smoking, alcohol intake, and physical activity as the most important features for predicting dyslipidemia. Finally, a higher model probability (T3 vs reference) was pronounced with an increased risk of acute myocardial infarction (adjusted hazard ratio 2.34, 95% CI 1.84-2.97) and cerebral infarction (adjusted hazard ratio 2.43, 95% CI 2.19-2.71).
CONCLUSIONS: This multinational study developed and validated an ML-based model using routine health checkup data to predict the 5-year risk of new-onset dyslipidemia, which was also associated with atherosclerotic events.
PMID:42155057 | DOI:10.2196/81130

