BMC Med Inform Decis Mak. 2026 Feb 26. doi: 10.1186/s12911-026-03389-1. Online ahead of print.
ABSTRACT
BACKGROUND: Cardiovascular disease constitutes the most formidable public health challenge in China, accounting for 48.98% and 47.35% of mortality in rural and urban populations, respectively, affecting approximately 330 million individuals. Existing risk stratification models predominantly derive from Western populations, with the Framingham Risk Equation systematically overestimating cardiovascular risk by 276% in Chinese men and 102% in Chinese women, underscoring the critical imperative for population-specific predictive instruments. Although machine learning methodologies demonstrate considerable promise in cardiovascular risk prognostication, their inherent "black-box" characteristics substantially impede clinical translational implementation.
OBJECTIVE: Leveraging longitudinal cohort data from the China Health and Retirement Longitudinal Study (CHARLS) and integrating machine learning with explainable artificial intelligence techniques, we sought to develop and validate a cardiovascular disease long-term risk prediction model tailored to the Chinese middle-aged and elderly population, achieving optimal synthesis of predictive accuracy and clinical interpretability through quantitative risk factor contribution analysis.
METHODS: We incorporated four waves of CHARLS surveillance data spanning 2011-2020, with 8,080 participants aged ≥ 45 years completing 9-year follow-up after rigorous inclusion criteria application. Recursive feature elimination was employed to identify optimal predictors from 90 candidate variables. We systematically evaluated 12 machine learning algorithms encompassing linear, non-linear, ensemble learning, and deep learning methodologies, utilizing stratified random 7:3 partitioning for training and validation cohorts. SHAP (SHapley Additive exPlanations) methodology facilitated comprehensive global and local interpretability analyses, with decision curve analysis assessing clinical net benefit.
RESULTS: Among 5,699 training cohort participants, 1,248 (21.9%) experienced cardiovascular events during follow-up. Recursive feature elimination identified 18 pivotal predictive factors spanning lipid metabolism, anthropometric parameters, renal function, and glucose homeostasis domains. The gradient boosting machine demonstrated superior comprehensive performance, achieving validation cohort AUC of 0.798 (95% CI: 0.776-0.820), specificity of 98%, and positive predictive value of 78%. SHAP analysis revealed waist circumference, triglycerides, and hypertension history as the three predominant predictive factors, with mean absolute SHAP values significantly exceeding other variables. Individual risk attribution analysis demonstrated substantial heterogeneity: extremely high-risk specimens (predicted probability 0.991) exhibited synergistic multi-factorial risk amplification, with standardized waist circumference contributing + 0.0778 SHAP value and triglycerides (477 mg/dL) contributing + 0.0729; conversely, low-risk specimens (predicted probability - 0.0393) demonstrated triglycerides (45.1 mg/dL) providing the maximal singular protective contribution of -0.166. Decision curve analysis confirmed positive net benefit across the 0-0.95 threshold probability spectrum, systematically surpassing conventional strategies.
CONCLUSIONS: The gradient boosting machine model achieved superior discrimination (AUC 0.798, 95% CI 0.785-0.825) compared to Framingham (0.638) and China-PAR (0.654) scores for 9-year cardiovascular disease prediction in Chinese adults aged ≥ 45 years. Waist circumference, triglycerides, and hypertension emerged as principal predictive features, though SHAP-derived importance reflects statistical contribution rather than causal effects. Decision curve analysis demonstrated clinical utility across threshold probabilities 0.05-0.95, enabling flexible deployment from population screening (98.3% sensitivity) to targeted intervention (98.7% specificity). External validation in independent cohorts is essential to establish generalizability before clinical implementation.
CLINICAL TRIAL NUMBER: Not applicable.
PMID:41749231 | DOI:10.1186/s12911-026-03389-1

