BMC Public Health. 2026 May 9. doi: 10.1186/s12889-026-27570-3. Online ahead of print.
ABSTRACT
BACKGROUND: Cardiovascular disease (CVD) remains a leading cause of morbidity and mortality worldwide. Although traditional cardiovascular risk models primarily rely on biomedical factors, socioeconomic and occupational characteristics are increasingly recognized as important correlates of cardiovascular health. However, applying machine learning to population-based survey data raises methodological concerns, particularly reverse causation and post-diagnosis information leakage.
METHODS: We conducted a cross-sectional analysis using data from the 2023 Behavioral Risk Factor Surveillance System (BRFSS). The analytic objective was classification of prevalent myocardial infarction (MI) or coronary heart disease (CHD) rather than prospective risk prediction. Four supervised machine learning algorithms (logistic regression, decision tree, random forest, and gradient boosting) were evaluated. To address potential label leakage, we implemented two model variants: an inclusive model incorporating all available predictors (Model A) and a model excluding post-diagnosis proxy variables such as medication use, functional limitations, and disability indicators (Model B). Model performance was assessed using receiver operating characteristic area under the curve (ROC-AUC), precision-recall area under the curve (PR-AUC), precision, recall, and F1 score.
RESULTS: Gradient boosting demonstrated the strongest discriminative performance among the evaluated models. In the inclusive setting, the model achieved a ROC-AUC of 0.867 and a PR-AUC of 0.389, with a best F1 score of 0.433. After removal of post-diagnosis variables, performance remained robust (ROC-AUC = 0.858; PR-AUC = 0.372; F1 = 0.418), suggesting that predictive capacity was not driven solely by downstream disease indicators. Feature importance analyses showed that socioeconomic and employment-related variables remained prominent predictors in Model B alongside established clinical risk factors.
CONCLUSIONS: Machine learning models can effectively classify prevalent MI and CHD using large-scale survey data even after explicit mitigation of post-diagnosis information leakage. Socioeconomic and occupational characteristics appear to function primarily as contextual correlates rather than causal determinants of CVD. These findings highlight the value of interpretable machine learning approaches for the population-level classification of prevalent cardiovascular disease while underscoring the limitations inherent to cross-sectional data.
PMID:42106717 | DOI:10.1186/s12889-026-27570-3