Sci Rep. 2026 May 30. doi: 10.1038/s41598-026-55482-0. Online ahead of print.
ABSTRACT
Diabetes mellitus (DM) is an escalating global public health concern, with a rapidly increasing burden in low- and middle-income countries, including Bangladesh. Despite its growing prevalence and associated complications such as cardiovascular disease, kidney failure and stroke, comprehensive evidence on its determinants and predictive modeling at the population level remains limited. This study aimed to predict the DM and identify its associated risk factors using ensemble machine learning (EML) approaches among adults in northern Bangladesh. A community-based cross-sectional study was conducted among 1408 adults in Dinajpur district between March 25 and June 5, 2025, using structured and pilot-tested questionnaires administered through face-to-face interviews. Feature selection was performed using Recursive Feature Elimination, Random Forest importance and Best First Search methods. Six machine learning models were developed, followed by a stacking ensemble model to enhance predictive performance. Model evaluation was based on accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). Model interpretability was assessed using SHAP analysis, and findings were validated using multivariable logistic regression. The prevalence of DM was 15.1% in the study population. Among individual models, LightGBM demonstrated the highest performance (accuracy: 89.44%; AUC: 0.958 [95% CI 0.945-0.973]), followed by XGBoost (accuracy: 88.69%; AUC: 0.955 [95% CI 0.945-0.972]). The stacking ensemble model outperformed all base learners, achieving an accuracy of 91.67% and an AUC of 0.967 (95% CI 0.957-0.981). SHAP analysis identified age, family history of diabetes, BMI, weight, dietary behaviors (particularly low vegetable intake and added salt/sugar), family income, and gender as key predictors. Multivariable logistic regression confirmed these findings, showing that advancing age especially 51-60 years, female gender, family history of diabetes, hypertension, kidney disease and low vegetable consumption were independently associated with DM. Therefore, stacking-based ensemble learning significantly improves the predictive accuracy of DM while enabling robust identification of key risk factors. The consistency between machine learning and traditional statistical approaches strengthens the validity of the findings. These results highlight the importance of integrating advanced analytical methods into public health research to support early detection, targeted prevention, and evidence-based decision-making in resource-constrained settings such as northern Bangladesh.
PMID:42218244 | DOI:10.1038/s41598-026-55482-0

