Development and Validation of a SHAP-Interpretable Machine Learning Model for Stroke Risk Prediction Using Circulating MicroRNA Biomarkers

Scritto il 12/05/2026

da Qiu Yao

J Mol Neurosci. 2026 May 12;76(2):83. doi: 10.1007/s12031-026-02540-x.

ABSTRACT

BACKGROUND/OBJECTIVE: Stroke remains a leading cause of morbidity and mortality worldwide. Circulating microRNAs (miRNAs) have emerged as promising non-invasive biomarkers for cardiovascular disease diagnosis and risk stratification. However, their integration into predictive models remains limited by challenges in feature selection, model robustness, and interpretability. This study aims to develop an interpretable machine learning framework for predicting stroke incidence using serum miRNA signatures.

METHODS: We analyzed serum miRNA expression profiles from 1,785 human samples (173 stroke patients and 1,612 non-stroke controls) obtained from the GEO dataset GSE117064. Differential expression analysis was performed to identify significantly dysregulated miRNAs using linear modeling and empirical Bayes moderation. LASSO logistic regression was then applied to select predictive miRNA features. Five supervised machine learning classifiers-logistic regression, support vector machine (SVM), random forest, XGBoost, and k-nearest neighbors-were evaluated using 10-fold cross-validation. The best-performing model (SVM) was further interpreted using Shapley Additive exPlanations (SHAP) to assess individual miRNA contributions to prediction. To validate the robustness and clinical generalizability of the identified signatures, quantitative real-time PCR (qPCR) and Next-Generation Sequencing (NGS) were performed on independent sets of serum samples. The pre-trained SVM model was applied to the NGS data to verify its classification performance in an external cohort.

RESULTS: A total of 604 miRNAs were differentially expressed between stroke and non-stroke groups, including 206 downregulated and 398 upregulated candidates. LASSO regression identified 66 non-zero-coefficient miRNAs with potential predictive value. Among the classifiers tested, the SVM model achieved the highest average accuracy of 0.9983 and perfect AUC (1.0000), demonstrating superior performance and stability. SHAP analysis revealed that a subset of miRNAs, including hsa-miR-3648, hsa-miR-1290, and hsa-miR-6765-3p, had the greatest impact on classification outcomes, providing mechanistic insights and enhancing model interpretability. qPCR results preliminarily supported the differential expression patterns of key miRNAs. Furthermore, NGS analysis of an independent validation cohort (n = 10) demonstrated distinct hierarchical clustering of the selected biomarkers. The pre-trained SVM model preliminarily supported its cross-platform validity by classifying these external samples with an overall accuracy of 80% (sensitivity 80%, specificity 80%), suggesting the potential generalizability of the identified miRNA panel across different quantification platforms.

CONCLUSIONS: This study demonstrates the utility of integrating differential expression analysis, regularized feature selection, and machine learning with SHAP-based interpretation to identify and validate serum miRNA signatures predictive of stroke. The successful external validation using NGS highlights the robustness of the identified biomarkers and their potential for cross-platform application. The results provide a transparent, data-driven framework for biomarker discovery in clinical risk prediction and support the development of non-invasive diagnostic tools for stroke detection.

PMID:42120650 | DOI:10.1007/s12031-026-02540-x