Phys Eng Sci Med. 2026 Jan 13. doi: 10.1007/s13246-025-01682-3. Online ahead of print.
ABSTRACT
Cardiovascular diseases (CVDs) are still the leading cause of death worldwide, emphasizing the critical need for reliable diagnostic systems. This study aims to create a standardized electrocardiogram (ECG) dataset that can be used to detect and classify six major CVDs using machine learning techniques and investigate feature selection and extraction methods for improved performance. A large dataset of 34,580 12-lead ECG recordings was collected from Sher-i-Kashmir Institute of Medical Sciences (SKIMS), Srinagar, Jammu and Kashmir spanning six clinically confirmed classes: Normal, Cardiac Arrhythmia, Coronary Heart Disease, Cardiomyopathy, Stroke, and Heart Failure. Data pre-processing involved baseline correction, removal of artifacts and the extraction of 14 clinically informative features. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was applied, resulting in an equal distribution of 16.7% of the data across each class. Ten Machine learning and deep learning models-Logistic Regression, Decision Tree, Random Forest, SVM, KNN, Naive Bayes, Gradient Boosting, MLP, DNN, and RNN-were trained and tested. SHAP and LIME methods were used for interpretability. On the raw dataset, Random Forest and Gradient Boosting produced highest performance with test accuracy of 99.88%, precision of 99.88%, recall of 99.88%, and F1-score of 99.88%. After SMOTE, DNN significantly improved (Accuracy: 97.62%, Precision: 97.66%, Recall: 97.62%, F1-score: 97.64%), while MLP obtained an F1-score of 98.49% and RNN obtained 94.76%. All models exhibited better generalization and stability after SMOTE. The balanced, heterogeneous, and clinically verified ECG dataset supported the highly accurate, interpretable, and real-time classification of CVD. SMOTE significantly improved the performance of the model, particularly for deep networks, substantiating its effectiveness in the class imbalance problem. These results place the proposed model and dataset as effective tools for clinical decision support in the diagnosis of cardiovascular disease.
PMID:41528718 | DOI:10.1007/s13246-025-01682-3