Arthritis Rheumatol. 2026 Apr 13. doi: 10.1002/art.70181. Online ahead of print.
ABSTRACT
OBJECTIVE: Disease activity plays a central role in rheumatoid arthritis (RA) clinical studies. The inconsistent availability of data on disease activity in real-world electronic health records (EHR) data has limited the ability to generate real-world evidence (RWE). This study aimed to develop and validate scalable machine learning (ML) models to infer RA disease activity from EHR data.
METHODS: We used EHR data from Mass General Brigham (MGB) and the Department of Veterans Affairs (VA) linked with RA registries that prospectively collected the disease activity score using 28 joint counts (DAS28). Features for the algorithm were extracted from the EHR including structured data, e.g., ICD codes and narrative data using natural language processing (NLP). ML models were trained on the registry-collected DAS28. The performance of models trained within the same institution and across institutions was evaluated. To assess face-validity we estimated the association between inferred disease activity and major adverse cardiovascular events (MACE) with stratified Cox models.
RESULTS: We studied 1105 MGB and 2631 VA RA patients. Models with structured data achieved an AUC of 0.68-0.70; models incorporating structured and NLP achieved higher performance (AUC=0.843, MGB; 0.833, VA). Cross-institution validation demonstrated limited transportability of algorithms across sites (AUC=0.679, MGB→VA; 0.718, VA→MGB). Within the same institution, inferred disease activity was significantly associated with increased risk for incident MACE (MGB: HR=1.12; VA: HR=1.14).
CONCLUSION: RA disease activity can be inferred at scale from within-institution EHR data, though cross-institution performance is limited. The inferred disease activity replicated the association with MACE and supports its use in future studies to generate RWE.
PMID:41972843 | DOI:10.1002/art.70181

