J Neurol Neurosurg Psychiatry. 2026 Mar 13:jnnp-2025-337689. doi: 10.1136/jnnp-2025-337689. Online ahead of print.
ABSTRACT
BACKGROUND: The relevance of covert cerebrovascular disease (CCD) in practice is uncertain, partly because estimation of risk in whole clinical populations is difficult. Studies have had success extracting CCD from clinical text using natural language processing (NLP), though they have been limited to specific CCD phenotypes. Here, we used NLP to measure multiple clinically-reported CCD phenotypes in a large clinical cohort and estimated subsequent disease risk in health record data.
METHODS: From all people with brain imaging in Scotland (2010-2018), we selected people with no prior hospitalisation for neurological disease (n=367 988). NLP of imaging reports identified: white matter hypoattenuation or hyperintensities (WMH), lacunes, cortical infarcts and cerebral atrophy. Adjusted HRs (aHRs) were estimated between each phenotype and stroke, dementia and Parkinson's disease (conditions previously associated with CCD), epilepsy and colorectal cancer (control conditions).
RESULTS: For each phenotype, the aHR of stroke was WMH 1.4 (95% CI 1.3-1.4), lacunes 1.6 (1.5-1.6), cortical infarct 1.8 (1.7-1.9) and cerebral atrophy 1.1 (1.0-1.1). The aHR of dementia was WMH 1.3 (1.3-1.3), lacunes 1.0 (0.9-1.0), cortical infarct 1.1 (1.1-1.2) and cerebral atrophy 1.7 (1.7-1.8). The aHR of Parkinson's disease was WMH 1.1 (1.0-1.2), lacunes 1.1 (0.9-1.2), cortical infarct 0.7 (0.6-0.9) and cerebral atrophy 1.4 (1.3-1.5). The aHRs between CCD phenotypes and epilepsy and colorectal cancer were around the null.
CONCLUSION: CCD and atrophy have implications for future disease risk and can be identified at scale using NLP of clinical reports. Prevention of neurological disease in people with CCD should be a priority for healthcare policy makers.
PMID:41825869 | DOI:10.1136/jnnp-2025-337689

