J Biomed Inform. 2026 Jun 5:105066. doi: 10.1016/j.jbi.2026.105066. Online ahead of print.
ABSTRACT
OBJECTIVE: Traditional causal structure learning algorithms struggle in distributed and privacy-sensitive environments, particularly when dealing with non-independent and identically distributed (non-IID) data. To address these limitations, this study proposes the Clustering-Based Federated Causal Discovery (CFedCD) framework, designed to enhance causal learning accuracy and applicability in multicenter clinical data analysis.
METHODS: The CFedCD framework integrates deterministic representation encoding and federated optimization techniques to address the data heterogeneity and privacy constraints inherent in distributed causal learning tasks. Each client independently extracts high-dimensional feature digests from local electronic medical records (EMR) using a Deep Sets model, which captures complex data distributions while preserving privacy. These locally computed digests are then aggregated on the server side, where K-means clustering is applied to group clients with similar data characteristics into federated clusters. Within each cluster, collaborative construction of cluster-specific causal graphs is facilitated through adaptive aggregation strategies and regularization techniques that mitigate distributional shifts. The framework's effectiveness is validated using EMR data from the eICU Collaborative Research Database for acute kidney injury (AKI) risk prediction, with performance evaluated using the area under the receiver operating characteristic curve (AUROC).
RESULTS: CFedCD successfully identified key candidate causal factors contributing to AKI, including pulmonary disease, hypertension, diabetes, stroke, and blood urea nitrogen levels. It demonstrated significant improvements in both causal learning and predictive performance across heterogeneous client sites. Specifically, the baseline federated learning alone led to a 0.025 decrease in AUROC (p<0.01), whereas the integration of clustering-driven personalized causal learning improved the overall performance by 0.014 AUROC (p<0.05). Visualization of causal graphs across clusters revealed substantial heterogeneity in patient populations and clinical practices, uncovering novel associations specific to different subpopulations.
CONCLUSION: The CFedCD framework offers an effective solution for federated causal structure learning in heterogeneous environments, producing graphical models that represent candidate causal relationships informed by observational data and clinical knowledge.
PMID:42250862 | DOI:10.1016/j.jbi.2026.105066

