Sci Data. 2026 Apr 7. doi: 10.1038/s41597-026-07192-5. Online ahead of print.
ABSTRACT
Publicly available synthetic population datasets often lack detailed health information, limiting their utility in disease modeling. To address this gap, we present the SUPPORT (Synthetic data Using Population Profiles for cardiOvascular Risk facTors) dataset, a large-scale cross-sectional resource comprising 777,358,492 synthetic individuals aged 35-84 across seven geographic regions of China, anchored to the year 2020 demographic structure. Each synthetic individual possesses a detailed profile of sociodemographic attributes and major cardiovascular disease (CVD) risk factors, including blood pressure, cholesterol levels, body mass index, and a history of diabetes. The population was constructed using iterative proportional fitting, multivariate normal distribution sampling, and multiple imputation, integrating data from China's Seventh National Population Census (2020), the Global Burden of Disease (GBD) study, and numerous health surveys. Technical validation against census statistics and independent cohorts, including the China Kadoorie Biobank, confirmed that the dataset accurately replicates marginal sociodemographic distributions and adequately approximates cardiovascular risk profiles of real-world populations. The open-source SUPPORT dataset can be extended with additional attributes, providing a publicly available resource to enable robust, individual-level modeling of CVD.
PMID:41946746 | DOI:10.1038/s41597-026-07192-5