Geographically Explicit Synthetic Population Dataset with Networks for the US
The geo-simulation research domains including agent-based modeling requires high-quality synthetic population dataset that can capture individuals' demographic characteristics, spatial distribution and social connections.
In this paper published in Nature - Scientific Data, (Na (Richard) Jiang, Boyu Wang, Andrew Crooks, and myself, we introduced a Python-based workflow that uses US Census 2020 dataset to generate large-scale geographically explicit synthetic population for America’s 50 states and Washington D.C.. The generated synthetic population is at individual level and their aggregated demographic attributes including age, gender distributions can match the US Census Data at the census tract level. In addition to demographic attributes such as age, gender, household and urban/rural status, our synthetic population data is also geographically explicit by assigning home locations to individuals using road networks as a proxy. We also included stylized multimodal networks to capture individuals' family, work, school and daycare connections. This spatially-explicit population dataset with social networks can be utilized to study various complex through social and spatial interactions and further fostering the study of complexity. We further validated the dataset based on external data sources by utilizing the American Community Survey Public Use Microdata Sample (ACS PUMS) and the census data at the block group level.
The generated synthetic population dataset is available here: https://osf.io/fpnc2/files/osfstorage. Interested readers can download the synthetic population along with their social networks by state.
Abstract:
Within the geo-simulation research domain, micro-simulation and agent-based modeling often require the creation of synthetic populations. Creating such data is a time-consuming task and often lacks social networks, which are crucial for studying human interactions (e.g., disease spread, disaster response) while at the same time impacting decision-making. We address these challenges by introducing a Python based method that uses the open data including that from 2020 U.S. Census data to generate a large-scale realistic geographically explicit synthetic population for America’s 50 states and Washington D.C. along with the stylized social networks (e.g., home, work and schools). the resulting synthetic population can be utilized within various geo-simulation approaches (e.g., agent-based modeling), exploring the emergence of complex phenomena through human interactions and further fostering the study of urban digital twins.
Full Reference:
Jiang, N., Yin, F., Wang, B., & Crooks, A. T. (2024). A large-scale geographically explicit synthetic population with social networks for the united states. Scientific Data, 11(1), 1204. Available at: https://www.nature.com/articles/s41597-024-03970-1 (pdf)
![]() |
A Sample of a Social Networks for one Household and their Home, Work and Educational Social Networks from the Generated Data. |
![]() |
Sample of Generated Social Networks Extracted from the City of Buffalo, New York: (a) Household; (b) Work; (c) School; (d) Daycare |
![]() |
Validation of the Synthetic Population at Different Levels: (a) Population under Different 18 Age Groups; (b) Household under Different Household Types. |