Practical Lessons from Generating Synthetic Healthcare Data with Bayesian Networks

de Benedetti, Juan; Oues, Namir; Wang, Zhenchen; Myles, Puja; Tucker, Allan

doi:10.1007/978-3-030-65965-3_3

Practical Lessons from Generating Synthetic Healthcare Data with Bayesian Networks

Juan de Benedetti³⁵,
Namir Oues³⁵,
Zhenchen Wang³⁵,
Puja Myles³⁵ &
…
Allan Tucker³⁶

Conference paper
First Online: 02 February 2021

2515 Accesses
4 Citations
2 Altmetric

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1323))

Abstract

Healthcare data holds huge societal and monetary value. It contains information about how disease manifests within populations over time, and therefore could be used to improve public health dramatically. To the growing AI in health industry, this data offers huge potential in generating markets for new technologies in healthcare. However, primary care data is extremely sensitive. It contains data on individuals that is of a highly personal nature. As a result, many countries are reluctant to release this resource. This paper explores some key issues in the use of synthetic data as a substitute for real primary care data: Handling the complexities of real world data to transparently capture realistic distributions and relationships, modelling time, and minimising the matching of real patients to synthetic datapoints. We show that if the correct modelling approaches are used, then transparency and trust can be ensured in the underlying distributions and relationships of the resulting synthetic datasets. What is more, these datasets offer a strong level of privacy through lower risks of identifying real patients.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Wang, Z., Myles, P., Tucker, A.: Generating and evaluating synthetic UK primary care data: preserving data utility & patient privacy. In: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems, pp. 126–131 (2019)
Google Scholar
Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data (2016). https://doi.org/10.1038/sdata.2016.35
Article Google Scholar
Wolf, A., et al.: Data resource profile: clinical practice research datalink (CPRD) aurum. Int. J. Epidemiol. 44(3), 827–836 (2019)
Google Scholar
https://gdpr-info.eu/
Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search. Lecture Notes in Statistics, vol. 81. Springer, New York (1993). https://doi.org/10.1007/978-1-4612-2748-9
Book MATH Google Scholar
Rabiner, R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. In: SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014 (2014)
Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE 3rd International Conference on Data Science and Advanced Analytics (DSAA), vol. 1, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Snoke, J., Slavkovi, A.: pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. arXiv:1805.09392v1 (2018)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 571–588 (2002)
Article MathSciNet Google Scholar
Abay, N., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., Sweeney, L.: Privacy Preserving Synthetic Data Release Using Deep Learning, pp. 510–526 (2018). https://doi.org/10.1007/978-3-030-10925-7
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)
MATH Google Scholar
Friedman, N.: Learning belief networks in the presence of missing values and hidden variables. In: Proceedings of the 14th International Conference on Machine Learning, pp. 125–133 (1997)
Google Scholar
Xu, L., et al.: Modeling tabular data using conditional GAN. In: 33rd Conference on Neural Information Processing Systems (2019)
Google Scholar
Jia, S., Lansdall-Welfare, T., Cristianini, N.: Right for the right reason: training agnostic networks. In: Advances in Intelligent Data Analysis XVII 17th International Symposium, IDA (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Medicine and Health Regulatory Authority, London, UK
Juan de Benedetti, Namir Oues, Zhenchen Wang & Puja Myles
Intelligent Data Analysis Group, Brunel University London, Uxbridge, UK
Allan Tucker

Authors

Juan de Benedetti
View author publications
You can also search for this author in PubMed Google Scholar
Namir Oues
View author publications
You can also search for this author in PubMed Google Scholar
Zhenchen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Puja Myles
View author publications
You can also search for this author in PubMed Google Scholar
Allan Tucker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Allan Tucker .

Editor information

Editors and Affiliations

University of Sydney, Sydney, NSW, Australia
Irena Koprinska
Monash University, Clayton, VIC, Australia
Michael Kamp
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Bari Aldo Moro, Bari, Italy
Corrado Loglisci
University of Guelph, Guelph, ON, Canada
Luiza Antonie
University of Caen Normandy, Caen, France
Albrecht Zimmermann
University of Pisa, Pisa, Italy
Riccardo Guidotti
Norwegian University of Science and Technology, Trondheim, Norway
Özlem Özgöbek
University of Porto, Porto, Portugal
Rita P. Ribeiro
UPC BarcelonaTech, Barcelona, Spain
Ricard Gavaldà
University of Porto, Porto, Portugal
João Gama
Fraunhofer IAIS, St. Augustin, Germany
Linara Adilova
Royal Holloway University of London, Egham, UK
Yamuna Krishnamurthy
University of Lisbon, Lisbon, Portugal
Pedro M. Ferreira
University of Bari Aldo Moro, Bari, Italy
Donato Malerba
University of Lisbon, Lisbon, Portugal
Ibéria Medeiros
University of Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
ICAR-CNR, Rende, Italy
Giuseppe Manco
University of Naples Federico II, Naples, Italy
Elio Masciari
University of North Carolina, Charlotte, NC, USA
Zbigniew W. Ras
Australian National University, Canberra, ACT, Australia
Peter Christen
Leibniz University Hannover, Hannover, Germany
Eirini Ntoutsi
Technical University of Dortmund, Dortmund, Germany
Erich Schubert
University of Southern Denmark, Odense, Denmark
Arthur Zimek
University of Pisa, Pisa, Italy
Anna Monreale
Warsaw University of Technology, Warsaw, Poland
Przemyslaw Biecek
ISTI-CNR, PISA, Italy
Salvatore Rinzivillo
Berlin Institute of Technology, Berlin, Germany
Benjamin Kille
Berlin Institute of Technology, Berlin, Germany
Andreas Lommatzsch
Norwegian University of Science and Technology, Trondheim, Norway
Jon Atle Gulla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Benedetti, J., Oues, N., Wang, Z., Myles, P., Tucker, A. (2020). Practical Lessons from Generating Synthetic Healthcare Data with Bayesian Networks. In: Koprinska, I., et al. ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-65965-3_3
Published: 02 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65964-6
Online ISBN: 978-3-030-65965-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)