Skip to main content

Practical Lessons from Generating Synthetic Healthcare Data with Bayesian Networks

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1323))

Abstract

Healthcare data holds huge societal and monetary value. It contains information about how disease manifests within populations over time, and therefore could be used to improve public health dramatically. To the growing AI in health industry, this data offers huge potential in generating markets for new technologies in healthcare. However, primary care data is extremely sensitive. It contains data on individuals that is of a highly personal nature. As a result, many countries are reluctant to release this resource. This paper explores some key issues in the use of synthetic data as a substitute for real primary care data: Handling the complexities of real world data to transparently capture realistic distributions and relationships, modelling time, and minimising the matching of real patients to synthetic datapoints. We show that if the correct modelling approaches are used, then transparency and trust can be ensured in the underlying distributions and relationships of the resulting synthetic datasets. What is more, these datasets offer a strong level of privacy through lower risks of identifying real patients.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Wang, Z., Myles, P., Tucker, A.: Generating and evaluating synthetic UK primary care data: preserving data utility & patient privacy. In: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems, pp. 126–131 (2019)

    Google Scholar 

  2. Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data (2016). https://doi.org/10.1038/sdata.2016.35

    Article  Google Scholar 

  3. Wolf, A., et al.: Data resource profile: clinical practice research datalink (CPRD) aurum. Int. J. Epidemiol. 44(3), 827–836 (2019)

    Google Scholar 

  4. https://gdpr-info.eu/

  5. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search. Lecture Notes in Statistics, vol. 81. Springer, New York (1993). https://doi.org/10.1007/978-1-4612-2748-9

    Book  MATH  Google Scholar 

  6. Rabiner, R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  7. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. In: SIGMOD 2014, Snowbird, UT, USA, 22–27 June 2014 (2014)

    Google Scholar 

  8. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE 3rd International Conference on Data Science and Advanced Analytics (DSAA), vol. 1, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49

  9. Snoke, J., Slavkovi, A.: pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. arXiv:1805.09392v1 (2018)

  10. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 571–588 (2002)

    Article  MathSciNet  Google Scholar 

  11. Abay, N., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., Sweeney, L.: Privacy Preserving Synthetic Data Release Using Deep Learning, pp. 510–526 (2018). https://doi.org/10.1007/978-3-030-10925-7

  12. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)

    MATH  Google Scholar 

  13. Friedman, N.: Learning belief networks in the presence of missing values and hidden variables. In: Proceedings of the 14th International Conference on Machine Learning, pp. 125–133 (1997)

    Google Scholar 

  14. Xu, L., et al.: Modeling tabular data using conditional GAN. In: 33rd Conference on Neural Information Processing Systems (2019)

    Google Scholar 

  15. Jia, S., Lansdall-Welfare, T., Cristianini, N.: Right for the right reason: training agnostic networks. In: Advances in Intelligent Data Analysis XVII 17th International Symposium, IDA (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Allan Tucker .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Benedetti, J., Oues, N., Wang, Z., Myles, P., Tucker, A. (2020). Practical Lessons from Generating Synthetic Healthcare Data with Bayesian Networks. In: Koprinska, I., et al. ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65965-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65964-6

  • Online ISBN: 978-3-030-65965-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics