Introduction

It is increasingly evident that the use of historical data within health systems can offer huge rewards in terms of increased accuracy, timely diagnoses, the discovery of new knowledge about disease and its progression, and the ability to offer a more personalised prognosis and care pathway for patients1. What is more, there is a huge demand from the public and governments to make new technology available within health services as quickly as possible while ensuring that any software that uses Artificial Intelligence (AI), in particular Machine Learning, is robustly validated to check for biases and errors2.

Many issues concerning patient privacy have been highlighted since the introduction of General Data Protection Regulation3. This includes protections from the identification of an individual’s data within large data samples4 and the right to explanation for any decision that is made by an automated system5. As a result of this legislation, the ability to offer large samples of real individual-level patient data to companies and institutions is limited. One possible solution to this problem is the use of synthetic data as an alternative to assist in the rapid development and validation of new tools. This data must capture all of the correct (potentially non-linear and multivariate) dependencies and distributions that are apparent in the real data sets, while also preserving patient privacy and avoiding the risks of individual identification.

In this paper, we explore some of the key issues in generating realistic and useful synthetic data, namely preserving relationships, distributions, predictive capabilities, and patients’ privacy. We also explore what robust methods need to be used to validate models using synthetic data in order to ensure biases in the models, overfitting issues, and high variance are discovered and reported. The paper is broken down in to three main sections: first, we discuss some of the key issues concerning the generation and use of synthetic data and introduce a method based on probabilistic graphical models; second, we explore a case study using primary care data from the Clinical Practice Research Datalink (CPRD) in the UK. CPRD is a real-world research service supporting retrospective and prospective public health and clinical studies. It is jointly sponsored by the Medicines and Healthcare products Regulatory Agency and the National Institute for Health Research, as part of the Department of Health and Social Care6. Finally, we make conclusions and recommendations about the advantages and disadvantages of using synthetic data for rapid development of AI systems in healthcare.

There are already existing methods for generating synthetic data. One simple approach is through data perturbation by adding noise to the original data set. For example, rotations, cropping, and noise injection in images7,8,9 in order to produce more diverse data sets for a more generalisable classifier, or through the addition of noise from some distribution such as the Laplace mechanism as used in PrivBayes10 in order to make it more difficult to identify individuals from a data set. Another approach uses generative models of data11. In this case, models that capture the correct relationships and distributions are built, either hand-coded based upon expert knowledge or inferred from real data using models such as Bayesian networks (BNs)10,12 or neural networks13. These can then be used to generate synthetic data via sampling techniques. Generative Adversarial Networks have become particularly popular as a method to generate synthetic image data to build more robust models containing fewer biases than those generated on real data alone14.

Bias in the data can appear due to the way data is collected. In many fields, data analysis involves using historical secondary-use data that was not collected for the analysis in question, as opposed to well-designed research data aimed at answering a specific statistical question (as found in clinical trials for example). This means that secondary-use data sets are often imbalanced, particularly in medicine. For example, in primary care data the number of patients with a specific disease may be far lower than patients who do not have the disease. Conversely, data that is collected by a particular hospital may not reflect the general population as less-severe patients may be managed in primary care, while the data collected in hospitals will only contain more severe patients who are already diagnosed with a specific disease or are at high risk of developing it. As a result, any models that are inferred from such data must deal with these imbalances, either through resampling methods15,16 or synthetic data generation. SMOTE is a commonly used resampling technique in machine learning for dealing with small and imbalanced samples and involves generating synthetic datapoints to supplement existing data17.

An important issue concerning the use of an underlying model to generate synthetic data is that the inherent biases may not be visible. For example, Neural Network approaches whereby models are inferred from data have turned out to be biased, leading to decisions and classifications being made for the wrong reasons18. Agnostic network approaches have attempted to deal with unwanted biases in the data by selecting known “protected concepts” and using domain adversarial training19 to account for these biases. The issue of bias is especially a problem for models where the relationships between features are not explicitly represented because unwanted correlations cannot easily be identified. This is known as the black box problem where it is difficult to know how a model will behave when it has many complex parameters that are not easily interpreted. Approaches that try to deal with this by modelling influences more transparently include probabilistic graphical models20 and tree-based models21,22.

Many data sets will contain specific characteristics that must be taken into account when learning a model for synthetic data generation. For example, missing data are common in most medical data sets. These missing data can manifest for many different reasons but if the data are not recorded for some systematic reason then this must be accounted for in the modelling process. This is structurally missing data—also known as Missing Not At Random (MNAR) as opposed to Missing At Random (MAR). If MNAR is non-ignorable, then we must find a way to model these types of missingness. For example, in probabilistic graphical models, a discrete variable can include a “missing” state, while continuous value variables can include a binary node representing whether the variable measurement is missing or not23. However, for non-ignorable MNAR data we need to use robust methods24. This is because the pattern of missing data can often have value in itself and be exploited to assist in making predictions23. Other approaches include explicitly modelling these unmeasured effects as latent variables25, which we will explore in this paper.

Most data sets will contain unmeasured effects. That is, some underlying processes that have not been recorded in the data (perhaps because they were not considered important at the time of collection, or perhaps because they were not known at the time—e.g. a particular clinical test that has been introduced part way through the data collection process). These can be modelled using latent variable approaches that use methods such as the FCI algorithm20 to infer the location and the Expectation Maximisation algorithm26 to infer the parameters of these unmeasured variables. A key issue being explored in this paper is how synthetic data can be used while ensuring patient privacy. That is, the ability to use simulated patient data to build new models without giving away personal information. There are a number of concepts that attempt to measure how easy it is to identify a patient from their data. For example, k-anonymisation is a measure of the least number of individuals (k) in a data set who share the set of attributes that might become identifying for each individual27, while ε-differential privacy is a metric which enables data managers to only release aggregates of data that cannot be used to identify individuals28. Re-identification has proven to be problematic, for example, through “differentiation attack” where aggregated data are repeatedly requested for different subsets to enable the attacker to identify individual. This is a risk even when data have been anonymised29. For many individuals, aggregated data can preserve their privacy if data cannot be repeatedly requested as they cannot be identified from the summary statistics/distributions that are learnt from a large population. However, people who are considered outliers, for example, those who have rare disease or demographics may still be identified. As a result, outlier analysis30 needs to be incorporated. Simply removing these patients may be an option but this can sometimes mean missing out on important data that could be used to help future patients.

In summary, there have been numerous attempts to generate synthetic data for different reasons, including to deal with biased, imbalanced, and small sampled data. There is now a push to explore how synthetic data may enable researchers to build predictive models while preserving patient privacy. In this paper, we explore the integration of probabilistic graphical models with latent variables and resampling to simultaneously capture many features of real-world complex primary care data, including missing data, non-linear relationships, and uncertainty, while focussing on the importance of transparency of the modelling and data generation process. In the next section, we describe the methods that we have adopted to construct and robustly validate synthetic data samples. We also describe the primary care data in detail. We then carry out an empirical analysis on a subset of the primary care data with a focus on cardiovascular risk. This includes an evaluation of our probabilistic graphical model approach to handling missing data by comparing the synthetic data to original ground truth data in terms of distributional characteristics. We then explore how the synthetic data compare on machine learning classification tasks by comparing the sensitivity analyses on synthetic and ground truth data.

Results

Data and modelling

Our experiments make use of the CPRD Aurum data set. This includes patient Electronic Healthcare Records collected routinely from primary care practices using the EMIS® patient management software system. When a practice agrees to contribute their patient data to CPRD Aurum, CPRD receives a full historic collection of the coded part of the practice’s electronic health records, which includes data on deceased patients and those who have left the practice. The coded clinical record includes symptoms, diagnoses, prescriptions, immunisations, tests, lifestyle factors, and referrals recorded by the general practitioner (GP) or other practice staff but does not include free text medical notes6. The November 2019 release of the CPRD Aurum database included a total of 27.5 million patients (including deceased and transferred patients) from 1042 practices of whom 9.7 million were currently registered with a GP6.

We have chosen a generative approach to modelling the CPRD data where the focus is on a combination of machine learning that is augmented with expert knowledge. This is because we want to ensure that any biases that occur in the ground truth data are made explicit and can be dealt with at each stage of the data generation process. As a result, the underlying model must deal with all the potential uncertainty in the data while also modelling the distributions and relationships in as transparent a manner as possible. For this reason, we have chosen a BN framework. The first experiment involved learning a BN from the CPRD data set.

The general structure of the discovered BNs from multiple samples of the original CPRD data, which we denote as ground truth (GT), are shown in Fig. 1. This explicit representation of independencies between variables allows experts to assess the underlying model and check for potential biases within the GT data. For example, almost all the black arc relationships are well recognised in medical research:

  • Cholesterol/high-density lipoprotein ratio and type 2 diabetes: increased ratio in type 2 diabetes31

  • Steroid treatment and systemic lupus erythematosus (SLE): steroids used in SLE treatment32

  • Rheumatoid arthritis (RA) and SLE—both are autoimmune conditions, and while RA affects joints, SLE can affect joints in some variants and mimic RA. They are considered distinct diseases but can co-occur33

  • Severe mental illness and migraines: migraines can precede mental illness and are common in those with anxiety disorders34

  • Smoking and severe mental illness: well-known association, especially in schizophrenia (widely observed but may not be causal)35

  • Ethnicity and body mass index (BMI): possibly confounded by lifestyle explanations but widely observed association36

  • Smoking and systolic blood pressure: the grey in the network reflects the conflicting evidence base in this area37

  • Smoking and impotence; this also explains why there is a relationship between the male gender and impotence38

  • Type 1 diabetes and impotence39

  • Age and systolic blood pressure: increasing systolic blood pressure with increasing age40

  • Family history of coronary heart disease increases risk of stroke/heart attacks41

  • Antipsychotics and severe mental illness: antipsychotics used for treatment of severe mental illness (bnf.nice.org.uk)

  • Systolic blood pressure and systolic blood pressure SD: correlated variables

  • Atrial fibrillation (AF) and stroke/heart attack: AF is risk factor for stroke (stroke.org.uk)

  • Chronic kidney disease and stroke/heart attacks: often co-occur42

  • Age and type 2 diabetes: increasing risk of type 2 diabetes with age39

There were, however, some surprises:

  • Region was connected to impotence. Perhaps there is an indirect link as linked to regional distribution of smoking43

  • There is no clear link between systolic blood pressure and blood pressure treatment. This could possibly be due to systolic blood pressure being a numeric variable spanning normal and high systolic blood pressure readings

Fig. 1: Resultant graph structure for BNs learnt from samples of ground truth data.
figure 1

Confidences of 100% are represented by black arcs while those <100% are represented by varying widths in grey.

We adopt three BN modelling approaches to handle missing data: First, we simply delete all cases with missing data. Second, we model missingness in discrete nodes by adding a “Miss State” to all possible node states, and in continuous nodes by adding a new binary parent (a “Miss Node”) to each node, representing whether the data point is missing or not. Finally, we explore the use of the FCI algorithm20 to infer any latent variables in the network. These methods are explained in more detail in the “Methods” section. The following links to 6 latent variables were discovered:

“L1” → “age”

“L1” → “af”

“L1” → “treathyp”

“L2” → “steroid”

“L2” → “treathyp”

“L3” → “impot”

“L3” → “gender”

“L4” → “migr”

“L4” → “choleratio”

“L4” → “gender”

“L5” → “strokeha”

“L5” → “ckidney”

“L5” → “type2”

“L5” → “choleratio”

“L5” → “sbps”

“L6” → “strokeha”

“L6” → “ckidney”

“L6” → “type2”

Having accepted this underlying BN model (though we can choose to update it based on expert knowledge by removing known false links and adding expected true links), we now explore how it can generate synthetic data with the underlying distributions in the GT data on a variable by variable basis, while accounting for missingness using the “Miss Nodes/States” approach and the latent variable approach.

Synthetic data compared to ground truth data for underlying distributions

We compare distributions of variables from 100,000 data samples generated by the BN with the original ground truth data under three conditions for handling missing data: first, by simply deleting all cases with missing data. Second, by using “Miss Nodes” (for continuous variables) and “Miss States” (for discrete variables). Finally, by additionally learning latent variables within the BN structure using the FCI algorithm to capture unmeasured effects, including potentially MNAR data.

Figures 2 and 3 show the resulting distributions for a sample of features in the CPRD. We explore the distribution comparisons between the GT and SYN that is generated by logic sampling from the BN under two conditions for a number of representative variables—first, when missing data are simply deleted (Fig. 2). Figure 2a shows the result for GT and Fig. 2b shows the SYN data generated from this. Second, we explore explicitly modelling the distributions using our approaches described in the “Methods” (Fig. 3). Figure 3a shows the GT with no missing data removed, Fig. 3b shows the SYN data generated from this using our Miss Nodes/States data approach, and Fig. 3c shows the resulting SYN from using the latent variable method. Also included are the number of data points with missing cases and the number of distinct values for a feature (e.g. a value of two for discrete binary features and potentially large numbers for integers and real values).

Fig. 2
figure 2

Plots of sample distributions and statistics of the original ground truth data when all missing data are deleted along with plots, distributions, and statistics from the synthetic data that are generated using a BN inferred from the ground truth.

Fig. 3
figure 3

Plots of sample distributions and statistics of the original ground truth data including missing data as well as plots for the synthetic data that models missing data with “Miss Nodes/States” and with latent variables.

First, notice how these results show that, for some variables, simply deleting the missing data can result in very different distributions. For example, the age distribution of the GT when missing data are simply removed in Fig. 2a has a very different distribution than for the original GT data without missing data removal in Fig. 3a. What is more, the approach to modelling missingness with “Miss Nodes/States” results in a similar shape distribution to the original in Fig. 3b for some, but in certain cases, the latent variable approach in Fig. 3c results in the most similar distribution to the ground truth with missing data—compare bmi in Fig. 3a–c. The bias in categorical data seems less significant and both the “Miss Nodes/States” and latent variable approaches capture the smoking and stroke distributions very closely though notice how different the distributions are if the missing data are simply removed, highlighting the importance of modelling missing values rather than removing them.

Note that the amount of missing data that is generated (% Missing) is different for the latent variable approach and the “Miss Nodes/States” approach, with the “Miss Nodes/States” approach in Fig. 3b reflecting this value more closely and the latent variable approach exhibiting far fewer missing cases. This is likely due to the latent variable method in Fig. 3c inferring the missing values. In summary, a close distribution can be created between synthetic data sets and ground truth. Distributions are generally closer to the original when missing data are preserved and modelled. We have found this general trend across all features.

Each discrete variable is compared using Chi-squared tests to measure the difference between n samples of the Ground Truth (GT) and n samples of SYNthetic data (SYN). For variables with continuous values, Kolmogorov–Smirnov (KS) test is used to measure the distribution difference between GT and SYN data sets. In addition, the Kullback–Leibler divergence (KLD) is used to measure the distribution difference between sampled GT and SYN data sets. These approaches are described in more detail in the “Methods” section.

Chi-squared test is performed with the null hypotheses to (1) test whether there is no significant difference between expected frequencies from SYN and (2) the observed frequencies from GT for each variable (categorical).

Table 1 shows that for all features the null hypothesis cannot be clearly rejected in both scenarios, i.e. removing the missing data and modelling it (all p values are far greater than the 0.05 level). This was surprising and implies that simply deleting missing data is not a problem for this primary care data (or at least it does not have much impact on the overall distribution). This could be because of the size of data set that we are dealing with and missing data may be more of an issue with smaller sample sizes. In addition, modelling missingness explicitly is likely to impact certain cases more than others (for example, where people have refused to give certain information for some underlying reason—i.e. MNAR data). These cases may be rare but significant.

Table 1 Chi-squared p values for the hypothesis of there being a difference in distributions between the GT and SYN data sets for categorical variables.

The KS test is performed to test the hypothesis if the numerical variables of the GT and SYN data sets come from the same distribution. We explore this for a number of different sample sizes (n). This is because larger sample sizes make the test more likely to conclude that the two distributions are different (i.e. reject the null hypothesis) because it is very sensitive to differences between distributions44.

Table 2 shows that the numerical variables from SYN and GT data sets are indeed from the same distributions (the p values are always >0.01, meaning we reject the hypothesis that they are from different distributions) for all variables except age and bmi for very high sample sizes (indeed the D statistics are nearly always smaller for the models using latent variables, which means that data sets are generally closer).

Table 2 KS test p values for the hypotheses of numerical variables for GT and SYN data sets are from the same distribution and the associated D statistic of the test.

We now look at using the KLD to see the difference over all variables for different samples (of size 100,000) of GT data in comparison to the difference between SYN and GT data sets.

Table 3 shows the mean squared KL distances between repeated GT samples compared to SYN samples scored against GT samples. Table 4 calculates the diffKL values using the results above. Additionally, the missing data rates of continuous variables are listed below based on the KL distance. Applying a KS test to these results for each variable shows that the KL distances of two ground truth samples is not significantly different to the KL distance between a ground truth sample and a synthetic data samples for variables with reasonably higher distances (chol and bmi with p values of 0.168 and 0.052, respectively). For age (p = 0.0), sbp (p = 0.0), and sbps (p = 0.002), they were found to be significantly different: age and sbps distances are very small (or zero) for both GT and SYN data comparisons (see Table 3) and sbp interestingly is the variable where the synthetic data actually contains smaller distances to ground truth than between ground truth samples.

Table 3 The mean squared \({\boldsymbol{D}}_{{\boldsymbol{KL}}}({\boldsymbol{GT}}_{\boldsymbol{i}}||{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}})\) (n {1…10}) of each variable ending with “gt” and the mean squared \({\boldsymbol{D}}_{{\boldsymbol{KL}}}({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}})\left. {\left\| {{\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}}} \right.} \right)\) (m,n {1…10}) of each variable ending with “sy”.
Table 4 KL divergence differences between resampled data sets and synthetic data sets for each variable and associated missing rate in parentheses.

We assume that that the synthetic data are suitably similar in distribution to the ground truth if the KL distances of the samples of synthetic data to the ground truth are similar to the KL distances of the resamples of ground truth data between one another. In order to test this, we randomly resample from ground truth, GT, and calculate the KL distances between each sample. These distances are then compared to the KL distances between the synthetic data, SYN, and the GT. diffKL represents the difference in the KL distances between multiple resamples of GT and between SYN data and GT, for each variable.

The mean diffKL values for the tested variables (in bottom rows of Table 4) indicate that the synthetic KLDs vary between 8.244 and −1.286 when missing data are presented. In some cases, such as systolic blood pressure (sbp), the synthetic data are constantly closer to the ground truth distribution shape than the resampled data are to one another. For variables without missing values such as age, the KL distance differences are close to zero. In other words, the synthetic data are closer to the GTi distributions. We can thus conclude that our approach generates synthetic data that is no more different to the ground truth data than differences found when generating multiple resamples of ground truth.

We now explore the joint distributions in the synthetic data sets by using kernel maximum mean discrepancy (kMMD) with a radial basis function kernel. We conducted a combination of distribution tests for 2-variable (253 combinations), 3-variable (1771 combinations), and 4-variable (8855 combinations) comparison. The hypothesis H0 for kMMD is that samples to be tested come from the same distribution with alpha ~0.05. With the same SYN data sets from the previous experiment for each iteration (10 iterations), we aim to see the difference between same-sized samples from the GT population and samples from SYN in terms of their distributions. The results of the H0 acceptance rate are shown in Table 5 (joint distribution tests on 1000 samples from 1 million GT population and 100,000 sampled SYN data).

Table 5 Joint distribution tests for 2-, 3-, and 4-variable combinations using kernel MMD.

We can conclude from these results that the distance between SYN and GT distributions are generally low when taking account of low-dimensional combinations of data features. What is more, they are not significantly worse than between two GT samples, when using our proposed methods of latent variable modelling to handle missingness. The distance between SYN and GT, however, can increase as the number of combinations of data features increases (potentially as a result of simplification within the structure of the model).

In order to see the practical implication of differences between GT and SYN data, we further compare GT and SYN’s performance on training and testing machine learning classifiers in the next section.

Synthetic data compared to ground truth data for machine learning classifier comparison

Figure 4 compares the receiver operator characteristic (ROC) and precision recall (PR) curves for the GT data and SYN data (generated using the latent variable method) when a machine learning classifier is inferred for predicting stroke. The results shown are on a Bayesian generalised linear classifier. In particular, the area under the ROC curve (AUC) for both curves is calculated for GT and SYN samples and the Granger causality statistic as described in the “Methods” is calculated to determine how predictive the SYN curves are of the underlying GT curves. Note that a p value is generated that determines the Granger causality statistic at the 5% significance level.

Fig. 4
figure 4

Five-sample sensitivity analyses for a Bayesian generalised linear classifier on GT and SYN data (latent model) for fixed sample size of 100,000, including ROC and PR curves, and AUC and Granger statistics.

First, notice that the ROC and PR curves are similar in shape for the GT data (blue) and the SYN data (red). Observing these sample curves, it is not surprising the Granger causality statistic for all samples is significant at less than the p = 0.001 level. We also applied identical tests to other machine learning classifiers (see Supplementary Figs. 3 and 4) where all p values were found to be <0.001 except for ROC curves generated when using stepwise regression with N < 1000. We conclude that the outcome of using SYN data samples for the selected prediction algorithms is that we can predict the sensitivity analysis of using actual GT data (as their difference is not significant). Indeed, this experiment set-up implies that the generated SYN data are able to achieve equivalent statistical results to GT data. (Incidentally, these AUC results are in line with similar results documented by Ozenne et al.45 i.e. high AUC ROC and low AUC PR curves were observed across tests.).

Detecting re-identification risks using outlier analysis with distance metrics

Finally, we explore the risk of re-identification of patients from the SY data based on the clones (Rclone), inliers (Nin), and outlier (Nout) statistics described in the “Methods” section. We base our experiments on the concept of event per variable (EPV), which explores the effect of sample size and number of variables on predictive accuracy46. The number of EPV is the number of events divided by the number of degrees of freedom required to represent all of the variables in the model. We use an EPV value of 22.2 based on the conclusions in the study by Austin and Steyerberg46. The results in Table 6 below are based on 10 iterations of resampling without replacement. This indicates a sample size of 7000 for each iteration within 11 random population groups. Notice how the risk of clones decreases as the sample size increases (as one would expect). While we also see that the risk of outliers decreases, they are always very small. What is more, the actual number of outliers generated stays relatively stable (between 10 and 70). These statistics demonstrate that, while there is always a risk of risk of a synthetic patient being linked to an actual patient in the ground truth data in the case or outliers, we can exploit such metrics to identify the at-risk samples and make a decision as to whether they should be removed or not (if they are clones or outliers with a too-small k-anonymisation value).

Table 6 The risk of seeing clones Rclone, inliers Nin, and outliers Nout in the synthetic data for increasing samples sizes of ground truth data.

Discussion

This paper has introduced and validated a set of techniques to model complex heterogeneous data for generating realistic synthetic data sets that capture the correct dependencies and distributions. The approach exploits resampling with probabilistic graphical modelling that explicitly handles missingness and complex non-linear/non-Gaussian relationships and is transparent in how data are modelled enabling biases to be assessed and accounted for. Through a case study on cardiovascular risk, the paper has demonstrated that these synthetic data sets not only generate similar distributions over both discrete and continuous variables but also produce similar sensitivity analyses to the original ground truth data (in the form of PR and ROC curves).

Patient privacy is quantified through a demonstration that the proximity of individual synthetic data points to real patients can be scored by using outlier statistics and distance metrics, though more research is required on the robustness of this particularly when clusters of patients with rare disease/demographics are modelled. We have demonstrated that our method can flag identical or similar patient profiles in the synthetic and real data. While the occurrence of these “clones” or similar rare patient profiles appears to be low (and does not seem to increase with sample size), there is still a small risk. However, our metrics enable these risks to be quantified so that appropriate action can be taken prior to releasing any data (depending on the risk protocol adopted).

Another issue that may impact the production of realistic synthetic data is the temporal nature of many health data sets. The methods that we have adopted here are well suited to handle this characteristic. For example, the dynamic BN47 and hidden Markov model48 are generalisations of the standard BN model used in this paper. Here the time dimension is represented by unrolling networks so that nodes represent variables at specific time points in Fig. 5c, d. These approaches will be included in our future directions for the project.

Fig. 5: Bayesian network architectures.
figure 5

a A Bayesian network with four nodes. b A Bayesian network classifier with class node C. c A dynamic Bayesian network with two time-slices, t and t−1. d A Hidden Markov model with latent variable H.

Generating synthetic data from large-scale real-world data that are noisy, contain structurally missing data, and many non-linear relationships such as the UK primary care data can bring enormous benefits to AI research. In particular, it can prevent the need for using real patient data when developing and validating state-of-the-art predictive models. This paper has explored several key issues involved with this but there is scope for more research to ensure that these data sets do not contain underlying biases (e.g. by exploring data collection processes) or present a privacy risk (e.g. by carrying out simulated privacy attacks), if they are to be made freely available without any access controls to facilitate innovation.

Methods

Data description—CPRD Aurum

For our case study, we used an extract from this database on 122,328 patients (all aged >16 years).

We tested the synthetic data performance using a risk prediction algorithm for cardiovascular disease (encompassing stroke, transient ischaemic attack, myocardial infarction, heart attacks, and angina). We used the same features as used by Hippisley-Cox et al.49 for predicting the onset of cardiovascular disease within 10 years (explained in Table 7).

Table 7 Description of the selected features used from CPRD for analysis based on predicting cardiovascular disease.

BN modelling

We have selected a BN due to its flexibility and transparency—see Fig. 5a. BNs model the joint distribution of a data set p(X) by making assumptions about conditional independence between features that are captured in a directed acyclic graph (DAG). A BN represents the joint probability distribution over a set of variables, X1,…,XN, by exploiting conditional independence relationships. These relationships are represented by a DAG. The conditional probability distribution (CPD) associated with each variable, Xi, encodes the probability of observing its values given the values of its parents and can be described by a continuous or a discrete distribution. All the CPDs in a BN together provide an efficient factorisation of the joint probability (see Eq. 1)

$$p\left( x \right) = \mathop {\prod }\limits_{i = 1}^n p(x_i|{\rm{pa}}_i),$$
(1)

where pai are the parents of the node xi (which denotes both node and variable).

This family of models can be used to perform inference by entering evidence into one or more nodes and inferring the posterior distributions of the remaining nodes. In this way, data can be sampled under different observations. We use logic sampling50 to sample data where we “fix” certain features if necessary, by entering evidence. For example, we can generate data where all samples are formed from people aged >65 years, or female-only samples, or all people who have been diagnosed with hypertension.

BNs can be constructed by hand where the links represent some form of influence or they can be inferred from data using constraint-based algorithms such as the PC or FCI algorithm20, or search and score methods such as BIC51, or MDL52. Here we use a method to infer models directly from the CPRD that can handle missing data known as structural expectation maximisation26,53. We record the fit of the models over multiple runs to calibrate the robustness of the models to sampling variation. This family of models can be used to perform machine learning prediction such as in the BN classifier in Fig. 5b, clustering using the EM algorithm, and time-series forecasting by unrolling the BN into the time-domain in Fig. 5c, d.

We use three approaches to handle missing data: one for discrete nodes where we add a “missing” state to all possible states in Fig. 6a, one for continuous nodes where we add a new binary parent to each node that represents either missing or not in Fig. 6b and one where we use the FCI algorithm to infer any latent variables in the network. The algorithm is applied to 10 resampled data sets to calculate robust statistics for determining the inclusion and position of any latent variables in the networks, e.g. Fig. 6c where the distribution of a variable is directly influenced by a discrete latent variable that is discovered as a parent. By identifying these robust latent variables, we aim to improve the details of the underlying distributions as well as capture any MNAR effects. Please see Supplementary Fig. 1 for the threshold statistics for each variable and Supplementary Fig. 2 for a sample network including latent variables.

Fig. 6: Methods to capture missing data and unmeasured effects.
figure 6

a A binary “Miss Node” pointing to all continuous nodes in a Bayesian network. b A “Miss State” for discrete nodes. c A latent variable with m states to capture Missing Not at Random data and other unmeasured effects (in both discrete and continuous nodes).

Experiments

Modelling missing data to capture underlying distributions

We assume that the synthetic data are suitably similar in distribution to the ground truth if the KL distances of the samples of synthetic data to the ground truth are similar to the KL distances of the resamples of ground truth data between one another. In order to test this, the experiment base population GTi is randomly sampled from the full CPRD primary care database. KL distances are compared to assess if the generated SYN can be representative. Three groups of data are used, where i denotes the sample size.

GTi—The sampled ground truth from the total population P;

\({\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}}\)—The generated n synthetic data sets based on GTi (with equal size to GTi);

\({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}},{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}}\)—The other n or m sets of resampled ground truth data (with equal size to GTi) from the total population P without replacement.

Two KL distances are obtained from each target variable’s distribution shape and these then can be compared as in Eqs. 25.

$$\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}}||{\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}})\;{\mathrm{and}}\;\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}||{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}}).$$
(2)
  • When the \(\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2}\)is close to 0, then the distributions are almost identical.

  • When the value of \(\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}}||{\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}})\) is close to \(\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}||{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}})\), then the generated synthetic variable has an almost identical distribution as the GTi.

$$\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}||{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}})\;{\mathrm{is}}\;{\mathrm{the}}\;{\mathrm{mean}}\;{\mathrm{squared}}\;{\mathrm{KLD}}\;\left( {n \in \left\{ {1 \ldots 10} \right\}} \right).$$
(3)
$$\overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}}||{\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}})\;{\mathrm{is}}\;{\mathrm{the}}\;{\mathrm{mean}}\;{\mathrm{squared}}\;{\mathrm{KLD}}\;\left( {m,n \in \left\{ {1 \ldots 10} \right\}} \right).$$
(4)
$${\boldsymbol{diff}}_{{\boldsymbol{KL}}} = \overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{m}}||{\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}}) - \overline {{\boldsymbol{D}}_{{\boldsymbol{KL}}}^2} ({\boldsymbol{GT}}_{\boldsymbol{i}}||{\boldsymbol{GT}}_{\boldsymbol{i}}^{\boldsymbol{n}}).$$
(5)

We also explore the joint distribution of our models compared to the ground truth data using MMD. The MMD is an approach to represent distances between distributions as distances between mean embeddings of features54. The approach tests whether distributions p and q are different on the basis of samples drawn from each of them, by finding a smooth function that is large on the points drawn from p and small (as negative as possible) on the points from q. The test statistic is the difference between the mean function values on the two samples. When this is large, the samples are likely to be drawn from different distributions.

For example, if we have any joint distributions P from GTi and Q from \({\boldsymbol{SY}}_{\boldsymbol{i}}^{\boldsymbol{n}}\) over a set X. The MMD can be defined by a feature map φ:XH, where H is called a reproducing kernel Hilbert space. Hence, when \(x = H = R_{\rm{d}}\;{\mathrm{and}}\;\varphi \left( x \right) = {\rm{a}}\;{\rm{kernel}}\;{\rm{function}}\;{\rm{over}}\;x\). MMD is defined in Eq. 6.

$$\begin{array}{l}{\mathrm{MMD}}\left( {P,Q} \right) = ||E_{X \sim P}\left[ {\varphi \left( X \right)} \right] - E_{Y \sim Q}\left[ {\varphi \left( Y \right)} \right]||_H\\ \qquad\qquad\quad\;\;\;= ||E_{X \sim P}[X] - E_{Y \sim Q}[Y]||R_{\rm{d}}\\\qquad\qquad\quad\;\;\; = ||\mu P - \mu Q||R_{\rm{d}},\end{array}$$
(6)

where μP and μQ are the mean embeddings for distributions p and q.

We take 10 synthetic and ground truth data set pairs. For each pair, we explored the combination of 2, 3, and 4 variables and applied the MMD test to compare all combinations of these variables. Each test produces the H0 hypothesis for that combination. We calculate the percentage of times that the H0 is not rejected for the combinations of 2, 3, and 4 variables.

Comparing machine learning classifiers inferred and tested from synthetic data and ground truth data

ROC and PR curves are often used to assess the predictive performance of a machine learning model. ROC curves capture the trade-off between false positives and false negatives but can often mask the biases in imbalanced data sets (for example, when the positive case is rare in a population)55. PR curves, on the other hand, can detect these biases as they capture the trade-off between precision (also known as the positive predictive value representing the number of correct true positives from all positive prediction) and recall (sensitivity). We analyse the ROC curves and PR curves that are generated when 3 machine learning classifiers (stepwise regression, linear discriminant analysis, and Bayesian generalised linear models) are used to model and predict GT data. We explore the ROC and PR plots for the classifiers’ performance on the SYN data and the original GT. We also measure the capability of the synthetic data curves to predict the GT curves for varying sample sizes using a Granger causality test56. In our experiments, the Granger causality test checks for the null hypothesis that the synthetic data curves cannot predict (or “Granger cause”) the ground truth curves.

Detecting re-identification risks using outlier analysis with distance metrics

The method we propose aims to generate synthetic data that avoids privacy issues associated with releasing real patient data. However, if the synthetic data sets enable re-identification of real patients (for example, through proximity between a synthetic data point and a real patient), then the intrinsic value is lost. As the probability of re-identification increases, the more unique a patient’s data is (for example, the older a patient is or cases of rare disease). Here we use a form of outlier detection to measure this risk. We randomly select synthetic datapoints from SYN and calculate the distances between it and all GT datapoints. Using an outlier analysis method (based on the distribution of GT data and the individual synthetic data), we calculate the number of GT datapoints (k) that are in the same distribution as the synthetic data point (rather than being statistically separate as an outlier). We apply this for varying large samples (100 K to 1 million) of synthetic datapoints. The smallest value of k for each of these can be considered the k-anonymisation value.

We use the quantile function to assess how many real-world patients are close to a synthetic patient given a pre-defined probability of smallest distance (e.g. Euclidean distance) observations. For example, given the probability of 0.1%, n observations of real patient records that are closest to a real patient record can be obtained. In this experiment, GT and SYN data sets are combined into one data set, so the total size of the data set will be S = SGT = SSYN, and we define the instances with high privacy risk under any of the following conditions:

Clones—when distance is 0, i.e. the synthetic patient record is identical to real-world patient record, the clone rate is used to measure clone risk Rclone defined in Eq. 7.

$$R_{{\rm{clone}}} = \frac{{{\rm{Total}}\;{\rm{identical}}\;{\rm{instances}}}}{{{\rm{Total}}\;{\rm{instances}}}}.$$
(7)

Inliers—when there is only one real patient instance that is closest to the synthetic patient given a pre-defined probability Pr within lower quantiles. The total number of such pairs are used to measure inliers risk Rin defined in Eq. 8.

$$R_{{\rm{in}}} = |\left( {{\rm{Pair}}\left( {{\rm{SYN}}_i,{\rm{GT}}_j} \right)|{\rm{Pr}}} \right)|,i,j \in \{ 1, \ldots ,S\}.$$
(8)

Outliers—when there is only one real patient instance that is closest to the synthetic patient given a pre-defined probability Pt within upper quantiles. The total number of such pairs are used to measure outliers risk Rout defined in Eq. 9.

$$R_{{\rm{out}}} = |\left( {{\rm{Pair}}\left( {{\rm{SYN}}_i,{\rm{GT}}_j} \right)|{\rm{Pt}}} \right)|,i,j \in \{ 1, \ldots ,S\}.$$
(9)

Ethics

The project was undertaken within the institutional governance framework of the Medicines and Healthcare products Regulatory Agency (MHRA) UK and Brunel University London. The use of real anonymised patient data as ground truth data was undertaken under the CPRD’s overarching research ethics committee (REC) approval (reference: 05/MRE04/87) and within CPRD’s secure research environment. Additional advice on privacy of the ground truth data was obtained from the UK Information Commissioner’s Office (ICO) Innovation Hub in response to a formal query by the MHRA.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.