Synthetic Data Generation

A computer-generated dataset sufficiently similar to an original base dataset.

Synthetic Data Generation

By

Bristena Oprisanu

Published on

August 10, 2022

What is synthetic data?

Data collection and data sharing are often necessary steps of making progress for businesses and society, as they enable a number of compelling applications and analytics. This means entities are often willing or compelled to provide access to their datasets to enable analysis by third parties or facilitate progress in research. However, useful datasets regularly contain information of a sensitive nature, which can endanger the privacy of users within the dataset.

While data sharing itself does not come without privacy risks, the EU’s General Data Protection Regulation (GDPR) and other similar regulations globally also put pressure on businesses to be more responsible, more accountable, and more transparent with respect to personal information and privacy laws. In an attempt to mitigate these risks, several methods have been proposed,e.g., anonymising datasets before sharing them. However, as pointed out on several occasions, anonymisation often fails to provide realistic privacy guarantees in practice. Another approach has been to release aggregate statistics, but this is also vulnerable to a number of attacks such as membership inference (where one could test for the presence of a given individual’s data in the aggregates). These challenges gave rise to additional mechanisms for privacy risk mitigation, including synthetic data generation.

Synthetic datasets are becoming increasingly popular for training artificial intelligence models in place of the original raw datasets from which they are generated. Proponents of this computer-generated data say it protects personal information and reduces the chances of bias emerging in AI systems. The basic idea behind synthetic data is that one can generate a sufficiently similar dataset to the original data so as to attempt to make the synthetic data as statistically useful as the original dataset or to “fill in” empty attributes of the original data with likely inputs, while mitigating the risk of data abuse to individuals in the original dataset.

In recent years, researchers have focused on the generation of synthetic electronic health records (EHR), aiming to facilitate research in and adoption of machine learning in medicine. NHS England is one of the first organisations that explored the potential of using synthetic data in order to enable open data release for research purposes.

How is synthetic data generated?

There are various synthetic data generation methods, the most notable being:

1. Imputation models. One of the first approaches for generating fully synthetic data  is to treat all observations from the sampling frame as missing data and to input them using the multiple imputation method.

2. Statistical models. Another approach is to attempt to generate a statistical model based on the original data. The main idea is to generate a low-dimensional distribution of the original data to help with the data generation process.

3. Generative Models. More recently, generative machine learning models have attracted a lot of attention from the research community. A generative model is a way to learn any kind of data distribution using unsupervised learning, aiming to generate new samples that follow the same probabilistic distribution of a given dataset. Generative models based on neural networks work by optimising the weights of the connections between neurons by back-propagation techniques. For complex networks, the optimisation is usually done by the mini-batch stochastic gradient descent (SGD) algorithm.

While these models can be used to generate inputs to original datasets or even entirely new datasets, even synthetic datasets are vulnerable to attack. Without the appropriate privacy protection in place, a sophisticated adversary can still reconstruct (possibly sensitive) training data from a synthetic dataset. As it was previously shown, synthetic data is subject to the same tradeoffs as previous anonymisation techniques, and the privacy gain of synthetic data publishing is also highly unpredictable. Because it is not possible to predict which data features a generative model will preserve, it is neither possible to anticipate the expected privacy protection from synthetic data publishing nor its utility loss. In comparison to deterministic sanitisation techniques, synthetic data does not allow data holders to provide transparency about what information will be omitted in the published dataset and what information will be retained.

To overcome these issues, researchers have recently proposed a number of techniques that use carefully crafted random noise to provide strong privacy guarantees – specifically, guaranteeing Differential Privacy – so that the privacy leakage of synthetic data generation can be quantified by rigorous, statistical means. A common approach in several of the proposed techniques involves training a generative machine learning model using a differentially private version of stochastic gradient descent, the so-called moments accountant method. Even though differentially private generative models can provide a significantly higher privacy gain with respect to privacy attacks than traditional data synthesis algorithms, the models’ implementation and operational environment should also not break any of the privacy definition’s theoretical assumptions.

One thing to overcome is the potential for privacy leakage through metadata, which can be challenging to achieve in practice.  Data owners likely do not have access to either a disjoint subset or a public dataset from the same distribution that would allow them to define metadata that fits the raw data they would like to share without solely using the original dataset, especially since synthetic data sharing is often motivated by the unique value of sensitive datasets that are limited in size. This implies that data holders might struggle to achieve the desired strict privacy guarantees when no alternative dataset is available. If a disjoint subset or public dataset is available, there is a risk of facing a large utility loss when either using public data or splitting the available data to derive the necessary metadata. In either case, it is important to take these concerns into consideration when deciding whether synthetic data is the right fit for your use case.

While these limitations do exist, there are many use cases for which synthetic data generation may be the right fit. Users of the Bitfount platform are encouraged to connect synthetic data to Pods for the purposes of collaboration or research as they see appropriate.

References

https://transform.england.nhs.uk/ai-lab/explore-all-resources/develop-ai/exploring-how-to-create-mock-patient-data-synthetic-data-from-real-patient-data/

https://www.usenix.org/system/files/sec22summer_stadler.pdf

https://arxiv.org/abs/1607.00133

https://discovery.ucl.ac.uk/id/eprint/10142367/