Synthetic health data, real regulatory challenge

Colin Mitchell                                                                                                                                                5 October 2023

 

Synthetic data— artificial data that closely mimic the properties and relationships of real data—is not a new idea but recent technological advances have brought it to prominence as a potentially transformative tool for research and innovation, particularly for AI development. High profile articles, from Forbes and the Wall Street Journal among others, highlight the potential of synthetic data (including AI-driven genomic data analysis). But most of them also note that the generation of synthetic data from real personal data may lead to breaches of privacy – e.g. Techmonitor’s coverage highlighting that ‘Synthetic data may not be AI’s privacy silver bullet’.

Part of the challenge for regulators, researchers and developers is the wide range of synthetic data generation methods, outputs and uses. A one-size-fits-all response cannot answer the question of whether synthetic data generate privacy risks or whether synthetic data are ‘personal data’, to adopt the language of data protection law.

In the health context, synthetic data are not always developed specifically to avoid privacy concerns—as opposed to filling gaps in information needed to test products, software or sections of code—but the use of real patient data to develop artificial datasets requires safeguards and careful testing to ensure that sensitive patient information cannot be re-identified.

Researchers in this space, including specialists at the Clinical Practice Research Datalink—part of the medicines and medical devices regulatory authority MHRA—have been testing this process. One part of the equation relates to the technical methods and safeguards involved, including testing whether skilled ‘hackers’ could feasibly re-identify an individual in synthetic datasets. The other part relates to what we mean by privacy, identification or ‘personal data’. The latter are legal constructs and notoriously hard to define at the boundaries between identification and anonymisation.

Our report, Are synthetic health data ‘personal data’?, was independently commissioned by the MHRA to assess the status of synthetic health data in UK data protection law. We evaluated the current legal framework (the UK and EU GDPR), regulatory guidance and latest legal commentary to examine whether—or in what circumstances—synthetic health data might be considered ‘personal data’.

We found that regulators and the courts are yet to grapple fully with synthetic data generation and that data authorities across the EU and UK are cautiously positive about the potential of synthetic data as a means of safeguarding privacy while recognising evidence that some risks may remain. We identify a current ‘orthodox’ approach that is being adopted by the regulators. This views synthetic data as a novel privacy enhancing technology (PET) and begins with the position that if the input or training data are ‘personal data’ it is presumed that models and output data will remain personal data unless effective anonymisation can be demonstrated with confidence.

This position is based on evidence that some residual risks of identification may remain in synthetic datasets – depending on the methods used and nature of the output data (see Stadler et al or Chen et al)1. As a consequence synthetic health data developers and users should continue to treat synthetic data as personal data unless they can establish with confidence that risk of identification has been reduced to remote or negligible levels. This requires careful consideration of the technical nature of the data as well as the environment of legal and organisational controls surrounding the data, in line with best practice on data protection impact assessments and anonymisation.

However, there are potential costs to applying data protection law in this ‘orthodox’ and cautious manner to all forms of synthetic health data generation. It is likely to lead to risk-averse decisions to limit access and use of data and to limit the production and utility of synthetic data due to the additional safeguards that may be applied and the resources and expertise that will be necessary to fully audit identification risks and make adjustments in response. Ultimately, this may slow the availability of synthetic data for health research and development purposes and to potentially diminish work in this area due to the costs involved.

It is imperative that policymakers, regulators and technical specialists work together to assess and define standards which strike an appropriate balance between privacy and health data research and innovation. In our assessment, an alternative regulatory approach to synthetic health data—viewing some forms of synthetic data as non-personal data, unless demonstrated otherwise—might be possible in certain circumstances. Whether this is the case requires a much more context-specific approach from regulators and policymakers, really scrutinising the evidence relating to specific methods for generating and safeguarding synthetic health data. While this will require considerable engagement among regulators and health researchers, the benefits of getting this right could be significant for health research, innovation and patient care.

The production of synthetic data is a good example of a novel technology that fundamentally challenges our existing regulatory framework. It raises questions about whether the potential risks or harms involved should be addressed through current legal standards or whether this overstretches the function of those laws. For example, should data protection law govern part of a purely synthetic dataset if it coincidentally matches a living individual, even if they were not part of the training data? Is this identification or something different—more akin to coincidental resemblance to a real person in a work of art?

These questions form part of a wider and increasingly important debate about the regulation of AI and related novel technologies. It could be argued that there are many potential harms of algorithmic processing (including harms to groups rather than individuals) that cannot be fully addressed using existing law. New regulatory approaches may be  necessary. These could have the added benefit of reducing any temptation to stretch aspects of data protection law in order to safeguard against threats to individuals and society.

References

1. E.g. Stadler T, Oprisanu B, Troncoso C. Synthetic data–anonymisation groundhog day. In31st USENIX Security Symposium (USENIX Security 22) 2022 (pp. 1451-1468). Chen D, Yu N, Zhang Y, & Fritz M.  GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models. Proceedings of the ACM Conference on Computer and Communications Security. 2020 343–362. https://doi.org/10.1145/3372297.3417238