It’s no secret that each day we generate a lot of new data. In the last 25 years or so this data generation and our newfound abilities to analyse it at scale have heralded in an era of Big Data in which data scientists have been able to innovate upon the efficiency and accuracy of many decision-making processes. However, it’s also no secret that our use of this data is imperfect — Rather than data being the new oil, it’s much more akin to fire: Data improves lives when used appropriately, but can also be a destructive force when abused.
As data collection and analysis methods are refined, there is increasing potential for sensitive data to be misused to make discriminatory decisions or breach individual privacy rights. So, how can we as a society ensure we can reap the clear benefits of data analysis, while mitigating the risk of abuse? One emerging solution is privacy-enhancing technology, or PETs.
What are PETs?
PETs are a set of technologies which mitigate the risk to the individual of data privacy abuse. While they’ve commercially grown to inhabit a broad definition ranging from technologies aiding with data privacy regulatory compliance to those that mathematically guarantee privacy-safe data analysis, the use of the term “PET” in the academic context has largely applied to technologies which technically protect the privacy of a consumer using cryptography or algorithmic enforcement of privacy policies. In this article, we’ll discuss privacy enhancing technologies as the broader set of technologies that provide privacy protection in some way.
On a basic level, these techniques are used to ensure data scientists or analysts can’t learn anything new about a specific individual within a dataset as a result of the analysis they undertake. The aim is to mitigate the risk of privacy leakage, while still preserving the utility of underlying data. We’ve found it helpful to separate the space of privacy technologies into four broad areas:
- Data modification: These techniques change the data in some way before use. Examples include tokenization, synthetic data generation, and differential privacy if applied against the data itself.
- Disclosure Control: Disclosure control specifies constraints on the data that can be released. The data can be checked for release manually, but checks can also be automated using definitions of privacy impact such as k-anonymity and differential privacy. Note that this step can occur at many levels of a given analysis — It can be applied from an organisational perspective, for a specific collaboration, or even against a specific dataset.
- Confidential collaboration: Confidential collaboration technologies provide a mechanism for multiple data custodians to allow a particular data scientist a mechanism for computing insights on their shared data without leaking privacy. Examples include multi-party computation, trusted execution environments and homomorphic encryption.
- Access control: Access controls systems confirm that the person who is trying to gather an insight is authorised to do so. It’s sometimes not considered a privacy technology, but it is always core to the protection of data. Note that this step can occur at many levels of a given analysis — It can be applied from an organisational perspective, for a specific collaboration, or even against a specific dataset.
Putting these technologies together gives one example of a data pipeline shown diagrammatically below. In the rest of this article we’ll explain the various technologies in a bit more detail and then describe some potential applications.
Data tokenization is a privacy-protection method that replaces pre-specified sensitive data with non-sensitive, pseudonymous tokens. This protects the data to a certain degree because the resulting data is no longer considered personally identifying. In some forms of tokenization, a mapping is maintained so that the original data can be reconstructed. In other forms, there is no mapping back, and some authorities would consider the data to be anonymised.
Tokenization is a popular approach because it aligns with legal concepts discussed in legislation like GDPR and HIPAA. Some of the drawbacks include loss of utility from the data, security concerns when pseudonymising, and cumbersome configurations. You can read more about tokenization in our introduction to the topic here.
Differential Privacy (DP)
Differential Privacy obfuscates data by randomly introducing noise. Imagine a dataset of people labelled as either with or without a disease and then randomly flipping some of those labels. The resulting dataset would be a differentially private form of the same data and may still be useful for extracting an overall trend. The same approach of adding noise can be used for any process of creating statistics. Averages, for example, could have a random adjustment made to them to create a differentially private form.
Differential privacy is popular because it gives a clear mathematical definition of the privacy in terms of probabilities. It is also easy to understand what happens when multiple pieces of differentially private information are shared. The main drawback of using differential privacy is that some accuracy is lost because the statistics that are generated have some noise added. You can read more about differential privacy in our intro here, and also find links to tutorials here.
Synthetic Data Generation
One extreme form of data modification is synthetic data generation. In this approach a model is built that is able to generate data of a form similar to the original data. The new data is sometimes called a digital twin of the original. The model is usually built from the data itself and so it’s important to take care that the generated data is not leaking anything private. For this reason, synthetic data generation approaches often also use techniques like differential privacy as a guarantee against privacy leakage.
Synthetic data can be very useful for testing purposes and some people even argue that it can be used for analytics and machine learning tasks. Concerns include impacts on utility, potential loss of privacy (which can be mitigated with differential privacy) and inability to do linking. You can read more here.
The simplest form of disclosure control is manual inspection. Data that is to be released is sent to specified people or placed in a separate area, sometimes called an Airlock. The data custodians then manually review the data and decide whether it can be transferred out of the system.
Manual inspection is a common approach to disclosure control in data clean rooms (also sometimes called trusted research environments or data safe havens). These environments are set up so that data scientists can do arbitrary computations within the environment but need manual inspection for any data to leave.
The advantage with manual inspection is that it is very flexible, but it requires significant effort by the data custodian. In more complex cases, for example machine learning models, it may not be possible for the data custodian to understand what is being released. Even in simple cases, it is not always clear that the data is safe, especially when multiple requests for data are approved. It also cannot be used in collaborations between data clean rooms.
A release of data is said to be k-anonymous if every person involved in the release is indistinguishable from at least k-1 others included in the release. The idea is that everything is grouped together in groups of at least k individuals. Knowing the average salary of group is considered acceptable if there were, say 100, individuals in the calculation but not if there were only 2.
One way of deciding whether to release data is then to choose an acceptable k, say 10, and as long as the data is k-anonymous, allow the data to be released. This is popular as it is relatively easy to understand. A significant risk with k-anonymity occurs when multiple data releases are approved. Just consider two queries for the total salary paid to people aged under 44 and under 45. Both of these may involve more than 100 people, but if someone knows the identity of the only person aged between 44 and 45 they would be able to calculate their salary.
Differential Privacy (DP)
We explained above about how DP can be used to modify the data itself for privacy protection. When used as a disclosure control system, differential privacy checks that sufficient noise has been added to the data before allowing it out. The amount of noise that has been added is measured through two key differential privacy parameters, epsilon and delta.
The exact meaning of these parameters can be challenging to understand but one can think of the epsilon as a measure of “privacy loss” through the disclosure. An epsilon of 1-5 is often considered relatively acceptable. You can read more about the details of differential privacy in our introduction here.
Differential privacy can be used for disclosure control by simply requiring a maximum level of acceptable epsilon privacy loss. An important feature of DP is that the privacy loss of multiple data releases can be calculated as the sum of the associated epsilons. This has strong advantages as a disclosure control mechanism due to the mathematical understanding of the privacy lost. The drawback is generally considered to be the loss of accuracy.
Confidential Collaboration is a term that we have coined to describe the various ways that multiple parties can jointly compute insights. The general situation is always that a collection of parties have separate pieces of information, x1, x2, … xn, and they want one party to receive a function of those values f(x1, x2, …, xn), without anyone having to share the inputs.
Secure Multi-Party Computation (SMPC)
In SMPC multiple parties calculate their shared output of a given query via secret sharing protocols. The data custodians communicate random secret shares and by ensuring that the secret shares cancel each other out in the right ways, a data scientist can calculate the desired functions securely without deducing the values of any one participating party’s dataset.
SMPC’s security depends heavily on trust that the parties involved in the calculation are not collaborating. There is also an impact on the speed of calculation and network cost due to the protocols involved. You can read more here.
Homomorphic encryption is a cryptographic approach where computations can be done in encrypted space without knowing the decryption keys. When the encrypted form of two values is known, Enc(X) and Enc(Y), it becomes possible to compute the encrypted form of a function of those values Enc(f(X, Y)). This makes confidential collaboration almost trivial because the encryption protects the input data directly.
The major concern with homomorphic encryption is that current methods are prohibitively costly and have significant impacts on computation time. For this reason, weaker forms of homomorphic encryption are often used, which restrict the types of computation that can be done. You can read more here.
Trusted Execution Environments (TEEs)
Trusted Execution Environments use hardware solutions to enable confidential collaboration. The TEE provides protections so that the operator of the chip is unable to inspect what data is being used inside. It also provides a mechanism called an attestation, that externally verifies what software is running on the chip. The collaborators check that the correct software is running and then encrypt and send their data to the chip for computation.
TEEs provide an approach to confidential collaboration with relatively little network and computational cost. Key concerns include the security of the hardware method and the requirement to still send data outside of the sensitive environment. You can read more here.
Regardless of the technologies used to protect against data disclosure, it is also important that access is restricted to the right people, for the right purposes. This is where access control comes in, and it is a critical technical control for any governance system.
The first component of access control is authentication, which is where a system establishes that the person trying to run a particular task is the person they say they are. Most organisations want to integrate the way employees authenticate with a single system, and so several standards have evolved for authentication. The most common open standards are OAuth, OIDC, and SAML. You can can read more about our own integration of these standards in our guides.
The second component of access control is authorisation. This is where the system checks that the user has permission to do the desired activities. In the context of sensitive data, there is a useful distinction to be made between usage-based access controls and data-centric access controls. In a data-centric system, all access controls revolve around the data that can be used. Usage-centric access control systems augment these controls and allow restrictions to specific types of activities. You can read more about the distinction in our blog article.
How Can I Apply PETs to My Work?
We’re thrilled you asked! The most common use of PETs today is in ensuring sufficient information governance or risk mitigation when handling sensitive data, such as healthcare patient or financial services data. Solutions to particular problems can often be solved by combinations of the techniques described above.
A number of open source libraries and communities have emerged to enable data scientists and AI/ML researchers to apply PETs to their analyses and model development. Bitfount is building a community of data scientists and researchers leveraging sensitive data who want to learn about how to safely unlock its full value. To find more tutorials and resources around PETs, explore the forum here.