Why Synthetic Data is Safer Than Masking

Bridgette Befort

yellow duct tape on blue surface

Walking through an example that illustrates the benefits of synthetic data.

Enterprises utilize sensitive personal data for a variety of applications, including testing internal applications and platforms, building AI models, and saving historical information about their users so they can gain future insights.

To keep this personal data private, organizations often turn to masking as a straightforward and easy to implement method of anonymizing sensitive data. Masking replaces sensitive data (e.g., name, social security number) with another value, such as a fake name or random integer hash, ensuring personally identifiable information no longer directly references an individual.

While anonymization and masking functions are easy to implement, their security is dubious and risks privacy attacks.

Privacy practitioners must determine which data are sensitive (e.g., are birth dates sensitive?) so that they can identify which data require removal or masking. Additionally, an intentional one-to-one mapping of original private data to a mask increases the likelihood of re-identification.

Masked data are also susceptible to linking and singling out attacks. In a linking attack adversaries try to uncover sensitive information about people in one data set using information from a different dataset. In a singling out attack, adversaries find it is easy to identify a person in a data set given their unique combination of feature values, even if their personal information is masked.  

So, what does masking look like in practice? Pretend you operate a dog walking business and have information about your client’s pets. Your data may look something like this:

Original Data

Dog’s nameBreedStreet Where Dog LivesFavorite Toy
SpotDalmatianMainTennis Ball
PuffPoodleBroadwaySqueaky Toy
RoverGolden RetrieverParkBone

Now, you have a protégé who tells you they are also interested in starting a dog walking business in another city. This person asked to see your records as an example of how to organize their new business’s data. However, your clients specifically asked for privacy for them and their dogs. So, in order to maintain privacy, you decide to mask the names of the dogs you walk using random ID codes instead and you send your protegee the following information:

Masked Data

Dog’s nameBreedStreet Where Dog LivesFavorite Toy
952DalmatianMainTennis Ball
478PoodleBroadwaySqueaky Toy
003Golden RetrieverParkBone

However, it turns out your protégé is an adversary and because your masked data has severe privacy problems, they can identify all your clients and take their business away from you! It turns out that, despite masking the name of the dog you walk with a random ID, the information about the dog and its owner is still vulnerable.

For example, if your protégé (turned adversary) had access to a secondary data source (such as social media), they might be able to realize that dog #952 is actually Spot, the Dalmatian whose owner always posts pictures of it on social media playing with a tennis ball near “Main Street Park”.

While this is a silly example, failures of masking have directly impacted real people. For example, in 2006, AOL released millions of search queries, with names of the searchers masked by an ID.

These IDs did not protect privacy because the searches provided enough information about individuals that they were easily identified. This event is referred to as the AOL data anonymization debacle (Barbaro), and it resulted in a class action lawsuit that was settled in 2013.

As another example, in 2010 Netflix issued a fun challenge to the programming community, asking coders to improve its search algorithm. To aid in this task, Netflix released 10 million movie rankings made by half a million customers, whose personal information was masked using random numbers.

Similar to the AOL debacle, the rankings actually contained a lot of information about individuals that could be used to identify them, illustrating how little information is needed to breach masked data privacy (Schneier). Finally, a research report of de Montjoye et al. (2013) found that even when personal information is removed, 95% of individuals can be uniquely reidentified using just four time-location data points.

Unfortunately, as the value of data has increased in recent years, these problems have only worsened. 

There is a safer approach than masking: synthetic data generated by Howso Synthesizer.

Howso creates synthetic data is totally new (or fake). It does not have a one-to-one mapping to the original data and is sufficiently different from the original data so it is not susceptible to privacy attacks like linking or singling out. And, even though Howso’s synthetic data generation method combines a variety of rigorous privacy techniques to ensure that your data is verifiably private, it maintains the overall utility of your original data.

So, circling back to our toy example, next time you need to send private information, you can protect it against adversaries using Howso Synthesizer. This would provide you with new synthetic data that would look something like this:

Synthetic Data

Dog’s nameBreedStreet Where Dog LivesFavorite Toy
FluffBichon FriseMapleTug-o-war Rope
DotAustralian Shepherd1stRubber Ball
FidoLabradorElmShoe

You can see there is now no one-to-one mapping between the original and the synthetic data, so your clients can’t be identified (and then stolen from you!). However, the information contained in the synthetic data is still useful for learning how to best organize dog walking data. Now you can relax, knowing your client’s, and their dogs’, data are not at risk of privacy leaks!

Are you interested in learning more about creating private synthetic data with Howso Synthesizer? If so, contact us here!