K-map, the weird cousin of k-anonymity (2017)

rntz · on Oct 5, 2018

The article gives the hypothetical example of redacting a table row

  {zip: 85535, age: 79}

to

  {zip: 85xxx, age: 79}

on the grounds that there are fewer than k (for some tunable constant k, larger = more anonymity) 79-year-olds in zip code 85535, but many more in zip 85xxx. However. If I see the second record, because it has been redacted, I also know that whatever zip code the person actually had, there were fewer than k 79-year olds in it! This may narrow the set of candidates considerably.

So it doesn't seem sufficient to count the mere number of people a redacted row could possibly match. You have to consider the meta-level information that knowing the row had to be redacted gives the attacker.

TedTed · on Oct 5, 2018

(I wrote the article.)

Excellent point! That's one of the reasons why people understood that syntactic definitions (which try and tell you "how anonymous data is" by looking at the data) were not the right approach, and that you had to look at the mechanism to determine whether the process was sufficiently private.

This line of thinking led to the definition of differential privacy, which completely changed the perspective. I wrote two articles about it, one very simple [1] and the other a bit more detailed [2].

[1] https://desfontain.es/privacy/differential-privacy-awesomene...

[2] https://desfontain.es/privacy/differential-privacy-in-more-d...

btilly · on Oct 5, 2018

This is true.

In fact it would be better to not indicate that the data was redacted. Instead of redacting it, change it to something else in the redacted range, preferably more common. With no hint about which pieces of data were changed, the attacker can't use what you describe.

Heck, merely including random small (unidentified) changes makes matching much, much harder.

nixpulvis · on Oct 5, 2018

Yes, but now when you find out that some important property you're testing for was caused by a mold outbreak in 85535, you'll be disturbed to know that the research was published for a subject in 85001.

oh_sigh · on Oct 6, 2018

You could publish the criteria for how your data may be jittered, but just not the specifics. Then, future users of the data could know to what extent they can rely on accuracy of the data.