Ars Technica has a good story explaining why so-called “anonymized” data usually isn’t anonymous at all.
In 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Such work by computer scientists over the last fifteen years has shown a serious flaw in the basic idea behind “personal information”: almost all information can be “personal” when combined with enough other relevant bits of data.
That’s the claim advanced by Ohm in his lengthy new paper on “the surprising failure of anonymization.” As increasing amounts of information on all of us are collected and disseminated online, scrubbing data just isn’t enough to keep our individual “databases of ruin” out of the hands of the police, political enemies, nosy neighbors, friends, and spies. […]
Examples of the anonymization failures aren’t hard to find.
When AOL researchers released a massive dataset of search queries, they first “anonymized” the data by scrubbing user IDs and IP addresses. When Netflix made a huge database of movie recommendations available for study, it spent time doing the same thing. Despite scrubbing the obviously identifiable information from the data, computer scientists were able to identify individual users in both datasets. (The Netflix team then moved on to Twitter users.) […]
For users, the prospect of some secret leaking to the public grows as databases proliferate. Here is Ohm’s nightmare scenario: “For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. Perhaps it is a fact about past conduct, health, or family shame. For almost every one of us, then, we can assume a hypothetical ‘database of ruin,’ the one containing this fact but until now splintered across dozens of databases on computers around the world, and thus disconnected from our identity. Reidentification has formed the database of ruin and given access to it to our worst enemies.”