Search


  • Categories


  • Archives

    « Home

    Ars Technica: ‘Anonymized’ data really isn’t — and here’s why not

    Ars Technica has a good story explaining why so-called “anonymized” data usually isn’t anonymous at all.

    In 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.

    Such work by computer scientists over the last fifteen years has shown a serious flaw in the basic idea behind “personal information”: almost all information can be “personal” when combined with enough other relevant bits of data.

    That’s the claim advanced by Ohm in his lengthy new paper on “the surprising failure of anonymization.” As increasing amounts of information on all of us are collected and disseminated online, scrubbing data just isn’t enough to keep our individual “databases of ruin” out of the hands of the police, political enemies, nosy neighbors, friends, and spies. [...]

    Examples of the anonymization failures aren’t hard to find.

    When AOL researchers released a massive dataset of search queries, they first “anonymized” the data by scrubbing user IDs and IP addresses. When Netflix made a huge database of movie recommendations available for study, it spent time doing the same thing. Despite scrubbing the obviously identifiable information from the data, computer scientists were able to identify individual users in both datasets. (The Netflix team then moved on to Twitter users.) [...]

    For users, the prospect of some secret leaking to the public grows as databases proliferate. Here is Ohm’s nightmare scenario: “For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. Perhaps it is a fact about past conduct, health, or family shame. For almost every one of us, then, we can assume a hypothetical ‘database of ruin,’ the one containing this fact but until now splintered across dozens of databases on computers around the world, and thus disconnected from our identity. Reidentification has formed the database of ruin and given access to it to our worst enemies.”

    Possibly related posts:

    One Response to “Ars Technica: ‘Anonymized’ data really isn’t — and here’s why not”

    1. Who Had my Genetic Information Yesterday, Who Has it Today, and Who Might Have it Tomorrow? | North Carolina Journal of Law and Technology Says:

      [...] and policies like it have been touted as an excellent method of maintaining privacy, it has been challenged by privacy advocates in many contexts, from the Google Books Settlement to abortion rights. The [...]

    Leave a Reply