In a recent article for Science, researchers Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, and Alex “Sandy” Pentland showed that the “anonymization” of personal data is not a guarantee of privacy for individuals. Before we discuss their study, let’s consider that it has been almost two decades of researchers telling us that anonymization, or “de-identification,” of private information has significant problems, and individuals can be re-identified and have their privacy breached.
Latanya Sweeney has been researching the issue of de-anonymization or re-identification of data for years. (She has taught at Harvard and Carnegie Mellon and has been the chief technologist for the Federal Trade Commission.) In 1998, she explained how a former governor of Massachusetts had his full medical record re-identified by cross-referencing Census information with de-identified health data. Sweeney also found that, with birth date alone, 12 percent of a population of voters can be re-identified. With birth date and gender, that number increases to 29 percent, and with birth date and Zip code it increases to 69 percent. In 2000, Sweeney found that 87 percent of the U.S. population could be identified with birth date, gender and Zip code. She used 1990 Census data.
In 2008, University of Texas researchers Arvind Narayanan and Vitaly Shmatikov were able to reidentify (pdf) individuals from a dataset that Netflix had released, data that the video-rental and -streaming service had said was anonymized. The researchers said, “Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.”
One of the most-publicized examples of reidentification of anonymized data occurred in 2006 with the publication of search records of 658,000 Americans by AOL demonstrated that the storage of a number as opposed to a name or address does not necessarily mean that search data cannot be linked back to an individual. Although the search logs released by AOL had been anonymized, identifying the user by only a number, New York Times reporters were quickly able to match some user numbers with the correct individuals. User No. 4417749 “conducted hundreds of searches over a three-month period on topics ranging from ‘numb fingers’ to ‘60 single men’ to ‘dog that urinates on everything.’” A short investigation led Times reporters to “Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga.” and has three dogs.
In 2009, University of Colorado law professor Paul Ohm discussed “the surprising failure of anonymization,” and said, “Data can either be useful or perfectly anonymous but never both.” He cited Sweeney’s research, as well as the research of other academics.
In 2011, Sweeney’s research reported on the dangers that can arise from the re-identification of anonymized medical data, and her advocacy of a “privacy-preserving marketplace” for data. And in 2013, she led a group that was able to re-identify a sample of anonymous participants in the Personal Genome Project, a DNA study. The group wrote: “We linked names and contact information to publicly available profiles in the Personal Genome Project. These profiles contain medical and genomic information, including details about medications, procedures and diseases, and demographic information, such as date of birth, gender, and postal code. By linking demographics to public records such as voter lists, and mining for names hidden in attached documents, we correctly identified 84 to 97 percent of the profiles for which we provided names.”
Last year, Princeton’s Arvind Narayanan and Edward W. Felten published a paper, “No silver bullet: De-identification still doesn’t work” (pdf), concerning the continued privacy problems with deidentification of personal information. They were blunt in their assessment, stating: “The thrust of our arguments is that (i) there is no evidence that de-identification works either in theory or in practice and (ii) attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.”
And just a few months ago, researchers at Neustar Research have delved into the anonymized NYC taxicab dataset and were able to reidentify passengers and their destinations, including customers of strip clubs.
What does all of this means? It means that we should not be surprised by the findings of the recent academic paper in Science, “Unique in the shopping mall: On the reidentifiability of credit card metadata” (Science html; archive pdf). The researchers studied “3 months of credit card records for 1.1 million people” and found that four publicly available “spatiotemporal points are enough to uniquely reidentify 90% of individuals.” They were able to identify the individuals even though “the data set was simply anonymized, which means that it did not contain any names, account numbers, or obvious identifiers. Each transaction was time-stamped with a resolution of 1 day and associated with one shop.”
Companies and institutions continue to use anonymization or deidentification techniques and processes to release data. Yet we have seen time and again that these processes haven’t worked. Personal, private information is linked back to an individual, violating his or her privacy.