In a recent article for Science, researchers Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, and Alex “Sandy” Pentland showed that the “anonymization” of personal data is not a guarantee of privacy for individuals. Before we discuss their study, let’s consider that it has been almost two decades of researchers telling us that anonymization, or “de-identification,” of private information has significant problems, and individuals can be re-identified and have their privacy breached.
Latanya Sweeney has been researching the issue of de-anonymization or re-identification of data for years. (She has taught at Harvard and Carnegie Mellon and has been the chief technologist for the Federal Trade Commission.) In 1998, she explained how a former governor of Massachusetts had his full medical record re-identified by cross-referencing Census information with de-identified health data. Sweeney also found that, with birth date alone, 12 percent of a population of voters can be re-identified. With birth date and gender, that number increases to 29 percent, and with birth date and Zip code it increases to 69 percent. In 2000, Sweeney found that 87 percent of the U.S. population could be identified with birth date, gender and Zip code. She used 1990 Census data.
In 2008, University of Texas researchers Arvind Narayanan and Vitaly Shmatikov were able to reidentify (pdf) individuals from a dataset that Netflix had released, data that the video-rental and -streaming service had said was anonymized. The researchers said, “Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.” Read more »