We’ve discussed before the pitfalls of various anonymization or “de-identification” techniques and how the information can be “deanonymized” or re-identified, leading to privacy problems for individuals. A few months ago, the EU’s Article 29 Working Party on the Protection of Individuals with regard to the Processing of Personal Data released a detailed report (pdf) on the issue. Now, researchers at Neustar Research have delved into the “anonymized” NYC taxicab dataset and were able to re-identify passengers and their destinations, including customers of strip clubs:
There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. It was obtained via a FOIL (Freedom of Information Law) request earlier this year and has been making waves in the hacker community ever since.
The release of this data in this unalloyed format raises several privacy concerns. The most well-documented of these deals with the hash function used to “anonymize” the license and medallion numbers. A bit of lateral thinking from one civic hacker and the data was completely de-anonymized. This data can now be used to calculate, for example, any driver’s annual income. More disquieting, though, in my opinion, is the privacy risk to passengers. With only a small amount of auxiliary knowledge, using this dataset an attacker could identify where an individual went, how much they paid, weekly habits, etc. I will demonstrate how easy this is to do in the following section.
Read the full story for details on how the data was deanonymized in order to be able to identify individuals.