I was taken aback when the clerk at the bookstore said, in a matter-of-fact way, “Can we verify your email address today?” Verify? I don’t recall that I ever gave it – and I was paying with cash. Then I realized: the store was asking me to add my email address to a database. “No,” I replied.
I cannot begin to count – or remember – the number of times I have been asked for personal information with the assurance that the results will “only be used in the aggregate” or that “all personally identifying information will be removed.” Netflix and Amazon encourage me to rate movies and books so that they can provide me with “better recommendations” on later visits. Search engines cannily leave traces of my search behind – I was beset with offers of black skirts on Yahoo for weeks after I searched various online stores for a while.
Paul Ohm’s article, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, shattered that belief. Early in the article he makes a startling claim: “Data can either be useful or perfectly anonymous but never both.” If that’s the case, any data that is useful – all those personalized searching or online shopping experiences – are not really anonymous.
Posted on the Social Science Research Network, his lengthy article describes the emerging science of “re-identification” – the practice of combing and combining previously-scrubbed data so that the identities of individuals can re-emerge. His hypothetical examples give the reader an idea of how it is done, but his real-life examples reveal the potential dangers.
- AOL released a group of searches, assigning user numbers in place of AOL user names. Intending to help researchers build better search engines, bloggers pored over the data, finding search-term combinations that told a story. They were accurate enough to allow people to knock on the front door of a 62-year-old woman who was selling her home.
- The Netflix Prize released a million movie ratings by users so that researchers would build better algorithms for making recommendations to users. Other researchers were able to link these ratings to ratings on a public movie site that does include names and, for some users, to re-identify all of their Netflix ratings. Even the ones they had not chosen to post with their names.
- Massachusetts released anonymized hospital records for research purposes, assuring the public that all personally identifying information had been removed. Sex, birthdate, and zip code were left in, with the presumption that these would not identify an individual out of the thousands of records. By paying $20 for the complete voter registration list of Cambridge, MA (public information), a graduate student was able to find then-Governor Weld’s hospital record and mail it to his office. The data contained only one male with his birth date living in his zip code.
Ohm highlights two mistaken assumptions that make re-identification easy and possible.
The first is ignorance of the “pockets of surprising uniqueness” in even massive data sets – as Governor Weld’s hospitalization data make clear. While the obvious personal identifiers are removed, combinations of the useful variables that remain will be unique to particular individuals.
The second is the assumption that the data will be used alone. The Netflix example showed that linkage to a single easily available related source of data made re-identification easy, legal, and free.
Ohm’s article focuses on legal and technological approaches to dealing with re-identification, ultimately concluding that neither approach is likely to work. The solutions – such as they are – rest on the development of different standards for releasing information and – the part that we can do something about – the decision of consumers to turn away from the myriads of “personalized” services which, over time, create a digital fingerprint that can be tied to their legal identity for all to read.
Related articles by Zemanta
- Netflix is about to commit a privacy Valdez with its customers’ viewing data (boingboing.net)
- Amazon: A Search Engine With A Warehouse (techdirt.com)