Friday, January 4, 2008

Staying anonymous online

In an online world is it possible to keep private data private?

I've been working for a while using data that is publicly available online. It has struck me recently how much information about themselves people are willing to put online - me included. I have a list of public bookmarks, photos, a web page, a blog ....

I'm posting about this because I read an interesting article by Arvind Narayanan and Vitaly Shmatikov on discovering the identity of users in anonymized datasets. The article is called "Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)" .

Before continuing, I should say that I think it is great what Netflix is doing in releasing anonymized data and creating a prize for the best movie recommendation algorithm. It can be very difficult to find good data for doing research in information retrieval (especially personalization) if you work for a University and not for Google/Yahoo/Amazon/etc. Moreover, I think competitions are a really good approach for both motivating research and for advancing the state of the art by allowing truly objective comparisons between different approaches.

In the paper the authors show that if you have some public and (crucially) non-anonymous data (i.e. information that can be traced to known individuals) and that data contains some overlap with some public but anonymous data from another source then the natural sparsity in the data can be used to identify corresponding users in the anonymous data. Data integration researchers will note the similarity to the record linkage problem.

The example they use is to take movie ratings on the Internet Movie Database (IMDb), which the authors argue is not really anonymous (users often login with their real names), and use it to find corresponding users in the anonymized Netflix dataset. This technique for "de-anonymizing" data relies on the existence of a non-anonymous data source that overlaps with the anonymous source over sparse attributes. In this case, the sparsity in the data results from the fact that it is extremely unlikely that two users watch the same movies in the same order as one another. Thus the order is sufficient to de-annonymise data even in the case where the data is noisy or has been deliberately blurred in some way.

The fact that the two researchers manage to discover the "identity" (IMDb login) of some users on Netflix is of itself somewhat worrying. When a user rates a movie on IMDb they are aware that they are making information public, while on Netflix they may rightfully assume their individual ratings are private (and that only aggregate data is public). Presumably, one could imagine a user rating only "politically correct" movies on their public IMDb profile and films relating to their "personal fetishes" on the private Netflix one, (the latter being useful for receiving personalized film recommendations).

The lesson is, however, more general than the Netflix dataset. Using similar techniques, it should also be possible to link people's profiles on many different sites (not just movie ratings). For example, we could link somebody's del.icio.us bookmarks, to their ratings on amazon.com, to their photos on flickr (and their location through geotagging), to their emails in newsgroups, to their profile on LinkedIn and perhaps even their page on facebook. All this data is public and together it would give a somewhat detailed description of the individual.

None of this is really very alarming per se. We are out in public all the time - our actions, appearance, etc. can be observed when we walk down the street. In some sense, our (public) activities online shouldn't be any different. One should just take the lack of anonymity into account when posting information to del.icio.us/flickr/IMDb/etc. - As Narayanan and Shmatikov point out, it is possible to guess at personal details like political leanings, age and sexual orientation, based solely on the seemingly innocuous information contained in movie ratings.

No comments: