I'm a big fan of the social bookmarking site del.icio.us. It contains a lot of really useful information on what's popular online. Users only tag webpages that they find interesting, so the fact that a page is bookmarked on del.icio.us is a pretty good indication of its quality, and the more individuals that have tagged the same page, the more likely it is to be interesting.
Yahoo bought del.icio.us a while back, and finally (according to this post on TechCrunch) are starting to incorporate this wealth of information into their own search results. Currently they are adding a line to each result page summary detailing the number of users that have bookmarked the page on del.icio.us. It is hard to tell whether they are also using the data internally to improve search results, i.e. by pushing highly tagged pages further up the list.
A recent and excellent paper by Paul Heymann, Georgia Koutrika and Hector Garcia-Molina called "Can Social Bookmarking Improve Web Search?" suggests that social bookmarking data may indeed (albeit in certain limited cases) be useful for improving search results.
They find that URLs in del.icio.us are disproportionately common in search results, given the relative size of del.icio.us compared to Yahoo's search engine index (which they estimate to be at least two orders of magnitude larger): Looking at the results for some 30000 popular queries, they found that 19% of the top 10 results and 9% of the top 100 results were present in del.icio.us. These statistics are hardly surprising given that users must first find a web page in order to add it to del.icio.us - and they do that using a search engine!
Far more interesting to me is the question as to whether the users are adding useful information by bookmarking a page once they've found it. The search engines record which search results users click on and can already use this information to improve their algorithms. Presumably, however, a user will often click on lots of different results and save only the very best (most relevant) ones to del.icio.us. Thus the fact that a page appears in del.icio.us should carry more meaning than a user clicking on a particular search engine result and therefore may be useful for reordering search results.
The authors don't deal directly with the question of whether a page's bookmark count is on its own a good indication of quality. They do however study the freshness of URLs on del.icio.us and conclude that the stream of URLs being added to del.icio.us are at least as new as search engine results. The authors also estimate that 12.5% of URLs posted to del.icio.us were not present in Yahoo's index at the time they were added, implying that del.icio.us could serve as a small source of highly relevant content for seeding Web crawls.
(Thanks to Greg Lindon for pointing out the paper on his blog.)
Sunday, January 20, 2008
Friday, January 4, 2008
Staying anonymous online
In an online world is it possible to keep private data private?
I've been working for a while using data that is publicly available online. It has struck me recently how much information about themselves people are willing to put online - me included. I have a list of public bookmarks, photos, a web page, a blog ....
I'm posting about this because I read an interesting article by Arvind Narayanan and Vitaly Shmatikov on discovering the identity of users in anonymized datasets. The article is called "Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)" .
Before continuing, I should say that I think it is great what Netflix is doing in releasing anonymized data and creating a prize for the best movie recommendation algorithm. It can be very difficult to find good data for doing research in information retrieval (especially personalization) if you work for a University and not for Google/Yahoo/Amazon/etc. Moreover, I think competitions are a really good approach for both motivating research and for advancing the state of the art by allowing truly objective comparisons between different approaches.
In the paper the authors show that if you have some public and (crucially) non-anonymous data (i.e. information that can be traced to known individuals) and that data contains some overlap with some public but anonymous data from another source then the natural sparsity in the data can be used to identify corresponding users in the anonymous data. Data integration researchers will note the similarity to the record linkage problem.
The example they use is to take movie ratings on the Internet Movie Database (IMDb), which the authors argue is not really anonymous (users often login with their real names), and use it to find corresponding users in the anonymized Netflix dataset. This technique for "de-anonymizing" data relies on the existence of a non-anonymous data source that overlaps with the anonymous source over sparse attributes. In this case, the sparsity in the data results from the fact that it is extremely unlikely that two users watch the same movies in the same order as one another. Thus the order is sufficient to de-annonymise data even in the case where the data is noisy or has been deliberately blurred in some way.
The fact that the two researchers manage to discover the "identity" (IMDb login) of some users on Netflix is of itself somewhat worrying. When a user rates a movie on IMDb they are aware that they are making information public, while on Netflix they may rightfully assume their individual ratings are private (and that only aggregate data is public). Presumably, one could imagine a user rating only "politically correct" movies on their public IMDb profile and films relating to their "personal fetishes" on the private Netflix one, (the latter being useful for receiving personalized film recommendations).
The lesson is, however, more general than the Netflix dataset. Using similar techniques, it should also be possible to link people's profiles on many different sites (not just movie ratings). For example, we could link somebody's del.icio.us bookmarks, to their ratings on amazon.com, to their photos on flickr (and their location through geotagging), to their emails in newsgroups, to their profile on LinkedIn and perhaps even their page on facebook. All this data is public and together it would give a somewhat detailed description of the individual.
None of this is really very alarming per se. We are out in public all the time - our actions, appearance, etc. can be observed when we walk down the street. In some sense, our (public) activities online shouldn't be any different. One should just take the lack of anonymity into account when posting information to del.icio.us/flickr/IMDb/etc. - As Narayanan and Shmatikov point out, it is possible to guess at personal details like political leanings, age and sexual orientation, based solely on the seemingly innocuous information contained in movie ratings.
I've been working for a while using data that is publicly available online. It has struck me recently how much information about themselves people are willing to put online - me included. I have a list of public bookmarks, photos, a web page, a blog ....
I'm posting about this because I read an interesting article by Arvind Narayanan and Vitaly Shmatikov on discovering the identity of users in anonymized datasets. The article is called "Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)" .
Before continuing, I should say that I think it is great what Netflix is doing in releasing anonymized data and creating a prize for the best movie recommendation algorithm. It can be very difficult to find good data for doing research in information retrieval (especially personalization) if you work for a University and not for Google/Yahoo/Amazon/etc. Moreover, I think competitions are a really good approach for both motivating research and for advancing the state of the art by allowing truly objective comparisons between different approaches.
In the paper the authors show that if you have some public and (crucially) non-anonymous data (i.e. information that can be traced to known individuals) and that data contains some overlap with some public but anonymous data from another source then the natural sparsity in the data can be used to identify corresponding users in the anonymous data. Data integration researchers will note the similarity to the record linkage problem.
The example they use is to take movie ratings on the Internet Movie Database (IMDb), which the authors argue is not really anonymous (users often login with their real names), and use it to find corresponding users in the anonymized Netflix dataset. This technique for "de-anonymizing" data relies on the existence of a non-anonymous data source that overlaps with the anonymous source over sparse attributes. In this case, the sparsity in the data results from the fact that it is extremely unlikely that two users watch the same movies in the same order as one another. Thus the order is sufficient to de-annonymise data even in the case where the data is noisy or has been deliberately blurred in some way.
The fact that the two researchers manage to discover the "identity" (IMDb login) of some users on Netflix is of itself somewhat worrying. When a user rates a movie on IMDb they are aware that they are making information public, while on Netflix they may rightfully assume their individual ratings are private (and that only aggregate data is public). Presumably, one could imagine a user rating only "politically correct" movies on their public IMDb profile and films relating to their "personal fetishes" on the private Netflix one, (the latter being useful for receiving personalized film recommendations).
The lesson is, however, more general than the Netflix dataset. Using similar techniques, it should also be possible to link people's profiles on many different sites (not just movie ratings). For example, we could link somebody's del.icio.us bookmarks, to their ratings on amazon.com, to their photos on flickr (and their location through geotagging), to their emails in newsgroups, to their profile on LinkedIn and perhaps even their page on facebook. All this data is public and together it would give a somewhat detailed description of the individual.
None of this is really very alarming per se. We are out in public all the time - our actions, appearance, etc. can be observed when we walk down the street. In some sense, our (public) activities online shouldn't be any different. One should just take the lack of anonymity into account when posting information to del.icio.us/flickr/IMDb/etc. - As Narayanan and Shmatikov point out, it is possible to guess at personal details like political leanings, age and sexual orientation, based solely on the seemingly innocuous information contained in movie ratings.
Subscribe to:
Posts (Atom)