Thursday, July 10, 2008

Search rankings based on user navigation

Mikhail Bilenko and Ryen White at Microsoft Research performed an interesting study on the utility of website navigation data for improving search results. They first recorded users' post-search browsing activity using (opt-in) functionality of the Microsoft toolbar plugin. They then analyzed the navigation data, turning it into features that they provided to a "learning to rank" system. The resulting retrieval function was able to significantly improve search results on their testbed. The paper is called "Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity".

Their basic idea was to record the trace of websites visited by a user from when he/she first issues a query to a search engine up until he/she stops navigating (by closing the browser window, remaining idle on a page for a period, or issuing a new query). The presence of websites in the trace is assumed to provide some indication as to the relevance of those pages to the initial query. The intuition here is that users issue queries to search engines to find "jumping off" pages, from where they begin their own navigation in order to find the truly relevant document for their current "information need".

In order to deal with the sparsity of query data (most queries to a search engine are seen infrequently and thus little click/navigation data is available for each), they proposed a very nice random walk idea that can generalize the click data across different query terms. Basically, a page that is navigated to (lets call it page0), after a user submits a query containing term0 is also visited by another user after submitting a different term, say term1. Moreover the latter term also resulted in clicks on pages: page1, page2, ..., so we might consider these links to be relevant (to a lesser extent) to the original query term term0. The random walk quantifies this intuition by deciding to what extent these additional pages are weighted for term0. As an aside, it seems that performing random walks over click-through data is popular idea at the moment in Web Mining research - see this post by Greg Linden.

In general, I like papers that go beyond analyzing the static Web graph (of hyperlinked webpages) to investigate the dynamic Web graph (of hyperlinks weighted by traffic volume) in order to improve search results. I read a somewhat related article recently on using the dynamic Web graph to calculate a better query-independent document prior (i.e. a better version of Google's PageRank). The article is called "Ranking Websites with Real User Traffic" by Meiss, Menczer, Fortunato, Flammini and Vespignani. The authors measure all of the navigation activity by (suitably anonymized) students and employees at the University of Indiana. (See this other post by Greg Linden.)

The PageRank algorithm takes as input the static graph and uses some rather strong assumptions (that a random surfer will choose from a page's outgoing links with uniform probability) to try to predict the dynamic graph (the amount of user traffic arriving at each webpage). The article looks at surfer data for all Internet users at the University of Indiana over a period of seven months and finds that the random surfer model of PageRank does a relatively poor job at predicting which pages attract the most traffic.

Of course, user navigation data can be very useful for a number of different purposes apart from improving search results, such as e-commerce site optimization and more recently, spam webpage detection. - See for example this recent paper on the subject: "Identifying Web Spam with User Behavior Analysis" by Yiqun Liu, Rongwei Cen, Min Zhang, Liyun Ru and Shaoping Ma.

Wednesday, July 2, 2008

Reviewer ratings, best reviewer prize, anyone?

I've been doing some more reviewing lately and have started to think about the problem of motivating reviewers to write good reviews.

The fact that the other reviewers assigned to the same conference paper as you (from whom your identity is usually not hidden) can read your reviews after they've submitted their own is a good motivation in my opinion. - Nobody in research wants to be seen as lazy or be caught out by their peers writing something that shows they have miss-understood the paper. Furthermore, the fact that the senior program committee (SPC) member (the "meta-reviewer") who's responsible for the paper reads all the reviews, means that those writing poor quality reviews probably wont be invited back next year to write reviews. So I guess putting the time and effort into reading papers thoroughly and writing good reviews does pay off in the long run.

Having said that, I was thinking that there may be other "more direct" ways to reward quality reviewing. One simple idea I had (which has surely been thought of before) would be to have a prize for the best reviewer - just as there is a prize for the best paper at the conference. The system would work the same way. When the other reviewers or the authors of the paper read a review that they find particularly compelling, they could nominate the reviewer for the "best reviewer" award. The tricky thing to deal with in this case would be to decide which of the nominated reviewers should be given the award - since even members of the so-called "senior program committee (SPC)" may be authors of one of the reviewed papers and hence should not become aware of the identity of the reviewer. Although I'm sure there are simple ways to get around this problem - e.g. to nominate an SPC member automatically who the system finds to be free of such conflicts.

A second, and slightly more interesting idea might be to give the authors (and maybe also the other reviewers) the chance to rate the quality of the reviews they receive. The aggregate scores across all the reviews submitted by a particular reviewer could then be used to directly rank the reviewers. Such a ranked list of reviewers - while containing a lot of error (variance in the ranking) - would at least give the SPC some indication of who's writing quality reviews (without any conflict of interest problems). The list could be used to decide who to invite back to review next year, and who to award the "best reviewer prize" to.

Video on document topic models

A friend of mine Mark Baillie, pointed me to an excellent talk by David Blei to the folks at Google on the use of Latent Dirichlet Allocation, LDA (a more rigorous statistical version of Latent Semantic Analysis) to discover topics/themes in the Journal of Science from 1880 to the present.

David is a really good speaker and manages to make what is essentially quite a complicated piece of work sound simple. He discusses the fact that the LDA model is modular, which means that it can be embedded in more complicated models. He then creates a generative model that take time into account, so that the terms used to describe a topic (say "genetics" or "data analysis") can change over the years as word-use evolves. For example, the term "relativity" starts appearing frequently in the topic "theoretical physics" after Einstein introduces his famous theories in the early 1900s.

Personally, I'm not convinced (possibly due to my limited understanding of the area) that such complicated statistical models are much more useful than simple non-exclusive (fuzzy) clustering of documents, especially given the considerable computation required to calculate the models and all of the approximations involved to make the model estimation problem tractable. Also, the fact that the generative model needs to be told a-priori how many topics to generate means that the generative model has the same number of parameters as fuzzy clustering, where the number of clusters must be chosen in advance. (David does mention some ongoing work in his talk on choosing the topic count automatically.)

I guess the real benefit of the generative model is the fact that the probabilities produced by the model have a real/intuitive interpretation. This allows you to calculate the likelihood of a previously unseen document and use that likelihood in a more complicated/embedding model. Moreover, the estimated models are smoothed and thus the simplest possible in the Bayesian sense. This means (I think) that the generative model will do a better job of estimating the topics than a clustering algorithm will in the case where the data is particularly sparse (i.e. there aren't enough documents to generate useful clusters).

On a highly related topic, Hal Daume has a very nice post on how one can/should evaluate hidden topic models of text corpora. The post sparked an interesting discussion in the comments.