mark's research

Thursday, July 10, 2008

Search rankings based on user navigation

Mikhail Bilenko and Ryen White at Microsoft Research performed an interesting study on the utility of website navigation data for improving search results. They first recorded users' post-search browsing activity using (opt-in) functionality of the Microsoft toolbar plugin. They then analyzed the navigation data, turning it into features that they provided to a "learning to rank" system. The resulting retrieval function was able to significantly improve search results on their testbed. The paper is called "Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity".

Their basic idea was to record the trace of websites visited by a user from when he/she first issues a query to a search engine up until he/she stops navigating (by closing the browser window, remaining idle on a page for a period, or issuing a new query). The presence of websites in the trace is assumed to provide some indication as to the relevance of those pages to the initial query. The intuition here is that users issue queries to search engines to find "jumping off" pages, from where they begin their own navigation in order to find the truly relevant document for their current "information need".

In order to deal with the sparsity of query data (most queries to a search engine are seen infrequently and thus little click/navigation data is available for each), they proposed a very nice random walk idea that can generalize the click data across different query terms. Basically, a page that is navigated to (lets call it page0), after a user submits a query containing term0 is also visited by another user after submitting a different term, say term1. Moreover the latter term also resulted in clicks on pages: page1, page2, ..., so we might consider these links to be relevant (to a lesser extent) to the original query term term0. The random walk quantifies this intuition by deciding to what extent these additional pages are weighted for term0. As an aside, it seems that performing random walks over click-through data is popular idea at the moment in Web Mining research - see this post by Greg Linden.

In general, I like papers that go beyond analyzing the static Web graph (of hyperlinked webpages) to investigate the dynamic Web graph (of hyperlinks weighted by traffic volume) in order to improve search results. I read a somewhat related article recently on using the dynamic Web graph to calculate a better query-independent document prior (i.e. a better version of Google's PageRank). The article is called "Ranking Websites with Real User Traffic" by Meiss, Menczer, Fortunato, Flammini and Vespignani. The authors measure all of the navigation activity by (suitably anonymized) students and employees at the University of Indiana. (See this other post by Greg Linden.)

The PageRank algorithm takes as input the static graph and uses some rather strong assumptions (that a random surfer will choose from a page's outgoing links with uniform probability) to try to predict the dynamic graph (the amount of user traffic arriving at each webpage). The article looks at surfer data for all Internet users at the University of Indiana over a period of seven months and finds that the random surfer model of PageRank does a relatively poor job at predicting which pages attract the most traffic.

Of course, user navigation data can be very useful for a number of different purposes apart from improving search results, such as e-commerce site optimization and more recently, spam webpage detection. - See for example this recent paper on the subject: "Identifying Web Spam with User Behavior Analysis" by Yiqun Liu, Rongwei Cen, Min Zhang, Liyun Ru and Shaoping Ma.

Wednesday, July 2, 2008

Reviewer ratings, best reviewer prize, anyone?

I've been doing some more reviewing lately and have started to think about the problem of motivating reviewers to write good reviews.

The fact that the other reviewers assigned to the same conference paper as you (from whom your identity is usually not hidden) can read your reviews after they've submitted their own is a good motivation in my opinion. - Nobody in research wants to be seen as lazy or be caught out by their peers writing something that shows they have miss-understood the paper. Furthermore, the fact that the senior program committee (SPC) member (the "meta-reviewer") who's responsible for the paper reads all the reviews, means that those writing poor quality reviews probably wont be invited back next year to write reviews. So I guess putting the time and effort into reading papers thoroughly and writing good reviews does pay off in the long run.

Having said that, I was thinking that there may be other "more direct" ways to reward quality reviewing. One simple idea I had (which has surely been thought of before) would be to have a prize for the best reviewer - just as there is a prize for the best paper at the conference. The system would work the same way. When the other reviewers or the authors of the paper read a review that they find particularly compelling, they could nominate the reviewer for the "best reviewer" award. The tricky thing to deal with in this case would be to decide which of the nominated reviewers should be given the award - since even members of the so-called "senior program committee (SPC)" may be authors of one of the reviewed papers and hence should not become aware of the identity of the reviewer. Although I'm sure there are simple ways to get around this problem - e.g. to nominate an SPC member automatically who the system finds to be free of such conflicts.

A second, and slightly more interesting idea might be to give the authors (and maybe also the other reviewers) the chance to rate the quality of the reviews they receive. The aggregate scores across all the reviews submitted by a particular reviewer could then be used to directly rank the reviewers. Such a ranked list of reviewers - while containing a lot of error (variance in the ranking) - would at least give the SPC some indication of who's writing quality reviews (without any conflict of interest problems). The list could be used to decide who to invite back to review next year, and who to award the "best reviewer prize" to.

Video on document topic models

A friend of mine Mark Baillie, pointed me to an excellent talk by David Blei to the folks at Google on the use of Latent Dirichlet Allocation, LDA (a more rigorous statistical version of Latent Semantic Analysis) to discover topics/themes in the Journal of Science from 1880 to the present.

David is a really good speaker and manages to make what is essentially quite a complicated piece of work sound simple. He discusses the fact that the LDA model is modular, which means that it can be embedded in more complicated models. He then creates a generative model that take time into account, so that the terms used to describe a topic (say "genetics" or "data analysis") can change over the years as word-use evolves. For example, the term "relativity" starts appearing frequently in the topic "theoretical physics" after Einstein introduces his famous theories in the early 1900s.

Personally, I'm not convinced (possibly due to my limited understanding of the area) that such complicated statistical models are much more useful than simple non-exclusive (fuzzy) clustering of documents, especially given the considerable computation required to calculate the models and all of the approximations involved to make the model estimation problem tractable. Also, the fact that the generative model needs to be told a-priori how many topics to generate means that the generative model has the same number of parameters as fuzzy clustering, where the number of clusters must be chosen in advance. (David does mention some ongoing work in his talk on choosing the topic count automatically.)

I guess the real benefit of the generative model is the fact that the probabilities produced by the model have a real/intuitive interpretation. This allows you to calculate the likelihood of a previously unseen document and use that likelihood in a more complicated/embedding model. Moreover, the estimated models are smoothed and thus the simplest possible in the Bayesian sense. This means (I think) that the generative model will do a better job of estimating the topics than a clustering algorithm will in the case where the data is particularly sparse (i.e. there aren't enough documents to generate useful clusters).

On a highly related topic, Hal Daume has a very nice post on how one can/should evaluate hidden topic models of text corpora. The post sparked an interesting discussion in the comments.

Thursday, June 19, 2008

Gamers provide heuristic support for complex problem solving

It has occurred to me in the past that it would be nice to harness the inherent abstraction and problem solving skills of humans to better guide heuristics for solving complex problems (such as NP-complete ones). Indeed it would be really cool to let people play some sort of online computer game that acts as a metaphor for an underlying heuristic combinatorial search problem. I was never really sure though, how one could map a combinatorial problem into a visual form that makes the game both engaging and challenging.

Well, according to this post on the nature blog, the bioinformatics researchers at fold.it have done just that by creating a video game in which users help computer to perform the task of protein structure prediction. Interestingly, the game developers didn't set off with the idea of developing a game, but of distributing the workload of search across the idle computing cycles of internet users (like in the SETI@home project). It was only when those users told the researchers that they would like to help/guide the program toward better solutions, that they have the idea to turn it into a game.

In general, I like approaches to harnessing the power of so-called human computation as is done for example in the far less complicated Google Image Labeler.

Everything you ever wanted to know about graph mining for social networks

A colleague of mine Mostafa Keikha pointed me to a very nice survey paper on graph mining and in particular its applications to the Web and social networks. The paper is called "Graph Mining: Laws, Generators and Algorithms" by D. Chakrabarti and C. Faloutsos. It really is very well written and a good read for those interested in partitioning graphs into communities, generating synthetic graphs that look like real ones, analyzing different properties of social networks, etc.

Friday, April 18, 2008

Google starts indexing behind web forms

A lot of interesting content on the Web is hidden behind Web forms and search interfaces. These pages are usually generated dynamically by database queries built using the form's inputs. Such dynamic sites are sometimes referred to as the Deep Web. Dynamic sites present a problem for Web crawlers as there are often not enough "static links" from the outside Web into the dynamic content (and between the dynamic pages) in order for the crawler to discover all the content on a site. The end result is that the search engine indexes only part of a website's content.

I just saw an interesting post stating that Google has started to automatically "fill in" the inputs on some Web forms so as to better index content behind the forms. It seems that Google's crawler (affectionately termed the Googlebot) has been "taught" how to generate input for search boxes on some sites. Thus the crawler can discover pages that are not linked to from the "Surface Web".

In general it is a difficult problem to automatically generate reasonable and relevant input values for a previously unseen Web form. Some forms contain drop-down boxes (multiple choice inputs) which simplifies the problem as the set of possible options are are enumerated explicitly in the form. In most cases, however, a search input is required involving an unrestricted free-text query. The difficulty with the latter is that for many valid inputs nothing will be returned by the site, or at best an error page will be "discovered". - Imagine sending the query "space-shuttle" to a website providing information on cancer research for instance.

Generating relevant inputs so as to probe and characterize a dynamic website is a research problem that I am very much interested in and is central to the area of Distributed Information Retrieval. The best work I've seen recently on this was done by Panagiotis Ipeirotis and Luis Gravano in their article: "Classification-Aware Hidden-Web Text Database Selection". In that work, while probing an online databases with single term queries, they try to assign the database to (progressively deeper) nodes in a topic hierarchy. They can then use a distribution of relevant terms for each node in the hierarchy to choose reasonable query terms for effective probing.

The problem is even more complicated for forms requiring multiple input fields. Consider for example a yellow-pages service that takes a company name (free text) and a partially filled address (street name, etc.) as input. Randomly generating plausible inputs for each field separately won't work in this case because the different fields must make sense together (i.e. a business with that name needs to exist on the given street). Thus managing to invoke the interface so as not to receive an empty/error page can be very difficult. - Just detecting the error page can be difficult on its own. This was an issue that I had to deal with (not very satisfactorily) in my thesis.

Monday, April 7, 2008

Women are better at judging personality

Psychology researchers have performed an interesting study on the ability of social network users to understand/judge different aspects of other users' personalities based on their social network profile/page. The study is called "What Elements of an Online Social Networking Profile Predict Target-Rater Agreement in Personality Impressions?" by David C. Evans, Samuel D. Gosling and Anthony Carroll. [Thanks to William Cohen for pointing me to it.]

The attributes of a user's personality that they investigated included being disciplined/casual, alternative/traditional, neurotic/unemotional, cooperative/competitive, and extroverted/introverted.

The study found that women are both better at assessing the personality of others and at describing their own personality (presumably by providing more meaningful answers to profile questions on their pages). In order to check the accuracy of the personality judgments, the authors compared each assessment to the user's own assessment of themselves. - I wonder if the conjecture that males are not as good judges of their own personality, might also explain the second result?

Other interesting findings is that a user's personality is best described by: i) a video they find funny, ii) a description of "what makes them glad to be alive", iii) the most embarrassing thing they ever did, or iv) the proudest thing they ever did. While things like their favorite (or least favorite) book, movie, etc. were found to be not very useful for assessing one's personality.

Sunday, April 6, 2008

Using mechanical turk for research

Panos Ipeirotis has been posting some very interesting information on his blog regarding a study of users on Amazon's Mechanical Turk (MT). For those not in the know, MT is basically a crowd-sourcing system, where "requesters" define small tasks that can be performed online (like adding text labels to a group of images) and offer a certain amount of money per task. "Workers" then choose which jobs they want to perform and receive small payments (in their US bank accounts) whenever they complete them. Most of the jobs are pretty menial, so the payments are quite low (a few cents per task) and workers are unlikely to get rich doing them. But the point is that all the tasks require humans to perform them, i.e. they can't be automated well by a computer program. The trick for the requesters is to design jobs in such a way that noisy results are removed automatically (e.g. by comparing results from different workers) - as the workers may be motivated more by time than quality when completing tasks.

Anyway, Panos has started using MT to help him with his research. While contracting out the research itself doesn't make sense (of course!), there are a number of very interesting ways different research related tasks can be done using the Mechanical Turk. The most obvious is to get MT users to label training data for experiments in Machine Learning. But MT can also be used to relatively easily set up cheap user-studies, such as those required for assessing result quality/user-satisfaction in Information Retrieval. I'm not exactly sure about what the ethics issues are for the latter use, but it does sound like a very good idea to take advantage of all the (bored) Internet users out there. Here is a simple example of a user-study, in which the authors ask MT users to name different colors that they present to them [via Anonymous Prof].

Friday, April 4, 2008

The AOL dataset

I'm currently doing some research in personalization of Distributed Information Retrieval (DIR) and I need access to some personalized query logs to do some experiments. Since I don't work for a major search engine, I don't have access to their logs and am unlikely to be granted access anytime soon. There is of course the AOL query log dataset that was released last year for research use and then quickly retracted by the company due to privacy complaints. The dataset can easily be found online. - As anybody who's ever released sensitive data online knows, it's impossible to "remove" information from the Net once it's been released.

The AOL data is exactly what I need, and I wish to use it for its intended purpose: to support research in Information Retrieval. I don't intend to pry into peoples' private lives and of course am not interested in identifying users from their logs.

My question is whether or not I am allowed (from a practical perspective) to use the data. Some conferences, like SIGIR, seem a little worried about authors using it from what hear, while others don't seem to mind. For example I saw a paper at WSDM'08 that used the data as a source of common queries to a Web search engine.

And from a legal perspective, does the fact that AOL retracted the data have any implications? Obviously AOL owns the copyright to their own data. Do I need permission from AOL in order to use their copyrighted data? Or is it considered "fair use" if I make use of the data only for research and don't distribute the raw data, just my analysis of it?

Thursday, April 3, 2008

Conferences starting to provide videos of talks online

I realized recently that I'm watching many more research presentations online than I ever did in the past. It seems that where once you needed to locate and read the paper (and maybe download the slides if available), you can now often find a video online of the author presenting their work. While I'm not suggesting that one no longer needs to read research papers (as I think that is critical for understanding the details of previous work), I am suggesting that the video presentations are a good place to start to understand a new topic. Presentations are often much better at conveying the overall picture of what the research is about and how it fits in with the rest of the literature.

As far as video sites go, I'm a big fan of Google's TechTalks, which are particularly good for their research focus and quality. But even better often are the videos on videolectures.net, which show slides in a separate window and allow you to jump directly to the part of the talk you are most interested in. This allows "speed-viewing" of a number of different presentations from the same conference/workshop/summer-school/etc.

Some examples of great content that I have found lately include lectures from the 2007 Machine Learning Summer School (particularly good is the lecture on Graphical Models), the 2007 Bootcamp in Machine Learning and the 2006 Autumn School on Machine Learning, (which includes seminars by two of my favorite machine learning researchers: William Cohen and Tom Mitchell).

I think in the very near future all of the top conferences will start recording and streaming the talks online. Some of the more progressive conferences are already doing it - see for example the presentations at WSDM'08. In the future, the fact that all the talks will be available online should mean that attending conferences will be even more about the networking and even less about attending talks that it is now!

Wednesday, March 19, 2008

Degrees of separation in social networks

It's a common belief that there are six degrees of separation between any two individuals on the planet. In other words, if you consider all the people you know, and then all the people they know, and continue for six iterations, you will cover (almost) everybody on the planet. With the advent of online social networks and the ability to mine them for statistics, it's interesting to see how well this "six degrees" rule really works in practice ....

Researchers at Microsoft and CMU recorded one month of chat conversations from the Microsoft's Instant Messaging (IM) client and analyzed the data - not the content of the conversations, just metadata regarding the users (age, sex, location) and the conversations (length, time, number of messages). The resulting social network is a biggie, with 240 million users and 30 billion conversations. The paper is called "Planetary-Scale Views on an Instant-Messaging Network" by Jure Leskovec and Eric Horvitz. [Thanks to the physics arXiv blog for pointing me to the paper.]

Interestingly they discover the average minimum path length between connected users (99.9% of all users) was 6.6. That is to say, that there was indeed 6 degrees of separation between users over the one month of data collected. Obviously, since IM users are not a representative sample of the whole world population (not many grandparents use instant messaging), we would expect them to be more highly connected than the rest of the population. My guess is that this average path length would drop significantly as the time window was increased from just one month of data.

Other notable statistics from the paper are that almost all conversations (99%) involve only 2 people and that we converse more frequently and longer with members of the opposite sex than they do with their own.

Another interesting study on social networks can be found in "Measurement and Analysis of Online Social Networks" by Alan Mislove, Massilmiliano Marcon, Krishna P. Gummadi, Peter Druschel and Bobby Bhattacharjee. [Thanks to Cyrus Hall for pointing me to the paper.] The interesting thing about this study is that they compare different types of social networks to see which properties they all share in common. The networks investigated are Flickr, YouTube, LiveJournal and Orkut, i.e. a photo-sharing, a video-sharing, a blogging and a friendship network.

The average minimum path length between users (in the main connected component) for the different networks was:

Flickr: 5.67 (1.8 million users, 26.9%)
LiveJournal: 5.88 (5.2 million users, 95.4%)
Orkut: 4.25 (3.0 million users, 11.3%)
YouTube: 5.10 (1.1 million users, coverage unknown)

The numbers in brackets are the number of users in the sample and the estimated coverage of the social network. The fact that links in Orkut are bidirectional ("friendship connections" are reciprocal) probably explains why the average path length is lower in that case.

The thing I find intriguing is that the path length for the other three networks are all pretty much within rounding error of one another. And even the values for Orkut and IM aren't that far off. - For comparison, the average path length between pages on the Web is 16 if the direction of hyperlinks is taken into account (and 7 if it isn't).

The paper also comes to other interesting conclusions. They find that for networks with directed connections between users, both the number of incoming connections to a node, and the number of outgoing connections follow power-law distributions. They also find that nodes with high in-degree tend to have high out-degree.

It is well known that hyperlinks in the Web graph also follow power-law distributions (see for example this paper), so this result for social networks was to be expected I guess. A second result regards the importance of "highly connected individuals" in the network and is a little more surprising:

The networks each contain a large, densely connected core. Overall, the network is held together by about 10% of the nodes with highest degree. As a result, path lengths are short, but almost all shortest paths of sufficient length traverse the highly connected core.

I'm wondering if the last result has anything to do with the limited sample made of each social network or the fact that crawling itself biased the network structure detected. - Although from the description of the crawling procedure given in the paper, it sounds like they tried their utmost to make sure their sampling process introduced as little bias as possible.

Wednesday, February 27, 2008

Correlating votes on Digg with website traffic

Anonymous Prof has written a very nice post analyzing the amount of traffic that flows to websites from front page stories on Digg. S/he took website traffic data from Alexa and investigated the change in traffic that resulted from a page on each site making it onto the front page of Digg. For large websites, the effects were negligible, but for smaller sites (less than 0.0005% of pageviews on Alexa) the effects are quite apparent. The traffic increased by around 0.1% per vote on Digg, meaning that a story receiving 500 diggs (votes) would temporarily increase traffic to the site hosting the content by around 50%. The effect on very small sites (those not indexed by Alexa) would obviously be much bigger, and often causes the webserver to go down under load.

Anonymous Prof's post on visualizing the last.fm social network is also worth a look for the cool pictures.

Tuesday, February 26, 2008

Staying anonymous online: part 2

I wrote recently about the amount of personal information that each of us is making public and putting online, sometimes without even realizing it. In the post, I stated that it should be possible to link the various profiles of a particular user across different social networking sites, even if the user has completely different/anonymous user-ids on the different sites. Turns out that there already exists a company that does exactly that ....

The website is called Spokeo and their aim is to let you keep track of what (/everything) that your friends are up to on a bunch of different social networking sites, including: last.fm, Amazon, MySpace, LinkedIn, YouTube, Flickr, Digg, Facebook, etc. According to Phil Bradley, who tried it out, the system is able to link the profiles of your friends on different sites, even if they use different login names on each site.

What kind of information can you see? Well, anything that's public and anything else you have access to according to the website:

"All users can track information that is publicly available. Only users who have accesss to private content on a third party site can track that private content in Spokeo."

The thing is, there is quite a lot of publicly available data about pretty much everybody on the net. A service like this that aggregates the data and make it easier to find and search, means that the inherent anonymity that comes with there being so much data on the net and that data being distributed over so many websites doesn't apply any longer. Data about you can now be integrated, stored and archived. So what you did a number of years ago on some obscure website, may well come back to find (/haunt) you in the future.

Sound fanciful? Imagine for example that you are applying for a job and the prospective employer has a look at your profile on LinkedIn and then links to your Digg profile only to discover that you particularly like stories about some ultra-liberal political candidate ... or that you're into guns ... or maybe you've shown a passing interest in a certain religious sect. - All this information wouldn't be hard to find, but it's probably not the stuff you want on your CV!

Also, it looks like Spokeo isn't the only "social content aggregator" and that these sites are becoming quite common, according to this post on Techcrunch.

Friday, February 15, 2008

Most Popular Data Mining Algorithms

Just saw this post on the Data Mining Research blog about a summary paper on the top 10 data mining algorithms as voted by the program committees of KDD, ICDM and SDM. The paper gives a really nice introduction to and summary of a bunch of particularly useful algorithms. It is a great starting point for somebody interested in finding out more about data mining. Some of the sections are better written than others - I particularly liked the discussions regarding support vector machines (SVM) and k-nearest neighbor (kNN) classification.

Friday, February 8, 2008

Predicting Airfares using Machine Learning

Good research ideas can be turned into interesting and successful businesses. Case in point, a company called Farecast that provides an airfare search engine (similar to Orbitz). The website distinguishes itself by using machine learning techniques to predict the future price of airfares based on historical data. Thus it recommends whether a user should buy immediately (because the price is going up) or wait for a lower price (as the price is likely to fall).

I'm interested in the company in particular because it is based on technology developed by my thesis advisor Craig Knoblock (together with other researchers at USC and the University of Washington). The original idea was described in this paper "To buy or not to buy: Mining airline fare data to minimize ticket purchase price" by Oren Etzioni, Craig A. Knoblock, Rattapoom Tuchinda, and Alexander Yates. The paper was published at KDD back in 2003, but is still a very good read. - Needless to say, I am a big fan of Craig's research. He has a knack for discovering new research problems that are both interesting and practical (and sometimes even commercializable).

According to a post on TechCrunch, the company has just started predicting airfares on some international routes. This will be very useful for those of us not living in the US and who spend a significant portion of our incomes on airfares. - I might be able to travel home and see the family a bit more often.

Thursday, February 7, 2008

Digging into the workings of social news aggregators

An ex-colleague of mine, Kristina Lerman, wrote a very interesting paper analyzing and modeling the process in which news items are discovered and make their way onto the front page of the social news aggregation site Digg. The paper is called "Social Information Processing in Social News Aggregation".

Interesting points made in the article regarding the social nature of the system, include:

People subscribe to feeds of news items that are posted or voted for ("dugg") by their friends, increasing greatly the probability of viewing (and thus voting on) news items that their friends found interesting.
This "social filtering" phenomenon is necessary in order to deal with the huge number of stories that are posted to Digg each day.
Stories propagate very fast through the system: anecdotal evidence suggests that they are faster than clustering-based approaches such as Google News, (which makes sense, as a story must break on multiple news sources before it will appear on a clustering-based news aggregator).
According to mathematical model devised to explain the process by which news items are propagated to the front page, a highly relevant/interesting story (relevance=0.9) may not get onto the front page if the original poster is not well connected - i.e. doesn't have a large number of readers.
And inversely, relatively uninteresting articles (relevance=0.1) may make it to the front page if the contributor is particularly well connected (i.e. famous).

Reading the article made me think of another article I read recently on using agent-modeling techniques to analyze human social behavior. This time the author was interested in the divergence of opinions rather than the consensus opinion (i.e. newsworthiness on Digg). In his article "Mobility and Social Network Effects on Extremist Opinions", André Martins created agent-based simulations of public opinion so as to analyze the emergence of extremist positions in society. The finding in that case was that more interaction between diverse groups could lower the overall amount of extremism.

(Thanks to the physics arXiv blog for bringing my attention to the second article. - They point out that in this age of terrorism fears, closing borders and restricting travel may well have a negative effect on curbing extremism.)

Friday, February 1, 2008

Gaming in the virtual real world

Google's 3D Warehouse is a repository of user-generated 3D models that can be viewed in Google Earth. Anybody can build models using Sketchup and contribute them to the collection. People have already modeled lots of buildings, bridges, and in some cases entire cities (see Adelaide for example). In true user-generated content style, users can vote on the quality of the models and the best of them are then included by default in Google Earth.

Since the modeling data is free to use/reuse and in an open format (KML), it looks particularly interesting from a data integration perspective. As this collection grows, it should provide a fun dataset for research in spatial data integration. One simple integration problem that comes to mind is:

"find me an office in downtown LA with price less than X and a view out to the ocean"

Answering this query would involve integrating data from different real estate websites along with the 3D building models. Another example integration problem, this time for terrorism defense:

"find all rooftops with a clear line of sight onto a convoy traveling along roads XYZ"

In this case, vector data (street maps) would need to be integrated with the 3D building models.

Anyway, the point of this post was to highlight something very cool (and not so research related) that I saw on the blog digital urban. In the post, the writers take the free content from the 3D Warehouse and add it to a video game, so that game play can take place in a real (virtual) city instead of an imagined one. There is a video of an attack helicopter flying around over London! Looks like you'll soon be able to defend (or invade) your favorite tourist destination.

Sunday, January 20, 2008

Yahoo adds del.icio.us data to search results

I'm a big fan of the social bookmarking site del.icio.us. It contains a lot of really useful information on what's popular online. Users only tag webpages that they find interesting, so the fact that a page is bookmarked on del.icio.us is a pretty good indication of its quality, and the more individuals that have tagged the same page, the more likely it is to be interesting.

Yahoo bought del.icio.us a while back, and finally (according to this post on TechCrunch) are starting to incorporate this wealth of information into their own search results. Currently they are adding a line to each result page summary detailing the number of users that have bookmarked the page on del.icio.us. It is hard to tell whether they are also using the data internally to improve search results, i.e. by pushing highly tagged pages further up the list.

A recent and excellent paper by Paul Heymann, Georgia Koutrika and Hector Garcia-Molina called "Can Social Bookmarking Improve Web Search?" suggests that social bookmarking data may indeed (albeit in certain limited cases) be useful for improving search results.

They find that URLs in del.icio.us are disproportionately common in search results, given the relative size of del.icio.us compared to Yahoo's search engine index (which they estimate to be at least two orders of magnitude larger): Looking at the results for some 30000 popular queries, they found that 19% of the top 10 results and 9% of the top 100 results were present in del.icio.us. These statistics are hardly surprising given that users must first find a web page in order to add it to del.icio.us - and they do that using a search engine!

Far more interesting to me is the question as to whether the users are adding useful information by bookmarking a page once they've found it. The search engines record which search results users click on and can already use this information to improve their algorithms. Presumably, however, a user will often click on lots of different results and save only the very best (most relevant) ones to del.icio.us. Thus the fact that a page appears in del.icio.us should carry more meaning than a user clicking on a particular search engine result and therefore may be useful for reordering search results.

The authors don't deal directly with the question of whether a page's bookmark count is on its own a good indication of quality. They do however study the freshness of URLs on del.icio.us and conclude that the stream of URLs being added to del.icio.us are at least as new as search engine results. The authors also estimate that 12.5% of URLs posted to del.icio.us were not present in Yahoo's index at the time they were added, implying that del.icio.us could serve as a small source of highly relevant content for seeding Web crawls.

(Thanks to Greg Lindon for pointing out the paper on his blog.)

Friday, January 4, 2008

Staying anonymous online

In an online world is it possible to keep private data private?

I've been working for a while using data that is publicly available online. It has struck me recently how much information about themselves people are willing to put online - me included. I have a list of public bookmarks, photos, a web page, a blog ....

I'm posting about this because I read an interesting article by Arvind Narayanan and Vitaly Shmatikov on discovering the identity of users in anonymized datasets. The article is called "Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)" .

Before continuing, I should say that I think it is great what Netflix is doing in releasing anonymized data and creating a prize for the best movie recommendation algorithm. It can be very difficult to find good data for doing research in information retrieval (especially personalization) if you work for a University and not for Google/Yahoo/Amazon/etc. Moreover, I think competitions are a really good approach for both motivating research and for advancing the state of the art by allowing truly objective comparisons between different approaches.

In the paper the authors show that if you have some public and (crucially) non-anonymous data (i.e. information that can be traced to known individuals) and that data contains some overlap with some public but anonymous data from another source then the natural sparsity in the data can be used to identify corresponding users in the anonymous data. Data integration researchers will note the similarity to the record linkage problem.

The example they use is to take movie ratings on the Internet Movie Database (IMDb), which the authors argue is not really anonymous (users often login with their real names), and use it to find corresponding users in the anonymized Netflix dataset. This technique for "de-anonymizing" data relies on the existence of a non-anonymous data source that overlaps with the anonymous source over sparse attributes. In this case, the sparsity in the data results from the fact that it is extremely unlikely that two users watch the same movies in the same order as one another. Thus the order is sufficient to de-annonymise data even in the case where the data is noisy or has been deliberately blurred in some way.

The fact that the two researchers manage to discover the "identity" (IMDb login) of some users on Netflix is of itself somewhat worrying. When a user rates a movie on IMDb they are aware that they are making information public, while on Netflix they may rightfully assume their individual ratings are private (and that only aggregate data is public). Presumably, one could imagine a user rating only "politically correct" movies on their public IMDb profile and films relating to their "personal fetishes" on the private Netflix one, (the latter being useful for receiving personalized film recommendations).

The lesson is, however, more general than the Netflix dataset. Using similar techniques, it should also be possible to link people's profiles on many different sites (not just movie ratings). For example, we could link somebody's del.icio.us bookmarks, to their ratings on amazon.com, to their photos on flickr (and their location through geotagging), to their emails in newsgroups, to their profile on LinkedIn and perhaps even their page on facebook. All this data is public and together it would give a somewhat detailed description of the individual.

None of this is really very alarming per se. We are out in public all the time - our actions, appearance, etc. can be observed when we walk down the street. In some sense, our (public) activities online shouldn't be any different. One should just take the lack of anonymity into account when posting information to del.icio.us/flickr/IMDb/etc. - As Narayanan and Shmatikov point out, it is possible to guess at personal details like political leanings, age and sexual orientation, based solely on the seemingly innocuous information contained in movie ratings.

Friday, December 28, 2007

Don't believe all claims you read in research

I stumbled upon a very interesting article today titled "Why Most Published Research Findings are False" by John Ioannidis. He's talking about empirical studies in Medicine and why so many articles in prestigious journals refute the claims of one another. His conclusion is that it should be expected given a number of contributing factors, such as:

the natural tendency of researchers to investigate those hypotheses, which support their own prior research;
time pressure to publish in "hot topic" areas causing researchers to publish whatever results appear statistically significant;
small and insufficient sample sizes;
flexibility in the way experiments are conducted including (bias in) the way the data is collected;
financial incentives causing researchers to investigate certain hypotheses over others.

While Ioannidis is talking about experiments in Medicine, I think his analysis applies equally to research in Computer Science, where the pressure to publish new research results quickly can be very high and many results (especially in fields like Information Retrieval) can be best explained by biases in the data set.

Regarding the dangers of searching for statistical significance in data, I particularly like the warning given in Moore and McCabe's handbook "Introduction to the Practice of Statistics, Volume 3":

"We will state it as a law that any large set of data - even several pages of a table of random digits - contains some unusual pattern, and when you test specifically for the pattern that turned up, the result will be significant. It will also mean exactly nothing."

They also give an intuitive example of a psychiatry study in which 77 variables (features) were measured and not surprisingly 2 were found to be statistically significant at the 5% level, (which was to be expected by chance). - Obviously one cannot publish results regarding freshly discovered significant features before running a new experiment (with different data) to test the hypothesis that these features are indeed important.

In Computer Science the use of standard test collections (like TREC) can go some way to remove errors due to the flexibility in the way experiments are devised. Of crucial importance in Computer Science, however, is the sample size (aka the data set size).

Peter Norvig gave a very interesting presentation on research at Google in which he emphasized the importance of dataset size. He cited work on word sense disambiguation by Banko and Brill in 2001 that showed that algorithms which performed well at one order of magnitude of training data didn't necessarily perform well at the next. Moreover, the worst performing algorithm with an order of magnitude more data often outperformed the best performing algorithm with less data. - In short, unless you are using the largest amount of data possible (or there is a rationale for limiting your data) it is hard to draw useful/lasting conclusions from your experiments.

Anyway, the point of this post was to say: when reading research papers its best to keep in mind that the claims made by the authors may be influenced (often unintentionally) by biases in their own experiments.