Friday, April 18, 2008

Google starts indexing behind web forms

A lot of interesting content on the Web is hidden behind Web forms and search interfaces. These pages are usually generated dynamically by database queries built using the form's inputs. Such dynamic sites are sometimes referred to as the Deep Web. Dynamic sites present a problem for Web crawlers as there are often not enough "static links" from the outside Web into the dynamic content (and between the dynamic pages) in order for the crawler to discover all the content on a site. The end result is that the search engine indexes only part of a website's content.

I just saw an interesting post stating that Google has started to automatically "fill in" the inputs on some Web forms so as to better index content behind the forms. It seems that Google's crawler (affectionately termed the Googlebot) has been "taught" how to generate input for search boxes on some sites. Thus the crawler can discover pages that are not linked to from the "Surface Web".

In general it is a difficult problem to automatically generate reasonable and relevant input values for a previously unseen Web form. Some forms contain drop-down boxes (multiple choice inputs) which simplifies the problem as the set of possible options are are enumerated explicitly in the form. In most cases, however, a search input is required involving an unrestricted free-text query. The difficulty with the latter is that for many valid inputs nothing will be returned by the site, or at best an error page will be "discovered". - Imagine sending the query "space-shuttle" to a website providing information on cancer research for instance.

Generating relevant inputs so as to probe and characterize a dynamic website is a research problem that I am very much interested in and is central to the area of Distributed Information Retrieval. The best work I've seen recently on this was done by Panagiotis Ipeirotis and Luis Gravano in their article: "Classification-Aware Hidden-Web Text Database Selection". In that work, while probing an online databases with single term queries, they try to assign the database to (progressively deeper) nodes in a topic hierarchy. They can then use a distribution of relevant terms for each node in the hierarchy to choose reasonable query terms for effective probing.

The problem is even more complicated for forms requiring multiple input fields. Consider for example a yellow-pages service that takes a company name (free text) and a partially filled address (street name, etc.) as input. Randomly generating plausible inputs for each field separately won't work in this case because the different fields must make sense together (i.e. a business with that name needs to exist on the given street). Thus managing to invoke the interface so as not to receive an empty/error page can be very difficult. - Just detecting the error page can be difficult on its own. This was an issue that I had to deal with (not very satisfactorily) in my thesis.

Monday, April 7, 2008

Women are better at judging personality

Psychology researchers have performed an interesting study on the ability of social network users to understand/judge different aspects of other users' personalities based on their social network profile/page. The study is called "What Elements of an Online Social Networking Profile Predict Target-Rater Agreement in Personality Impressions?" by David C. Evans, Samuel D. Gosling and Anthony Carroll. [Thanks to William Cohen for pointing me to it.]

The attributes of a user's personality that they investigated included being disciplined/casual, alternative/traditional, neurotic/unemotional, cooperative/competitive, and extroverted/introverted.

The study found that women are both better at assessing the personality of others and at describing their own personality (presumably by providing more meaningful answers to profile questions on their pages). In order to check the accuracy of the personality judgments, the authors compared each assessment to the user's own assessment of themselves. - I wonder if the conjecture that males are not as good judges of their own personality, might also explain the second result?

Other interesting findings is that a user's personality is best described by: i) a video they find funny, ii) a description of "what makes them glad to be alive", iii) the most embarrassing thing they ever did, or iv) the proudest thing they ever did. While things like their favorite (or least favorite) book, movie, etc. were found to be not very useful for assessing one's personality.

Sunday, April 6, 2008

Using mechanical turk for research

Panos Ipeirotis has been posting some very interesting information on his blog regarding a study of users on Amazon's Mechanical Turk (MT). For those not in the know, MT is basically a crowd-sourcing system, where "requesters" define small tasks that can be performed online (like adding text labels to a group of images) and offer a certain amount of money per task. "Workers" then choose which jobs they want to perform and receive small payments (in their US bank accounts) whenever they complete them. Most of the jobs are pretty menial, so the payments are quite low (a few cents per task) and workers are unlikely to get rich doing them. But the point is that all the tasks require humans to perform them, i.e. they can't be automated well by a computer program. The trick for the requesters is to design jobs in such a way that noisy results are removed automatically (e.g. by comparing results from different workers) - as the workers may be motivated more by time than quality when completing tasks.

Anyway, Panos has started using MT to help him with his research. While contracting out the research itself doesn't make sense (of course!), there are a number of very interesting ways different research related tasks can be done using the Mechanical Turk. The most obvious is to get MT users to label training data for experiments in Machine Learning. But MT can also be used to relatively easily set up cheap user-studies, such as those required for assessing result quality/user-satisfaction in Information Retrieval. I'm not exactly sure about what the ethics issues are for the latter use, but it does sound like a very good idea to take advantage of all the (bored) Internet users out there. Here is a simple example of a user-study, in which the authors ask MT users to name different colors that they present to them [via Anonymous Prof].

Friday, April 4, 2008

The AOL dataset

I'm currently doing some research in personalization of Distributed Information Retrieval (DIR) and I need access to some personalized query logs to do some experiments. Since I don't work for a major search engine, I don't have access to their logs and am unlikely to be granted access anytime soon. There is of course the AOL query log dataset that was released last year for research use and then quickly retracted by the company due to privacy complaints. The dataset can easily be found online. - As anybody who's ever released sensitive data online knows, it's impossible to "remove" information from the Net once it's been released.

The AOL data is exactly what I need, and I wish to use it for its intended purpose: to support research in Information Retrieval. I don't intend to pry into peoples' private lives and of course am not interested in identifying users from their logs.

My question is whether or not I am allowed (from a practical perspective) to use the data. Some conferences, like SIGIR, seem a little worried about authors using it from what hear, while others don't seem to mind. For example I saw a paper at WSDM'08 that used the data as a source of common queries to a Web search engine.

And from a legal perspective, does the fact that AOL retracted the data have any implications? Obviously AOL owns the copyright to their own data. Do I need permission from AOL in order to use their copyrighted data? Or is it considered "fair use" if I make use of the data only for research and don't distribute the raw data, just my analysis of it?

Thursday, April 3, 2008

Conferences starting to provide videos of talks online

I realized recently that I'm watching many more research presentations online than I ever did in the past. It seems that where once you needed to locate and read the paper (and maybe download the slides if available), you can now often find a video online of the author presenting their work. While I'm not suggesting that one no longer needs to read research papers (as I think that is critical for understanding the details of previous work), I am suggesting that the video presentations are a good place to start to understand a new topic. Presentations are often much better at conveying the overall picture of what the research is about and how it fits in with the rest of the literature.

As far as video sites go, I'm a big fan of Google's TechTalks, which are particularly good for their research focus and quality. But even better often are the videos on videolectures.net, which show slides in a separate window and allow you to jump directly to the part of the talk you are most interested in. This allows "speed-viewing" of a number of different presentations from the same conference/workshop/summer-school/etc.

Some examples of great content that I have found lately include lectures from the 2007 Machine Learning Summer School (particularly good is the lecture on Graphical Models), the 2007 Bootcamp in Machine Learning and the 2006 Autumn School on Machine Learning, (which includes seminars by two of my favorite machine learning researchers: William Cohen and Tom Mitchell).

I think in the very near future all of the top conferences will start recording and streaming the talks online. Some of the more progressive conferences are already doing it - see for example the presentations at WSDM'08. In the future, the fact that all the talks will be available online should mean that attending conferences will be even more about the networking and even less about attending talks that it is now!