mark's research: Google starts indexing behind web forms

A lot of interesting content on the Web is hidden behind Web forms and search interfaces. These pages are usually generated dynamically by database queries built using the form's inputs. Such dynamic sites are sometimes referred to as the Deep Web. Dynamic sites present a problem for Web crawlers as there are often not enough "static links" from the outside Web into the dynamic content (and between the dynamic pages) in order for the crawler to discover all the content on a site. The end result is that the search engine indexes only part of a website's content.

I just saw an interesting post stating that Google has started to automatically "fill in" the inputs on some Web forms so as to better index content behind the forms. It seems that Google's crawler (affectionately termed the Googlebot) has been "taught" how to generate input for search boxes on some sites. Thus the crawler can discover pages that are not linked to from the "Surface Web".

In general it is a difficult problem to automatically generate reasonable and relevant input values for a previously unseen Web form. Some forms contain drop-down boxes (multiple choice inputs) which simplifies the problem as the set of possible options are are enumerated explicitly in the form. In most cases, however, a search input is required involving an unrestricted free-text query. The difficulty with the latter is that for many valid inputs nothing will be returned by the site, or at best an error page will be "discovered". - Imagine sending the query "space-shuttle" to a website providing information on cancer research for instance.

Generating relevant inputs so as to probe and characterize a dynamic website is a research problem that I am very much interested in and is central to the area of Distributed Information Retrieval. The best work I've seen recently on this was done by Panagiotis Ipeirotis and Luis Gravano in their article: "Classification-Aware Hidden-Web Text Database Selection". In that work, while probing an online databases with single term queries, they try to assign the database to (progressively deeper) nodes in a topic hierarchy. They can then use a distribution of relevant terms for each node in the hierarchy to choose reasonable query terms for effective probing.

The problem is even more complicated for forms requiring multiple input fields. Consider for example a yellow-pages service that takes a company name (free text) and a partially filled address (street name, etc.) as input. Randomly generating plausible inputs for each field separately won't work in this case because the different fields must make sense together (i.e. a business with that name needs to exist on the given street). Thus managing to invoke the interface so as not to receive an empty/error page can be very difficult. - Just detecting the error page can be difficult on its own. This was an issue that I had to deal with (not very satisfactorily) in my thesis.

mark's research

Friday, April 18, 2008

Google starts indexing behind web forms

2 comments:

Blog Archive

blogs I recommend