Friday, December 28, 2007

Don't believe all claims you read in research

I stumbled upon a very interesting article today titled "Why Most Published Research Findings are False" by John Ioannidis. He's talking about empirical studies in Medicine and why so many articles in prestigious journals refute the claims of one another. His conclusion is that it should be expected given a number of contributing factors, such as:
  • the natural tendency of researchers to investigate those hypotheses, which support their own prior research;
  • time pressure to publish in "hot topic" areas causing researchers to publish whatever results appear statistically significant;
  • small and insufficient sample sizes;
  • flexibility in the way experiments are conducted including (bias in) the way the data is collected;
  • financial incentives causing researchers to investigate certain hypotheses over others.
While Ioannidis is talking about experiments in Medicine, I think his analysis applies equally to research in Computer Science, where the pressure to publish new research results quickly can be very high and many results (especially in fields like Information Retrieval) can be best explained by biases in the data set.

Regarding the dangers of searching for statistical significance in data, I particularly like the warning given in Moore and McCabe's handbook "Introduction to the Practice of Statistics, Volume 3":
"We will state it as a law that any large set of data - even several pages of a table of random digits - contains some unusual pattern, and when you test specifically for the pattern that turned up, the result will be significant. It will also mean exactly nothing."
They also give an intuitive example of a psychiatry study in which 77 variables (features) were measured and not surprisingly 2 were found to be statistically significant at the 5% level, (which was to be expected by chance). - Obviously one cannot publish results regarding freshly discovered significant features before running a new experiment (with different data) to test the hypothesis that these features are indeed important.

In Computer Science the use of standard test collections (like TREC) can go some way to remove errors due to the flexibility in the way experiments are devised. Of crucial importance in Computer Science, however, is the sample size (aka the data set size).

Peter Norvig gave a very interesting presentation on research at Google in which he emphasized the importance of dataset size. He cited work on word sense disambiguation by Banko and Brill in 2001 that showed that algorithms which performed well at one order of magnitude of training data didn't necessarily perform well at the next. Moreover, the worst performing algorithm with an order of magnitude more data often outperformed the best performing algorithm with less data. - In short, unless you are using the largest amount of data possible (or there is a rationale for limiting your data) it is hard to draw useful/lasting conclusions from your experiments.

Anyway, the point of this post was to say: when reading research papers its best to keep in mind that the claims made by the authors may be influenced (often unintentionally) by biases in their own experiments.

Thursday, November 29, 2007

Life during and after a PhD

I gave a quick talk today (along with Giovanni Toffetti and Monica Landoni) to the new PhD students at the University of Lugano on life during and after a PhD. - Finally my chance to tell the new students all the stuff I wish somebody had told me when I started! I'm posting my advice/musings online in case somebody else finds it useful.

Thursday, November 22, 2007

Lecture slides on Web 2.0

Today I gave a lecture on Web 2.0 to 3rd year undergraduate students in Informatics at the University of Lugano.

I concentrated on the themes of user-generated content (wikis, blogging, tagging), applications (mash-ups), social networks and personalization. For each area I gave an overview of current techniques and described some examples, before discussing some research areas (and providing pointers to further information).

The slides of the presentation are available online. - Being a presentation on Web 2.0 I thought that using an online office suite was appropriate....

Wednesday, September 12, 2007

JAIR Article

Craig Knoblock and I published an article "Learning Semantic Definitions of Online Information Sources" in the Journal of Artificial Intelligence Research (JAIR). The article provides a more detailed description of our work on inducing service descriptions that we presented at IJCAI.

Archived News

6 Aug, 2007:
I have just taken up a PostDoc position in the Informatics Faculty of the University of Lugano, Switzerland. I will be working with Fabio Crestani on the problem of discovering, modeling and providing personalized access to news feeds, blogs, and other online data sources. I am very excited about the work and see a great opportunity for combining Data Integration, Personalization and Distributed Information Retrieval techniques.

14 Feb, 2007:
The software I wrote for my thesis for learning definitions of web sources has just been made available on the ISI website! The package is royalty-free for research purposes and comes with all the source code. Documentation is "in progress", so feel free to contact me with installation questions.

9 Jan, 2007:
Here are the slides that I presented today at IJCAI-07.

18 Sep, 2006:
A paper I submitted with Craig Knoblock to IJCAI-07 entitled "Learning Semantic Descriptions of Web Information Sources" was accepted for presentation at the conference.

15 Sep, 2006:
I just gave a talk at ISI on the work I did in my thesis. Be sure to check out the video of the talk! (Unfortunately it's missing the first 4 minutes of audio.)

31 Jul, 2006:
I graduated! My defense was in Trento, Italy. I am now (finally!) a Ph.D. graduate. You can read all about my thesis here.