- the natural tendency of researchers to investigate those hypotheses, which support their own prior research;
- time pressure to publish in "hot topic" areas causing researchers to publish whatever results appear statistically significant;
- small and insufficient sample sizes;
- flexibility in the way experiments are conducted including (bias in) the way the data is collected;
- financial incentives causing researchers to investigate certain hypotheses over others.
Regarding the dangers of searching for statistical significance in data, I particularly like the warning given in Moore and McCabe's handbook "Introduction to the Practice of Statistics, Volume 3":
"We will state it as a law that any large set of data - even several pages of a table of random digits - contains some unusual pattern, and when you test specifically for the pattern that turned up, the result will be significant. It will also mean exactly nothing."They also give an intuitive example of a psychiatry study in which 77 variables (features) were measured and not surprisingly 2 were found to be statistically significant at the 5% level, (which was to be expected by chance). - Obviously one cannot publish results regarding freshly discovered significant features before running a new experiment (with different data) to test the hypothesis that these features are indeed important.
In Computer Science the use of standard test collections (like TREC) can go some way to remove errors due to the flexibility in the way experiments are devised. Of crucial importance in Computer Science, however, is the sample size (aka the data set size).
Peter Norvig gave a very interesting presentation on research at Google in which he emphasized the importance of dataset size. He cited work on word sense disambiguation by Banko and Brill in 2001 that showed that algorithms which performed well at one order of magnitude of training data didn't necessarily perform well at the next. Moreover, the worst performing algorithm with an order of magnitude more data often outperformed the best performing algorithm with less data. - In short, unless you are using the largest amount of data possible (or there is a rationale for limiting your data) it is hard to draw useful/lasting conclusions from your experiments.
Anyway, the point of this post was to say: when reading research papers its best to keep in mind that the claims made by the authors may be influenced (often unintentionally) by biases in their own experiments.