Peer review of statistics in medical research: the other problemhttp://bmj.com/cgi/content/full/324/7348/1271
Education and debatePeer review of statistics in medical research: the other problemPeter Bacchetti Department of Epidemiology and Biostatistics, University of California, San Francisco, CA 94143-0560, USA Peer review has long been criticised for failing to identify flaws in research. Here Peter Bacchetti argues that it is also guilty of the opposite: finding flaws that are not there The process of peer review before publication has long been criticised for failing to prevent the publication of statistics that are wrong, unclear, or suboptimal. 1 2 My concern here, however, is not with failing to find flaws, but with the complementary problem of finding flaws that are not really there. My impression as a collaborating and consulting statistician is that spurious
criticism of sound statistics is increasingly common, mainly from
subject matter reviewers with limited statistical knowledge. Of the
subject matter manuscript reviews I see that raise statistical
issues, perhaps half include a mistaken criticism. In grant reviews
unhelpful statistical comments seem to be a near certainty, mainly
due to unrealistic expectations concerning sample size planning.
While funding or publication of bad research is clearly undesirable,
so is preventing the funding or publication of good research.
Responding to misguided comments requires considerable time and
effort, and poor reviews are demoralising This paper discusses the problem, its causes, and what might improve the
situation. Although the main focus is on statistics, many of the
causes and potential improvements apply to peer review generally.
Mistaken criticism is a general problem, but may be especially acute for statistics. The examples below illustrate this, including commonly abused areas (examples 1 and 2), non-constructiveness (1), quirkiness and unpredictability (3 and 4), and the potential difficulty of successful rebuttal (3 and 4). Example 1: Grant review, US National Institutes of Health
Because of uncertainties inherent in sample size planning, reviewers can
always quibble with sample size justifications Unfortunately, reviewers usually expect a "sample size calculation," with all the precision that "calculation" implies. This may be reasonable for studies based on extensive previous data but is unrealistic in many situations, particularly pilot or exploratory studies. In this particular example the request for proposals specifically asked for pilot studies and prohibited phase III clinical trials, and the review provided no reasoning for the quoted criticism.
Example 2: Review for a leading bench science journal The authors were puzzled by this comment, because two groups differed
substantially and a third was intermediate, all in keeping with their
biological theory. They had not expected that gathering data on the
intermediate condition would be interpreted as weakening their
results. Because it is rarely acceptable to perform only a single
statistical analysis in a study, this type of objection can usually
be raised. Whether to adjust P values for multiple comparisons is
controversial, 4 5 but
reviews usually state the need for adjustment as accepted dogma. More
importantly, I have rarely seen the issue raised in the classic
situation where only one result of many has a small P value. Instead,
some reviewers object routinely, even when most results have small P
values and there is even a coherent pattern (for example, a treatment
showing benefit by many different measures). In such situations, the
results reinforce each other, rather than detracting from each other
as required by the methods (usually Bonferroni adjustment) that
reviewers often suggest.
Example 3: Review for a clinical specialty journal For those not familiar with receiver operating characteristic (ROC) curves, this is a self contradicting criticism because such curves display the tradeoffs from all possible cut offs of a prediction rule. The paper was an excellent first effort by a very junior lead author, but the deputy editor explicitly endorsed this and many other spurious and demoralising comments and rejected the revised paper despite our attempts diplomatically to rebut the errors. Another journal published essentially the same paper.
Example 4: Grant review for a disease specific foundation
We found this nearly indecipherable even in the context of the entire review, but the criticism concerned a published study with a matched design and a corresponding statistical analysis (Wilcoxon signed-rank tests). Such methods boil down to testing whether within-pair differences are centred at zero, so the reviewer seemed to be objecting to this general strategy, an objection so spurious that it is almost impossible to rebut. How does one argue that no difference implies a difference of zero when a reviewer believes that empirical research is needed to verify or refute this? The study was not funded, even though the above comment was the only substantive criticism of the proposal. Essentially the same proposal was funded nine months later, after that reviewer had rotated off the committee.
Several factors may contribute to this problem, some common to all peer review. A pervasive factor is the desire to find something to criticise. Tannen recently documented the overvaluation of criticism and conflict both generally in Western popular culture and specifically in academia.6 In addition, the notion that finding flaws is the key to high quality peer review is fairly explicit in some writings,7-9 and developers of an instrument for rating review quality recently focused only on "completeness" and not on "whether the reviewer's judgment was correct."10 A panel on peer review for the US National Institutes of Health acknowledged an overly critical climate, stating "Peer reviewers should eschew the common current tendency to find fault."11 Finding flaws is certainly important, and scepticism and disputation are revered in scientific tradition. But when criticism is an end in itself rather than a tool for advancing knowledge, when finding flaws is imperative rather than the natural result of careful review with an open mind, then mistaken criticisms will arise. The problem may be more acute in statistics because of two factors that are synergistic both with each other and with the need to criticise. The first is that reviewers see statistics as a rich area for finding mistakes. This perception is correct, because statistical errors are common. But areas such as sample size and multiple comparisons can be reflexively subjected to unfair and unhelpful criticism. In the case of clinical trials Meinert lists many other "universal" criticisms.12 The second factor is many reviewers' poor understanding of statistics,2 especially the belief that rules must be blindly followed. I am dismayed by how often my clients ask whether a particular approach would be "legal" or "against the rules" rather than "accurate" or "misleading." This misunderstanding of statistics as a body of seemingly arbitrary dogma leads many reviewers to perceive violations even when the research has not actually been harmed. Finally, another pair of synergistic factors apply to peer review generally.
The first is the frequent need to rush reviewing. This seems unlikely
to improve, given increasing emphasis on documented productivity and
the accelerating pace of life generally.13
The second, perhaps more important, is the lack of incentives.
One recent editorial noted, "It is generally admitted that being a
good referee does not lead to any tangible rewards with respect to
career advancement."14 Another
noted that "the integrity of the scientific review process requires
that the performance of reviewers be appropriately rewarded" and
ended, "We do thank you."15 This gratitude,
while sincere, is emblematic of the inadequate rewards that reviewers
can expect. The only likely concrete consequence of good reviewing is
future requests for more reviews.
Aside from widespread improvement in understanding of statistical methods (a worthy goal), care by reviewers and changes in peer review systems and culture may reduce mistaken statistical criticisms and improve peer review generally.
Changing the system A change that would perhaps improve peer review even more would be to evaluate its quality and reward good performance. Because meaningful grading must reflect the substance of the reviews, including whether criticisms are correct and whether serious flaws have been overlooked, fellow reviewers of the same paper are perhaps best positioned to rate each other's performance. This would also promote reflection on one's own performance. A simple form of reward would be to supplement long annual lists of all reviewers with much shorter honour rolls of those who have provided high quality reviews. Multiple honour rolls could address different aspects, such as helpfulness to editors, high ratings from fellow reviewers, or good marks from rejected authors on constructiveness. Paying attention to review quality might result in cultural changes. For example, top academic institutions may come to see failure to make at least one honour roll of a relevant journal as a serious weakness.
Peer review is a key part of the collective scientific process. Expecting it
to work well on donated time, with little training and even less
accountability or incentives, seems unrealistic. Changes in the
systems and culture of peer review might improve things, notably less
pressure to criticise, more training in reviewing skills, and less
statistical dogmatism. The most promising change might be to better
reward good performance.
I thank Professors Douglas G Altman and Steven N Goodman for helpful comments on an early draft of this paper.
Competing interests: None declared.
Rapid Responses:Read all Rapid Responses
Other related articles in BMJ:
ALL INFORMATION, DATA, AND
MATERIAL CONTAINED, PRESENTED, OR PROVIDED HERE IS FOR GENERAL INFORMATION
PURPOSES ONLY AND IS NOT TO BE CONSTRUED AS REFLECTING THE KNOWLEDGE OR OPINIONS
OF THE PUBLISHER, AND IS NOT TO BE CONSTRUED OR INTENDED AS PROVIDING MEDICAL OR
LEGAL ADVICE. THE DECISION WHETHER OR NOT TO VACCINATE IS AN IMPORTANT AND
COMPLEX ISSUE AND SHOULD BE MADE BY YOU, AND YOU ALONE, IN CONSULTATION WITH
YOUR HEALTH CARE PROVIDER. |