Peer review has long been criticised for failing to identify flaws in
research. Here Peter Bacchetti argues that it is alsoguilty of the
opposite: finding flaws that are not there
The process of peer review before publication has long been criticised for
failing to prevent the publication of statisticsthat are wrong,
unclear, or suboptimal. 12
My concern here,however, is not with failing to find flaws, but with
the complementaryproblem of finding flaws that are not reallythere.
My impression as a collaborating and consulting statistician is that spurious
criticism of sound statistics is increasinglycommon, mainly from
subject matter reviewers with limited statisticalknowledge. Of the
subject matter manuscript reviews I see thatraise statistical
issues, perhaps half include a mistaken criticism.In grant reviews
unhelpful statistical comments seem to be a nearcertainty, mainly
due to unrealistic expectations concerning samplesize planning.
While funding or publication of bad research isclearly undesirable,
so is preventing the funding or publicationof good research.
Responding to misguided comments requires considerabletime and
effort, and poor reviews are demoralisinga
subtler butpossibly more seriouscost.
This paper discusses the problem, its causes, and what might improve the
situation. Although the main focus is on statistics,many of the
causes and potential improvements apply to peer reviewgenerally.
Summary points
Peer reviewers often make unfounded statistical criticisms,
particularly in difficult areas such as sample size and multiple
comparisons
These spurious statistical comments waste time and sap morale
Reasons include overvaluation of criticism for its own sake,
inappropriate statistical dogmatism, time pressure, and lack of rewards
for good peer reviewing
Changes in the culture of peer review could improve things,
particularly honouring good performance
The problem
Mistaken criticism is a general problem, but may be especially acute for
statistics. The examples below illustrate this, includingcommonly
abused areas (examples 1 and 2), non-constructiveness(1), quirkiness
and unpredictability (3 and 4), and the potentialdifficulty of
successful rebuttal (3 and4).
Example 1: Grant review, US National Institutes of Health "There is a flaw in the study design withregard to statistical
preparation. The sample size appears small."
Because of uncertainties inherent in sample size planning, reviewers can
always quibble with sample size justificationsandthey usually do. The information needed to determine accurately
the "right" sample size (a murky concept in itself) is often much
more than available preliminary information. For example, even
directly relevant preliminary information from 30 subjects provides
an estimated variance with a threefold difference between thelower
and upper ends of its 95% confidence interval, resultingin threefold
uncertainty in a corresponding sample size calculation.3Often considerable uncertainty also exists about other relevant
factors such as the size of the effect or association, an outcome's
prevalence, confounding variables, adherence to study medication,and
so on. Such uncertainties can be especially acute for highly
innovativeresearch.
Unfortunately, reviewers usually expect a "sample size calculation," with all
the precision that "calculation" implies. Thismay be reasonable for
studies based on extensive previous databut is unrealistic in many
situations, particularly pilot or exploratorystudies. In this
particular example the request for proposalsspecifically asked for
pilot studies and prohibited phase IIIclinical trials, and the
review provided no reasoning for thequotedcriticism.
Example 2: Review for a leading bench science journal "The statistical test used . . . is not appropriatefor the
multiple comparisons necessitated by this experimentaldesign."
The authors were puzzled by this comment, because two groups differed
substantially and a third was intermediate, all in keepingwith their
biological theory. They had not expected that gatheringdata on the
intermediate condition would be interpreted as weakeningtheir
results. Because it is rarely acceptable to perform onlya single
statistical analysis in a study, this type of objectioncan usually
be raised. Whether to adjust P values for multiplecomparisons is
controversial, 45 but
reviews usually statethe need for adjustment as accepted dogma. More
importantly, Ihave rarely seen the issue raised in the classic
situation whereonly one result of many has a small P value. Instead,
some reviewersobject routinely, even when most results have small P
values andthere is even a coherent pattern (for example, a treatment
showingbenefit by many different measures). In such situations, the
resultsreinforce each other, rather than detracting from each other
asrequired by the methods (usually Bonferroni adjustment) that
reviewersoften suggest.
Example 3: Review for a clinical specialty journal "Figure 1 appears to be a ROC Curve at a 50%threshold. . . . it
is not clear how well the system would haveworked had other
thresholds been chosen."
For those not familiar with receiver operating characteristic (ROC) curves,
this is a self contradicting criticism becausesuch curves display
the tradeoffs from all possible cut offs ofa prediction rule. The
paper was an excellent first effort bya very junior lead author, but
the deputy editor explicitly endorsedthis and many other spurious
and demoralising comments and rejectedthe revised paper despite our
attempts diplomatically to rebutthe errors. Another journal
published essentially the samepaper.
Example 4: Grant review for a disease specific foundation "It is questionable whether a theoreticalbaseline value of zero
should the use for statistical analysisof differences in the
measurements of median matched difference.(sic)"
We found this nearly indecipherable even in the context of the entire review,
but the criticism concerned a published studywith a matched design
and a corresponding statistical analysis(Wilcoxon signed-rank
tests). Such methods boil down to testingwhether within-pair
differences are centred at zero, so the reviewerseemed to be
objecting to this general strategy, an objectionso spurious that it
is almost impossible to rebut. How does oneargue that no difference
implies a difference of zero when a reviewerbelieves that empirical
research is needed to verify or refutethis? The study was not
funded, even though the above commentwas the only substantive
criticism of the proposal. Essentiallythe same proposal was funded
nine months later, after that reviewerhad rotated off the
committee.
Causes
Several factors may contribute to this problem, some common to all peer
review. A pervasive factor is the desire to find somethingto
criticise. Tannen recently documented the overvaluation ofcriticism
and conflict both generally in Western popular cultureand
specifically in academia.6 In addition, the notion
thatfinding flaws is the key to high quality peer review is fairlyexplicit in some writings,7-9 and developers
of an instrumentfor rating review quality recently focused only on
"completeness"and not on "whether the reviewer's judgment was
correct."10A panel on peer review for
the US National Institutes of Healthacknowledged an overly critical
climate, stating "Peer reviewersshould eschew the common current
tendency to find fault."11Finding
flaws is certainly important, and scepticism and disputationare
revered in scientific tradition. But when criticism is anend in
itself rather than a tool for advancing knowledge, whenfinding flaws
is imperative rather than the natural result ofcareful review with
an open mind, then mistaken criticisms willarise.
The problem may be more acute in statistics because of two factors that are
synergistic both with each other and with theneed to criticise. The
first is that reviewers see statisticsas a rich area for finding
mistakes. This perception is correct,because statistical errors are
common. But areas such as samplesize and multiple comparisons can be
reflexively subjected tounfair and unhelpful criticism. In the case
of clinical trialsMeinert lists many other "universal" criticisms.12
The secondfactor is many reviewers' poor understanding of
statistics,2especially the belief that
rules must be blindly followed. I amdismayed by how often my clients
ask whether a particular approachwould be "legal" or "against the
rules" rather than "accurate"or "misleading." This misunderstanding
of statistics as a bodyof seemingly arbitrary dogma leads many
reviewers to perceiveviolations even when the research has not
actually beenharmed.
Finally, another pair of synergistic factors apply to peer review generally.
The first is the frequent need to rush reviewing.This seems unlikely
to improve, given increasing emphasis on documentedproductivity and
the accelerating pace of life generally.13The second, perhaps more important, is the lack of incentives.
One recent editorialnoted, "It is generally admitted that being a
good referee doesnot lead to any tangible rewards with respect to
career advancement."14Anothernoted that "the integrity of the scientific review process requiresthat the performance of reviewers be appropriately rewarded" andended, "We do thank you."15 This gratitude,
while sincere, isemblematic of the inadequate rewards that reviewers
can expect.The only likely concrete consequence of good reviewing is
futurerequests for morereviews.
What might help?
Aside from widespread improvement in understanding of statistical methods (a
worthy goal), care by reviewers and changes inpeer review systems
and culture may reduce mistaken statisticalcriticisms and improve
peer reviewgenerally.
As a reviewer, criticise statistical flaws only when you can explain how
they specifically detract from the study. If yoususpect a problem
but cannot meet this criterion, recommend furtherreview by an
expertstatistician.
Raise criticisms in swampy areas such as sample size and multiple
comparisons only when unavoidable. For sample size, thiswill
usually mean that a proposal requires a clearly unrealistic
scenario to achieve the stated goals. Keep in mind that a meaningfulincrease in sample size may be impossible, so the alternative
is a sample size of zerothat
is, not doing the study. Also rememberthat research in new areas
must start somewhere, even when thereare no preliminary data for
sample size calculations. Concernsabout sample size after a study
is done can generally be refocusedmore directly on whether the
authors have properly presented andinterpreted the uncertainty in
their results, particularly negativefindings.
Changing the system
Any substantial improvement will probablyrequire changes in the
manuscript and grant review processes.Statistical reviewers are
already used to some extent.2 Whilemuch
wider use may not be possible, obtaining a statistical review
whenever subject reviewers raise statistical concerns might be
workable. Research on double blinded and non-anonymous peer review
has found little effect of these variations on who knows whose
identities,16-19 but mistaken criticism has not
been directly addressed.Training or guidelines may help,
particularly if they warn againststretching to find
criticisms.
A change that would perhaps improve peer review even more would be to
evaluate its quality and reward good performance. Becausemeaningful
grading must reflect the substance of the reviews,including whether
criticisms are correct and whether serious flawshave been
overlooked, fellow reviewers of the same paper are perhapsbest
positioned to rate each other's performance. This would alsopromote
reflection on one's ownperformance.
A simple form of reward would be to supplement long annual lists of all
reviewers with much shorter honour rolls of thosewho have provided
high quality reviews. Multiple honour rollscould address different
aspects, such as helpfulness to editors,high ratings from fellow
reviewers, or good marks from rejectedauthors on constructiveness.
Paying attention to review qualitymight result in cultural changes.
For example, top academic institutionsmay come to see failure to
make at least one honour roll of arelevant journal as a seriousweakness.
Conclusion
Peer review is a key part of the collective scientific process. Expecting it
to work well on donated time, with little trainingand even less
accountability or incentives, seems unrealistic.Changes in the
systems and culture of peer review might improvethings, notably less
pressure to criticise, more training in reviewingskills, and less
statistical dogmatism. The most promising changemight be to better
reward goodperformance.
Acknowledgments
I thank Professors Douglas G Altman and Steven N Goodman for helpful comments
on an early draft of thispaper.
Godlee F, Gale CR, Martyn CN. Effect on the quality of peer
review of blinding reviewers and asking them to sign their reports. JAMA
1998; 280: 237-240[Medline].
Black N, van Rooyen S, Godlee F, Smith R, Evans S. What
makes a good reviewer and a good review for a general medical journal?
JAMA 1998; 280: 231-233[Medline].
Alberts BM, Ayala FJ, Botstein D, Frank E, Holmes EW, Lee
RD, et al. Recommendations for change at the NIH's Center for Scientific
Review: phase 1 report.
www.csr.nih.gov/bioopp/select.htm (accessed 17 September 2001).
van Rooyen S, Godlee F, Evans S, Black N, Smith R. Effect
of open peer review on quality of reviews and on reviewers' recommendations:
a randomised trial. BMJ 1999; 318: 23-27[Abstract/Full
Text].
ALL INFORMATION, DATA, AND
MATERIAL CONTAINED, PRESENTED, OR PROVIDED HERE IS FOR GENERAL INFORMATION
PURPOSES ONLY AND IS NOT TO BE CONSTRUED AS REFLECTING THE KNOWLEDGE OR OPINIONS
OF THE PUBLISHER, AND IS NOT TO BE CONSTRUED OR INTENDED AS PROVIDING MEDICAL OR
LEGAL ADVICE. THE DECISION WHETHER OR NOT TO VACCINATE IS AN IMPORTANT AND
COMPLEX ISSUE AND SHOULD BE MADE BY YOU, AND YOU ALONE, IN CONSULTATION WITH
YOUR HEALTH CARE PROVIDER.
"A foolish faith in authority is the worst enemy of truth."
-- Albert Einstein, letter to a friend, 1901
"I know of no safe depository of the ultimate powers of the society but the people themselves, and if we think them not enlightened enough to exercise control with a wholesome discretion, the remedy is not to take it from them, but to inform their discretion by education."
-- Thomas Jefferson, letter to William C. Jarvis, September 28, 1820
"What's the point of vaccination if it doesn't protect you from the unvaccinated?"
-- Sandy Gottstein
"Who gets to decide what the greater good is and how many will be sacrificed to it?"