Nonsignificance plus high power does not imply support for the null over the alternative
Nonsignificance Plus High Power Does Not Imply Support for the NullOver the Alternative
This article summarizes arguments against the use of power to analyze data, and illustrates a key pitfall: Lackof statistical significance (e.g., p O .05) combined with high power (e.g., 90%) can occur even if the datasupport the alternative more than the null. This problem arises via selective choice of parameters at whichpower is calculated, but can also arise if one computes power at a prespecified alternative. As noted by earlierauthors, power computed using sample estimates (‘‘observed power’’) replaces this problem with even morecounterintuitive behavior, because observed power effectively double counts the data and increases as theP value declines. Use of power to analyze and interpret data thus needs more extensive discouragement. Ann Epidemiol 2012;22:364–368. Ó 2012 Elsevier Inc. All rights reserved.
KEY WORDS: Counternull, Power, Significance, Statistical Methods, Statistical Testing.
to the estimates from the sample used in the power calcu-
lation; for a study as completed (observed), it is analogous
Use of power for data analysis (post hoc power) has a long
to giving odds on a horse race after seeing the outcome.
history in epidemiology . Over the decades, however,
2. Arbitrariness: There is no convention governing the free
many authors have criticized such use, noting that power
parameters (parameters that must be specified by the
provides no valid information beyond that seen in P values
analyst) in power calculations beyond the a-level.
and confidence limits . Despite these criticisms,
3. Opacity: Power is more counterintuitive to interpret
recommendations favoring post hoc power have appeared
correctly than P values and confidence limits. In partic-
in many textbooks, articles, and journal instructions, espe-
ular, high power plus ‘‘nonsignificance’’ does not imply
cially as a purported aid for interpreting a ‘‘nonsignificant’’
that the data or evidence favors the null .
test of the null. Although such recommendations havedwindled in mainstream journals, as Hoenig and Heisey
The charge of irrelevance can be made against all fre-
note a search on ‘‘power’’ through journal archives
quentist statistics (which refer to frequencies in hypothet-
reveals that the practice and its encouragement survives
ical repetitions), but can be deflected somewhat by noting
Furthermore, it is still common in internal reports,
that confidence intervals and one-sided p values have
especially for litigation, where it may be used to buttress
straightforward single-sample likelihood and Bayesian
claims of study adequacy when in fact the study has inade-
posterior interpretations I therefore review the
quate numbers to reach any conclusion.
arbitrariness and opacity issues with the goal of illustrating
Statistical power is the probability of rejection (‘‘signifi-
them in simple numerical terms. I then review how
cance’’) when a given non-null value (the alternative) is
‘‘observed power’’ (power computed using sample esti-
correct. That is, power is the probability that p ! a under
mates), which is supposed to address the arbitrariness issue,
the alternative, where a is a given maximum allowable
aggravates the opacity issue. Like many predecessors I
type I error (false positive) rate. Among the problems with
conclude that post hoc power is unsalvageable as an analytic
power computed from completed studies are these:
tool, despite any value it has for study planning.
1. Irrelevance: Power refers only to future studies done on
populations that look exactly like our sample with respect
A P value has no free parameter and a confidence interval
From the Department of Epidemiology and Department of Statistics,
has only one, a, which is inevitably taken to be 0.05. In
University of California, Los Angeles, Los Angeles, CA.
Address correspondence to: Sander Greenland, MA, MS, DrPH, Univer-
sity of California, Department of Epidemiology and Department of Statis-
native and at least one background parameter (e.g., baseline
tics, Campus 177220, Los Angeles, CA 90095-1772. Tel.: þ1 310 455
incidence); because there is no convention regarding their
choice, power can be manipulated far more easily than
Received October 28, 2011. Accepted February 3, 2012. Published on-
a p value or a confidence interval. The reason for lack of
Ó 2012 Elsevier Inc. All rights reserved.
360 Park Avenue South, New York, NY 10010
NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE
(b) The value of the relative risk (RR) as 3.18 in the power
calculation is back-calculated to produce 80% power,
rather than determined from context; for example,
there was no plaintiff claim that an effect this largewas present. In many legal contexts, a guideline usedfor tort decisions is instead RR Z 2, based on thecommon notion that this represents a (2 À 1)/2 Z
convention is not hard to understand: The alternative and
50% individual probability of causation. This notion
any background parameter are too context specific (even
is incorrect in general, but tends to err on the low side
more context specific than an a-level).
of the actual probability of causation at RR Z 2
The following example, although extreme, is real and
thus, RR Z 2 is still useful as a pragmatic upper
illustrates the plasticity of power calculations compared
bound on the RR needed to yield 50% probability of
with P values and confidence intervals. While serving as
a plaintiff statistical expert concerning data on the relationof gabapentin to suicidality, I was asked to review pooled
If one uses the baseline rate of 0.22% cited by the expert,
data from randomized trials as used in a U.S. Food and
the power for detecting RR Z 2 is under 25%; if one uses
Drug Administration (FDA) alert and report regarding
instead the 0.05% seen in the gabapentin trials, the power
suicidality risk from anti-epileptics (the class of drugs to
for detecting RR Z 2 is under 10%. Thus the power reported
which gabapentin belongs) and defense expert calculations.
by the defense expert was maximized by first taking the high-
The defense expert statistician (a full professor of biostatis-
er risk population as the source of the baseline rate, and then
tics at a major university and ASA Fellow) wrote:
finding an RR that would yield the desired power.
Assuming that the base-rate of suicidality among
Regardless of one’s preference, the figures illustrate the
placebo controlled subjects is 0.22% as stated in
dramatic sensitivity of the power calculations to debatable
the FDA alert, we would have power of 80% to
choices. Of course, all the powers are arguably irrelevant
detect a statistically significant effect of gabapentin
to inference (problem 1) The mid-P 95% odds-ratio
relative to placebo for gabapentin alone in the
confidence limits (8, Ch. 14) from the same combined
4932 subjects (2903 on drug and 2029 on placebo)
data are 0.11, 41, whereas the approximate risk-ratio limits
used by FDA in their analysis, once the rate for gaba-
(8, Ch. 14) after adding ½ to each cell are 0.15 and 8.8, both
pentin reached 0.70%, or a relative risk of 3.18. This
showing that there is almost no information in the gabapen-
computation reveals that even for the subset of gaba-
tin trials about the side effect at issue.
pentin data used by FDA in their analysis, a signifi-cant difference between gabapentin and placebowould have been consistently detected for gabapen-tin alone, once the incidence was approximately
three times higher in gabapentin treated subjects
In the previous example, the low adverse event rate in
controls severely limited the actual (before trial) powerand after trial precision. However, genuinely high power
The computation and conclusion do not withstand scru-
can coincide with nonsignificance, regardless of whether
tiny. With regard to problem 2 above, note that
the power is computed before the study or from the data
(a) There were only 3 cases observed in the 28 placebo-
under analysis. This phenomenon seems to especially chal-
controlled gabapentin trials contributing to these
lenge intuitions. Hence, I provide a simple, hypothetical
numbers, and only one case among the placebo groups;
example (with reasonable rates for common safety evalua-
thus, actual observed baseline rate in the gabapentin trials
tion settings) in which there is high power for RR Z 2
was 1/2029 Z 0.05%. The figure of 0.22% used in the
and the P value for testing RR Z 1 (the null P value) exceeds
expert’s calculation was more than four times this rate; it
the usual significance cutoff a of 0.05, yet standard statistical
is not from placebo-controlled trials of gabapentin, but is
measures of evidence favor the alternative (RR Z 2) over
instead from all 16,029 placebo controls in 199 random-
the null (RR Z 1). The example is designed to exclude
ized trials of all types of anti-epileptics. The gabapentin
other issues such as bias, with a rare outcome and large
trial controls are only 2029 of 16,029 or 13% of these
case numbers to keep the computations simple (although
controls; furthermore, only 7% of the gabapentin trial
the figures resemble those seen in large postmarketing
patients were psychiatric (high suicide risk), compared
with 29% of patients in other trials (13, Table 8), so
Suppose a series of balanced trials randomize 1000
the lower rate in gabapentin controls is unsurprising.
patients to a new treatment, 1000 to placebo treatment,
NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE
TABLE 1. Hypothetical randomized trial data exhibiting
exceptions can occur ). In the hypothetical example,
‘‘nonsignificance’’ and high power, yet evidential measures favor
the observed power is only about 45%.
Observed power is plagued by nonintuitive behavior,
traceable to the fact that the alternative used in anobserved power calculation varies randomly and may be
contextually irrelevant; hence, the observed power is alsorandom like a p value, rather than fixed in advance as inordinary power calculations . One consequence is
with no protocol violations, losses, unmasking, and so on,
that, just as a p value can be far from the false-positive
(type I error) rate of the test so observed power can
From conventional 2 Â 2 table formulas treating the log
be far from the true-positive rate (sensitivity) of the test.
RR estimate as an approximately normal variate (see
Even more startling is the ‘‘power approach paradox’’
detailed by Hoenig and Heisey Among nonsignificant
P Z .07 (and thus ‘‘not significant at the .05 level’’) for
results, those with higher observed power are commonly
the null hypothesis that the RR is 1.
interpreted as stronger evidence for the null, when in
Assuming the 32 events observed arm were as expected in
fact just the opposite is the case. Observed power is merely
the placebo group, the power for RR Z 2 at aZ0.05
a fixed transform of the p value, which grows as the p value
computed from these data is over 85%.
shrinks; thus, higher observed power corresponds witha lower P value and lower relative likelihood for the null
Based on these results, do the data favor RR Z 1 over
In other words, higher observed power implies more
evidence against the null by common evidence measures,
Here are some relevant statistics to answer the question:
even if the evidence is ‘‘nonsignificant’’ by ordinary testingconventions.
a) The RR estimate is 1.50; in proportional terms, 1.50 is
Observed power also involves and encourages a double
counting of data. To illustrate, consider the following state-
b) The 95% confidence limits are 0.97 and 2.33; in propor-
ment: ‘‘We observed no significant difference (p Z .10)
tional terms, 1 is closer the lower limit than 2 is to the
despite high power.’’ Introducing observed power alongside
p gives the impression that one has two pieces of information
c) The likelihood ratio comparing RR Z 2 vs. RR Z 1 is
relevant to the null. But because observed power is merely
a fixed transform of the null p value, it adds no new statistical
d) The P value for RR Z 2 is 0.20, 3 times the p value for
information; it just an awkward rescaling of the null p value
that is even harder to interpret correctly than that p value
e) The value of RR having the same p value and likelihood
(which is notorious for its misinterpretation
as the null (the ‘‘counternull’’ is about 1.52 Z 2.25,
even though one-sided p values do have simple Bayesian
which is further from the RR estimate than is 2.
interpretations ). In contrast, confidence limits cannotbe constructed from a single p value, and thus do supply addi-
Thus, despite ‘‘nonsignificance’’ (p O .05 for RR Z 1)
tional and more easily interpreted information beyond
and power approaching 90% for RR Z 2 at a Z 0.05, the
results favor RR Z 2 over RR Z 1 whether one comparesthem using the point estimate, the confidence interval, theirlikelihoods, their p values, or the counternull value.
There are elements of arbitrariness in all analyses. For all
their problems, conventions are an obstacle to manipulation
To avoid the arbitrariness problem, post hoc power analyses
of results. Thus, although a p value can vary tremendously
often focus on ‘‘observed power,’’ that is, the power
depending what value of a measure (such as RR) is being
computed using the point estimates of the parameters in
tested, convention has decreed the null p value (e.g., for
the calculation (the baseline rate and effect size). One
RR Z 1) as one that must be included if testing is done.
problem with observed power is that it will make most any
Of course, such conventions have side effects, and arguably
study look underpowered : In approximately normal situ-
many of the objections to statistical testing and p values
ations with a Z 0.05, such as those common in epidemio-
stem from the focus on the null testing. But, as with power,
logic studies and clinical trials, the observed power will
these objections would be partially addressed if a conven-
usually be less than 50% when p O a (although moderate
tional alternative value was always tested as well (e.g.,
NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE
RRZ ½ or RR Z 2 depending on the directions observed
effect) that can be detrimental to inference from existing
data even if they are useful for study planning.
Likewise, the convention of fixing the test criterion a at
The problem of ‘‘underpowered studies’’ that
0.05 is arbitrary, but has likely prevented its manipulation.
post hoc power is supposed to address is an artifact of
This convention has carried over into interval estimation
focusing on whether p ! a (fixed-level testing) in indi-
as the nearly universal 95% level seen in both confidence
vidual studies. A study can contribute useful data no matter
intervals and posterior intervals, and remained in place
how small and underpowered it is, as long as it is interpreted
despite attempts to unseat it by using a 90% level
with proper accounting for its final imprecision. Once its
From a precision perspective, however, shifting to 90%
data are in, ‘‘underpowered’’ needs to be replaced by its
has modest implications, as it narrows approximate normal
post-trial analog, imprecisionda problem immediately
intervals by only 1 À 1.645/1.960 Z 16%; furthermore,
evident and addressed when using confidence intervals
the reader is warned of this narrowing by the statement of
. Unlike p values and power, those intervals also
90% accompanying the interval. In contrast, power changes
supply the minimum information needed to combine indi-
arising from shifts in the baseline rate or alternative can
vidual study results in a meta-analysis, which is the most
have far more spectacular impact, and yet come with no
direct way of addressing imprecision.
reference point, simple calculation, or even intuition towarn of this impact.
The latter arbitrariness problem has led to use of observed
1. Beaumont JJ, Breslow NE. Power considerations in epidemiologic studies
power, which brings a host of its own problems. Nonethe-
of vinyl chloride workers. Am J Epidemiol. 1981;114:725–734.
less, one might ask if observed power or the like remains
2. Cox DR. The planning of experiments. New York: Wiley; 1958.
useful for speculating how much power a future study would
3. Greenland S. On sample-size and power calculations for studies using
have. I would question even that much utility: The observed
confidence intervals. Am J Epidemiol. 1988;128:231–237.
data are almost never the only source of information on
4. Smith AH, Bates M. Confidence limit analyses should replace power
which to base such a forecast. The alternative of interest
calculations in the interpretation of epidemiologic studies. Epidemiology.
should be at least partly determined by what effect size is
considered important or worth detecting, rather than
5. Goodman SN, Berlin J. The use of predicted confidence intervals when
planning experiments and the misuse of power when interpreting results.
the noisy and possibly biased estimate observed from exist-
6. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power
Calculating power from data using a fixed alternative of
calculations for data analysis. Am Stat. 2001;55:19–24.
genuine interest is a partial answer to the problems of
7. Senn S. Power is indeed irrelevant in interpreting completed studies
observed power, but brings back the arbitrariness issue.
And it still depends on study-peculiar features (such as the
8. Rothman KJ, Greenland S, Lash TL, eds. Modern epidemiology. 3rd ed.
Philadelphia: Lippincott-Wolters-Kluwer; 2008.
observed baseline rate and exposure allocation ratio or prev-
9. Hooper R. The Bayesian interpretation of a P-value depends only weakly
alence) that would unlikely apply to a different study popu-
on statistical power in realistic situations. J Clin Epidemiol. 2009;62:
lation. In fact, it could be advantageous to alter these
features for future studies, as power can be sensitive to design
10. Halpern SD, Barton TD, Gross R, Hennessy S, Berlin JA, Strom BL.
choices like allocation ratios (or case-control ratios in case-
Epidemiologic studies of adverse effects of anti-retroviral drugs: how wellis statistical power reported? Pharmacoepidemiol Drug Safety. 2005;14:
control studies), which can be improved relative to past
11. Cox DR, Hinkley DV. Theoretical statistics. New York: Chapman and
In sum, use of power in data analysis and interpretation
(as opposed to research proposals) is more prone to grave
12. Casella G, Berger RL. Reconciling Bayesian and frequentist evidence in
misinterpretation than are other statistics. Chief among
the one-sided testing problem. J Am Stat Assoc. 1987;82:106–111.
them is the mistake that ‘‘high power’’ in the face of non-
13. Office of Biostatistics. Statistical review and evaluation: antiepileptic drugs
significance means the null is better supported than the
and suicidality. Bethesda, MD: U.S. Food and Drug Administration; 2008.
alternative, a mistake still exploited in unpublished reports
14. Gibbons RD. Supplemental expert report of March 19, 2009 in re: Neuro-
ntin Marketing, Sales and Liability Litigation, U.S. District Court of
even if no longer common in epidemiologic articles. Thus,
Massachusetts (Case 1:04-cv-10981-PBS).
contrary to some articles but in agreement with many
15. Robins JM, Greenland S. The probability of causation under a stochastic
others I argue that power analysis is only useful in dis-
model for individual risks. Biometrics. 1989;46:1125–1138 [Erratum:
cussing sample size requirements of further studies; if there
are specific alternatives of interest in an analysis, the P value
16. Greenland S. The relation of the probability of causation to the relative
risk and the doubling dose: a methodologic error that has become a social
for those alternatives should be given in place of power. This
problem. Am J Public Health. 1999;89:1166–1169.
means, in particular, that we need to accustom ourselves and
17. Greenland S, Robins JM. Epidemiology, justice, and the probability of
students to concepts (such as power and smallest detectable
causation. Jurimetrics. 2000;40:321–340.
NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE
18. Rosenthal R, Rubin DB. The counternull value of an effect size: a new
and the following approximations are useful for tables in
statistic. Psychol Sci. 1994;5:329–334.
19. Sellke T, Bayarri MJ, Berger JO. Calibration of p values for testing precise
null hypotheses. Am Stat. 2001;55:62–71.
1) The 95% confidence limits for RR are exp(bH1.96s).
20. Goodman SJ. A dirty dozen: twelve P-value misconceptions. Semin Hematol.
2) The one-sided P values for RR < eb and RR > eb are
21. Greenland S, Poole C. Problems in common interpretations of statistics in
3) The two-sided P value for RR Z eb is 2F(ÀjbÀbj/s),
scientific articles, expert reports, and testimony. Jurimetrics. 2011;51:
4) The rejection rates of the one-sided 0.025-level tests of
22. Rothman KJ. Modern epidemiology. Boston: Little Brown; 1986.
RR < 1 and RR > 1 given RR Z eb are F(b/sÀ1.96)
23. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised
recommendations for improving the quality of reports of parallel-group
randomized trials. JAMA. 2001;285:1987–1991.
5) The power of the two-sided 0.05-level test of RR Z 1
24. Poole C. Low P values or narrow confidence intervals: which are more
given RR Z eb is the sum of the one-sided 0.025-level
durable? Epidemiology. 2001;12:291–294.
rejection rates, F(b/s À 1.96) þ F(Àb/s À 1.96).
6) The likelihood ratio for RR2 Z exp(b2) relative to
RR1 Z exp(b1) is exp(À[(b2 À b)2 À (b1 À b)2]/2s2).
Statistics for were computed from the usual normal
Statistics for were computed using b Z ln(1.5)
approximation to the log risk-ratio estimator ^
and s Z (1/48 þ 1/32 À 2/1000)‘ in these formulas. Because
method), where b is the log risk-ratio parameter ln(RR)
of the large case numbers, using the two-binomial likelihood
(8,Ch. 14). Suppose the sample (observed) log risk ratio is
for the table instead of the normal approximation changes
b and the estimated asymptotic standard deviation of ^b is
the answers only slightly, for example, the approximate ratio
s. Let F(z) is the standard cumulative normal distribution
of likelihoods for RR Z 2 versus RR Z 1 is 2.3, whereas the
(area below z). Then F(Àz) Z 1ÀF(z) is its complement
Kendall A, Dowsett M, Folkerd E, Smith I. Caution: Vaginal estradiol appears to be contraindicated in postmenopausal women on adjuvant aromatase inhibitors. Ann Oncol 2006 17(4):584-7. PMID: 16443612 Laumann EO, Nicolosi A, Glasser DB, Paik A, Gingell C, Moreira E, Wang T; GSSAB Investigators' Group. Sexual problems among women and men aged 40-80 y: prevalence and correlates identified in the Glob
FICHE TECHNIQUE juin 2006 Voir en complément : VACCINATIONS ET HYGIÈNERecommandations sanitaires pour le voyageur, RECOMMANDATIONS SANITAIRES POUR LE VOYAGEUR 2e partie : Prophylaxie du paludisme D’après les recommandations éditées par le Bulletin épidémiologique hebdomadaire Introduction Le nombre de cas de paludisme d’importation en France métropolitaine a été est