#### DMCA

## The insignificance of statistical significance testing. (1999)

Venue: | Journal of Wildlife Management, |

Citations: | 92 - 0 self |

### Citations

11779 | Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Lawrence Erlbaum. - Cohen - 1988 |

2531 |
The Foundations of Statistics,
- Savage
- 1954
(Show Context)
Citation Context ...tions of the investigator (Berger and Berry 1988). Hence, P, the outcome of a statistical hypothesis test, depends on results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were. Are Null Hypotheses Really True? P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses are not rejected, it is usually because the sample size is too small (Nunnally 1960). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (2 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing To see if the null hypotheses being tested in The Journal of Wildlife Management can validly be considered to be true, I arbitrarily selected two issues: an issue from the 1996 volume, the other from 1998. I scanned the results section of each paper, looking for P-values. For each P-value I found, I looked back to see what hypothesis was being tested.... |

2091 |
The Logic of Scientific Discovery
- Popper
- 1968
(Show Context)
Citation Context ...known a priori to be false. The confusion of the 2 types of hypotheses has been attributed to the pervasive influence of R. A. Fisher, who did not distinguish them (Schmidt and Hunter 1997). Scientific hypothesis testing dates back at least to the 17th century: in 1620, Francis Bacon discussed the role of proposing alternative explanations and conducting explicit tests to distinguish between them as the most direct route to scientific understanding (Quinn and Dunham 1983). This concept is related to Popperian inference, which seeks to develop and test hypotheses that can clearly be falsified (Popper 1959), because a falsified hypothesis provides greater advance in understanding than does a hypothesis that is supported. Also similar is Platt's (1964) notion of strong inference, which emphasizes developing alternative hypotheses that lead to different predictions. In such a case, results inconsistent with predictions from a hypothesis cast doubt of its validity. Examples of scientific hypotheses, which were considered credible, include Copernicus' notion HA: the Earth revolves around the sun, versus the conventional wisdom of the time H0: the sun revolves around the Earth. Another example is Fer... |

1886 |
Statistical Decision Theory and Bayesian Analysis
- Berger
- 1985
(Show Context)
Citation Context ...othesis) fall where it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, approving a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. The... |

1156 | Optimal Statistical Decisions, - DeGroot - 1970 |

746 |
Bayesian Inference in Statistical Analysis.
- Box, Tiao
- 1992
(Show Context)
Citation Context ...ov/resource/1999/statsig/whatalt.htm (2 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing considering the entire set of models, each perhaps weighted by its own strength of evidence (Buckland et al. 1997). Bayesian Approaches Bayesian approaches offer some alternatives preferable to the ordinary (often called frequentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction. Sometimes the value of a parameter is predicted from theory, and it is more reasonable to test whether or not that value is consistent with the observed data than to calculate a confidence interval (Berger and Delampady 1987, Zellner 1987). For testing such hypotheses, what is usually desired (and what is sometimes believed to be provided by a statistical hypothesis test) is Pr[H0 |data]. What is obtained, as pointed out earlier, is P = Pr[observed or more extreme data |H0]. ... |

511 |
Applied Statistical Decision Theory
- Raiffa, Schlaifer
- 1961
(Show Context)
Citation Context ... it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, approving a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of mode... |

477 |
Bayes and Empirical Bayes Methods for Data Analysis
- Carlin, Louis
- 1996
(Show Context)
Citation Context ...of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing considering the entire set of models, each perhaps weighted by its own strength of evidence (Buckland et al. 1997). Bayesian Approaches Bayesian approaches offer some alternatives preferable to the ordinary (often called frequentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction. Sometimes the value of a parameter is predicted from theory, and it is more reasonable to test whether or not that value is consistent with the observed data than to calculate a confidence interval (Berger and Delampady 1987, Zellner 1987). For testing such hypotheses, what is usually desired (and what is sometimes believed to be provided by a statistical hypothesis test) is Pr[H0 |data]. What is obtained, as pointed out earlier, is P = Pr[observed or more extreme data |H0]. Bayes' theorem offers a formula for converti... |

454 |
Adaptive management of renewable resources.
- Walters
- 1986
(Show Context)
Citation Context ...nsider parameter estimation. Suppose you want to estimate a parameter . Then replacing H0 by in the above formula yields which provides an expression that shows how initial knowledge about the value of a parameter, reflected in the prior probability function Pr[ ], is modified by data obtained from a study, Pr[data |], to yield a final probability function, Pr[ |data]. This process of updating beliefs leads in a natural way to adaptive http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (3 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing resource management (Holling 1978, Walters 1986), a recent favorite topic in our field (e.g., Walters and Green 1997). Bayesian confidence intervals are much more natural than their frequentist counterparts. A frequentist 95% confidence interval for a parameter , denoted ( L, U), is interpreted as follows: if the study were repeated an infinite number of times, 95% of the confidence intervals that resulted would contain the true value . It says nothing about the particular study that was actually conducted, which led Howson and Urbach (1991:373) to comment that "statisticians regularly say that one can be '95 per cent confident' that the pa... |

370 | The earth is round (p < .05).
- Cohen
- 1994
(Show Context)
Citation Context ...sential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Barnard (1998:47) argued that "... simple P-values are not now used by the best statisticians." These examples are but a fraction of the comments made by statisticians and users of statistics about the role of statistical hypothesis testing. While many of the arguments against significance tests stem from their misuse, rather than their intrinsic values (Mulaik et al. 1997), I believ... |

306 |
Experiments in ecology: their logical design and interpretation using analysis of variance. Cambridge,
- UNDERWOOD
- 1997
(Show Context)
Citation Context ...iables in an ecological system are intercorrelated, and that any null hypothesis postulating no effect of a variable on another will in fact be false; a statistical test of that hypothesis will be rejected, as long as the sample is sufficiently large. This line of reasoning does not denigrate the value of experimentation in real systems; ecologists should seek situations in which variables thought to be http://www.npwrc.usgs.gov/resource/1999/statsig/whyused.htm (2 of 3) [12/6/2002 3:11:01 PM] Statistical Significance Testing influential can be manipulated and the results carefully monitored (Underwood 1997). Too often, however, experimentation in natural systems is very difficult if not impossible. Previous Section -- What is Statistical Hypothesis Testing? Return to Contents Next Section -- Replication Northern Prairie Wildlife Research Center Home |Site Map |Biological Resources |Help & Feedback http://www.npwrc.usgs.gov/resource/1999/statsig/whyused.htm (3 of 3) [12/6/2002 3:11:01 PM] Statistical Significance Testing The Insignificance of Statistical Significance Testing Replication Replication is a cornerstone of science. If results from a study cannot be reproduced, they have no credibility... |

180 |
Testing a point null hypothesis: the irreconcilability of p-values and evidence,”
- Berger, Sellke
- 1987
(Show Context)
Citation Context ...eliable" under this interpretation. Alternatively, P can be treated as the probability that the null hypothesis is true. This interpretation is the most direct one, as it addresses head-on the question that interests the investigator. These 3 interpretations are what Carver (1978) termed fantasies about statistical significance. None of them is true, although they are treated as if they were true in some statistical textbooks and applications papers. Small values of P are taken to represent strong evidence that the null hypothesis is false, but workers demonstrated long ago (see references in Berger and Sellke 1987) that such is not the case. In fact, Berger and Sellke (1987) gave an example for which a P-value of 0.05 was attained with a sample http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (1 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing of n = 50, but the probability that the null hypothesis was true was 0.52. Further, the disparity between P and Pr[H0 |data], the probability of the null hypothesis given the observed data, increases as samples become larger. In reality, P is the Pr[observed or more extreme data |H0], the probability of the observed data or data more extrem... |

161 | Sampling and Bayes inference in scientific modeling and robustness (with discussion).
- Box
- 1980
(Show Context)
Citation Context ...g a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of models entertained, and the data will provide different strength of evidence for each model. Rather than basing decisions or conclusions on th... |

151 | The case against statistical significance testing revisited.
- Carver
- 1993
(Show Context)
Citation Context ...est that µ = 0, would suggest that the mean actually recorded was due to chance, and µ could be assumed to be zero (Schmidt and Hunter 1997). Other times, 1-P is considered the reliability of the result, that is, the probability of getting the same result if the experiment were repeated. Significant differences are often termed "reliable" under this interpretation. Alternatively, P can be treated as the probability that the null hypothesis is true. This interpretation is the most direct one, as it addresses head-on the question that interests the investigator. These 3 interpretations are what Carver (1978) termed fantasies about statistical significance. None of them is true, although they are treated as if they were true in some statistical textbooks and applications papers. Small values of P are taken to represent strong evidence that the null hypothesis is false, but workers demonstrated long ago (see references in Berger and Sellke 1987) that such is not the case. In fact, Berger and Sellke (1987) gave an example for which a P-value of 0.05 was attained with a sample http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (1 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing ... |

125 |
Model selection: an integral part of inference.
- Buckland, Burnham, et al.
- 1997
(Show Context)
Citation Context ... applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of models entertained, and the data will provide different strength of evidence for each model. Rather than basing decisions or conclusions on the single model most strongly supported by the data, one should acknowledge the uncertainty about the model by http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (2 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing considering the entire set of models, each perhaps weighted by its own strength of evidence (Buckland et al. 1997). Bayesian Approaches Bayesian approaches offer some alternatives preferable to the ordinary (often called frequentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction. Sometimes the value of a parameter is predicted from theory, and it is more r... |

102 |
Testing precise hypotheses.
- Berger, Delampady
- 1987
(Show Context)
Citation Context ...e collected that there are real differences in the other examples, which are what Abelson (1997) referred to as "gratuitous" significance testing—testing what is already known. Three comments in favor of the point null hypothesis, such as µ = µ0. First, while such hypotheses are virtually always false for sampling studies, they may be reasonable for experimental studies in which subjects are randomly assigned to treatment groups (Mulaik et al. 1997). Second, testing a point null hypothesis in fact does provide a reasonable approximation to a more appropriate question: is µ nearly equal to µ0 (Berger and Delampady 1987, Berger and Sellke 1987), if the sample size is modest (Rindskopf 1997). Large sample sizes will result in small P-values even if µ is nearly equal to µ0. Third, testing the point null hypothesis is mathematically much easier than testing composite null hypotheses, which involve noncentrality parameters (Steiger and Fouladi 1997). The bottom line on P-values is that they relate to data that were not observed under a model that is known to be false. How meaningful can they be? But they are objective, at least; or are they? P is Arbitrary If the null hypothesis truly is false (as most of those ... |

96 |
The test of significance in psychological research.
- Bakan
- 1966
(Show Context)
Citation Context ...difference Not significant Significant Not important Happy Annoyed Important Very sad Elated Table 2. Interpretation of sample size as related to results of a statistical significance test. Statistical significance Practical importance of observed difference Not significant Significant Not important n okay n too big Important n too small n okay Other Comments on Hypothesis Tests Statistical hypothesis testing has received an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data... |

86 |
Statistical power analysis can improve fisheries research and management.
- Peterman
- 1990
(Show Context)
Citation Context ...Fig. 1B). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (4 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing Fig 1. Results of a test that failed to reject the null hypothesis that a mean equals 0. Shaded areas indicate regions for which hypothesis would be rejected. (A) suggests the null hypothesis may well be false, but the sample was too small to indicate significance; there is a lack of power. (B) suggests the data truly were consistent with the null hypothesis Power Analysis Power analysis is an adjunct to hypothesis testing that has become increasingly popular (Peterman 1990, Thomas and Krebs 1997). The procedure can be used to estimate the sample size needed to have a specified probability (power = 1 - ) of declaring as significant (at the level) a particular difference or effect (effect size). As such, the process can usefully be used to design a survey or experiment (Gerard et al. 1998). Its use is sometimes recommended to ascertain the power of the test after a study has been conducted and nonsignificant results obtained (The Wildlife Society 1995). The notion is to guard against wrongly declaring the null hypothesis to be true. Such retrospective power analy... |

73 |
An introduction to Bayesian inference for ecological research and environmental decision making.
- Ellison
- 1996
(Show Context)
Citation Context ... obtained across valid replications. Whether any or all of the results are statistically significant is irrelevant." Replicated results automatically make statistical significance testing unnecessary (Bauernfeind 1968). Individual studies rarely contain sufficient information to support a final conclusion about the truth or value of a hypothesis (Schmidt and Hunter 1997). Studies differ in design, measurement devices, samples included, weather conditions, and many other ways. This variability among studies is more pervasive in ecological situations than in, for example, the physical sciences (Ellison 1996). To have generality, results should be consistent under a wide variety of circumstances. Meta-analysis provides some tools for combining information from repeated studies (e.g., Hedges and Olkin 1985) and can reduce dependence on significance testing by examining replicated studies (Schmidt and Hunter 1997). Meta-analysis can be dangerously misleading, however, if nonsignificant results or results that did not conform to the conventional wisdom were less likely to have been published. Previous Section -- Why are Hypothesis Tests Used? Return to Contents Next Section -- What are the Alternativ... |

54 |
Reversal of the burden of proof in fishery management.
- Dayton
- 1998
(Show Context)
Citation Context ...le letting the probability of a Type II error (accepting a false null hypothesis) fall where it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, approving a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable t... |

54 |
Noncentral interval estimation and the evaluation of statistical models. In
- Steiger, Fouladi
- 1997
(Show Context)
Citation Context ...y be reasonable for experimental studies in which subjects are randomly assigned to treatment groups (Mulaik et al. 1997). Second, testing a point null hypothesis in fact does provide a reasonable approximation to a more appropriate question: is µ nearly equal to µ0 (Berger and Delampady 1987, Berger and Sellke 1987), if the sample size is modest (Rindskopf 1997). Large sample sizes will result in small P-values even if µ is nearly equal to µ0. Third, testing the point null hypothesis is mathematically much easier than testing composite null hypotheses, which involve noncentrality parameters (Steiger and Fouladi 1997). The bottom line on P-values is that they relate to data that were not observed under a model that is known to be false. How meaningful can they be? But they are objective, at least; or are they? P is Arbitrary If the null hypothesis truly is false (as most of those tested really are), then P can be made as small as one wishes, by getting a large enough sample. P is a function of (1) the difference between reality and the null hypothesis and (2) the sample size. Suppose, for example, that you are testing to see if the mean of a population (µ) is, say, 100. The null hypothesis then is H0: µ = ... |

52 |
Statistical analysis and the illusion of objectivity.
- Berger, Berry
- 1988
(Show Context)
Citation Context ... males and 3 females, 12 males and 3 females, 13 males and 3 females, etc. Alternatively, the investigator might have collected data until the difference between the numbers of males and females was 7, or until the difference was significant at some level. Each set of more extreme outcomes has its own probability, which, along with the probability of the result actually obtained, constitutes P. The point is that determining which outcomes of an experiment or survey are more extreme than the observed one, so a P-value can be calculated, requires knowledge of the intentions of the investigator (Berger and Berry 1988). Hence, P, the outcome of a statistical hypothesis test, depends on results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were. Are Null Hypotheses Really True? P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses ar... |

47 |
The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In
- Meehl
- 1997
(Show Context)
Citation Context ...sting Introduction Statistical testing of hypotheses in the wildlife field has increased dramatically in recent years. Even more recent is an emphasis on power analysis associated with hypothesis testing (The Wildlife Society 1995). While this trend was occurring, statistical hypothesis testing was being deemphasized in some other disciplines. As an example, the American Psychological Association seriously debated a ban on presenting results of such tests in the Association's scientific journals. That proposal was rejected, not because it lacked merit, but due to its appearance of censorship (Meehl 1997). The issue was highlighted at the 1998 annual conference of The Wildlife Society, in Buffalo, New York, where the Biometrics Working Group sponsored a half-day symposium on Evaluating the Role of Hypothesis Testing–Power Analysis in Wildlife Science. Speakers at that session who addressed statistical hypothesis testing were virtually unanimous in their opinion that the tool was overused, misused, and often inappropriate. My objectives are to briefly describe statistical hypothesis testing, discuss common but incorrect interpretations of resulting P-values, mention some shortcomings of hypothe... |

47 |
Eight common but false objections to the discontinuation of significance testing in the analysis of research data.
- Schmidt, Hunter
- 1997
(Show Context)
Citation Context ...d have been rummaged through, but that's another topic.) A statistical test of the null hypothesis then is conducted, which generates a P-value. Finally, the question of what that value means relative to the null hypothesis is considered. Several interpretations of P often are made. Sometimes P is viewed as the probability that the results obtained were due to chance. Small values are taken to indicate that the results were not just a happenstance. A large value of P, say for a test that µ = 0, would suggest that the mean actually recorded was due to chance, and µ could be assumed to be zero (Schmidt and Hunter 1997). Other times, 1-P is considered the reliability of the result, that is, the probability of getting the same result if the experiment were repeated. Significant differences are often termed "reliable" under this interpretation. Alternatively, P can be treated as the probability that the null hypothesis is true. This interpretation is the most direct one, as it addresses head-on the question that interests the investigator. These 3 interpretations are what Carver (1978) termed fantasies about statistical significance. None of them is true, although they are treated as if they were true in some ... |

44 |
Statistical power analysis in wildlife research.
- Steidl, Hayes, et al.
- 1997
(Show Context)
Citation Context ...ocedure can be used to estimate the sample size needed to have a specified probability (power = 1 - ) of declaring as significant (at the level) a particular difference or effect (effect size). As such, the process can usefully be used to design a survey or experiment (Gerard et al. 1998). Its use is sometimes recommended to ascertain the power of the test after a study has been conducted and nonsignificant results obtained (The Wildlife Society 1995). The notion is to guard against wrongly declaring the null hypothesis to be true. Such retrospective power analysis can be misleading, however. Steidl et al. (1997:274) noted that power estimated with the data used to test the null hypothesis and the observed effect size is meaningless, as a high P-value will invariably result in low estimated power. Retrospective power estimates may be meaningful if they are computed with effect sizes different from the observed effect size. Power analysis programs, however, assume the input values for effect and variance are known, rather than estimated, so they give misleadingly high estimates of power (Steidl et al. 1997, Gerard et al. 1998). In addition, although statistical hypothesis testing invokes what I believ... |

38 |
Some difficulties of interpretation encountered in the application of the chi-square test.
- Berkson
- 1938
(Show Context)
Citation Context ...e of the intentions of the investigator (Berger and Berry 1988). Hence, P, the outcome of a statistical hypothesis test, depends on results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were. Are Null Hypotheses Really True? P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses are not rejected, it is usually because the sample size is too small (Nunnally 1960). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (2 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing To see if the null hypotheses being tested in The Journal of Wildlife Management can validly be considered to be true, I arbitrarily selected two issues: an issue from the 1996 volume, the other from 1998. I scanned the results section of each paper, looking for P-values. For each P-value I found, I looked back to see what hypothesis was ... |

35 |
On the tyranny of hypothesis testing in the social sciences.
- Loftus
- 1991
(Show Context)
Citation Context ...as "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Barnard (1998:47) argued that "... simple P-values are not now used by the best statisticians." These examples are but a fraction of the comments made by statisticians and users of statistics about the role of statistical hypothesis testing. While many of the arguments against signific... |

31 |
What statistical significance testing is and what it is not.
- Shaver
- 1993
(Show Context)
Citation Context ... high P-value will invariably result in low estimated power. Retrospective power estimates may be meaningful if they are computed with effect sizes different from the observed effect size. Power analysis programs, however, assume the input values for effect and variance are known, rather than estimated, so they give misleadingly high estimates of power (Steidl et al. 1997, Gerard et al. 1998). In addition, although statistical hypothesis testing invokes what I believe to be 1 rather arbitrary parameter ( or P), power analysis requires three of them ( , , effect size). For further comments see Shaver (1993:309), who termed power analysis "a vacuous intellectual game," and who noted that the tendency to use criteria, such as Cohen's (1988) standards for small, medium, and large effect sizes, is as mindless as the practice of using the = 0.05 criterion in statistical significance testing. Questions about the likely size of true effects can be better addressed with confidence intervals than with retrospective power analyses (e.g., Steidl et al. 1997, Steiger and Fouladi 1997). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (5 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testi... |

29 |
Analyzing data: sanctification or detective work?
- Tukey
- 1969
(Show Context)
Citation Context ... Significance Testing Replication Replication is a cornerstone of science. If results from a study cannot be reproduced, they have no credibility. Scale is important here. Conducting the same study at the same time but at several different sites and getting comparable results is reassuring, but not nearly so convincing as having different investigators achieve similar results using different methods in different areas at different times. R. A. Fisher's idea of solid knowledge was not a single extremely significant result, but rather the ability of repeatedly getting results significant at 5% (Tukey 1969). Shaver (1993:304) observed that "The question of interest is whether an effect size of a magnitude judged to be important has been consistently obtained across valid replications. Whether any or all of the results are statistically significant is irrelevant." Replicated results automatically make statistical significance testing unnecessary (Bauernfeind 1968). Individual studies rarely contain sufficient information to support a final conclusion about the truth or value of a hypothesis (Schmidt and Hunter 1997). Studies differ in design, measurement devices, samples included, weather conditi... |

26 |
There is a time and place for significance testing. In
- Mulaik, Raju, et al.
- 1997
(Show Context)
Citation Context ...61.1% in 1 season and 61.2% in another. The only question was whether or not the sample size was sufficient to detect the difference. Likewise, we know before data are collected that there are real differences in the other examples, which are what Abelson (1997) referred to as "gratuitous" significance testing—testing what is already known. Three comments in favor of the point null hypothesis, such as µ = µ0. First, while such hypotheses are virtually always false for sampling studies, they may be reasonable for experimental studies in which subjects are randomly assigned to treatment groups (Mulaik et al. 1997). Second, testing a point null hypothesis in fact does provide a reasonable approximation to a more appropriate question: is µ nearly equal to µ0 (Berger and Delampady 1987, Berger and Sellke 1987), if the sample size is modest (Rindskopf 1997). Large sample sizes will result in small P-values even if µ is nearly equal to µ0. Third, testing the point null hypothesis is mathematically much easier than testing composite null hypotheses, which involve noncentrality parameters (Steiger and Fouladi 1997). The bottom line on P-values is that they relate to data that were not observed under a model t... |

24 |
Historical origins of statistical testing practices: the treatment of Fisher versus Neyman-Pearson views in textbooks.
- Huberty
- 1993
(Show Context)
Citation Context ...tabilize at the true parameter values. Hence, a large value of n translates into a large value of t. This strong dependence of P on the sample size led Good (1982) to suggest that P-values be standardized to a sample size of 100, by replacing P by P (or 0.5, if that is smaller). Even more arbitrary in a sense than P is the use of a standard cutoff value, usually denoted . P-values less than or equal to are deemed significant; those greater than are nonsignificant. Use of was advocated by Jerzy Neyman and Egon Pearson, whereas R. A. Fisher recommended presentation of observed P-values instead (Huberty 1993). Use of a fixed level, say = 0.05, promotes the seemingly nonsensical distinction between a significant finding if P = 0.049, and a nonsignificant finding if P = 0.051. Such minor differences are illusory anyway, as they derive from tests whose assumptions often are only approximately met (Preece 1990). Fisher objected to the Neyman-Pearson procedure because of its mechanical, automated nature (Mulaik et al. 1997). Proving the Null Hypothesis Discourses on hypothesis testing emphasize that null hypotheses cannot be proved; they can only be disproved (rejected). Failing to reject a null hypoth... |

21 | On probability as a basis for action.
- Deming
- 1975
(Show Context)
Citation Context ...on of sample size as related to results of a statistical significance test. Statistical significance Practical importance of observed difference Not significant Significant Not important n okay n too big Important n too small n okay Other Comments on Hypothesis Tests Statistical hypothesis testing has received an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we ... |

21 |
Measuring uncertainty: an elementary introduction to Bayesian statistics.
- Schmitt
- 1969
(Show Context)
Citation Context ...e Testing considering the entire set of models, each perhaps weighted by its own strength of evidence (Buckland et al. 1997). Bayesian Approaches Bayesian approaches offer some alternatives preferable to the ordinary (often called frequentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction. Sometimes the value of a parameter is predicted from theory, and it is more reasonable to test whether or not that value is consistent with the observed data than to calculate a confidence interval (Berger and Delampady 1987, Zellner 1987). For testing such hypotheses, what is usually desired (and what is sometimes believed to be provided by a statistical hypothesis test) is Pr[H0 |data]. What is obtained, as pointed out earlier, is P = Pr[observed or more extreme data |H0]. Bayes' theorem offers a formula for converting between them. This is an old (Bayes 1763... |

20 |
The role of statistical significance testing
- McLean, Ernest
- 1998
(Show Context)
Citation Context ...timates of the treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest." Further, because wildlife ecologists want to influence management practices, Johnson (1995) noted that, "If ecologists are to be taken seriously by decision makers, they must provide information useful for deciding on a course of action, as opposed to addressing purely academic questions." To enforce that point, several education and psychological journals have adopted editorial policies requiring that parameter estimates accompany any P-values presented (McLean and Ernest 1998). Ordinary confidence intervals provide more information than do P-values. Knowing that a 95% confidence interval includes zero tells one that, if a test of the hypothesis that the parameter equals zero is conducted, the resulting P-value will be greater than 0.05. A confidence interval provides both an estimate of the effect size and a measure of its uncertainty. A 95% confidence interval of, say, (-50, 300) http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (1 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing suggests the parameter is less well estimated than would a con... |

19 |
An applied statistician's creed.
- Nester
- 1996
(Show Context)
Citation Context ...lling our earlier comments about the effect of sample size on P-values, the two outcomes that please the researcher suggest the sample size was about right (Table 2). The annoying unimportant differences that were significant indicate that too large a sample was obtained. Further, if an important difference was not significant, the investigator concludes that the sample was insufficient and calls for further research. This schizophrenic nature of the interpretation of significance greatly reduces its value. Table 1. Reaction of investigator to results of a statistical significance test (after Nester 1996). Statistical significance Practical importance of observed difference Not significant Significant Not important Happy Annoyed Important Very sad Elated Table 2. Interpretation of sample size as related to results of a statistical significance test. Statistical significance Practical importance of observed difference Not significant Significant Not important n okay n too big Important n too small n okay Other Comments on Hypothesis Tests Statistical hypothesis testing has received an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longe... |

18 |
Limits of retrospective power analysis.
- Gerard, Smith, et al.
- 1998
(Show Context)
Citation Context ... hypothesis may well be false, but the sample was too small to indicate significance; there is a lack of power. (B) suggests the data truly were consistent with the null hypothesis Power Analysis Power analysis is an adjunct to hypothesis testing that has become increasingly popular (Peterman 1990, Thomas and Krebs 1997). The procedure can be used to estimate the sample size needed to have a specified probability (power = 1 - ) of declaring as significant (at the level) a particular difference or effect (effect size). As such, the process can usefully be used to design a survey or experiment (Gerard et al. 1998). Its use is sometimes recommended to ascertain the power of the test after a study has been conducted and nonsignificant results obtained (The Wildlife Society 1995). The notion is to guard against wrongly declaring the null hypothesis to be true. Such retrospective power analysis can be misleading, however. Steidl et al. (1997:274) noted that power estimated with the data used to test the null hypothesis and the observed effect size is meaningless, as a high P-value will invariably result in low estimated power. Retrospective power estimates may be meaningful if they are computed with effect... |

18 |
Detecting community-wide patterns: estimating power strengthens statistical inference.
- Toft, Shea
- 1983
(Show Context)
Citation Context ...othesis testing is inadequate, for it does not take into consideration the costs of alternative actions. Here a useful tool is statistical decision theory: the theory of acting rationally with respect to anticipated gains and losses, in the face of uncertainty. Hypothesis testing generally limits the probability of a Type I error (rejecting a true null hypothesis), often arbitrarily set at = 0.05, while letting the probability of a Type II error (accepting a false null hypothesis) fall where it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, approving a pesticide that reduces the survival rate of an endangered species by 5% may be disastrous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tenta... |

16 |
On hypothesis testing in ecology and evolution.
- Quinn, Dunham
- 1983
(Show Context)
Citation Context ...operties of populations; Simberloff 1990). Unlike scientific hypotheses, the truth of which is truly in question, most statistical hypotheses are known a priori to be false. The confusion of the 2 types of hypotheses has been attributed to the pervasive influence of R. A. Fisher, who did not distinguish them (Schmidt and Hunter 1997). Scientific hypothesis testing dates back at least to the 17th century: in 1620, Francis Bacon discussed the role of proposing alternative explanations and conducting explicit tests to distinguish between them as the most direct route to scientific understanding (Quinn and Dunham 1983). This concept is related to Popperian inference, which seeks to develop and test hypotheses that can clearly be falsified (Popper 1959), because a falsified hypothesis provides greater advance in understanding than does a hypothesis that is supported. Also similar is Platt's (1964) notion of strong inference, which emphasizes developing alternative hypotheses that lead to different predictions. In such a case, results inconsistent with predictions from a hypothesis cast doubt of its validity. Examples of scientific hypotheses, which were considered credible, include Copernicus' notion HA: the... |

16 |
Valuation of experimental management options for ecological systems.
- Walters, Green
- 1997
(Show Context)
Citation Context ... parameter . Then replacing H0 by in the above formula yields which provides an expression that shows how initial knowledge about the value of a parameter, reflected in the prior probability function Pr[ ], is modified by data obtained from a study, Pr[data |], to yield a final probability function, Pr[ |data]. This process of updating beliefs leads in a natural way to adaptive http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (3 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing resource management (Holling 1978, Walters 1986), a recent favorite topic in our field (e.g., Walters and Green 1997). Bayesian confidence intervals are much more natural than their frequentist counterparts. A frequentist 95% confidence interval for a parameter , denoted ( L, U), is interpreted as follows: if the study were repeated an infinite number of times, 95% of the confidence intervals that resulted would contain the true value . It says nothing about the particular study that was actually conducted, which led Howson and Urbach (1991:373) to comment that "statisticians regularly say that one can be '95 per cent confident' that the parameter lies in the confidence interval. They never say why." In cont... |

16 |
Bayesian environmental policy decisions: two case studies.
- Wolfson, Kadane, et al.
- 1996
(Show Context)
Citation Context ...r say why." In contrast, a Bayesian confidence interval, sometimes called a credible interval, is interpreted to mean that the probability that the true value of the parameter lies in the interval is 95%. That statement is much more natural, and is what people think a confidence interval is, until they get the notion drummed out of their heads in statistics courses. For decision analysis, Bayes' theorem offers a very logical way to make decisions in the face of uncertainty. It allows for incorporating beliefs, data, and the gains or losses expected from possible consequences of decisions. See Wolfson et al. (1996) and Ellison (1996) for recent overviews of Bayesian methods with an ecological orientation. Previous Section -- Replication Return to Contents Next Section -- Conclusions Northern Prairie Wildlife Research Center Home |Site Map |Biological Resources |Help & Feedback http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (4 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing The Insignificance of Statistical Significance Testing Conclusions Editors of scientific journals, along with the referees they rely on, are really the arbiters of scientific practice. They need to understan... |

14 |
When confidence intervals should be used instead of statistical tests, and vice versa.
- Reichardt, Gollob
- 1997
(Show Context)
Citation Context ...es both an estimate of the effect size and a measure of its uncertainty. A 95% confidence interval of, say, (-50, 300) http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (1 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing suggests the parameter is less well estimated than would a confidence interval of (120, 130). Perhaps surprisingly, confidence intervals have a longer history than statistical hypothesis tests (Schmidt and Hunter 1997). With its advantages and longer history, why have confidence intervals not been used more than they have? Steiger and Fouladi (1997) and Reichardt and Gollob (1997) posited several explanations: (1) hypothesis testing has become a tradition; (2) the advantages of confidence intervals are not recognized; (3) there is some ignorance of the procedures available; (4) major statistical packages do not include many confidence interval estimates; (5) sizes of parameter estimates are often disappointingly small even though they may be very significantly different from zero; (6) the wide confidence intervals that often result from a study are embarrassing; (7) some hypothesis tests (e.g., chi square contingency table) have no uniquely defined parameter associated... |

9 |
Bayesian reasoning in science.
- Howson, Urbach
- 1991
(Show Context)
Citation Context ...whatalt.htm (3 of 4) [12/6/2002 3:11:03 PM] Statistical Significance Testing resource management (Holling 1978, Walters 1986), a recent favorite topic in our field (e.g., Walters and Green 1997). Bayesian confidence intervals are much more natural than their frequentist counterparts. A frequentist 95% confidence interval for a parameter , denoted ( L, U), is interpreted as follows: if the study were repeated an infinite number of times, 95% of the confidence intervals that resulted would contain the true value . It says nothing about the particular study that was actually conducted, which led Howson and Urbach (1991:373) to comment that "statisticians regularly say that one can be '95 per cent confident' that the parameter lies in the confidence interval. They never say why." In contrast, a Bayesian confidence interval, sometimes called a credible interval, is interpreted to mean that the probability that the true value of the parameter lies in the interval is 95%. That statement is much more natural, and is what people think a confidence interval is, until they get the notion drummed out of their heads in statistics courses. For decision analysis, Bayes' theorem offers a very logical way to make decisio... |

9 |
Statistical sirens: the allure of nonparametrics.
- Johnson
- 1995
(Show Context)
Citation Context ...investigator (Berger and Berry 1988). Hence, P, the outcome of a statistical hypothesis test, depends on results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were. Are Null Hypotheses Really True? P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses are not rejected, it is usually because the sample size is too small (Nunnally 1960). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (2 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing To see if the null hypotheses being tested in The Journal of Wildlife Management can validly be considered to be true, I arbitrarily selected two issues: an issue from the 1996 volume, the other from 1998. I scanned the results section of each paper, looking for P-values. For each P-value I found, I looked back to see what hypothesis was being tested. I made a very ... |

8 |
Hypothesis testing in relation to statistical methodology.
- Clark
- 1963
(Show Context)
Citation Context ...stical significance test (after Nester 1996). Statistical significance Practical importance of observed difference Not significant Significant Not important Happy Annoyed Important Very sad Elated Table 2. Interpretation of sample size as related to results of a statistical significance test. Statistical significance Practical importance of observed difference Not significant Significant Not important n okay n too big Important n too small n okay Other Comments on Hypothesis Tests Statistical hypothesis testing has received an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are n... |

6 |
Model selection and inference:a practical information-theoretic approach.
- Burnham, Anderson
- 1998
(Show Context)
Citation Context ...in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995). Model Selection Statistical tests can play a useful role in diagnostic checks and evaluations of tentative statistical models (Box 1980). But even for this application, competing tools are superior. Information criteria, such as Akaike's, provide objective measures for selecting among different models fitted to a data set. Burnham and Anderson (1998) provided a detailed overview of model selection procedures based on information criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of models entertained, and the data will provide different strength of evidence for each model. Rather than basing decisions or conclusions on the single model most strongly supported by the data, one should acknowledge the uncertainty about the model by http://www.npwrc.usgs.gov/resource/1999/statsig/whatalt.htm (2 of 4) [12/6/2002 3:11:03 PM] Statistical Si... |

6 |
The place of statistics
- Nunnally
- 1960
(Show Context)
Citation Context ... results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were. Are Null Hypotheses Really True? P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses are not rejected, it is usually because the sample size is too small (Nunnally 1960). http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (2 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing To see if the null hypotheses being tested in The Journal of Wildlife Management can validly be considered to be true, I arbitrarily selected two issues: an issue from the 1996 volume, the other from 1998. I scanned the results section of each paper, looking for P-values. For each P-value I found, I looked back to see what hypothesis was being tested. I made a very biased selection of some conclusions reached by rejecting null hypotheses; these include: (1) the occurre... |

6 |
Testing "small," not null, hypotheses: classical and Bayesian approaches.
- Rindskopf
- 1997
(Show Context)
Citation Context ...Abelson (1997) referred to as "gratuitous" significance testing—testing what is already known. Three comments in favor of the point null hypothesis, such as µ = µ0. First, while such hypotheses are virtually always false for sampling studies, they may be reasonable for experimental studies in which subjects are randomly assigned to treatment groups (Mulaik et al. 1997). Second, testing a point null hypothesis in fact does provide a reasonable approximation to a more appropriate question: is µ nearly equal to µ0 (Berger and Delampady 1987, Berger and Sellke 1987), if the sample size is modest (Rindskopf 1997). Large sample sizes will result in small P-values even if µ is nearly equal to µ0. Third, testing the point null hypothesis is mathematically much easier than testing composite null hypotheses, which involve noncentrality parameters (Steiger and Fouladi 1997). The bottom line on P-values is that they relate to data that were not observed under a model that is known to be false. How meaningful can they be? But they are objective, at least; or are they? P is Arbitrary If the null hypothesis truly is false (as most of those tested really are), then P can be made as small as one wishes, by gettin... |

5 |
Standardized tail-area probabilities.
- Good
- 1982
(Show Context)
Citation Context ...t-test, which is where is the mean of the sample, S is the standard deviation of the sample, and n is the sample size. Clearly, t can be made arbitrarily large (and the P-value associated with it arbitrarily small) by making either ( – 100) or large enough. As the sample size increases, ( – 100) and S will http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (3 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing approximately stabilize at the true parameter values. Hence, a large value of n translates into a large value of t. This strong dependence of P on the sample size led Good (1982) to suggest that P-values be standardized to a sample size of 100, by replacing P by P (or 0.5, if that is smaller). Even more arbitrary in a sense than P is the use of a standard cutoff value, usually denoted . P-values less than or equal to are deemed significant; those greater than are nonsignificant. Use of was advocated by Jerzy Neyman and Egon Pearson, whereas R. A. Fisher recommended presentation of observed P-values instead (Huberty 1993). Use of a fixed level, say = 0.05, promotes the seemingly nonsensical distinction between a significant finding if P = 0.049, and a nonsignificant fi... |

5 |
Hypotheses, errors, and statistical assumptions.
- Simberloff
- 1990
(Show Context)
Citation Context ...efute the hypothesis, that outcome implies that the theory is incorrect and should be modified or scrapped. If the results do not refute the hypothesis, the theory stands and may gain support, depending on how critical the experiment was. http://www.npwrc.usgs.gov/resource/1999/statsig/whyused.htm (1 of 3) [12/6/2002 3:11:01 PM] Statistical Significance Testing In contrast, the hypotheses usually tested by wildlife ecologists do not devolve from general theories about how the real world operates. More typically they are statistical hypotheses (i.e., statements about properties of populations; Simberloff 1990). Unlike scientific hypotheses, the truth of which is truly in question, most statistical hypotheses are known a priori to be false. The confusion of the 2 types of hypotheses has been attributed to the pervasive influence of R. A. Fisher, who did not distinguish them (Schmidt and Hunter 1997). Scientific hypothesis testing dates back at least to the 17th century: in 1620, Francis Bacon discussed the role of proposing alternative explanations and conducting explicit tests to distinguish between them as the most direct route to scientific understanding (Quinn and Dunham 1983). This concept is r... |

5 |
Sir Ronald Fisher and the design of experiments.
- Yates
- 1964
(Show Context)
Citation Context ...theory is an appropriate tool. For either of these applications, as well as for hypothesis testing itself, the Bayesian approach offers some distinct advantages over the traditional methods. These alternatives are briefly outlined below. Although the alternatives will not meet all potential needs, they do offer attractive choices in many frequently encountered situations. Estimates and Confidence Intervals Four decades ago, Anscombe (1956) observed that statistical hypothesis tests were totally irrelevant, and that what was needed were estimates of magnitudes of effects, with standard errors. Yates (1964) indicated that "The most commonly occurring weakness in the application of Fisherian methods is undue emphasis on tests of significance, and failure to recognize that in many types of experimental work estimates of the treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest." Further, because wildlife ecologists want to influence management practices, Johnson (1995) noted that, "If ecologists are to be taken seriously by decision makers, they must provide information useful for deciding on a course of action, as opposed to add... |

4 |
The need for replication in educational research.
- Bauernfeind
- 1968
(Show Context)
Citation Context ...hieve similar results using different methods in different areas at different times. R. A. Fisher's idea of solid knowledge was not a single extremely significant result, but rather the ability of repeatedly getting results significant at 5% (Tukey 1969). Shaver (1993:304) observed that "The question of interest is whether an effect size of a magnitude judged to be important has been consistently obtained across valid replications. Whether any or all of the results are statistically significant is irrelevant." Replicated results automatically make statistical significance testing unnecessary (Bauernfeind 1968). Individual studies rarely contain sufficient information to support a final conclusion about the truth or value of a hypothesis (Schmidt and Hunter 1997). Studies differ in design, measurement devices, samples included, weather conditions, and many other ways. This variability among studies is more pervasive in ecological situations than in, for example, the physical sciences (Ellison 1996). To have generality, results should be consistent under a wide variety of circumstances. Meta-analysis provides some tools for combining information from repeated studies (e.g., Hedges and Olkin 1985) and... |

2 |
Pooling probabilities.
- Barnard
- 1998
(Show Context)
Citation Context ...78) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Barnard (1998:47) argued that "... simple P-values are not now used by the best statisticians." These examples are but a fraction of the comments made by statisticians and users of statistics about the role of statistical hypothesis testing. While many of the arguments against significance tests stem from their misuse, rather than their intrinsic values (Mulaik et al. 1997), I believe http://www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (6 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing that 1 of their intrinsic problems is that they do encourage misuse. Previous Section -- Introductio... |

2 |
The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis
- Guttman
- 1985
(Show Context)
Citation Context ...d an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Barnard (1998:47) argued that "... simple P-values are not now used by the best statisticians." These examples are but a fraction of the comments made by statisticians and users of st... |

2 |
Fisher and experimental design: a review.
- Preece
- 1990
(Show Context)
Citation Context ...sense than P is the use of a standard cutoff value, usually denoted . P-values less than or equal to are deemed significant; those greater than are nonsignificant. Use of was advocated by Jerzy Neyman and Egon Pearson, whereas R. A. Fisher recommended presentation of observed P-values instead (Huberty 1993). Use of a fixed level, say = 0.05, promotes the seemingly nonsensical distinction between a significant finding if P = 0.049, and a nonsignificant finding if P = 0.051. Such minor differences are illusory anyway, as they derive from tests whose assumptions often are only approximately met (Preece 1990). Fisher objected to the Neyman-Pearson procedure because of its mechanical, automated nature (Mulaik et al. 1997). Proving the Null Hypothesis Discourses on hypothesis testing emphasize that null hypotheses cannot be proved; they can only be disproved (rejected). Failing to reject a null hypothesis does not mean that it is true. Especially with small samples, one must be careful not to accept the null hypothesis. Consider a test of the null hypothesis that a mean µ equals µ0. The situations illustrated in Figure 1 both reflect a failure to reject that hypothesis. Figure 1A suggests the null h... |

1 |
Discussion on Dr. David's and Dr. Johnson's Paper.
- Anscombe
- 1956
(Show Context)
Citation Context ...ctions between, a number of processes. For this purpose, estimation is far more appropriate than hypothesis testing (Campbell 1992). For certain other situations, decision theory is an appropriate tool. For either of these applications, as well as for hypothesis testing itself, the Bayesian approach offers some distinct advantages over the traditional methods. These alternatives are briefly outlined below. Although the alternatives will not meet all potential needs, they do offer attractive choices in many frequently encountered situations. Estimates and Confidence Intervals Four decades ago, Anscombe (1956) observed that statistical hypothesis tests were totally irrelevant, and that what was needed were estimates of magnitudes of effects, with standard errors. Yates (1964) indicated that "The most commonly occurring weakness in the application of Fisherian methods is undue emphasis on tests of significance, and failure to recognize that in many types of experimental work estimates of the treatment effects, together with estimates of the errors to which they are subject, are the quantities of primary interest." Further, because wildlife ecologists want to influence management practices, Johnson (... |

1 |
Confidence intervals.
- Campbell
- 1992
(Show Context)
Citation Context ...01 PM] Statistical Significance Testing Home |Site Map |Biological Resources |Help & Feedback http://www.npwrc.usgs.gov/resource/1999/statsig/replicat.htm (2 of 2) [12/6/2002 3:11:01 PM] Statistical Significance Testing The Insignificance of Statistical Significance Testing What are the Alternatives? What should we do instead of testing hypotheses? As Quinn and Dunham (1983) pointed out, it is more fruitful to determine the relative importance to the contributions of, and interactions between, a number of processes. For this purpose, estimation is far more appropriate than hypothesis testing (Campbell 1992). For certain other situations, decision theory is an appropriate tool. For either of these applications, as well as for hypothesis testing itself, the Bayesian approach offers some distinct advantages over the traditional methods. These alternatives are briefly outlined below. Although the alternatives will not meet all potential needs, they do offer attractive choices in many frequently encountered situations. Estimates and Confidence Intervals Four decades ago, Anscombe (1956) observed that statistical hypothesis tests were totally irrelevant, and that what was needed were estimates of magn... |

1 |
Faith, hope and statistics.
- Matthews
- 1997
(Show Context)
Citation Context ...istics packages; (3) everyone else seems to use them; (4) students, statisticians, and scientists are taught to use them; and (5) some journal editors and thesis supervisors demand them. Carver (1978) recognized that statistical significance is generally interpreted as having some relationship to replication, which is the cornerstone of science. More cynically, Carver (1978) suggested that complicated mathematical procedures lend an air of scientific objectivity to conclusions. Shaver (1993) noted that social scientists equate being quantitative with being scientific. D. V. Lindley (quoted in Matthews 1997) observed that "People like conventional hypothesis tests because it's so easy to get significant results from them." I attribute the heavy use of statistical hypothesis testing, not just in the wildlife field but in other "soft" sciences such as psychology, sociology, and education, to "physics envy." Physicists and other researchers in the "hard" sciences are widely respected for their ability to learn things about the real world (and universe) that are solid and incontrovertible, and also yield results that translate into products that we see daily. Psychologists, for one group, have diffic... |

1 |
Technological tools.
- Thomas, Krebs
- 1997
(Show Context)
Citation Context ...//www.npwrc.usgs.gov/resource/1999/statsig/stathyp.htm (4 of 7) [12/6/2002 3:10:59 PM] Statistical Significance Testing Fig 1. Results of a test that failed to reject the null hypothesis that a mean equals 0. Shaded areas indicate regions for which hypothesis would be rejected. (A) suggests the null hypothesis may well be false, but the sample was too small to indicate significance; there is a lack of power. (B) suggests the data truly were consistent with the null hypothesis Power Analysis Power analysis is an adjunct to hypothesis testing that has become increasingly popular (Peterman 1990, Thomas and Krebs 1997). The procedure can be used to estimate the sample size needed to have a specified probability (power = 1 - ) of declaring as significant (at the level) a particular difference or effect (effect size). As such, the process can usefully be used to design a survey or experiment (Gerard et al. 1998). Its use is sometimes recommended to ascertain the power of the test after a study has been conducted and nonsignificant results obtained (The Wildlife Society 1995). The notion is to guard against wrongly declaring the null hypothesis to be true. Such retrospective power analysis can be misleading, h... |