### Citations

1164 |
Statistical power analysis for the behavioural sciences. revised edition ed
- Cohen
- 1977
(Show Context)
Citation Context ...e for a threshold of three is halved (see first column in Table 2 compared to Table 1). B will be most variable early on in testing, because B 4 http://dx.doi.org/10.1016/j.jmp.2015.10.003. 82 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89Table 1 Per cent decision rates for accepting/rejecting H0 for BH(0,1) (i.e. a Bayes factor in which H1 has been represented as a half-normal, with mode = 0, and SD = 1). Each participant provides a single difference score, sampled from a normal distribution with a standard deviation of 1. Thus, the specified population effect sizes are dz’s (Cohen, 1988). Maximum number of participants before stopping (MaxN) = 100; minimum number of participants before checking after every trial (MinN) = 1. H0 is rejected if B exceeds the stated threshold, and accepted if B goes below 1/threshold. Threshold: 3 4 5 6 7 8 9 10 Population effect: dz = 0 Reject H0 14 12 11 11 7 7 6 5 Accept H0 86 87 86 86 85 79 74 69 dz = 1 Reject H0 97 100 100 100 100 100 100 100 Accept H0 1 0 0 0 0 0 0 0Table 2 Per cent decision rates for accepting/rejecting H0 for BH(0,1) as for Table 1, except that MinN = 10. Threshold: 3 4 5 6 7 8 9 10 Population effect: dz = 0 Reject H0 7 7... |

913 | Probability Theory: The Logic of Science
- Jaynes
- 2003
(Show Context)
Citation Context ...the evidence to be stronger than it actually was for the general theory that priming ‘‘closing’’ speeds the closing of sales. (Pre-registration of studies, by itself neither Bayesian nor non-Bayesian, is a key solution to this problem; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012.) That is, bias introduced by the judicious dropping of conditions, discussed by Simmons et al. (2011), is not in itself solved by using Bayesianmethods on the data that remains. Bayesian analyses will not solve all forms of bias (and then only the ones that are part of the formal statistical problem; Jaynes, 2003). A Bayesian analysis requires that all relevant data are included. In fact, say that the authors report all studies, so cherry picking is avoided. What is the evidential value of the final study showing an effect? The orthodox approach corrects for multiple testing. Thus, if all studies are taken as a family, a threshold of 0.05/6 = 0.008may be used for p-values, now rendering the final study nonsignificant at the 5% level: The presence of an effect cannot be asserted for the embodied priming manipulation. However, from a Bayesian point of view, the evidence provided by the data from specific... |

879 |
Theory of probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...one happened to have in different theories prior to data (which will be different for different people), that belief should be updated by the same Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89 79amount, B, for everyone.1What this equation tells us is that if we measure strength of evidence of data as the amount by which anyone should change their strength of belief in the two theories in the light of the data, then the only relevant information is provided by the Bayes factor, B (cf Birnbaum, 1962). Conventional approximate guidelines for strength of evidence were provided by Jeffreys (1939, though Bayes factors stand on their own as continuousmeasures of degrees of evidence). If B > 3 then there is substantial evidence for H1 rather than H0; if B < 1/3 then there is substantial evidence for H0 rather thanH1; and if B is in between 1/3 and 3 then the evidence is insensitive. The term ‘prior’ has two meanings in the context of Bayes factors. P(H1) is a prior probability ofH1, i.e. howmuch youbelieve in H1 before seeing the data. But the term ‘prior’ is also used to refer to setting up the model of H1, i.e. to state what the theory predicts, used for obtaining P(D|H1), the probabi... |

704 |
Design of experiments
- Fisher
- 1935
(Show Context)
Citation Context ...mpted to hack in just one direction; b) as a measure of evidence they are insensitive to the stopping rule; c) families of tests cannot be arbitrarily defined; and d) falsely implying a contrast is planned rather than post hoc becomes irrelevant (though the value of pre-registration is not mitigated). © 2015 Elsevier Inc. All rights reserved.1. Introduction A Bayes factor is a form of statistical inference in which one model, say H1, is pitted against another, say H0. Both models need to be specified, even if in a default way. Significance testing (using only the p-value for inference, as per Fisher, 1935) involves setting up a model for H0 alone—and yet is typically still used to pit H0 against H1. I will argue that significance testing is in this way flawed, with harmful consequences for the practice of science (Wagenmakers, 2007). Bayes factors, by specifying two models, resolve several key problems (though not all problems). After ∗ Correspondence to: School of Psychology, University of Sussex, Brighton, BN1 9QH, UK. E-mail address: dienes@sussex.ac.uk. http://dx.doi.org/10.1016/j.jmp.2015.10.003 0022-2496/© 2015 Elsevier Inc. All rights reserved.defining a Bayes factor, the introduction fi... |

417 |
Scientific reasoning: The Bayesian approach (2nd ed
- Howson, Urbach
- 1993
(Show Context)
Citation Context ...searcher can be taken as randomly sampling from the same population as before), each time stopping the experiment after three participants in a row were happier on the drug than on placebo. Even if the drug were ineffective, each estimate would have a tendency to indicate that people were happier on the drug; that is, the mean of all the estimates would show greater happiness on the drug than on the placebo. Is not this a problem for an experiment, even if analysed by Bayesian statistics? The clue to the solution is that bias is inherently a frequentist notion, with need of a reference class (Howson & Urbach, 2006); yet it is the use of reference classes that leads to the inferential paradoxes in significance testing that do not apply to Bayesian analyses (Dienes, 2011; Lindley, 1993). Our researcher, as a Bayesian, would not simply average the results of the different experiments together (in an unweighted way). The experiments are all basic events in the reference class; but a Bayesian does not recognize the reference class as relevant to inference. Note that each experiment would have a different number of participants. The events in the reference class are just one arbitrary way of carving up the fu... |

405 |
Conjectures and Refutations
- Popper
- 1963
(Show Context)
Citation Context ...to simple theory. A useful rule of thumb is that confirming novel rather than post hoc predictions is more likely to provide strong evidence for a simple theory. But that is not to do with some magic about when someone thought of a theory (someone’s brilliance in mentally penetrating the structure of Mother Nature in advance may be relevant to their self-esteem but such personal brilliance does not transfer to the evidential support of the data for the theory: In science it does not matter who you are). The objective properties of theory and data as entities in their own right (Feynman, 1998; Popper, 1963) need to be separated from accidental facts concerning when certain brains thought of the theory. Gelman and Loken (2013) illustrate this beautifully by considering how, in a range of real examples, different results would have more simply confirmed a general theory than the results on offer. The metaphysics and the epistemology get put in their right place by Bayesian inference (getting a prediction right in advance has no metaphysical status as an indication of good theory; but it does help us know when we have one). In considering what a general theory predicts in order to calculate the Bay... |

275 | Why most published research findings are false
- Ioannidis
- 2005
(Show Context)
Citation Context ...n as one thinkswhat level of raw effect size would be predicted in one’s study, one has to carefully consider the literature with eyes onemay not have had before, to estimate how well effect sizes in one paper might apply to one’s own, given a change in context. Once effect sizes become relevant to the conclusions one draws, people may pay attention to them. In conclusion, I argue that the use of Bayes factors is a crucial part of the solution to the crisis in which psychology (and other disciplines) find themselves. Now that the problems of what we have been doing up to now are evident (e.g. Ioannidis, 2005; John et al., 2012; Open Science Collaboration, 2015; Pashler & Harris, 2012), I hope Bayes is seriously considered as part of the solution— alongwith, for example, full transparency and online availability of materials, data and analysis (Nosek et al., 2015); greater emphasis on direct replications as well as multi-experiment theory building (Asendorpf et al., 2013); and increasing use of pre-registration (Chambers, Dienes, McIntosh, Rotshtein, & Willmes, 2015). Appendix A. Comparing error properties of a Bayes factor with inference by intervals One way of distinguishing H1 from H0 is by use... |

214 |
How persuasive is a good fit? A comment on theory testing.
- Roberts, Pashler
- 2000
(Show Context)
Citation Context ...ffect sizes past studies using coffee. If the model is supported how much does that support bear on the theory? That is a matter of scientific judgement, not statistics per se, and will depend on the full context (cf Gelman & Rubin, 1995). The art of science is partly setting up experiments where interesting theories can be compared using simple models, so that the Bayes factor is informative in discriminating the theories. Thus, one should set up a test of a theory, that when translated into a model, makes a risky prediction, i.e. one contradicted by other background knowledge (Popper, 1963; Roberts & Pashler, 2000; Vanpaemel, 2014) so that the Bayes factor is likely to be discriminating if used to compare the contrasting theories. One problem with using Bayes factors is precisely that the psychological theory could be translated to several models; yet the support indicated by any given Bayes factor strictly refers to the model not the theory. Thus, the distribution in the model needs to have those properties that capture relevant predictions of the theory in context, while the distribution’s other properties should not alter the qualitative conclusion drawn from the resulting Bayes factor. If the outco... |

147 | False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.
- Simmons, Nelson, et al.
- 2011
(Show Context)
Citation Context ...is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a misleading representation of the evidence. Because the p-value is asymmetric, people seek to get the evidence in the only way it can appear to be strong—as against H0. Thus, apart from failure to publish relevant evidence concerning a theory, another outcome is p-hacking: Pushing the data in the one direction it can for it to be recognized as strong evidence, by use of analytic flexibility (John, Loewenstein, & Prelec, 2012; Masicampo & Lalande, 2012; Simmons, Nelson, & Simonsohn, 2011). No wonder there is a crisis in the credibility of our published results. How Bayes factors help. Bayes factors partly solve the problem by allowing the evidence to go both ways. This means you can tell when there is evidence for the null hypothesis and against the alternative. You can tell when there is good evidence against there being a treatment side effect (and when the evidence is just weak); you can tell when the data count against a theory (and when they count for nothing). There is every reason to publish evidence supporting the null as going against it, because the evidence can be ... |

135 |
A statistical paradox
- Lindley
- 1999
(Show Context)
Citation Context ...cate (i) strong evidence for H1 and against H0; or (ii) strong evidence for H0 and against H1; or (iii) not much evidence either way. That is a Bayes factor can make a threeway distinction. A p-value, by contrast, is asymmetric. A small p-value (often) indicates evidence against H0 and for the H1 of interest; but a large p-value does not distinguish evidence for H0 from not much evidence for anything. A p-value only tries to make a two-way distinction: evidence against H0 (i.e. (i)) versus anything else (i.e. (ii) or (iii), without distinguishing them) (and even this it does not do very well; Lindley, 1957). A large p-value is, therefore, never in itself evidence for H0. The asymmetry of p-values leads to many problems that are part of the ‘credibility crisis’ in science (Pashler & Wagenmakers, 2012). The reason why p-values are asymmetric is that they specify only one model: H0. This is their simplicity and hence their beguiling beauty. But their simplicity is simplistic. This paper will argue that using Bayes factors will therefore help solve some (but not all) of the problems leading to the credibility crisis, by changing scientific practice. 1 In symbols: P(H1|D)/P(H0|D) = P(D|H1)/P(D|H0) × ... |

117 |
thinking: The foundations of probability and its applications
- Good
- 1983
(Show Context)
Citation Context ...translated to several models; yet the support indicated by any given Bayes factor strictly refers to the model not the theory. Thus, the distribution in the model needs to have those properties that capture relevant predictions of the theory in context, while the distribution’s other properties should not alter the qualitative conclusion drawn from the resulting Bayes factor. If the outcome is robust to large distributional changes (while respecting the implementation of the same theory), the distributions are acceptable for use in Bayes factors, and the conclusion transfers to the theory (cf Good, 1983). This is referred to as robustness checking. For example if the application of a theory to an experiment indicates that the raw maximum difference should not be more than about m, then try simple distributions that satisfy this judgement yet change their shapes in other ways: Dienes (2014) suggests a uniform from 0 to m; a half-normal with mode 0 and standard deviation m/2; and a normal with mean m/2 and standard deviation m/4. In all cases the (at least rough) maximum is m yet in one case the distribution is flat, in another the probability is pushed up to one side, and in another peaked in ... |

96 |
A practical solution to the pervasive problems of p values.
- Wagenmakers
- 2007
(Show Context)
Citation Context ... becomes irrelevant (though the value of pre-registration is not mitigated). © 2015 Elsevier Inc. All rights reserved.1. Introduction A Bayes factor is a form of statistical inference in which one model, say H1, is pitted against another, say H0. Both models need to be specified, even if in a default way. Significance testing (using only the p-value for inference, as per Fisher, 1935) involves setting up a model for H0 alone—and yet is typically still used to pit H0 against H1. I will argue that significance testing is in this way flawed, with harmful consequences for the practice of science (Wagenmakers, 2007). Bayes factors, by specifying two models, resolve several key problems (though not all problems). After ∗ Correspondence to: School of Psychology, University of Sussex, Brighton, BN1 9QH, UK. E-mail address: dienes@sussex.ac.uk. http://dx.doi.org/10.1016/j.jmp.2015.10.003 0022-2496/© 2015 Elsevier Inc. All rights reserved.defining a Bayes factor, the introduction first indicates the general consequences of having two models (namely, the ability to obtain evidence for the null hypothesis; and the fact the alternative has to be specified well enough to make predictions). Then the body of the pa... |

93 |
Bayesian t tests for accepting and rejecting the null hypothesis.
- Rouder, Speckman, et al.
- 2009
(Show Context)
Citation Context ...aving two models The specification of two models in a Bayesian approach, rather than one in significance testing, has two direct consequences: One is that Bayes factors are symmetric in a way that p-values are asymmetric; and, second, Bayes factors relate theory to data in a direct way that is not possible with p-values. Here I clarify what these two properties mean; then the paper will consider in detail how these properties are important for how we do science. First, a Bayes factor, unlike a p-value, is a continuous degree of evidence that can symmetrically favour one model or another (e.g. Rouder, Speckman, Sun, Morey, & Iverson, 2009). Let us call the models H1 and H0. By using conventional criteria, the Bayes factor can indicate whether evidence is weak or strong. Thus, the Bayes factor may indicate (i) strong evidence for H1 and against H0; or (ii) strong evidence for H0 and against H1; or (iii) not much evidence either way. That is a Bayes factor can make a threeway distinction. A p-value, by contrast, is asymmetric. A small p-value (often) indicates evidence against H0 and for the H1 of interest; but a large p-value does not distinguish evidence for H0 from not much evidence for anything. A p-value only tries to make ... |

91 |
The likelihood principle.
- Berger, Wolpert
- 1988
(Show Context)
Citation Context ...ayesian analysis of the experiments in the Reproducibility Project.) In sum, Bayes factors would enable amore informed evaluation of replications than p-values allow. The need for more direct replications is clear (Pashler & Harris, 2012); but replications are no good if one cannot properly evaluate the results. Now we will consider some inferential paradoxes. The asymmetry of p-values leads to a sensitivity to stopping rules which is inferentially paradoxical, because the same data and theories can be evaluated differently depending on the intentions inside the head of the experimenter (e.g. Berger & Wolpert, 1988). We now consider this and other inferential paradoxes that allow p-hacking. The paradoxes mean that inferential outcome depends on more than the actual data obtained, and may depend on things which are in practice unknowable (the intentions and thoughts of experimenters; see Dienes, 2011 for explanation). The need to correct for multiple testing with significance testing is a paradox in that theories may pass or fail tests on data collected that was irrelevant to the theory, but corrected for anyway. Instead Bayesian approaches inwhich themodel of H1 is informed by scientific context focus on... |

63 |
Doing Bayesian data analysis: a tutorial with R and BUGS.
- Kruschke
- 2010
(Show Context)
Citation Context ...at evidence informs the posterior probabilities for the different theories. The posterior probability that embodied priming of closure is effective may be affected by the evidence for priming using words; that is, if there is priming for words it increases the probability that there could be priming from gestures, and vice versa. The evidence from the other studies, using different priming procedures, may rationally affect the posterior probability of any one of the priming techniques working. This is because these specific theories fall under the same general theory. Gelman et al. (2013) and Kruschke (2010) describe how to set up hierarchical models whereby the posterior distributions of the means of different conditions is automatically influenced by the data from all conditions. This has the effect of making it harder to detect an effect of embodied priming if there were no priming in any other condition (cf correction for multiple testing); but easier if there were priming in other conditions. This rational adjustment cannot be done with non-Bayesian approaches. In essence the procedure provides a sort of correction for multiple testing—but not for the sake of correcting for multiple testing,... |

60 |
On the Foundations of Statistical Inference,"
- Birnbaum
- 1962
(Show Context)
Citation Context ...es factor, B × prior belief in one theory versus another. That is, whatever strength of belief one happened to have in different theories prior to data (which will be different for different people), that belief should be updated by the same Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89 79amount, B, for everyone.1What this equation tells us is that if we measure strength of evidence of data as the amount by which anyone should change their strength of belief in the two theories in the light of the data, then the only relevant information is provided by the Bayes factor, B (cf Birnbaum, 1962). Conventional approximate guidelines for strength of evidence were provided by Jeffreys (1939, though Bayes factors stand on their own as continuousmeasures of degrees of evidence). If B > 3 then there is substantial evidence for H1 rather than H0; if B < 1/3 then there is substantial evidence for H0 rather thanH1; and if B is in between 1/3 and 3 then the evidence is insensitive. The term ‘prior’ has two meanings in the context of Bayes factors. P(H1) is a prior probability ofH1, i.e. howmuch youbelieve in H1 before seeing the data. But the term ‘prior’ is also used to refer to setting up th... |

56 |
Bayesian versus Orthodox statistics: Which side are you on?
- Dienes
- 2011
(Show Context)
Citation Context ... the results. Now we will consider some inferential paradoxes. The asymmetry of p-values leads to a sensitivity to stopping rules which is inferentially paradoxical, because the same data and theories can be evaluated differently depending on the intentions inside the head of the experimenter (e.g. Berger & Wolpert, 1988). We now consider this and other inferential paradoxes that allow p-hacking. The paradoxes mean that inferential outcome depends on more than the actual data obtained, and may depend on things which are in practice unknowable (the intentions and thoughts of experimenters; see Dienes, 2011 for explanation). The need to correct for multiple testing with significance testing is a paradox in that theories may pass or fail tests on data collected that was irrelevant to the theory, but corrected for anyway. Instead Bayesian approaches inwhich themodel of H1 is informed by scientific context focus only on the relation between theory and the data that bear on specifically that theory. Similarly, the use of timing of theory versus data as inferentially relevant in itself disguises what is actually very important about pre-registration of studies, as we will discuss. 3 Meta-analytic com... |

55 |
Measuring the prevalence of questionable research practices with incentives for truth-telling
- John, Loewenstein, et al.
- 2012
(Show Context)
Citation Context ...lso are also key significant results. Butwhere the key result is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a misleading representation of the evidence. Because the p-value is asymmetric, people seek to get the evidence in the only way it can appear to be strong—as against H0. Thus, apart from failure to publish relevant evidence concerning a theory, another outcome is p-hacking: Pushing the data in the one direction it can for it to be recognized as strong evidence, by use of analytic flexibility (John, Loewenstein, & Prelec, 2012; Masicampo & Lalande, 2012; Simmons, Nelson, & Simonsohn, 2011). No wonder there is a crisis in the credibility of our published results. How Bayes factors help. Bayes factors partly solve the problem by allowing the evidence to go both ways. This means you can tell when there is evidence for the null hypothesis and against the alternative. You can tell when there is good evidence against there being a treatment side effect (and when the evidence is just weak); you can tell when the data count against a theory (and when they count for nothing). There is every reason to publish evidence suppor... |

55 | On the psychology of experimental surprises.
- Slovic, Fischhoff
- 1977
(Show Context)
Citation Context ...tistical inference, Bayesian or otherwise (Goldacre, 2013). This alone is sufficient to justify an extensive use of pre-registered reports. In addition, pre-registration may help us judge such things as simplicity and elegance of theory more objectively. How much judgements of the properties of theory and their relation to predictions are affected by knowing the results in a naturalistic scientific context needs to be investigated further, but it is likely 86 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to be a substantial factor, perhaps moderated by experience (Arkes, 2013; Slovic & Fischhoff, 1977). This is an extra-statistical consideration that does not undermine the direct conclusions that follow from a Bayesian analysis (how well a theory is supported by data relative to another theory), but does raise the issue of the context of scientific judgements within which those conclusions are embedded. Finally, and very importantly, pre-registration helps deal with the problem of analytic flexibility (Chambers, 2015). There are generally various ways of analysing a given data set, each roughly equally justified. What should the cut off for outliers be—two or three SD, or something else? Wh... |

52 |
Statistical analysis and the illusion of objectivity.
- Berger, Berry
- 1988
(Show Context)
Citation Context ...n the conditions one would stop in advance of collecting data—and then stop at that point. By contrast, a Bayes factor B is symmetric. If H0 is false, then, in the long run, B is driven upwards. If H0 is true, B is driven towards zero. Because B is driven in opposite directions dependent on which theory is true, when using a Bayes factor one can stop collecting data whenever one likes (Savage, 1962). Thus, use of Bayes factors respects the ‘‘stopping rule principle’’ according to which the only evidence about a parameter is contained in the data and not the stopping rule used to collect them (Berger & Berry, 1988a,b; Berger & Wolpert, 1988). A useful rule would be to stop collecting data when either B is greater than 3 or less than 1/3; then one has guaranteed an informative conclusion with a minimum number of participants (cf. Schoenbrodt,Wagenmakers, Zehetleitner, & Perugini, in press). (Something which power cannot guarantee: A study can be highpowered but still the data do not discriminate between the models.) While significance testing allows p-hacking by optional stopping, one cannot B-hack by optional stopping. The possibility that one can legitimately ignore the stopping rule would be such a d... |

47 |
analysis for research designs: analysis of variance and multiple regression/correlation approaches.
- Keppel, Zedeck
- 1989
(Show Context)
Citation Context ...ically the final study for thehypothesis that embodiedprimes are effective remains the same, no matter what other procedures are tested. The Bayes factor remains 3.72 for the evidential worth of the data from the final study, noteworthy evidence for H1 concerning this particular procedure. Is this not a problem for Bayes? Before considering the Bayesian solution, first note the flexibility in the frequentist one. Families do not have to be defined by theoretical question in frequentist statistics, and indeed often are not (theymay e.g. be defined by degrees of freedom in an omnibus test, e.g. Keppel & Zedeck, 1989). By contrast, the Bayesian solution is to consider the evidence for each theory. In frequentist terms there is no reason why families could not be made of various subsets of the studies. In frequentist terms if the sixth study was treated as planned it could be tested separately from the others, which are then each corrected at the 0.05/5 level as one family. We will consider planned vs. post hoc tests below. For now we consider how Bayes just depends on the relation of the data to theories. Whymultiple comparisons are not a problem for Bayes factors. The priming technique used in the final s... |

43 | Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence?. - Pashler, &Wagenmakers - 2012 |

35 | Positive results increase down the hierarchy of the sciences.
- Fanelli
- 2010
(Show Context)
Citation Context ... families cannot be defined arbitrarily, but only by reference to theories of scientific interest. 2.4. Planned versus post hoc tests First we consider the problem, that the timing of theory relative to data intuitively feels important, yet correcting for it introduces inferential arbitrariness; then we consider the Bayesian solution, which removes the arbitrariness. The problem. One intuition is that it is desirable to predict the precise results one obtained in advance of obtaining them. Indeed, in an estimated 92% of papers in psychology and psychiatry, the results confirm the predictions (Fanelli, 2010). Yet when the predictions are made in advance of seeing the data, the confirmation rate is considerably less (Open Science Collaboration, 2015). Scientists feel a pressure to obtain confirmatory results. For significance testing it makes a difference whether one thought of one’s theory before analysing the data or afterwards (planned versus post hoc comparisons). In Bayesian inference all that matters are the data and the theory, not their timing (because the Bayes factor depends just on the probability of the data given the theory). At first, the Bayesian answer might seem strange. We have a... |

33 | Is the replicability crisis overblown? Three arguments examined.
- Pashler, Harris
- 2012
(Show Context)
Citation Context ....2 ms, SE = 6 ms, F(1, 21) = 0.001, p = 0.5), and also BH(0,37) = 0.19 (i.e. B < 1/3), with 22 subjects, indicating substantial support for the null. The point is that knowing power alone is not enough; once the data are in, the obtained evidence needs to be assessed for how sensitively H0 is distinguished from H1, and power cannot do this (Dienes, 2014). (Compare Etz, 2015, for a Bayesian analysis of the experiments in the Reproducibility Project.) In sum, Bayes factors would enable amore informed evaluation of replications than p-values allow. The need for more direct replications is clear (Pashler & Harris, 2012); but replications are no good if one cannot properly evaluate the results. Now we will consider some inferential paradoxes. The asymmetry of p-values leads to a sensitivity to stopping rules which is inferentially paradoxical, because the same data and theories can be evaluated differently depending on the intentions inside the head of the experimenter (e.g. Berger & Wolpert, 1988). We now consider this and other inferential paradoxes that allow p-hacking. The paradoxes mean that inferential outcome depends on more than the actual data obtained, and may depend on things which are in practice ... |

26 | Recommendations for increasing replicability in psychology. - Asendorpf, Conner, et al. - 2013 |

19 | Head up, foot down: object words orient attention to the objects’ typical location. - Estes, Verges, et al. - 2008 |

19 |
The foundations of statistical inference: a discussion.
- Savage
- 1962
(Show Context)
Citation Context ...rticipants after initially failing to get a significant result. If one decides to continue running until a significant result is obtained, significance is guaranteed even if H0 is true. Thus, one has to decide on the conditions one would stop in advance of collecting data—and then stop at that point. By contrast, a Bayes factor B is symmetric. If H0 is false, then, in the long run, B is driven upwards. If H0 is true, B is driven towards zero. Because B is driven in opposite directions dependent on which theory is true, when using a Bayes factor one can stop collecting data whenever one likes (Savage, 1962). Thus, use of Bayes factors respects the ‘‘stopping rule principle’’ according to which the only evidence about a parameter is contained in the data and not the stopping rule used to collect them (Berger & Berry, 1988a,b; Berger & Wolpert, 1988). A useful rule would be to stop collecting data when either B is greater than 3 or less than 1/3; then one has guaranteed an informative conclusion with a minimum number of participants (cf. Schoenbrodt,Wagenmakers, Zehetleitner, & Perugini, in press). (Something which power cannot guarantee: A study can be highpowered but still the data do not discri... |

19 |
Prior sensitivity in theory testing: An apologia for the Bayes factor.
- Vanpaemel
- 2010
(Show Context)
Citation Context ...non-significant resultmight not actually be evidence for the null hypothesis, as we shall see. Further, it involves (or should involve) specifying only the minimal interesting effect size, which is a rather incomplete specification of H1 (and it is the aspect of H1 most difficult to make in many cases). In practice, psychologists are happy to assert null hypotheses even when power has not been calculated, and inference is based on p-values alone (as we shall see). The second consequence of having to specify H1 as well as H0 is that thought must be given to what one’s theory actually predicts (Vanpaemel, 2010). In this way, Bayes factors allow a more intimate connection between theory and data than p-values allow. This issue is particularly important for dealing with issues of multiple testing and the timing of theorizing versus collecting data. I conjecture that a Bayesian view of these issues will lead to a more probing exploration of theory than significance testing encourages, a point taken up at the end. The paper now considers in detail the specific changes to scientific practice the use of Bayes factors may bring about. Specifically it considers, in order, issues of obtaining support for the... |

16 |
The secret lives of experiments: Methods reporting in the fMRI literature.
- Carp
- 2012
(Show Context)
Citation Context ...analysis method was the one method that did not obtain support for H0, the Bayesian analysis now fails to reflect the overall message of the data, even though themethodwas pre-registered. Thus, having all data transparently available must also be part of the solution. Then anyone with the time can check different ways of analysing the data for themselves. And in any argument that ensues, it may be worth bearing in mind that the pre-registered method is likely, but is not guaranteed, to reflect what the data say on balance. The argument for pre-registration is particularly compelling for fMRI. Carp (2012a; see also 2012b) considers common analytic decisions made in the fMRI literature and shows these lead to 34,560 significance maps, which can be substantially different from each other. Considering all of these for each experiment is not feasible. Pre-registration would mean the Bayes factors calculated are likely to be reflective of the data. In sum, using Bayes factors would change scientific practice by focusing attention on what matters—the relation of data to theory. No one would have pressure to pretend when they thought of the theory. People can focus on how simple and elegant the theo... |

14 |
Bayesian cognitive modeling: A practical course. Cambridge, England:
- Lee, Wagenmakers
- 2013
(Show Context)
Citation Context ...0 0 0 0 0Table 4 Decision rates for accepting/rejecting H0 for NR[−0.1, 0.1] MaxN = 1000 MinN = 1. Minimum width: None 10 ∗ NRW 5 ∗ NRW 4 ∗ NWR 3 ∗ NRW 2 ∗ NRW NRW 0.5 ∗ NRW Actual effect: dz = 0 Reject 19 12 6 5 2 1 0 0 Accept 72 78 83 84 86 88 89 0 dz = 1 Reject 100 100 100 100 100 100 100 0 Accept 0 0 0 0 0 0 0 0used (with reasons) on the grounds such a distribution is likely to reflect the conclusion from most simple representations of the theory (see Section 2.4). An example of Bayes factors motivating a closer consideration of theory is provided byDienes (2015); see also the examples in Lee and Wagenmakers (2014): Sometimes Bayes requires that extra data are gathered on a different condition in order to interpret another condition, data not demanded by p-value calculations. For example, in order to claim that a measure of conscious knowledge shows chance performance, we need data to estimate what level of conscious performance could be expected if the priming or learning performance claimed to be unconscious had actually been based on conscious knowledge. Further, as soon as one thinkswhat level of raw effect size would be predicted in one’s study, one has to carefully consider the literature with eye... |

13 |
The meaning of it all.
- Feynman
- 1998
(Show Context)
Citation Context ...rarily related to simple theory. A useful rule of thumb is that confirming novel rather than post hoc predictions is more likely to provide strong evidence for a simple theory. But that is not to do with some magic about when someone thought of a theory (someone’s brilliance in mentally penetrating the structure of Mother Nature in advance may be relevant to their self-esteem but such personal brilliance does not transfer to the evidential support of the data for the theory: In science it does not matter who you are). The objective properties of theory and data as entities in their own right (Feynman, 1998; Popper, 1963) need to be separated from accidental facts concerning when certain brains thought of the theory. Gelman and Loken (2013) illustrate this beautifully by considering how, in a range of real examples, different results would have more simply confirmed a general theory than the results on offer. The metaphysics and the epistemology get put in their right place by Bayesian inference (getting a prediction right in advance has no metaphysical status as an indication of good theory; but it does help us know when we have one). In considering what a general theory predicts in order to ca... |

12 | Avoiding model selection in Bayesian social research.
- Gelman, Rubin
- 1995
(Show Context)
Citation Context ... model versus H0; it is an extra-statistical matter to decide what work the theory did and how much credit it should get. For example, the theory that caffeine improves concentration because it is a placebo predicts that a cup of coffee should enhance performance on a concentration task. The exact model for how much a cup of coffee enhances concentration could be informed by the effect sizes past studies using coffee. If the model is supported how much does that support bear on the theory? That is a matter of scientific judgement, not statistics per se, and will depend on the full context (cf Gelman & Rubin, 1995). The art of science is partly setting up experiments where interesting theories can be compared using simple models, so that the Bayes factor is informative in discriminating the theories. Thus, one should set up a test of a theory, that when translated into a model, makes a risky prediction, i.e. one contradicted by other background knowledge (Popper, 1963; Roberts & Pashler, 2000; Vanpaemel, 2014) so that the Bayes factor is likely to be discriminating if used to compare the contrasting theories. One problem with using Bayes factors is precisely that the psychological theory could be transl... |

11 |
On the plurality of (methodological) worlds: estimating the analytic flexibility of fMRI experiments. Frontiers in Neuroscience,
- Carp
- 2012
(Show Context)
Citation Context ...analysis method was the one method that did not obtain support for H0, the Bayesian analysis now fails to reflect the overall message of the data, even though themethodwas pre-registered. Thus, having all data transparently available must also be part of the solution. Then anyone with the time can check different ways of analysing the data for themselves. And in any argument that ensues, it may be worth bearing in mind that the pre-registered method is likely, but is not guaranteed, to reflect what the data say on balance. The argument for pre-registration is particularly compelling for fMRI. Carp (2012a; see also 2012b) considers common analytic decisions made in the fMRI literature and shows these lead to 34,560 significance maps, which can be substantially different from each other. Considering all of these for each experiment is not feasible. Pre-registration would mean the Bayes factors calculated are likely to be reflective of the data. In sum, using Bayes factors would change scientific practice by focusing attention on what matters—the relation of data to theory. No one would have pressure to pretend when they thought of the theory. People can focus on how simple and elegant the theo... |

10 |
The garden of forking paths: Why multiple comparisons canbe aproblem, evenwhen there is no ‘‘fishing expedition’’ or ‘‘phacking’’ and the research hypothesis was posited ahead of time. Unpublished paper,
- Gelman, Loken
- 2013
(Show Context)
Citation Context ...ly to provide strong evidence for a simple theory. But that is not to do with some magic about when someone thought of a theory (someone’s brilliance in mentally penetrating the structure of Mother Nature in advance may be relevant to their self-esteem but such personal brilliance does not transfer to the evidential support of the data for the theory: In science it does not matter who you are). The objective properties of theory and data as entities in their own right (Feynman, 1998; Popper, 1963) need to be separated from accidental facts concerning when certain brains thought of the theory. Gelman and Loken (2013) illustrate this beautifully by considering how, in a range of real examples, different results would have more simply confirmed a general theory than the results on offer. The metaphysics and the epistemology get put in their right place by Bayesian inference (getting a prediction right in advance has no metaphysical status as an indication of good theory; but it does help us know when we have one). In considering what a general theory predicts in order to calculate the Bayes factor, one might be tempted to use the obtained data to refine the estimate of the magnitude of the prediction for th... |

10 |
Bayes factor approaches for testing interval null hypotheses.
- Morey, Rouder
- 2011
(Show Context)
Citation Context ... alternative. You can tell when there is good evidence against there being a treatment side effect (and when the evidence is just weak); you can tell when the data count against a theory (and when they count for nothing). There is every reason to publish evidence supporting the null as going against it, because the evidence can be measured to be just as strong either way (thus the published record can be balanced). In fact, the Bayes factor is the only way for indicating the strength of evidence for a point null hypothesis (though for a Bayes factor H0 need not be a point value; Dienes, 2014; Morey & Rouder, 2011). People can still ‘‘Bhack’’ (i.e. massage data to get a Bayes factor just beyond some conventional threshold by the use of analytic flexibility), but we will explore how options are more limited than for p-hacking in important ways. Power and replication. Replications are hard to evaluate by reference to p-values. If an original result was significant, and a direct replication non-significant, it might feel like a failure to replicate. But as p-values cannot indicate whether the null hypothesis is supported, a non-significant replication tells one nothing in itself. This is even true for high... |

9 |
P-curve: A key to the file drawer.
- Simonsohn, Nelson, et al.
- 2014
(Show Context)
Citation Context ... non-significant (it was later shown to have such a risk, Goldacre, 2013). Human death aside, do we want 80 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to guide our theory development partly on conclusions that are groundless2? Researchers may know that inferring the null hypothesis from a non-significant result is suspect. That obviously does not stop the practice from happening, it just makes sure it happens freely in paperswhere there also are also key significant results. Butwhere the key result is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a misleading representation of the evidence. Because the p-value is asymmetric, people seek to get the evidence in the only way it can appear to be strong—as against H0. Thus, apart from failure to publish relevant evidence concerning a theory, another outcome is p-hacking: Pushing the data in the one direction it can for it to be recognized as strong evidence, by use of analytic flexibility (John, Loewenstein, & Prelec, 2012; Masicampo & Lalande, 2012; Simmons, Nelson, & Simonsohn, 2011). No wonder there is a crisis in the credibility of our published results. H... |

8 |
The relevance of stopping rules in statistical inference. In
- Berger, Berry
- 1988
(Show Context)
Citation Context ...n the conditions one would stop in advance of collecting data—and then stop at that point. By contrast, a Bayes factor B is symmetric. If H0 is false, then, in the long run, B is driven upwards. If H0 is true, B is driven towards zero. Because B is driven in opposite directions dependent on which theory is true, when using a Bayes factor one can stop collecting data whenever one likes (Savage, 1962). Thus, use of Bayes factors respects the ‘‘stopping rule principle’’ according to which the only evidence about a parameter is contained in the data and not the stopping rule used to collect them (Berger & Berry, 1988a,b; Berger & Wolpert, 1988). A useful rule would be to stop collecting data when either B is greater than 3 or less than 1/3; then one has guaranteed an informative conclusion with a minimum number of participants (cf. Schoenbrodt,Wagenmakers, Zehetleitner, & Perugini, in press). (Something which power cannot guarantee: A study can be highpowered but still the data do not discriminate between the models.) While significance testing allows p-hacking by optional stopping, one cannot B-hack by optional stopping. The possibility that one can legitimately ignore the stopping rule would be such a d... |

8 |
1/f noise and effort on implicit measures of bias.
- Correll
- 2008
(Show Context)
Citation Context ...d the experiment. Thus, it is rational to consider how the data actually come out to consider what they say, and power cannot do this. Most theories allowmore than just one point value; then Bayes factors can be used to specify the strength of evidence. For example, consider the Reproducibility Project (https://osf.io/ezcuj/) spearheaded by Brian Nosek (Open Science Collaboration, 2015). The aim was to establish how well 100 experiments published in 2008 in high impact journals in psychology replicate, when the exact methods specified are followed as closely as possible. In the replication of Correll (2008) by LeBel (https://osf.io/fejxb/ wiki/home/), the original ‘‘PSD slope’’ reported in the Correll paper (Study 2) was 0.18, SE = 0.077, F(1, 68) = 5.52, p < 0.02. The attempted direct replication doubled sample size to achieve a power of 85%. The slope in the replication was 0.05, SE = 0.056, 2 The sample difference being small, zero, or in the wrong direction does not in itself provide sufficient grounds either; see Dienes (2014) for examples.F(1, 145) = 0.79, p = 0.37. This looks like a ‘‘failure’’ to replicate. In fact, calculating a Bayes factor (see Dienes, 2014, 2015, for details of how t... |

8 | The analysis of experimental data: The appreciation of tea and wine.
- Lindley
- 1993
(Show Context)
Citation Context ...placebo. Even if the drug were ineffective, each estimate would have a tendency to indicate that people were happier on the drug; that is, the mean of all the estimates would show greater happiness on the drug than on the placebo. Is not this a problem for an experiment, even if analysed by Bayesian statistics? The clue to the solution is that bias is inherently a frequentist notion, with need of a reference class (Howson & Urbach, 2006); yet it is the use of reference classes that leads to the inferential paradoxes in significance testing that do not apply to Bayesian analyses (Dienes, 2011; Lindley, 1993). Our researcher, as a Bayesian, would not simply average the results of the different experiments together (in an unweighted way). The experiments are all basic events in the reference class; but a Bayesian does not recognize the reference class as relevant to inference. Note that each experiment would have a different number of participants. The events in the reference class are just one arbitrary way of carving up the full set of data (as given by stringing together the infinite number of experiments the researcher runs). Different stopping rules (defining different reference classes) would... |

7 |
Informative hypotheses: theory and practice for behavioral and social scientists. Chapman and Hall/CRC.
- Hoijtink
- 2011
(Show Context)
Citation Context ...theory about what is going on, consider what other voxels are now implicated in testing the theory. Only when all evidence relevant to the superordinate theory has been taken into account can the superordinate theory be evaluated.) The strategy suggested so far relies on using a Bayes factor to test a single degree-of-freedom hypothesis. This provides a simple broadly applicable strategy but the use of Bayes factors is not limited to this strategy. A superordinate theory that specifies a rank ordering of means in different conditions can also be tested with a Bayes factor using the methods of Hoijtink (2011). For example, a theory that specified that the mean for the first and second conditions should be the same but higher than those from a third, specifies a set of ordinal constraintswhich together are richer than a single degree-of-freedom comparison. An editor might be especially prepared to accept a paper in favour of or against a superordinate theory if the theory received substantial evidence as a whole (either for or against), regardless of the direction of specific cherry-picked comparisons. Of course, the single degree of freedom comparisons (first mean versus second mean; their average... |

6 | The frequentist implications of optional stopping on Bayesian hypothesis tests. - Sanborn, &Hills - 2014 |

5 |
An introduction to the file drawer problem.
- Rosenthal
- 1979
(Show Context)
Citation Context ...rease of risk was non-significant (it was later shown to have such a risk, Goldacre, 2013). Human death aside, do we want 80 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to guide our theory development partly on conclusions that are groundless2? Researchers may know that inferring the null hypothesis from a non-significant result is suspect. That obviously does not stop the practice from happening, it just makes sure it happens freely in paperswhere there also are also key significant results. Butwhere the key result is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a misleading representation of the evidence. Because the p-value is asymmetric, people seek to get the evidence in the only way it can appear to be strong—as against H0. Thus, apart from failure to publish relevant evidence concerning a theory, another outcome is p-hacking: Pushing the data in the one direction it can for it to be recognized as strong evidence, by use of analytic flexibility (John, Loewenstein, & Prelec, 2012; Masicampo & Lalande, 2012; Simmons, Nelson, & Simonsohn, 2011). No wonder there is a crisis in the cre... |

5 | Optional stopping: No problem for Bayesians.
- Rouder
- 2014
(Show Context)
Citation Context ...ons). Each number in Table 1 is the outcome of 200 simulations. Appendix B gives the R code. Appendix A shows the results for different types of Bayes factors. Notice that when the same threshold for B (i.e. three/a third) is used as for our example with a fixed number of subjects in the last paragraph, the false alarm rate for when H0was true increased from 1% to 14%. That is, the stopping rule affected the false alarm rate of the Bayes factor. Does this not contradict the claim that inference using B is immune to the stopping rule? Why the stopping rule is a not a problem for Bayes factors. Rouder (2014) argued elegantly for why the sensitivity of the false alarm rate to the stopping rule is consistent with inference from B remaining immune to the stopping rule. Here the same argument will be put slightly differently. First notice that the equation ‘posterior odds = B ∗ prior odds’ follows from the axioms of probability. That is, given that the axioms normatively specify how the strength of belief should be changed, B is normatively the amount by which the strength of belief should be changed regardless of the stopping rule. If strength of evidence is measured by howmuch in principle beliefs ... |

5 | A Bayes-factormeta analysis of Bem’s ESP claim. - Rouder, &Morey - 2011 |

3 |
The humble Bayesian: Model checking from a fully Bayesian perspective.
- Morey, Romeijn, et al.
- 2013
(Show Context)
Citation Context ...minimum number of participants is used. Nonetheless if one wanted to control false alarm rate, in addition to discriminability, the Supplementary Materials (see Appendix B) would allow the reader to work out how to do so by, for example, changing the minimum number of participants, or raising the threshold of B. (There is another reason to run a minimum number of participants: the validity of the Bayes factor, as for any statistical test, depends on the assumptions of the statistical model of the data being approximately true. A minimum number of participants allows assumptions to be checked; Morey, Romeijn, & Rouder, 2013). The Supplementary Materials (see Appendix B) also provide results for different types of Bayes factors. Appendix A illustrates how Bayes factors have better error properties as a function of the stopping rule not only than significance testing, but also than the use of confidence or credibility intervals. In sum, Bayes factors can be used as a measure of evidence irrespective of the stopping rule, and hence optional stopping is not a form of B-hacking. In fact stopping when B > 3 or < 1/3 (or any other threshold) would enable stopping when the data are just as discriminating as needed. This... |

3 |
Promoting an open research culture.
- Nosek, Alter, et al.
- 2015
(Show Context)
Citation Context ...ntext. Once effect sizes become relevant to the conclusions one draws, people may pay attention to them. In conclusion, I argue that the use of Bayes factors is a crucial part of the solution to the crisis in which psychology (and other disciplines) find themselves. Now that the problems of what we have been doing up to now are evident (e.g. Ioannidis, 2005; John et al., 2012; Open Science Collaboration, 2015; Pashler & Harris, 2012), I hope Bayes is seriously considered as part of the solution— alongwith, for example, full transparency and online availability of materials, data and analysis (Nosek et al., 2015); greater emphasis on direct replications as well as multi-experiment theory building (Asendorpf et al., 2013); and increasing use of pre-registration (Chambers, Dienes, McIntosh, Rotshtein, & Willmes, 2015). Appendix A. Comparing error properties of a Bayes factor with inference by intervals One way of distinguishing H1 from H0 is by use of inference by intervals (Dienes, 2014). This requires specifying not a point null, but a null region, whose limits are the minimally interesting effect size. According to the rules of inference by intervals, if the interval (confidence, credibility, or like... |

3 |
An ethical approach to peeking at data.
- Sagarin, Ambler, et al.
- 2014
(Show Context)
Citation Context ...rm rate was only 1%. Table 1 indicates what happened when the stopping rule was as follows: After every participant, check to see if either B > 3 or else B < 1/3. If so, stop. Otherwise run another participant and continue until either the threshold is crossed or else 100 subjects are reached. In terms of researcher practice, this is a worst case scenario; researchers do not typically check after every participant,but maybe only two or three times when the initial result is nonsignificant; see Dienes (2011) for why the latter practice is wrong when uncorrected for orthodox statistics (and see Sagarin, Ambler, & Lee, 2014, for appropriate corrections). Each number in Table 1 is the outcome of 200 simulations. Appendix B gives the R code. Appendix A shows the results for different types of Bayes factors. Notice that when the same threshold for B (i.e. three/a third) is used as for our example with a fixed number of subjects in the last paragraph, the false alarm rate for when H0was true increased from 1% to 14%. That is, the stopping rule affected the false alarm rate of the Bayes factor. Does this not contradict the claim that inference using B is immune to the stopping rule? Why the stopping rule is a not a p... |

2 |
Instead of ‘‘playing the game’’ it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond.
- Chambers, Feredoes, et al.
- 2014
(Show Context)
Citation Context ...t to us beyond any other arbitrary version. While there may be evidence for the specific theory that priming occurs in this context with magnitude 12.63 s, theremay be evidence against the general theory that priming occurs in this context (cf. Section 2.3). Further, if a mechanism of priming occurred to you after looking at the data, and for reasons independent of the data that mechanism would predict a likely priming effect of 12 ms, the data provide support for that theory. The Bayesian answer helps show why pre-registered reports, such as used in Cortex and now at least 16 other journals (Chambers, Feredoes, Muthukumaraswamy, & Etchells, 2014; Wagenmakers et al., 2012; see the website Registered Reports, 2015, for regular updates) are valuable. It is not due to the magical power of guessing Nature in advance. Rather, pre-registration ensures the public availability of all results that are pre-registered, regardless of the pattern, which is important for all approaches to statistical inference, Bayesian or otherwise (Goldacre, 2013). This alone is sufficient to justify an extensive use of pre-registered reports. In addition, pre-registration may help us judge such things as simplicity and elegance of theory more objectively. How mu... |

2 |
How Bayesian statistics are needed to determine whether mental states are unconscious. In
- Dienes
- 2015
(Show Context)
Citation Context ...00 100 100 100 100 100 0 0 Accept 0 0 0 0 0 0 0 0Table 4 Decision rates for accepting/rejecting H0 for NR[−0.1, 0.1] MaxN = 1000 MinN = 1. Minimum width: None 10 ∗ NRW 5 ∗ NRW 4 ∗ NWR 3 ∗ NRW 2 ∗ NRW NRW 0.5 ∗ NRW Actual effect: dz = 0 Reject 19 12 6 5 2 1 0 0 Accept 72 78 83 84 86 88 89 0 dz = 1 Reject 100 100 100 100 100 100 100 0 Accept 0 0 0 0 0 0 0 0used (with reasons) on the grounds such a distribution is likely to reflect the conclusion from most simple representations of the theory (see Section 2.4). An example of Bayes factors motivating a closer consideration of theory is provided byDienes (2015); see also the examples in Lee and Wagenmakers (2014): Sometimes Bayes requires that extra data are gathered on a different condition in order to interpret another condition, data not demanded by p-value calculations. For example, in order to claim that a measure of conscious knowledge shows chance performance, we need data to estimate what level of conscious performance could be expected if the priming or learning performance claimed to be unconscious had actually been based on conscious knowledge. Further, as soon as one thinkswhat level of raw effect size would be predicted in one’s study, ... |

2 |
Bayes’ rule: a tutorial introduction to Bayesian analysis.
- Stone
- 2013
(Show Context)
Citation Context ....jmp.2015.10.003 0022-2496/© 2015 Elsevier Inc. All rights reserved.defining a Bayes factor, the introduction first indicates the general consequences of having two models (namely, the ability to obtain evidence for the null hypothesis; and the fact the alternative has to be specified well enough to make predictions). Then the body of the paper explores four ways in which these consequences may change the practice of science for the better. 1.1. What is a Bayes factor? In order to define a Bayes factor, the following equation can be derivedwith a few steps from the axioms of probability (e.g. Stone, 2013): Normative posterior belief in one theory versus another in the light of data = a Bayes factor, B × prior belief in one theory versus another. That is, whatever strength of belief one happened to have in different theories prior to data (which will be different for different people), that belief should be updated by the same Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89 79amount, B, for everyone.1What this equation tells us is that if we measure strength of evidence of data as the amount by which anyone should change their strength of belief in the two theories in the light o... |

2 | When decision heuristics and science collide. - Yu, Sprenger, et al. - 2014 |

1 |
The consequences of the hindsight bias in medical decision making.
- Arkes
- 2013
(Show Context)
Citation Context ...oaches to statistical inference, Bayesian or otherwise (Goldacre, 2013). This alone is sufficient to justify an extensive use of pre-registered reports. In addition, pre-registration may help us judge such things as simplicity and elegance of theory more objectively. How much judgements of the properties of theory and their relation to predictions are affected by knowing the results in a naturalistic scientific context needs to be investigated further, but it is likely 86 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to be a substantial factor, perhaps moderated by experience (Arkes, 2013; Slovic & Fischhoff, 1977). This is an extra-statistical consideration that does not undermine the direct conclusions that follow from a Bayesian analysis (how well a theory is supported by data relative to another theory), but does raise the issue of the context of scientific judgements within which those conclusions are embedded. Finally, and very importantly, pre-registration helps deal with the problem of analytic flexibility (Chambers, 2015). There are generally various ways of analysing a given data set, each roughly equally justified. What should the cut off for outliers be—two or thre... |

1 |
Ten reasons why journals must reviewmanuscripts before results are known.
- Chambers
- 2015
(Show Context)
Citation Context ... further, but it is likely 86 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to be a substantial factor, perhaps moderated by experience (Arkes, 2013; Slovic & Fischhoff, 1977). This is an extra-statistical consideration that does not undermine the direct conclusions that follow from a Bayesian analysis (how well a theory is supported by data relative to another theory), but does raise the issue of the context of scientific judgements within which those conclusions are embedded. Finally, and very importantly, pre-registration helps deal with the problem of analytic flexibility (Chambers, 2015). There are generally various ways of analysing a given data set, each roughly equally justified. What should the cut off for outliers be—two or three SD, or something else? What transformation might be used, if the data look roughly equally normal with several? Should a covariate be added? Should the dependent variables be combined, or one of them dropped? And so on. Such considerations can affect Bayes factors as much as t-tests. It is possible to B-hack. Imagine that out of 10 roughly equally valid analysis methods, nine indicate support for H0 and one does not, as shown by a Bayes factor i... |

1 |
Registered reports: realigning incentives in scientific publishing.
- Chambers, Dienes, et al.
- 2015
(Show Context)
Citation Context ...part of the solution to the crisis in which psychology (and other disciplines) find themselves. Now that the problems of what we have been doing up to now are evident (e.g. Ioannidis, 2005; John et al., 2012; Open Science Collaboration, 2015; Pashler & Harris, 2012), I hope Bayes is seriously considered as part of the solution— alongwith, for example, full transparency and online availability of materials, data and analysis (Nosek et al., 2015); greater emphasis on direct replications as well as multi-experiment theory building (Asendorpf et al., 2013); and increasing use of pre-registration (Chambers, Dienes, McIntosh, Rotshtein, & Willmes, 2015). Appendix A. Comparing error properties of a Bayes factor with inference by intervals One way of distinguishing H1 from H0 is by use of inference by intervals (Dienes, 2014). This requires specifying not a point null, but a null region, whose limits are the minimally interesting effect size. According to the rules of inference by intervals, if the interval (confidence, credibility, or likelihood9) is containedwithin the null region, then the null region hypothesis can be accepted. If the interval is entirely outside the null region, thenH1 can be accepted. 9 Though see Morey et al. (in press... |

1 |
Using Bayes to get themost out of non-significant results.
- Dienes
- 2014
(Show Context)
Citation Context ...h states that the data are B times more probable under H1 rather than H0. Briefly, posterior odds = B × prior odds.The symmetry is particularly important in determining support for the null hypothesis, interpreting replications, and p-hacking by optional stopping, all practical issues discussed below. The strict use of only onemodel is Fisherian; Neyman and Pearson (1967) argued that twomodels should be used, and introduced the concept of power, which helps introduce symmetry in inference, in that it provides grounds for asserting the null hypothesis. Unfortunately power is a flawed solution (Dienes, 2014) and that might explain why it is not always taken up. Power cannot be determined based on the actual data in order to assess their sensitivity; hence, a highpowerednon-significant resultmight not actually be evidence for the null hypothesis, as we shall see. Further, it involves (or should involve) specifying only the minimal interesting effect size, which is a rather incomplete specification of H1 (and it is the aspect of H1 most difficult to make in many cases). In practice, psychologists are happy to assert null hypotheses even when power has not been calculated, and inference is based on ... |

1 |
pharma: howmedicine is broken, and howwe can fix it. Fourth Estate.
- Goldacre
- 2013
(Show Context)
Citation Context ...the authors claiming no effect), with no further grounds given for accepting the null other than that the p-value was greater than 0.05. That is, in the vast majority of the articles where there were no grounds for accepting the null hypothesis at all, the null hypothesis was nonetheless accepted, often in order to draw important theoretical conclusions. The effect of this practice can be disastrous. For example, the drug paroxetine was originally declared to have no risk of increased suicide in children because the increase of risk was non-significant (it was later shown to have such a risk, Goldacre, 2013). Human death aside, do we want 80 Z. Dienes / Journal of Mathematical Psychology 72 (2016) 78–89to guide our theory development partly on conclusions that are groundless2? Researchers may know that inferring the null hypothesis from a non-significant result is suspect. That obviously does not stop the practice from happening, it just makes sure it happens freely in paperswhere there also are also key significant results. Butwhere the key result is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a mislea... |

1 |
The Einstein decade (1905–1915). London: Elek Science.
- Lanczos
- 1974
(Show Context)
Citation Context ...t met by our example paper. The paper would be flawed just as much even if, in fact, the authors had thought of their predictions before looking at the data. The data are not actually likely given any stated general theory—that’s the problem. Opposite or different predictions could just as well be generated from the stated general theory (if any theory were stated). Consider an opposite case: Einstein’s finding that his theory of general theory, developed around 1915, explained the anomalous orbit of Mercury, known since 1859. It was a key result that helped win scientists over to his theory (Lanczos, 1974). First the result was known, then the theory was developed. But the theory had its own independent elegant motivation. What is important is the theory’s simplicity and elegance both in itself and in application to the results, not which came first. Thus, using his procedure amounts to a different assumption than if one just used the Bayes factor based on the embodiment data. What this evidence does is change the prior distribution for the embodiment prime; the posterior is thereby affected. Naturally, different scientific judgements concerning relevance can affect the Bayesian outcome.The rol... |

1 |
A peculiar prevalence of p values just below 0.05.
- Masicampo, Lalande
- 2012
(Show Context)
Citation Context ...s. Butwhere the key result is non-significant, papers are less likely to be published (Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014). The research record becomes a misleading representation of the evidence. Because the p-value is asymmetric, people seek to get the evidence in the only way it can appear to be strong—as against H0. Thus, apart from failure to publish relevant evidence concerning a theory, another outcome is p-hacking: Pushing the data in the one direction it can for it to be recognized as strong evidence, by use of analytic flexibility (John, Loewenstein, & Prelec, 2012; Masicampo & Lalande, 2012; Simmons, Nelson, & Simonsohn, 2011). No wonder there is a crisis in the credibility of our published results. How Bayes factors help. Bayes factors partly solve the problem by allowing the evidence to go both ways. This means you can tell when there is evidence for the null hypothesis and against the alternative. You can tell when there is good evidence against there being a treatment side effect (and when the evidence is just weak); you can tell when the data count against a theory (and when they count for nothing). There is every reason to publish evidence supporting the null as going agai... |

1 |
Joint statistical papers. Hodder Arnold.
- Neyman, Pearson
- 1967
(Show Context)
Citation Context ...ties (or strength of belief) in H1 versus H0, i.e. the prior odds of H1 versus H0. P(H1/D)/P(H0|D) is the ratio of the probabilities of the two theories in the light of the data; i.e. the posterior odds. The remaining term is the Bayes factor, B, which states that the data are B times more probable under H1 rather than H0. Briefly, posterior odds = B × prior odds.The symmetry is particularly important in determining support for the null hypothesis, interpreting replications, and p-hacking by optional stopping, all practical issues discussed below. The strict use of only onemodel is Fisherian; Neyman and Pearson (1967) argued that twomodels should be used, and introduced the concept of power, which helps introduce symmetry in inference, in that it provides grounds for asserting the null hypothesis. Unfortunately power is a flawed solution (Dienes, 2014) and that might explain why it is not always taken up. Power cannot be determined based on the actual data in order to assess their sensitivity; hence, a highpowerednon-significant resultmight not actually be evidence for the null hypothesis, as we shall see. Further, it involves (or should involve) specifying only the minimal interesting effect size, which i... |

1 | Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods, - Schoenbrodt, Wagenmakers, et al. - 2015 |

1 | Theory testing with the prior predictive. Oral presentation at the 26th annual convention of the association for psychological science, 22–25 May, - Dienes - 2016 |