On Tuesday July 23rd, I posted a rough schematic of a 0.05 P-value cut off point on a curve and a hypothesis testing table, with a title and legend of "marriage of inconvenience". This is Part 1 of potentially a series of narratives which introduces some of the reasoning of why this phrase may be a good description of the relationship between the two and the alternatives. When searching if "marriage of inconvenience" has been applied in some other context, I found out that indeed it has. In 1963, Lionel Gelber (a Canadian Diplomat who writes about foreign affairs) wrote an essay on the relationship between Europe and the US describing it as a marriage of inconvenience, "a union in which partners who are
incompatible in many respects yet are welded indissolubly together". When thinking about it, this applies well to the relationship between P-values and hypothesis testing, especially in biomedical research.
One of the very first questions reviewers, of biomedical research papers to clinical journals, would ask is whether your results are statistically significant (usually it is the fixed P-value <0.05)? [I actually find this peculiar especially when the big picture is missed]. What does that really mean, and how important is it to differentiate between statistical and clinical/meaningful significance, or even significance of the research question asked? This topic is by all means old (and I mean very old) where writing about misconceptions of P-values have appeared in the literature for decades and people still either do not believe them or simply don't know what the alternative is.
The P-value is a double-edged sword, great to have but potentially a trickling problem if not interpreted properly (which is the case most of the times) An article by Steven Goodman in 2008 lists twelve misconceptions of the P-value (calling it a Dirty Dozen, listed below), and I agree they are "dirty"! While I thought I would be the first to compare statistical testing using p-values, hypothesis testing (null and alternative) with fixed error probabilities, and posterior probabilities, I am not. James O. Berger published an article in Statistical Science titled "Could Fisher, Jeffreys, and Neyman have agreed on Testing?" discussing these methods. There are several other articles that explain in details the confusion that the P-value in significance testing or hypothesis testing has generated. Different articles blame different people for the fixed value of 0.05 (Lehmann 1993, Berger 2003). Regardless of who came up with it, it is important to understand the uses of P-values and the available alternatives. The question becomes what would a well-intentioned researcher do? Is it a mixed approach of p-values and/or Type I and Type II errors and/or Bayesian measures? Would different methods be used in different contexts (e.g. Type I and Type II errors for screening)?
------------------------------------------------------------------------------------------------------
Twelve P-value Misconceptions: (Taken from Table 1. In: Goodman S, 2008):
Readings:
Gelber L. A Marriage of Inconvenience. http://www.foreignaffairs.com/articles/23478/lionel-gelber/a-marriage-of-inconvenience. January 1963 (last accessed 9/20/2013).
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40.
Lehmann EL. The Fischer, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. J Am Stat Assoc. 1993 Dec; 88(424):1242-1249
Berger JO. Could Fischer, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
2003, Vol. 18, No. 1, 1–32. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.4064&rep=rep1&type=pdf (last accessed 9/20/2013).
One of the very first questions reviewers, of biomedical research papers to clinical journals, would ask is whether your results are statistically significant (usually it is the fixed P-value <0.05)? [I actually find this peculiar especially when the big picture is missed]. What does that really mean, and how important is it to differentiate between statistical and clinical/meaningful significance, or even significance of the research question asked? This topic is by all means old (and I mean very old) where writing about misconceptions of P-values have appeared in the literature for decades and people still either do not believe them or simply don't know what the alternative is.
The P-value is a double-edged sword, great to have but potentially a trickling problem if not interpreted properly (which is the case most of the times) An article by Steven Goodman in 2008 lists twelve misconceptions of the P-value (calling it a Dirty Dozen, listed below), and I agree they are "dirty"! While I thought I would be the first to compare statistical testing using p-values, hypothesis testing (null and alternative) with fixed error probabilities, and posterior probabilities, I am not. James O. Berger published an article in Statistical Science titled "Could Fisher, Jeffreys, and Neyman have agreed on Testing?" discussing these methods. There are several other articles that explain in details the confusion that the P-value in significance testing or hypothesis testing has generated. Different articles blame different people for the fixed value of 0.05 (Lehmann 1993, Berger 2003). Regardless of who came up with it, it is important to understand the uses of P-values and the available alternatives. The question becomes what would a well-intentioned researcher do? Is it a mixed approach of p-values and/or Type I and Type II errors and/or Bayesian measures? Would different methods be used in different contexts (e.g. Type I and Type II errors for screening)?
------------------------------------------------------------------------------------------------------
Twelve P-value Misconceptions: (Taken from Table 1. In: Goodman S, 2008):
- If P = 0.05, the null hypothesis has only a 5% chance of being true.
- A nonsignificant difference (eg, P >.05) means there is no difference between groups.
- A statistically significant finding is clinically important.
- Studies with P -values on opposite sides of .05 are conflicting.
- Studies with the same P value provide the same evidence against the null hypothesis.
- P = 0.05 means that we have observed data that would occur only 5% of the time under the null hypothesis.
- P = 0.05 and P <.05 mean the same thing.
- P-values are properly written as inequalities (eg, “P <0.02” when P = 0.015)
- P = 0.05 means that if you reject the null hypothesis, the probability of a type I error is only 5%.
- With a P = 0.05 threshold for significance, the chance of a type I error will be 5%.\
- You should use a one-sided P value when you don’t care about a result in one direction, or a difference in that direction is impossible.
- A scientific conclusion or treatment policy should be based on whether or not the P value is significant
Readings:
Gelber L. A Marriage of Inconvenience. http://www.foreignaffairs.com/articles/23478/lionel-gelber/a-marriage-of-inconvenience. January 1963 (last accessed 9/20/2013).
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40.
Lehmann EL. The Fischer, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. J Am Stat Assoc. 1993 Dec; 88(424):1242-1249
Berger JO. Could Fischer, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
2003, Vol. 18, No. 1, 1–32. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.4064&rep=rep1&type=pdf (last accessed 9/20/2013).