Thursday, March 27, 2014

Part 1z. The P-value: A surviving 'mosquito'


Extremists do not see the world in black or white.

Prior datapede posts have included discussions on the P-value: its origin, its inconvenient marriage with hypothesis testing, and its misconceptions.

This week, I read a paper titled "Scientific method: Statistical errors". This article sheds light on how the "P-value was never meant to be used the way it's used today" and how we should be very aware of the limits of conventional statistics.

The following points are a 'selective' summary mixed with additional details:
  1. On reproducibility: Reproducibility is like the ghost that will always come back to haunt you. Most published findings have shown to be false, and scientists who have tried to reproduce results have found it immensely challenging. The article starts off with a common example of a psychology student wanting to prove a hypothesis that extremists quite literally see the world in black and white. The P-value with the initial data was < 0.01, very significant. Upon replication with additional data the P-value dramatically changed to 0.59. The investigators ended up not publishing their findings, but instead wrote an article about Scientific Utopia.
  2. On Fischer: Fischer really did not intend for the P-value to be a definitive test. It was just part of a "non-numerical" process and it was inconveniently married to hypothesis testing to settle feuds of rivalry between Fischer and Neyman to develop a working mechanism for scientists.
  3. On confusion: Yes, the P-value can be confusing. A significant P-value can perplex our thinking, where we simply get too excited and forget to look at the actual effect size. Does that < 0.05 really matter when the effect size is small? The author gives an example of the study which concluded that the "internet is changing the dynamics and outcomes of marriage itself". This study showed that those who meet their spouses online are less likely to divorce and more likely to have high marital satisfaction (of course with very significant P-values). However, the effect size was very very small where happiness, for example, barely moved from 5.48 to 5.64. So, do not sign up for match.com thinking that you may be happier with your spouse.
  4. On the future: The future should hold a change of culture. In the interim, the following measures may help: (i) always report the effect size and confidence interval because the P-value does not; (ii) take advantage of Baye's rule; (iii) disclose all your methods in the paper, the assumptions used, the manipulations and all the measures; (iv) two-stage analysis or what is newly known as 'preregistered replication'. The first stage is called exploratory where investigators perform their study and preregister their ideas in a public database and how they plan to confirm findings. The second stage would include performing the replication study and publishing it along with the exploratory study. I really hope this becomes the norm.
------------------------------------------------------------------------------------------------------------
References and additional reading:  
Nuzzo R. Scientific method: statistical errors. Nature. 2014 Feb 13;506(7487):150-2.
Nosek, B. A., Spies, J. R. & Motyl, M. Perspect. Psychol. Sci. 7, 615631 (2012).
Cacioppo, J. T., Cacioppo, S., Gonzagab, G. C., Ogburn, E. L. & VanderWeele, T. J. Proc. Natl Acad. Sci. USA 110, 1013510140 (2013).