Tuesday, May 20, 2014

Migration of posts to QS-2

Current and future datapede posts will be on Quantitative Scientific Solutions, LLC.

www.QS-2.com


"Quantitative Scientific Solutions LLC is a technical consulting and data analytics firm based in Washington, D.C.


QS-2 believes that complicated problems require innovative ideas and fresh strategies, and leverages strong technical capabilities to provide comprehensive and creative solutions to our client’s most challenging needs.


Building off of our deep expertise in the technical consulting sector, hard scientific and data driven approaches are at the heart of every service that we provide to our clients. QS-2 services some of the world’s leading institutions, working together to provide support and guidance in addressing the most interesting and challenging problems they face. We are focused on innovation, and work with our clients in both the development and utilization of advanced, disruptive technologies."


Posts will also appear on the QS-2 Facebook Page:

https://www.facebook.com/quantitativescientific

Thursday, March 27, 2014

Part 1z. The P-value: A surviving 'mosquito'


Extremists do not see the world in black or white.

Prior datapede posts have included discussions on the P-value: its origin, its inconvenient marriage with hypothesis testing, and its misconceptions.

This week, I read a paper titled "Scientific method: Statistical errors". This article sheds light on how the "P-value was never meant to be used the way it's used today" and how we should be very aware of the limits of conventional statistics.

The following points are a 'selective' summary mixed with additional details:
  1. On reproducibility: Reproducibility is like the ghost that will always come back to haunt you. Most published findings have shown to be false, and scientists who have tried to reproduce results have found it immensely challenging. The article starts off with a common example of a psychology student wanting to prove a hypothesis that extremists quite literally see the world in black and white. The P-value with the initial data was < 0.01, very significant. Upon replication with additional data the P-value dramatically changed to 0.59. The investigators ended up not publishing their findings, but instead wrote an article about Scientific Utopia.
  2. On Fischer: Fischer really did not intend for the P-value to be a definitive test. It was just part of a "non-numerical" process and it was inconveniently married to hypothesis testing to settle feuds of rivalry between Fischer and Neyman to develop a working mechanism for scientists.
  3. On confusion: Yes, the P-value can be confusing. A significant P-value can perplex our thinking, where we simply get too excited and forget to look at the actual effect size. Does that < 0.05 really matter when the effect size is small? The author gives an example of the study which concluded that the "internet is changing the dynamics and outcomes of marriage itself". This study showed that those who meet their spouses online are less likely to divorce and more likely to have high marital satisfaction (of course with very significant P-values). However, the effect size was very very small where happiness, for example, barely moved from 5.48 to 5.64. So, do not sign up for match.com thinking that you may be happier with your spouse.
  4. On the future: The future should hold a change of culture. In the interim, the following measures may help: (i) always report the effect size and confidence interval because the P-value does not; (ii) take advantage of Baye's rule; (iii) disclose all your methods in the paper, the assumptions used, the manipulations and all the measures; (iv) two-stage analysis or what is newly known as 'preregistered replication'. The first stage is called exploratory where investigators perform their study and preregister their ideas in a public database and how they plan to confirm findings. The second stage would include performing the replication study and publishing it along with the exploratory study. I really hope this becomes the norm.
------------------------------------------------------------------------------------------------------------
References and additional reading:  
Nuzzo R. Scientific method: statistical errors. Nature. 2014 Feb 13;506(7487):150-2.
Nosek, B. A., Spies, J. R. & Motyl, M. Perspect. Psychol. Sci. 7, 615631 (2012).
Cacioppo, J. T., Cacioppo, S., Gonzagab, G. C., Ogburn, E. L. & VanderWeele, T. J. Proc. Natl Acad. Sci. USA 110, 1013510140 (2013).

Monday, February 17, 2014

The color green: Where incidence meets prevalence

Figure 1. Money Flow
Does Figure 1 (right) look familiar?
The majority should relate to it as a continuously encountered event.

This similar image has been used to describe  incidence and prevalence and the relationship between the two (pebbles as representation). Incidence measures the frequency of events (such as the onset of illness) while prevalence  measures the proportion of people who have the illness right now.

The relationship between prevalence and incidence is related to duration or time, where prevalence would approximate incidence when the duration of disease is short.

Prevalence ≈ (incidence rate) × (average duration of illness).

So, if the duration of disease is short (like the common cold) prevalence approximates the incidence rate. Specifically, the inflow of disease approximates the outflow. Outflow is usually due to two main reasons: death or cure.

Now let's apply this concept to money flow into individual bank accounts, where after some pondering the prevalence equals incidence concept surprisingly could be very applicable.

Figure 2. Bank account before payday
The majority of people live from paycheck-to-paycheck, and most of our balance accounts (including mine) look like Figure 2. 

Using the following assumptions:

Dollar Incidence: Number of new dollars into balance on paycheck date (new disease). The incidence rate is usually calculated as the number of new cases within a specified time divided by the population at risk. I am not sure what the $$ at risk would be here or how to even think about computing it, so calculating a rate would be quite challenging, but for the sake of this analogy let's call new dollars as incidence (the unit of measurement is individual accounts).

Dollar Prevalence:  Is basically how much your bank account has at this moment. It is usually calculated by comparing the number of people who have a condition with the total number of people studied. Again, I will not dwell on whether it is possible to calculate a proportion here. The unit we are looking at is individual bank account...So, let us consider again that the amount of $$ right now is called prevalence.

Applying the formula of prevalence equals incidence above, inflow is your paycheck being posted and outflow could be, again, due to main reasons: expenses or investments. As such, a labeled Figure 1 would look like (Figure 3):

Figure 3.



Given that expenses strike the day of, if not the very day after, your paycheck, the duration in your account is really short. In individual bank accounts for the average person, dollar prevalence would approximate dollar incidence at that point in time.

Is it true that "it doesn't matter how fast color travels it is how fast you can see it"? I am not sure about the source of this saying, but green I think is very fast.