Thursday, November 21, 2013

The Intraclass Correlation Coefficient (ICC) and the Z Fisher Transformation for the 95% Confidence Interval


Case Scenario
Suppose that there are 10 subjects with eyelid measurements from three different observes (making the total sample = 30), who used a new computerized technique to get eyelid images and later measured the parameters using image J software. What would you do to test for reliability of measurements and how would you go by performing this analysis? Please note that the dependent variable (eyelid measurement) is a continuous variable. This is a real example.

Approach
The best measure of reliability for continuous data is the Intraclass Correlation Coefficient (ICC). The ICC would inform you about the inter-rater reliability, where the higher the ICC the less unique information each additional measure provides. 


Friday, September 20, 2013

P-value: A double-edged sword? (Part 1)

On Tuesday July 23rd, I posted a rough schematic of a 0.05 P-value cut off point on a curve and a hypothesis testing table, with a title and legend of "marriage of inconvenience". This is Part 1 of potentially a series of narratives which introduces some of the reasoning of why this phrase may be a good description of the relationship between the two and the alternatives. When searching if "marriage of inconvenience" has been applied in some other context, I found out that indeed it has. In 1963, Lionel Gelber (a Canadian Diplomat who writes about foreign affairs) wrote an essay on the relationship between Europe and the US describing it as a marriage of inconvenience, "a union in which partners who are incompatible in many respects yet are welded indissolubly together". When thinking about it, this applies well to the relationship between P-values and hypothesis testing, especially in biomedical research.

One of the very first questions reviewers, of biomedical research papers to clinical journals, would ask is whether your results are statistically significant (usually it is the fixed P-value <0.05)? [I actually find this peculiar especially when the big picture is  missed]. What does that really mean, and how important is it to differentiate between statistical and clinical/meaningful significance, or even significance of the research question asked? This topic is by all means old (and I mean very old) where writing about misconceptions of P-values have appeared in the literature for decades and people still either do not believe them or simply don't know what the alternative is.

The P-value is a double-edged sword, great to have but potentially a trickling problem if not interpreted properly (which is the case most of the times) An article by Steven Goodman in 2008 lists twelve misconceptions of the P-value (calling it a Dirty Dozen, listed below), and I agree they are "dirty"! While I thought I would be the first to compare statistical testing using p-values, hypothesis testing (null and alternative) with fixed error probabilities, and posterior probabilities, I am not. James O. Berger published an article in Statistical Science titled "Could Fisher, Jeffreys, and Neyman have agreed on Testing?" discussing these methods. There are several other articles that explain in details the confusion that the P-value in significance testing or hypothesis testing has generated. Different articles blame different people for the fixed value of 0.05 (Lehmann 1993, Berger 2003). Regardless of who came up with it, it is important to understand the uses of P-values and the available alternatives. The question becomes what would a well-intentioned researcher do? Is it a mixed approach of p-values and/or Type I and Type II errors and/or Bayesian measures? Would different methods be used in different contexts (e.g. Type I and Type II errors for screening)?

 ------------------------------------------------------------------------------------------------------
Twelve P-value Misconceptions: (Taken from Table 1. In: Goodman S, 2008):

  1. If P = 0.05, the null hypothesis has only a 5% chance of being true.
  2.  A nonsignificant difference (eg, P >.05) means there is no difference between groups.
  3.  A statistically significant finding is clinically important.
  4.  Studies with P -values on opposite sides of .05 are conflicting.
  5.  Studies with the same P value provide the same evidence against the null hypothesis.
  6.  P = 0.05 means that we have observed data that would occur only 5% of the time under the null hypothesis.
  7.  P = 0.05 and P <.05 mean the same thing.
  8.  P-values are properly written as inequalities (eg, “P <0.02” when P = 0.015)
  9.  P = 0.05 means that if you reject the null hypothesis, the probability of a type I error is only 5%.
  10. With a P = 0.05 threshold for significance, the chance of a type I error will be 5%.\
  11. You should use a one-sided P value when you don’t care about a result in one direction, or a difference in that direction is impossible.
  12.  A scientific conclusion or treatment policy should be based on whether or not the P value is significant
-------------------------------------------------------------------------------------------------------------
Readings:

Gelber L. A Marriage of Inconvenience. http://www.foreignaffairs.com/articles/23478/lionel-gelber/a-marriage-of-inconvenience. January 1963 (last accessed 9/20/2013).
 
Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40.

Lehmann EL. The Fischer, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. J Am Stat Assoc. 1993 Dec; 88(424):1242-1249

Berger JO. Could Fischer, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
2003, Vol. 18, No. 1, 1–32. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.4064&rep=rep1&type=pdf (last accessed 9/20/2013).

Wednesday, August 21, 2013

"It's all about the content. Get better content or use sparklines". Edward Tufte

Yesterday I went to Edward Tufte's one-day course on data visualization: Presenting data and information (http://www.edwardtufte.com/tufte/courses).

I am sharing below some rough take home notes I took:
  • New methods of presenting will be sans PowerPoint. Provide readings, discuss, and explain with visuals as you go along.
  • Get better content
  • Use sparklines. Sparklines are datawords: data-intense, design-simple, word-sized graphics. They have applications for financial and economic data by tracking changes over time. Sparklines also reduce recency bias and may aid in better decision making (it was one of the first times I hear about a bias termed "recency"). You can easily create sparklines using excel, the sparkline feature was added in Microsoft excel 2010. I created my sparkline satisfaction rate with Tufte's style and lecture over the course of the day (Figure).
    Data points varied over the course of the day. The First data point was 8.89 at 10:15 am and my last data point was 9.99 at 16:15 pm. The lowest point is highlighted in red because the course stopped and I had to go find food. The highest point is highlighted in green because the course ended 2 minutes early and satisfaction was high given the feeling of wanting more. Galileo and Euclid were also mentioned which also contributed to a higher satisfaction, given my fascination in both. Note that this is just an example to illustrate a sparkline, satisfaction is subjective. Also Tufte strongly encouraged links to raw data so I am pasting the original table at the end.



Monday, August 19, 2013

Datapede leisure time



These past two weeks were spent with family and friends. During this leisure time, the first Datapede tote bag was developed at T-Gallery, a custom T-shirt store in Greenwich village in New York. I recommend it if you are ever interested in designing t-shirts and/or tote bags.

On another note, I will be attending a one day course tomorrow with Edward Tufte on the visual display of quantitative information  (http://www.edwardtufte.com/tufte/courses).  Edward Tufte is a statistician and professor emeritus of statistics, political science, and computer science at Yale. He is a pioneer in data visualization. I will be sharing take home notes from the course. Additionally, upcoming posts on Datapede will be about:

P-value: A double-edged sword?
More of the confidence interval

I have also gotten some requests to discuss more epidemiological principles. I will weave those in future posts.

Tuesday, July 30, 2013

The lady tasting vodka: The Null Hypothesis

This past Saturday night I contemplated designing an experiment similar to the one Sir Ronald Fischer described in his 1935 book "The Design of Experiments" known as the Lady tasting tea, with the company of two good friends (Dana and Zahi). However, (and I am only blaming it on the time of day) iced chamomile tea and vodka were used. I am sharing the thought just as a fun analogy because the overpowering taste of vodka will make it challenging to differentiate between what was poured first, chamomile tea or vodka. The Lady tasting tea experiment was one of the first experiments designed with randomization. Up to my knowledge (derived from readings), it is a true story describing the null hypothesis and randomization.

Fischer only worked with null hypotheses. There is no alternative in his experiments (those were the works of Jerzy Neyman and Egon Pearson). The Null here and in every scenario is the "default position", which is the "no difference" between two methods, treatments groups, measurements etc...Fischer used his P-value as a rough guide of the strength of evidence against the null.

The Lady Tasting Tea
I first read about the lady tasting tea in David Salsburg's book which includes stories of how statistics revolutionized medicine. You can easily find the book because "The Lady Tasting Tea" is in the title (I personally received it as a gift from Dr. Wallace Chamon, an Ophthalmologist and Professor in Brazil). I found several references to the story online and even a full lecture about the topic by Debroh Nolan at UC Berkeley.

The story, as described, goes back to a summer afternoon in Cambridge, England, in the 1920s. A group of university scholars and their spouses were all gathered for an afternoon tea. A Lady known as Dr. Muriel Bristol , an algologist, was being served tea and she says: "No thank you, I prefer my tea poured with milk first." Fischer then responds by saying: "Nonsense it all tastes the same". William Roach in the background (who probably had his eye on Bristol as he later married her) yells: "let's test her". and so the preliminary preparations began..

The Experiment
The Null Hypothesis was that the Lady will have no ability to differentiate between the cups with milk being poured first or tea being poured first. The experiment considered: (i) the number of cups (more than 2 because with 2 only the Lady would have 50/50 chance of getting it right); (ii) whether they should be paired; (iii) in what order should they be presented, (iv) who prepares them, portions and right temperature etc..

Example of random ordered cups. T: Tea poured first; M: Milk poured first
The Lady was then provided with 8 cups of tea (randomly ordered): 4 prepared by first adding milk and 4 prepared by first adding tea. Any software can easily generate randomized numbers. The RAND() function in excel can assign numbers to an ordered list of cups with milk or tea being poured first. I took a snapshot of how excel can do this (table on the right), where I assigned M for milk being poured first and T for tea being poured first.

The Lady was then asked to identify the cups. Fischer was only willing to reject the hypothesis if the Lady categorized all the cups correctly, recognizing her ability at a 1.4% significance level. Here is where the 1.4% came from:


I am not sure if the results of the experiment were presented, but the conclusion in the end was that Lady Dr. Bristol was indeed able to differentiate between the cups. While my attempt to design an experiment with chamomile tea and vodka was fun. Alcohol has an overpowering taste and should be tested on its own. I should probably run this experiment comparing two types of vodka (for example British and Russian vodka) and see if anyone can really tell the difference, similar to how Debroh Nolan ran the experiment comparing Mexican and American Coca Cola. (Note: this is only a tasting experiment, the subject will not drink the 8 cups).

The main reason why I decided to write about this topic is to lay the foundation for future discussions on the null hypothesis, the alternative, and significance tests and how they are inconveniently married. When discussing this story with Dr. Sandeep Jain (the Director of the Corneal Neurobiology Laboratory at the University of Illinois at Chicago), his first reaction was that "the greatest discoveries are observational and they come about by a fair degree of luck and chance. They come about by ways you don't expect them to."  This one came about from a Lady wanting to drink her tea poured with milk first.

_____________________________________________________________________________
References 

David Salsburg. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. Copyright by WH Freeman and Company, 2001: Publication date May 2002.

Debroh Nolan. Lecture 10: Fischer's "Lady Tasting Tea" Experiment. given at UC Berkeley. CosmoLearning. http://www.cosmolearning.com/video-lectures/lecture-10-fishers-lady-tasting-tea-experiment-10081/

Box, Joan Fischer (1978). R.A. Fisher, the Life of a Scientist. New York: Wiley. p. 134 

Charlie Gibbons. Fischer's Exact Test and its Extensions. University of California Berkeley. Fall 2012. http://cgibbons.us/courses/are210/NPTestsNotes.pdf. Last accessed July 31, 2013







Sunday, July 14, 2013

Ockham's razor, frequentist-Bayesian, and parsimonious regression models


With all else being equal, a simpler explanation is more likely to be the correct one is what we understand as Occam's razor (William Ockham). In other words, simpler models are favored until the data is able to explain, with a few assumptions, more complex ones. This philosophical notion is applied in several scientific disciplines.

Given Occam's principle, how does one go about reaching a simple parsimonious model to predict disease risk? What is the difference between adopting a probabilistic frequentist or a subjective (Bayesian) approach, in other words should there be a Bayesian viewpoint in epidemiological research data?

Frequentist and subjective probability
Frequentist methods are known as the Fisherian P-values (RA Fisher) and the confidence intervals that remain the norm in biostatistics and epidemiological data under analysis (what we see in published clinical and epidemiological studies).  They are based on notions of objectivity and likelihood functions that help with setting conclusions. Frequentist techniques are highly effective in randomized trials. However in observational studies, a frequentist model may become more questionable (potentially misleading) as we are more likely to be confronted with confounding, selection bias, and measurement error. That is when Bayesian methods may potentially be worthwhile looking into. Even though Bayesian methods have been criticized for their imprecision, their total reliance on prior parameter distribution, and that they are largely based on subjective and arbitrary elements, it has been suggested that they may be useful where prior estimates may be generated by applying the same formulas that frequentists use. An article by Sander Greenland in 2006 provides a clear explanation of this topic with clear examples.

What are Baysian methods (subjective probability)?
Subjective probability can be defined as the degree of belief that x is true. The probability in this context does not represent the external world but rather features of personal subjective interpretations. In subjective probability, we are not interested in any kind of long run frequency behaviors.

For example, what is the probability of your flight to Hawaii on January 28, 2014 will be cancelled?
In this case, you are not interested in a frequency behavior, but you are interested in predicting whether your flight will be cancelled on this specific day (one single occasion). There is a certain degree of belief in whether this event will occur. Subjective attitude toward the belief that the flight will be cancelled. You know that flights are more likely to get cancelled due to it being in the middle of "storm season", you may determine in this case that the probability is high. Another example, is what is the probability of you getting a heart attack on your 80th birthday? Betting games are widely known as developed based on subjective probability. In larger contexts and data sets, Bayesian methods are applied through subjectively determining prior distributions and applying them to current models.

The parallel between Bayesian and frequency methods is the conditional model.
conditional probability (Baye's rule): P(data/parameters)
P (A/B) = (P(A) x P(B/A))/P(B)

Where Probability that A is true given that B is true is known as the Posterior probability of A.

If you had observational data and wanted to determine the outcome Y of breast cancer given several parameters x, you would need to build a logistic model using either automated mechanical methods or confounding and interaction assumptions.   In certain cases, the epidemiological data model will be based on statistical cut off points potentially conflicting with contextual information. Most models are criticized as being biased with too many assumptions. In this case, can one consider developing a model using Bayesian priors equally arbitrarily as a frequency data model? Can one replace arbitrary variable selection by prior distributions?  The articles by Greenland suggest 'yes'. The concept of pooling studies (hypothetical prior with current study) is suggested (adding results from the hypothetical study of priors as a new strata...).

In conclusion, would the recipe of reaching a parsimonious model (making Ockham happy) using observational data include:
1) Common sense and ingenuity
2) A frequency model with few assumptions (frequentist approach)
3) A priors model (Bayesian perspective)

Should this become common practice?
_____________________________________________________________________
References and good reads:

Savage LJ. Subjective Probability and Statistical Practice. The Foundations of Statistical Inference. 1962

Greenland S. Bayesian perspectives for epidemiological research: I. Foundations and basic methods. Int J Epidemiol. 2006 Jun;35(3):765-75. 

Greenland S. Bayesian perspectives for epidemiological research. II. Regression analysis. Int J Epidemiol. 2007 Feb;36(1):195-202. 





Friday, July 5, 2013

The Normal Distribution-Bell-Shaped Curve-Central Limit Theorem


The role Galileo played in data distributions is that he was the first to suggest that "measurement errors are deserving of a systematic and scientific treatment". Yes, he did it through his observation of distances between stars and the distance of a star from the center of the earth. All observations that we see are burdened with errors and those observations are distributed symmetrically about the true value; errors are distributed symmetrically about zero.

The bell-shaped curve

Abraham de Moivre (1667-1754) proved that the central limit theorem holds for simple collections of numbers from games of chance. He is known as the father of the normal distribution and the first appearance for a bell-shaped curve appeared in his Book of Chances, although the beginnings of the curve has been attributed to Carl Friedrich Gauss (1777-1855).

A normal distribution simply means that observations of a certain variable have a continuous probability distribution characterized by a mean and a standard deviation (dispersion of data from the mean calculated as the square root of the average squares of the deviations from the mean or center). If the mean equals zero and the standard deviation equals one then we would have what is known as a standard normal distribution. The important thing to know in a normal distribution is that 68% of the area under the curve is within 1 standard deviation of the mean, 95% of the area lies within 2 standard deviations, and 99% of the area lies within 3 standard deviations. You may also hear that a normal distribution is symmetric about its mean. In a perfect world, you may be able to see plotted continuous values distributed equally with complete symmetry around the mean. However, we live in a messy world and it is rarely the case that you will ever see complete/perfect symmetry (may be only in stars).

The normal distribution is famous because of the central limit theorem, which is a fascinating phenomenon. It is always refreshing to be able to attribute the existence of one phenomena from another, and that is the relationship between central limit theorem and normal distributions. The central limit theorem simply means taking samples from an original randomly distributed variable (could be discrete), calculating the averages from each of these samples and plotting the means will give us a normal distribution.The larger the samples you take and the more the samples you take the more normally distributed your plot will be. 

Example:
Imagine Gauss, Fisher, Pearson, Cox, Student (the pen name of a statistician who developed the student's t-test who was known as William Sealy Gosset) were all participating in a show called the World Idol of Statistics (disregarding time and space, here). Voters get to choose their idol by dialing in their votes and pressing 1 for Gauss, 2 for Fisher, 3 for Pearson, 4 for Cox, and 5 for Student (yes somewhat similar to the singing Idol talent shows). Plotting the results of these votes would reveal a discrete probability distribution function. Now take 50 samples each of n=10, for example and plot the frequency of means of these samples; you will start to see a normal distribution pattern.

_____________________________________________________________________________
References and Further Reading:

G. Galilei, Dialogue Concerning the Two Chief World Systems—Ptolemaic & Copernican (S. Drake translator), 2nd ed., Berkeley, Univ. California Press, 1967.

Hald, Anders (1990), "De Moivre and the Doctrine of Chances, 1718, 1738, and 1756", History of Probability and Statistics and Their Applications before 1750, Wiley Series in Probability and Statistics.

Stahl S. The evolution of the normal distribution. http://mathdl.maa.org/images/upload_library/22/Allendoerfer/stahl96.pdf. last accessed on July 5, 2013.

Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. Holt Paperbacks. First published in hardcover in 2001 by WH Freeman and Company. 

http://www.stat.uchicago.edu/events/normal/fatherND.html

http://www.robertnowlan.com/pdfs/de%20Moivre,%20Abraham.pdf

Thursday, June 27, 2013

Galileo and normal distributions?

What role did Galileo play in developing characteristics of a normal distribution? and why should we have a party if we have data that is normally distributed?

Wednesday, June 26, 2013

The world is messy and so is data

The world is messy; a pure study does not exist. Adopting a piecemeal approach is a logical method...
This blog is about scientific data and methods of analysis in academic medicine. While medical and biological data are the main focus, the application of methods for data analysis in other fields may be referenced and discussed. 
"Coming together is a beginning; keeping together is progress; working together is success."
   Henry Ford