Remember my login
    Member ID   Last Name    
 
 
   
   
 

 
 
Evaluating the Quality of Data Through Statistical Analysis

By Aristidis Veves, MD, PhD, and Damanpreet Singh Bedi

Translating data that has been gathered from a research project into a meaningful language is a critical step in the evaluation of whether a study provides useful information. This is done through statistical analysis, which allows us to analyze and interpret numerical or categorical data. When talking specifically about the statistical analysis of the biological or health sciences, we call it biostatistics. There are two basic functions for which biostatistics is used in the health sciences. First, it is used to compare the same set of variables between two different groups of individuals. Second, it is used to correlate a set of variables to one another or to a particular outcome that is being studied. It is critical to be able to present and interpret data in a manner that shows statistical validity.

Understanding the P-Value
One of the most prevalent evaluative terms seen in statistics is the p-value of a test. In order to understand the meaning of p-value, it is necessary to briefly discuss hypothesis testing. The baseline, or null, hypothesis of every study is that there is no appreciable correlation between the set of variables being studied. The alternative hypothesis contradicts this. By performing a statistical hypothesis test, we can obtain a p-value. The p-value tells us the probability of obtaining a result as extreme as the observed result if the null hypothesis was in fact true. It allows us to determine whether we should reject the null hypothesis. Usually, a value of 0.05 is established as the significance level of a test; when the p-value is less than or equal to 0.05, we reject the null hypothesis. A p-value of 0.05 indicates that there is only a 5% chance that the results that were obtained could have been observed if there really was no correlation between the variables being studied. It allows us to be fairly confident that the results of a particular test are statistically significant. For an even more rigorous test, a significance level of 0.025 or 0.01 can be used.

Selecting the Appropriate Statistical Tool
Now, let us explore the types of statistical tests available when attempting to compare the same set of variables between two different groups of individuals. If we wanted to compare whether a continuous variable was correlated to a particular categorical group, we would use a two-sample t-test. A continuous variable is one whose data points are not restricted to particular values (like integers). Fractional values are possible, such as in the measurement of height. A categorical, or nominal, variable is one whose data points fall into unordered categories, such as when recording gender or ABO blood types. The two-sample t-test allows us to determine whether the means of two categorical groups are statistically the same, or whether there is a significant difference between them. For instance, if we were trying to see whether there was a correlation between gender and height, we would use this t-test to compare the average height of males to the average height of females in our study. Remembering our discussion earlier, we would look for the t-test to yield a p-value that was less than or equal to our significance level, usually set at 0.05, to reject the null hypothesis. Notably, when we compare the same group of individuals before and after a particular treatment, we use a paired t-test, which allows us to evaluate paired samples.

We perform the t-test with an assumption that the variables are distributed normally, which means that they fall into a symmetrical bell curve. In scenarios where this assumption cannot be made, we use an analog of the t-test called the Wilcoxon-Mann-Whitney test, which compares the medians instead of the means of the two groups.

Often times, we are interested in comparing multiple groups to see if there is any impact on a variable that we are studying. Instead of doing multiple t-tests, we would perform an Analysis of Variance, or ANOVA. This test allows us to compare the means of more than two groups to check for any statistically significant variation between the groups.

When the variables being studied are categorical, we use the chi-squared test. This allows us to determine whether the occurrence of a particular categorical variable is indeed correlated to a particular group. For example, we would use the chi-squared test if we were trying to determine whether there was a correlation between gender and handedness. We would expect a random distribution of left- or right-handedness between males and females, regardless of gender. The chi-squared test would allow us to determine if our observations were consistent with this hypothesis, or if in fact there was a correlation between gender and handedness.

Another common analytical technique involves trying to determine the strength of a linear relationship between two continuous variables (such as height and weight). In this situation, we calculate a correlation coefficient, given the notation r. Values close to r = 1 indicate a strong positive correlation, values close to r = -1 indicate a strong negative correlation, whereas values close to r = 0 indicate no correlation between the variables. Most statistical software also generally reports a p-value to indicate whether the correlation coefficient obtained is statistically meaningful. It is important to note that correlation does not imply causation. The two variables could both be influenced by a third variable, causing them to change together. In addition, comparing unlike populations may yield misleading results, so this should be avoided.

When it appears that one variable (the explanatory variable, x) affects the outcome of the other variable (the response variable, y), a simple linear regression model is used to determine the exact linear relationship between the two variables. The goal of using simple linear regression is to determine how changes in the explanatory variable affect the outcome of the response. Frequently, we are given a value represented as R2. This is known as the coefficient of determination, and represents the proportion of variation in y that is explained by the linear regression. In other words, it tells us how well the linear model we have constructed fits the observed data. The closer R2 is to 1, the greater the linear relationship between the two variables we are studying. More often than not, we are interested in the effects of multiple explanatory variables on a particular response variable. In these situations, we use a multiple regression model to identify the relationship of many explanatory variables on one response. For instance, multiple regression can be used to link height, weight, and bone density to BMI.

In an analogous situation, when we are trying to determine how a binary response variable (such as disease versus no disease) is affected by a series of explanatory variables, a logistic regression analysis is used. Hence, a logistic regression analysis would be used if we were performing a study that aimed to link blood pressure, cholesterol level, and BMI to the occurrence of heart disease.

Understanding Statistics is the Key to Understanding and Evaluating Research
With the increasing multitude of studies that are promoting many revolutionary procedures and touting other valuable findings, it is important to be able to sift through these in an organized and efficient manner. Developing an understanding of biostatistics can notably improve your ability to do so. This tool allows one to establish some degree of confidence of how reliable a set of data is. This will enable you to present data in a meaningful way and to ultimately make educated choices about which findings to incorporate into your own repertoire of knowledge and practice.


Aristidis Veves, MD, PhD is an endocrinologist with special interest in Diabetic Foot Disorders. He currently serves as the Director of Research at the Joslin Beth Israel Deaconess Medical Center.

Damanpreet Singh Bedi is a medical student at the University of Rochester School of Medicine and Dentistry in New York.


Source: ACFAS Bulletin, July/August 2001


 

 

 
 

Copyright © 2008 American College of Foot and Ankle Surgeons, All Rights Reserved