Evaluating the Quality of Data Through Statistical Analysis
Aristidis Veves, MD, PhD, and Damanpreet Singh Bedi
Research is the cornerstone of the medical profession, providing important
data about illness, injury and biological processes. But, we cannot begin to
interpret that data until we establish a context into which it can be placed.
For this reason, statistical analysis is one of a researcher’s most valuable
approaches to analyzing data.
Statistical analysis is used to analyze and interpret numerical or
categorical data. There are two basic functions for which biological statistical
analysis, or “biostatistics,” is used in the health sciences:
- To compare the same set of variables between two different groups of
- To correlate sets of variables to one another or to a particular outcome
that is being studied
Presenting and interpreting data in a manner that shows statistical validity
is a vital part of this process.
Understanding the P-Value
One of the most recognized ways to evaluate biostatistics is to look at the
p-value of a test. P-value measures the difference between the baseline, or
null, hypothesis and the alternative hypothesis being tested. The p-value allows
us to determine whether we should accept or reject the null hypothesis.
A value of 0.05 is the standard for establishing the “significance level” of
a test – when the p-value is less than or equal to 0.05, we reject the null
hypothesis. Why? A p-value of 0.05 suggests there is only a 5 percent chance the
results could have occurred if there was zero correlation between the variables.
This standard allows us to be fairly confident that the results of a particular
test are statistically significant to some degree. For an even more rigorous
test, a significance level of 0.025 or 0.01 can be used.
Selecting the Appropriate Statistical Tool
Depending on the nature of the study, researchers have several statistical
tools at their disposal.
If we wanted to compare whether a continuous variable was correlated to a
particular categorical group, we would use a two-sample t-test. A continuous
variable is one whose data points are not restricted to particular values, such
as whole numbers (the measurement of height, weight or body fat percentage, for
example). A categorical, or nominal, variable is one whose data points fall into
unordered categories, such as gender or ABO blood types.
The two-sample t-test allows us to determine whether the means of two
categorical groups are statistically the same, or if there is a significant
difference between them.
For instance, if we were trying to determine a correlation between gender and
height, we would use this t-test to compare the average height of males to the
average height of females in our study. We would look for the t-test to yield a
p-value that was less than or equal to our 0.05 significance level, to reject
the null hypothesis.
Notably, when we compare the same group of individuals before and after a
particular treatment, we use a paired t-test, which allows us to evaluate
When we perform a t-test, there is the assumption that the variables are
distributed normally, which means they fall into a symmetrical bell curve. In
scenarios where this assumption cannot be made, we use an analog of the t-test
called the Wilcoxon-Mann-Whitney test, which compares the medians instead
of the means of the two groups.
Analysis of Variance (ANOVA)
If we are interested in comparing multiple groups to see if there is any
impact on a variable that we are studying, we would perform an Analysis of
Variance, or ANOVA. This test allows us to compare the means of more than two
groups to check for any statistically significant variation between them.
When the variables being studied are categorical, we use the chi-squared
test. This allows us to determine whether the occurrence of a particular
categorical variable is indeed correlated to a particular group.
For example, we would use the chi-squared test if we wanted to know if there
was a correlation between gender and handedness. We would expect a random
distribution of left- or right-handedness between males and females, regardless
of gender. The chi-squared test would allow us to determine if our observations
were consistent with this hypothesis, or if in fact there was a correlation
between gender and handedness.
Another common analytical technique involves trying to determine the strength
of a linear relationship between two continuous variables (such as height and
weight). In this situation, we calculate a correlation coefficient. Given the
notation r, values close to r = 1 indicate a strong positive correlation, values
close to r = -1 indicate a strong negative correlation, whereas values close to
r = 0 indicate no correlation between the variables.
It is important to note that correlation does not imply causation. The
two variables could both be influenced by a third variable, causing them to
change together. In addition, comparing unlike populations may yield misleading
results, so this should be avoided.
Linear Regression Models
When it appears that one variable (the explanatory variable, x) affects the
outcome of the other variable (the response variable, y), a simple linear
regression model is used to determine the exact linear relationship between the
The goal of using simple linear regression is to determine how changes in the
explanatory variable affect the outcome of the response, frequently represented
as the variable R2. This is known as the coefficient of
determination, and represents the proportion of variation in y that is
explained by the linear regression.
In other words, it tells us how well the linear model we have constructed
fits the observed data. The closer R2 is to 1, the greater the linear
relationship between the two variables we are studying.
More often than not, we are interested in the effects of multiple explanatory
variables on a particular response variable. In these situations, we use a
multiple regression model to identify the relationship of many
explanatory variables on one response. For instance, multiple regression can be
used to link height, weight and bone density to BMI.
Logistic Regression Analysis
In an analogous situation, when we are trying to determine how a binary
response variable (such as disease versus no disease) is affected by a series of
explanatory variables, a logistic regression analysis is used. Hence, a logistic
regression analysis would be used if we were performing a study that aimed to
link blood pressure, cholesterol level, and BMI to the occurrence of heart
Understanding Statistics is the Key to Understanding and Evaluating
As more and more studies emerge promoting revolutionary procedures and
touting other valuable findings, it is important to be able to sift through
these claims in an organized and efficient manner – comparing apples to apples,
as you will. An understanding and familiarity with biostatistics can notably
improve your ability to do so. Biostatics allows researchers to determine, with
confidence, the reliability of given data sets.
Biostatistics enables you to present data in a meaningful way and to
ultimately make educated choices about which findings to incorporate into your
own repertoire of knowledge and practice.