# Evaluating the Quality of Data Through Statistical Analysis

Aristidis Veves, MD, PhD, and Damanpreet Singh Bedi

Research is the cornerstone of the medical profession, providing important data about illness, injury and biological processes. But, we cannot begin to interpret that data until we establish a context into which it can be placed. For this reason, statistical analysis is one of a researcher’s most valuable approaches to analyzing data.

Statistical analysis is used to analyze and interpret numerical or categorical data. There are two basic functions for which biological statistical analysis, or “biostatistics,” is used in the health sciences:

- To compare the same set of variables between two different groups of
individuals
- To correlate sets of variables to one another or to a particular outcome that is being studied

Presenting and interpreting data in a manner that shows statistical validity is a vital part of this process.

**Understanding the P-Value**

One of the most recognized ways to evaluate biostatistics is to look at the p-value of a test. P-value measures the difference between the baseline, or null, hypothesis and the alternative hypothesis being tested. The p-value allows us to determine whether we should accept or reject the null hypothesis.

A value of 0.05 is the standard for establishing the “significance level” of a test – when the p-value is less than or equal to 0.05, we reject the null hypothesis. Why? A p-value of 0.05 suggests there is only a 5 percent chance the results could have occurred if there was zero correlation between the variables. This standard allows us to be fairly confident that the results of a particular test are statistically significant to some degree. For an even more rigorous test, a significance level of 0.025 or 0.01 can be used.

**Selecting the Appropriate Statistical Tool**

Depending on the nature of the study, researchers have several statistical tools at their disposal.

*The t-Test*

If we wanted to compare whether a continuous variable was correlated to a particular categorical group, we would use a two-sample t-test. A continuous variable is one whose data points are not restricted to particular values, such as whole numbers (the measurement of height, weight or body fat percentage, for example). A categorical, or nominal, variable is one whose data points fall into unordered categories, such as gender or ABO blood types.

The two-sample t-test allows us to determine whether the **means** of two
categorical groups are statistically the same, or if there is a significant
difference between them.

For instance, if we were trying to determine a correlation between gender and height, we would use this t-test to compare the average height of males to the average height of females in our study. We would look for the t-test to yield a p-value that was less than or equal to our 0.05 significance level, to reject the null hypothesis.

Notably, when we compare the same group of individuals before and after a
particular treatment, we use a **paired t-test**, which allows us to evaluate
paired samples.

*Wilcoxon-Mann-Whitney* *Test*

When we perform a t-test, there is the assumption that the variables are
distributed normally, which means they fall into a symmetrical bell curve. In
scenarios where this assumption cannot be made, we use an analog of the t-test
called the Wilcoxon-Mann-Whitney test, which compares the **medians** instead
of the means of the two groups.

*Analysis of Variance (ANOVA)*

If we are interested in comparing multiple groups to see if there is any impact on a variable that we are studying, we would perform an Analysis of Variance, or ANOVA. This test allows us to compare the means of more than two groups to check for any statistically significant variation between them.

*Chi-Squared Test*

When the variables being studied are categorical, we use the chi-squared test. This allows us to determine whether the occurrence of a particular categorical variable is indeed correlated to a particular group.

For example, we would use the chi-squared test if we wanted to know if there was a correlation between gender and handedness. We would expect a random distribution of left- or right-handedness between males and females, regardless of gender. The chi-squared test would allow us to determine if our observations were consistent with this hypothesis, or if in fact there was a correlation between gender and handedness.

*Correlation Coefficients*

Another common analytical technique involves trying to determine the strength
of a linear relationship between two continuous variables (such as height and
weight). In this situation, we calculate a correlation coefficient. Given the
notation r, values close to r = 1 indicate a strong positive correlation, values
close to r = -1 indicate a strong negative correlation, whereas values close to
r = 0 indicate no correlation between the variables**.**

**It is important to note that correlation does not imply causation**. The
two variables could both be influenced by a third variable, causing them to
change together. In addition, comparing unlike populations may yield misleading
results, so this should be avoided.

*Linear Regression Models*

When it appears that one variable (the explanatory variable, x) affects the outcome of the other variable (the response variable, y), a simple linear regression model is used to determine the exact linear relationship between the two variables.

The goal of using simple linear regression is to determine how changes in the
explanatory variable affect the outcome of the response, frequently represented
as the variable R^{2}. This is known as the **coefficient of
determination**, and represents the proportion of variation in y that is
explained by the linear regression.

In other words, it tells us how well the linear model we have constructed
fits the observed data. The closer R^{2} is to 1, the greater the linear
relationship between the two variables we are studying.

More often than not, we are interested in the effects of multiple explanatory
variables on a particular response variable. In these situations, we use a
**multiple regression model** to identify the relationship of many
explanatory variables on one response. For instance, multiple regression can be
used to link height, weight and bone density to BMI.

*Logistic Regression Analysis*

In an analogous situation, when we are trying to determine how a binary response variable (such as disease versus no disease) is affected by a series of explanatory variables, a logistic regression analysis is used. Hence, a logistic regression analysis would be used if we were performing a study that aimed to link blood pressure, cholesterol level, and BMI to the occurrence of heart disease.

**Understanding Statistics is the Key to Understanding and Evaluating
Research**

As more and more studies emerge promoting revolutionary procedures and touting other valuable findings, it is important to be able to sift through these claims in an organized and efficient manner – comparing apples to apples, as you will. An understanding and familiarity with biostatistics can notably improve your ability to do so. Biostatics allows researchers to determine, with confidence, the reliability of given data sets.

Biostatistics enables you to present data in a meaningful way and to ultimately make educated choices about which findings to incorporate into your own repertoire of knowledge and practice.