Last Updated on: 29th August 2022, 08:09 am
In the media, in professional analyses, and in academic research, we are constantly confronted with statistics intended to sell a product, prove a hypothesis, or evaluate alternatives. In defense of the field of mathematics, some statisticians and some statistics are actually objective, rigorous, and useful.
But, there remains a nagging doubt in our minds about the statistics we are presented with. Part of the problem is the jargon, and part is a failure to adequately answer these questions: Are the statistical results significant? Should we believe them?
In this article, I want to emphasize the importance of this topic for students conducting doctoral research and writing their dissertations. I will clarify a couple of the relevant concepts, using a real-world example to illustrate. I will explain how you will then be equipped to conduct valid research or to interpret others’.
The two concepts I will cover are statistical significance and practical significance (also called, operational significance).
Significance: Some Background
Let’s start with some background. In most statistical analyses, including graduate research and analysis for theses and dissertations, the intent is to answer some simple questions:
- Is there a difference in the average values of an attribute between two separate groups?
- Or, is the average value of an attribute different from some specification?
- Or, does the predicted value of a group attribute change depending on the value of some factor (we call this an effect)?
The tools we use to answer these kinds of questions are the hypothesis and a statistical test (or, hypothesis test). The result of the test provides us the evidence to justify one of two conclusions:
- Yes, there is sufficient evidence to conclude that a difference or an effect exists. This is what is referred to as statistical significance.
- Or, there is insufficient evidence to conclude that a difference or effect exists. This outcome is statistically non-significant.
Here is the problem, often subtle, that we must confront. The statistical test might be significant, indicating the presence of a difference or an effect. But, is that difference or that effect meaningful to us, in practical or operational terms? It is not enough to say that a difference or an effect is statistically significant. In graduate research, we expect a bit more insight into the outcome.
Statistical and Practical Significance
Now, we have just introduced the concept of practical (or, operational) significance. And, there is a huge distinction here from the concept of statistical significance. Statistical significance refers to whether or not the test indicates that our hypothesis is likely—mathematically probable. But, even if there is statistical significance, the difference or the effect may or may not have any practical consequences or meaning.
It is possible that if our sample size is large enough, and variation in the attribute we care about is small enough, our statistical test will be able to detect just about any difference or effect, no matter how small or trivial.
For example, think about it this way:
If for some practical reason we asked, is there a difference in height of American men with last names beginning with A-M compared with those with last names beginning with L-Z? Well, we are pretty sure that they are for practical purposes the same. But, depending on how precisely we measure a sample of them, there is almost surely a difference between the two groups, though it may be very small (perhaps, 1/1000 of an inch). A very large sample may enable the statistical test to detect a very small difference in average height, but the difference is likely meaningless to our lives or work.
To summarize: Statistical significance relates to the hypothesis test we employ, and its results, using a sample of the population. Practical significance refers to differences or effects that are operationally meaningful and useful, whether or not the statistics tell us there is a difference or an effect.
Let’s illustrate this using a real-world example.
Auto Reliability by Manufacturer: Significant Differences?
There are many agencies out there telling us how reliable vehicles are, by manufacturer and by model. There are many different measures of reliability. And, there are many methods for collecting and analyzing the data, and not all of them are mathematically sound.
But, no matter, consumers pay attention to reliability statistics, and make significantly costly purchasing decisions based on what the “data” show. I will show how our concepts of significance need to be considered carefully so we do not make big decisions on shaky evidence.
One common measure of reliability is consumer-reported problems per 100 vehicles, for a given model year (provided via questionnaire). Call this variable, PPH. Here is a link to an online article, What You Should Know About Vehicle Reliability, that shows a listing of 31 manufacturers with their reliability in one model year, along with the “best” (top 3) and the “worst” (bottom 3). For this particular year, PPH ranged from 106 to 249, averaging 136 (Carley, 2019; https://www.aa1car.com/library/vehicle_reliability.htm; also, see J.D. Power and U.S. News).
So, using a data set that replicates this analysis, with identical mean and standard deviation for PPH, we can create a notional sample of about 20,000 vehicle reports for the 31 manufacturers—about 620 per manufacturer—for a single model year. I am assuming, for this exercise, equal samples for each manufacturer. By the way, many consumer advocacy and product websites claim sample sizes on the order of 100,000 (consumers providing data).
We want to know if there is a difference in average PPH by manufacturer. I used one-way analysis of variance (ANOVA) to perform the statistical hypothesis test. The result of that analysis provides evidence that the mean reliability is different among the 31 manufacturers. This is a statistically significant outcome, difference (between manufactures), and effect (reliability correlated to manufacturer).
However, is the difference among manufacturers meaningful in practical terms? Is it an operationally significant result?
Remembering that the reliability measure was “problems per 100 vehicles” . . . the difference between any adjoining pair in the ranked list is less than 5 problems reported for every 100 vehicles. At a glance, this does not seem like a huge difference.
Let’s investigate further. Consumers do not own 100 vehicles; they only own one, two, or maybe three. Showing reliability per 100 vehicles may somewhat exaggerate the differences among manufacturers. But, it is a different story when we look at problems per vehicle. Here, we find that the difference between adjoining pairs averages less than 0.05 per vehicle. The difference between the best (1.1 problems per vehicle) and the worst (2.5) in the list of 31 manufacturers is only 1.4 problems per vehicle.
Between the best and #9 on the list, there is a difference of 0.18 problems per vehicle. In fact, the difference between any pair of adjoining manufactures in the list is not even statistically significant, let alone practically significant. This includes the difference between #28 and #29 (the first of the three in the “worst” category).
What Can We Learn From This?
So, for this analysis, and many others like it, what is reported as important information, and what our initial analysis shows is statistically significant, may not be what they seem at first glance.
While the outcome shows that there is a statistically significant difference based on manufacturer using problems per 100 vehicles, the practical differences might not be considered meaningful to the consumer. Understanding this, and looking beyond the claims, a consumer might be more careful about making a multi-thousand-dollar decision based on the reliability figures provided.
Statistical significance does not guarantee that the differences or effects are operationally meaningful—that they mean anything useful for the decision-maker.
Final Thoughts
There are some significant lessons from this discussion and illustration.
- Be wary of claims of differences; or of “significant” differences. Know what that means.
- Some advertised differences or effects may not even be statistically significant, even with large samples; despite someone’s claims.
- Be aware of claims of statistical significance, especially with large sample sizes, and the assumption that this is of practical significance. Almost any effect or difference can be detected and shown to be statistically significant, with a large enough sample size and a small enough variation.
- Even with statistical significance, the actual differences or effects may not be operationally or practically important to the decision-maker or consumer.
- When planning your own research, be mindful of effect size as a driver of sample size. Effect size reflects the degree of precision you seek—how much of a difference or the magnitude of an effect, relative to population variation, you wish or need to detect. The more precision you require (the smaller the effect you wish to detect), the higher the sample size required.
- The effect size should be driven by an operational or practical need.
- Strive to understand the methodologies used by organizations and analysts to produce statistical results, and claims of significance. Look for evidence of questionable sampling techniques; grouping of continuous numerical data into categorical or ordinal data; and shaky techniques to address missing data or pieces of data with low sample size.
By all means, look very carefully at claims of statistical significance. Find out for yourself if that claim is true; and if it is relevant to your weighty decisions—if the differences or effects are also practically significant.
And, when performing your own analysis, as you might for a graduate level thesis or dissertation, understand and report on both statistical and practical significance.