Want to Get your Dissertation Accepted?

Discover how we've helped doctoral students complete their dissertations and advance their academic careers!

one-on-one support
live coaching sessions each week
online course material
GET FREE CONSULTATION
Join 200+ Graduated Students
textbook-icon Waiting to Get Your Dissertation Accepted?
girl-scholarship

Get Your Dissertation Accepted On Your Next Submission

Get customized coaching for:
BOOK A FREE CONSULTATION

Trapped in dissertation revisions?

BOOK YOUR FREE CONSULTATION

Last Updated on: 3rd February 2024, 01:37 am

Previous articles on common multivariate tools (linear regression, binary logistic regression, and ANOVA) highlighted their benefits in real-world analysis, and in academic dissertations and theses. These articles emphasized that the techniques are properly used as components of a comprehensive predictive model-building process. I want to wrap our understanding of multivariate analysis within this discussion of model-building. I will explain the process, stages, and rationale so that you will be able to execute scholarly quantitative research and analysis.

What Do We Mean by Predictive Model-Building?

Predictive model-building is a method to find the mathematical model that “best” predicts a response variable—that minimizes bias and that best fits the data. Model-building involves the rigorous use of the multivariate tools. It also employs various strategies for model specification—selecting (adding and eliminating) predictive terms (which we will discuss). 

Complex Real-World Phenomena

two people using a laptop to analyze charts

Most real-world phenomena are incompletely understood. That’s why we study them.

There are many causes and influences in their operational environment that potentially affect the behavior of real phenomena. Some are known, some are unknown or undiscovered. We use multivariate tools, precisely because of the complexity—the multitude of possible and poorly understood causes, influences, and predictors; and the interactions among them.

We rely on samples to assess behavior. We try to identify the true causes, influences, and predictors from samples. 

This is a stochastic process, with probabilistic answers, involving a measurable risk of not getting it right. Our challenge is to minimize the probability of making wrong conclusions from the sample and the analysis. 

A single run of a multivariate tool will not get the job done. There’s a process, part science and part artwork, for using the tools to find the best predictive models—to identify reliably the causes, influences, and predictors. 

What Decision-Makers Ask

Decision-makers often ask first, is factor A a significant influence on or a predictor of response Y? Or, what factors are significant influences or predictors of performance? The questions are incomplete because the influence of A on Y depends on the presence of other factors in the predictive model. And, it depends on the interactions and interdependencies among many predictors—some known and some unknown. 

The objective is to understand the system well enough to identify all of the predictors, but that is not always possible. But because of complexity, we cannot make an unqualified claim that A (in isolation) is a significant influence on Y, without considering all of the predictors in the model.

Art and Science

woman focused on taking notes from analytics on laptop

The Science of Model-Building

Model-building uses several regression techniques, collaboratively, to find the best model to predict the response variable. These techniques (best-subsets, statistical, and purposeful sequential regression) are covered in my article on regression

The science relies on reliable statistics focused on goodness-of-fit: Adjusted R2 (and its analogies in binary regression), Mallows’ CP, and Akaike’s information criterion (AIC). The last two are used with best-subsets regression, and focus on avoiding over-specifying models (too many nuisance predictors).

Get Your Dissertation Accepted On Your Next Submission

Get customized coaching for crafting your proposal, collecting and analyzing your data, or preparing your defense.
BOOK YOUR FREE CONSULTATION
Over 300+ Students Coached • 40+ Years of Experience • 90% Success Rate

The Art of Model-Building

The art of model-building involves a combination of techniques to generate a breadth of evidence to make good analytical decisions about model specification. In addition, this approach relies on SME insights, analyst judgment, trial and error, and assessment of combinations of predictors using various goodness-of-fit criteria. 

The sequence of model runs is not serial. It is iterative, adding and eliminating predictors and adding them back in various combinations. It also involves repetition: for example, re-running best-subsets regression after successive stages of variable screening (more on that to come).

Pitfalls and Challenges

close-up of math charts next to a calculator

Two Competing Goals

Inherent in the holistic model-building process is the need to avoid over-specifying models—including predictors that are significant only because of a random sample, but that are just part of noise. 

The second, and competing challenge is excluding predictors that are, in fact, influential, because of overly restrictive variable selection criteria. Much scientific literature argues against an over-fixation on an arbitrary and stringent level of significance for individual predictors leading to an underspecified model. 

An underspecified model does not include all of the predictors that influence the response variable, which results in a misleading model. This is the outcome that must be avoided. The emphasis should be on goodness-of fit—even with predictors that are not individually, statistically significant based on an overly stringent level of significance. An over-specified model is most likely a more accurate but less precise model (more variation caused by noisy predictors), while an under-specified is potentially less accurate (biased) but more precise (less variation). There are reasonable approaches to overcoming the competing challenges during model-building.

Pitfalls with Statistical Regression (Automated Stepwise)

Many analysts rely exclusively on statistical regression—automated stepwise regression (for example, backward elimination), without considering its pitfalls. The inherent problem with stepwise techniques is that the final model (set of predictors) is dependent on the order in which predictors are added or eliminated. We need to avoid the temptation to use, exclusively, a statistical (automated) regression approach. We also need to avoid the tendency to proceed through a single, serial, inflexible sequence of manual stepwise analyses. The right approach is comprehensive, thoughtful, and agile—the purposeful sequential regression process. This captures both the art and science of model-building.

Overcoming the Challenges

There are several ways we overcome the challenges in model-building, which I have discussed in my other articles. A couple of these*** I will elaborate on later:

  • Avoiding considering predictors in isolation
  • Power analysis and thoughtful sample size calculations
  • Thoughtful choice of effect size
  • Emphasis on missing variable bias (Type II statistical errors)***
  • Thoughtful variable selection criteria***
  • Combination of regression model-building techniques
  • Model-building in stages

The Workhorses 

woman in red carefully analyzing documents in front of laptop

Purposeful Sequential Regression

Purposeful sequential regression, variously called hierarchical, simultaneous, standard, or user-determined regression, is the workhorse of this holistic approach to model-building. 

Purposeful sequential regression executes a series of manual, individual regression analyses. The analysis explores various models, using an iterative, sometimes repetitive process of fine-tuning the model. It adds and removes predictors, to achieve maximum goodness-of-fit (predictability). We compare the significance of each predictor (p value) to our variable selection criterion; and goodness-of-fit (for example, adjusted R2). We may remove and then add back predictors based on their effect on adjusted R2, knowledge obtained from best-subsets regression and statistical regression. We try different combinations of predictors, some trial and error. It is possible that a predictor added in an early stage may subsequently be removed after other predictors are considered. Or, predictors removed early may be considered for re-entry at a later stage. 

To emphasize—the foundation of this is approach is that the influence of any single predictor cannot be considered in isolation from the other predictors or factor-interactions. The overall goal is to obtain the best possible predictive model—the best goodness-of-fit. 

To be sure, we want model terms that are truly significant predictors, not merely adding noise to the model. This is where judgment applies. We use a liberal variable inclusion criterion at each step where a decision is being made whether or not to eliminate or add a predictor in the model. 

Model-Building in Stages

To handle the complexity, model-building is most effectively and efficiently performed in five stages: 

  • Stage 1 relies on theory, previous research, empirical results, and subject matter expertise to identify candidate predictors. 
  • Stage 2 addresses missing or corrupt data, outliers, and statistical assumptions including multicollinearity.
  • Stage 3 is a screening stage, to identify and eliminate candidate predictors that are highly unlikely to be significant predictors—that do not contribute to the goodness-of-fit of the model. Stage 3 is performed using multiple segments:
    • Segment A: best-subsets regression analysis to evaluate all combinations of predictors based on Mallows’ CP statistic, AIC, and adjusted R2.
    • Segment B: statistical regression analyses using automated methods: stepwise, backward, and forward. Note which predictors are consistently included or excluded. 
    • Segment C: purposeful sequential regression analysis
    • Segment D: List model compositions from all three segments. Select the best preliminary model based on the overall evidence.
    • Stage 4 adds the two-factor interactions from the remaining predictors. Then, repeat Segments A through D. 
    • Segment E uses graphical analysis to supplement the analysis of the factor-interactions. 
  • Stage 5 compares the collaborative body of evidence, to select the final model of predictors and factor-interactions.

Emphasize Avoiding Missing Variable Bias

There is a tendency to use the “generally accepted” level of significance of .05 and a power of .80 for all problems. This implies that a Type II error (excluding a significant predictor) is four times more likely or important than including a nonsignificant predictor. 

Excluding a significant predictor leads to missing variable bias. This is more serious than including a nonsignificant predictor. It makes coefficients inaccurate. We should be more concerned with avoiding missing variable bias at the cost of losing precision (adding noise to the model).

Thoughtful Variable Selection Criteria

Certainly, there needs to be some standard for including predictors, otherwise we get a collection of terms that contribute mostly to noise. But, an overly stringent variable selection criterion, especially with the issues with stepwise regression, can eliminate good predictors. 

I recommend relatively liberal variable selection criterion during the model-development process, perhaps the reverse of conventional practice: β = .05 (high statistical power) and α (variable selection criterion) = .20. Then, focus on the contribution of each term to goodness-of-fit. We accept a higher risk of a false positive, but increase the likelihood of finding a truly influential predictor. 

Meeting Every Requirement, Yet Facing Rejection?
A game-changing training tailored to give you a clear path to gaining your committee’s approval, so you can graduate with your hard-earned doctorate.
BOOK YOUR FREE CONSULTATION
Over 300+ Students Coached • 40+ Years of Experience • 90% Success Rate

Final Thoughts

Problems worthy of studying in dissertations and theses, and in the real world, are complex and largely poorly understood. There are usually multiple influences, and they interact. We have an array of superb multivariate tools to handle this complexity. But, the tools are used most effectively as part of a predictive model-building approach. We strive to find mathematical models with the best fit of the data, accepting a bit of noise for the most accurate predictions. This cannot be done with a single run of a multivariate tool, or using only automated stepwise regression. Model-building performed rigorously, using a purposeful sequential approach, is our best path to accurate predictive models, and reliable research.

References:

  • Bozdogan, H. (2000). Akaike’s information criterion and recent developments in information complexity. Journal of Mathematical Psychology, 44(1). https://doi.org/10.1006/jmps.1999.1277
  • Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. Springer.
  • Flom, P. L., & Cassell, D. L. (2007, November 11-14). Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use [Paper presentation]. SAS NESUG 2007 Conference, Baltimore, MD, United States. https://www.lexjansen.com/pnwsug/2008/DavidCassell-StoppingStepwise.pdf
  • Grace-Martin, K. (2020) When to leave insignificant effects in a model. https://www.theanalysisfactor.com/insignificant-effects-in-model
  • Harrell, F. E. (2015). Regression modeling strategies. Springer Series in Statistics. https://doi.org/10.1007/978-3-319-19425-7_4
  • Harrell, F. E., & Levy, D. G. (2020). Regression modeling strategies. http://hbiostat.org/doc/rms.pdf
  • Heinze, G. & Dunkler, D. (2017). Five myths about variable selection. Transplant International, 30, 6-10.
  • Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection – A review and recommendations for the practicing statistician. Biometrical Journal. 60, 431–449. https://doi.org/10.1002/bimj.201700067
  • Milhken G. A. & Johnson, D. E. (1984). Analysis of messy data. Volume 1: Designed experiments. Van Nostrand. https://www.statisticshowto.com/balanced-and-unbalanced-designs
  • Nau, R. (2020). What’s the bottom line? How to compare models. https://people.duke.edu/~rnau/compare.htm
  • Nicolis, G. & Nicolis, C. (2009). Foundations of complex systems. European Review, 17(2), 237-248.
  • Pennsylvania State University (PSU), Eberly College of Science. (2018). Best subsets regression, adjusted R-Sq, Mallows Cp. https://online.stat.psu.edu/stat462/node/197
  • Riopelle, A. J. (2000). Are effect sizes and confidence levels problems for or solutions to the null hypothesis test? The Journal of General Psychology, 127(2), 198-216.

Branford McAllister

Branford McAllister received his PhD from Walden University in 2005. He has been an instructor and PhD mentor for the University of Phoenix, Baker College, and Walden University; and a professor and lecturer on military strategy and operations at the National Defense University. He has technical and management experience in the military and private sector, has research interests related to leadership, and is an expert in advanced quantitative analysis techniques. He is passionately committed to mentoring students in post-secondary educational programs.