Want to Get your Dissertation Accepted?

Discover how we've helped doctoral students complete their dissertations and advance their academic careers!

one-on-one support
live coaching sessions each week
online course material
GET FREE CONSULTATION
Join 200+ Graduated Students
textbook-icon Waiting to Get Your Dissertation Accepted?
girl-scholarship

Get Your Dissertation Accepted On Your Next Submission

Get customized coaching for:
BOOK A FREE CONSULTATION

Trapped in dissertation revisions?

BOOK YOUR FREE CONSULTATION

Last Updated on: 3rd February 2024, 01:32 am

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” Sir R. A. Fisher, English statistician, First Session of the Indian Statistical Conference, Calcutta, 1938.1

Any worthy research and analysis, for dissertations and theses in academia, or in the real world, is well planned. And the first major task that requires planning is data collection. The criticality of data collection is often overlooked, because we tend to focus on statistical analysis and interpretations.

But, it stands to reason—if the data are not collected properly and carefully, the analysis will be flawed. And, the process will be inefficient and costly, with unreliable results.

In this article, I will identify the major steps and considerations for quantitative data collection.

What Are We Trying to Accomplish?

african american woman carefully looking at notes in front of laptop

In quantitative analysis, our objective is usually one of these: 

  • Compare numerical attributes of different groups
  • Compare the attribute of a group to a standard or specification
  • Determine the relationship between variables
  • Determine if one or more variables are causes of, or significant predictors of a response

Basic Premises of Good Data Collection

First, the imperatives for rigorous data collection, with the details in subsequent sections:

  • Use reliable instruments and scientific sampling procedures.
  • Control the collection parameters, factors, and independent variables, if possible.
  • Control the collection as much as possible, even if you cannot control the parameters.
  • Collect all the information you can feasibly obtain—but, at least as much information as you need.
  • Collect the right info.
  • Count and measure as precisely as you can, feasibly.
  • Be organized, meticulous, and careful. Document the process as it unfolds.

Instruments

There are many ways to collect data:

  • Inspection or observation: The analyst witnesses, collects, counts or measures, and records the data first-hand.
  • Experimentation: Purposeful changes are made to input variables to observe corresponding changes in an output.2
  • Survey: Data are collected using a questionnaire and a sampling process.
  • Secondary data: Data are obtained from another, available source.
  • Modeling and simulation: Data are collected from a simulation using a model that replicates a real system or phenomenon.

Why Do We Sample?

young woman focused an analyzing statistics on computer

Generally, we are trying to learn something about the attributes of our population, or to compare groups, or to identify causes of responses. In an ideal world, we’d prefer to count or measure every element of the population (a census). But, that is rarely feasible.

So, to conduct the analysis within the limits of our resources (time and money), we rely on samples of the population. We then infer attributes of the population from the sample.

Get Your Dissertation Accepted On Your Next Submission

Get customized coaching for crafting your proposal, collecting and analyzing your data, or preparing your defense.
BOOK YOUR FREE CONSULTATION
Over 300+ Students Coached • 40+ Years of Experience • 90% Success Rate

Control the Parameters

We control the sampling process so that the data we collect are suitable to the task. Here are some ways and reasons we control the collection:

  • Sample size: Enough data to minimize the probability of statistical errors (false positive and false negative conclusions).
  • Independence: Avoid autocorrelation—when the value of a data point is quantitatively related to the previous and subsequent data point. Autocorrelation confounds the comparison of groups or identification of predictors of responses. Autocorrelation might occur in time-series data with inherent cycles; or where there is a time-related influence such as a learning curve or heating effect.
  • Stratification: Sampling so that the sample is representative of and proportional to the population (e.g., demographics). 
  • Robust process: If we are trying to determine the causes or predictors of variations in a response, we want the process minimally affected by external sources of variation.

Planning for Data Collection

grad student carefully looking at stats on computer

A poorly designed collection effort, without careful consideration of sample size, the order of the collection, the response variables, and the predictor variables, will yield plenty of data. But much of the benefit of the data will have been wasted due to confounded predictors, low sample size, and missing or sparse data. These kinds of problems are avoided by meticulous planning.

Planning is an upfront investment. In the planning phase, we 

  • Identify our objectives.
  • Define the attribute of the population we care about.
  • Identify a numerical measure or characterization of that attribute—a response or dependent variable.
  • Define the input/independent variables postulated to influence the response.
  • Postulate the relationship between independent variables and response variables.
  • Identify the target population from which we will sample.
  • Plan specifically how we will control the samples (i.e., stratification) or the factors postulated to influence the outcome (i.e., experimental design). 

Sampling and Experimental Design

Controlling data collection is part of the science of sampling and experimental design

Experimental design (or, design of experiments [DOE]) is a structured, logical, and mathematically sound process. This process facilitates a rigorous a statistical analysis to determine how a change in inputs results in a corresponding change in outputs. DOE provides the most efficient coverage of the population of factor combinations.

Even when our instrument is non-experimental (e.g., observation or survey), we should follow some of the principles of rigorous experimental design:

  • Controlling the independent variables of the sample items.
  • Balanced distribution of independent variables.
  • Calculating a minimum sample size based on level of significance, statistical power, and precision.
  • Randomization of collection.
  • Measuring variables as precisely as possible.

A Word About Variables

back view of woman looking at charts on laptop

In quantitative analysis, we want the most statistical power, confidence, and precision we can afford. We achieve this objective by our definition and selection of variables.

We strive for continuous numerical variables first. We settle for discrete numerical variables when appropriate. We might use ordinal numerical variables from questionnaires, especially if we can inform those variables with a mean response from multiple items. And finally, we can use categorical variables, when the predictors are not quantifiable.

If we have or can obtain continuous or discrete numerical variables, we never convert them to categorical, before performing statistical analysis (e.g., converting academic test scores to categories such as quartiles or nominal descriptions [excellent, good, fair, poor]). There is simply too much loss of information when we do that, and a corresponding decrease in statistical power, confidence, and precision. We can categorize our results, but not before we perform the analysis.

What About Sample Size?

The minimum desired sample size is calculated, generally with the assistance of an app such as G*Power.3 Sample size is a function of the acceptable risk of Type I (false positive) and Type II (false negative) statistical errors. It also depends on our choice of precision, captured in effect size, which depends on the statistical test.

Confidence is the inverse of level of significance, or α, which is the probability of a Type I error. Power is the inverse of β, the probability of a Type II error. One error may be more important than the other. Consider a medical test for cancer, and what is more critical between a false positive and a false negative test. Compare that with a drug test for employment. Those considerations drive the analyst’s decision about the relative values of α and β.

back view of woman using a calculator and a laptop

Precision is driven by the magnitude of the difference in attributes between two groups. Or, the difference between a group and a specification. It is the difference that matters to a decision-maker from a practical sense. To illustrate, we might find that the reliability between two cars is statistically different, but in truth the difference is not meaningful from a practical sense. They are virtually the same to the driver. There is no point in paying for sample size to achieve a result more precise than a decision-maker needs.

We can trade off power, confidence, and precision, depending on what is important to the decision-maker. And, out of necessity, we can balance these parameters and the resulting sample size, with the resources available (e.g., money and time). 

It is remarkable that many sample size tools and sample size calculations do not consider statistical power. They simply compute a sample size based on a level of significance, α. This is to guard against a false positive. But, failing to consider power in the sample size calculation presents an unacceptable risk of a false negative—failing to detect a difference or a statistical effect. 

Sample size must also account for 

  • Response rate for surveys (which may be far lower than we expect). 
  • Missing or corrupted data, which might render an individual sample useless. 
  • Outliers within the data set, that might create unacceptable influence on the analysis. 

For that reason, sample size should be increased significantly so that when the data set is cleansed and prepared, we are left with a number of valid records that exceeds the minimum sample size calculated based on power, confidence, and precision.

Data Preparation and Cleansing

close-up view of a person pointing to statistical charts

Data cleansing and preparation comprise the procedures to identify outliers, to handle missing or corrupt data, and to organize data for analysis. 

Outliers

An outlier is an extreme value for a quantitative variable. Outliers can exert a disproportionate effect on the outcome, particularly in small sample sizes. 

There is no universal definition of an outlier. One criterion is a value falling more than 3 standard deviations from the mean. Another outlier detection method is using a boxplot, generated from statistical software such as SPSS. A boxplot provides a visual indication of extreme values compared to other data points. 

Missing or Corrupt Data

Missing or corrupt data discovered in the analysis phase are challenging if not impossible to remedy. The data collection plan should provide procedures to avoid a pattern of missing or corrupt values. And, the plan must provide the procedures to identify missing and corrupt data; and the remedy. 

Remedies

There are, generally, two remedies for outliers and missing or corrupt data. The first is to exclude records with unusable values. This is not catastrophic when sample size is large. However, there may be an impact on orthogonality of a designed experiment or representation of the population in a sample. 

A second option is to substitute a reasonable, estimated value for a missing or corrupt value. We might use a mean value; or substitute an estimated value using a predictive regression technique (imputation). This is not an option for an outlier.

woman pointing at graph on computer

The decision to remove data points with outliers is a judgment call considering the nature of the analysis and the data, especially sample size. The first step is to determine if it is aberrant, expected, or reasonable; and, what the cause might be. One technique is to run the analysis with and without the outlier to determine the impact the outlier has on results. If results are significantly different with and without that data point, the decision must be made about which provides the most accurate description of the situation.

Whether or not a record with an outlier is included, the presence of an outlier should generate some investigation. There may be some root cause or explanation indicating the possibility of behavior that is worthy of understanding.

When eliminating records for either outliers or missing/corrupt data, it is important to ensure that the minimum sample size is met after remedies are applied. That is why we plan for some percentage of invalid records upfront, and ensure our collection is sufficient to handle the loss of some records.

Final Thoughts

It is not uncommon for analysts to focus on statistical analysis, and fail to plan, adequately, for data collection. The fact is, with a valid, clean data set, the analysis is usually straightforward and reliable. 

Meeting Every Requirement, Yet Facing Rejection?
A game-changing training tailored to give you a clear path to gaining your committee’s approval, so you can graduate with your hard-earned doctorate.
BOOK YOUR FREE CONSULTATION
Over 300+ Students Coached • 40+ Years of Experience • 90% Success Rate

Planning data collection is money and effort well spent. It is an investment, that pays off by minimizing problems with the data set, that commonly arise when it is too late to remedy them: insufficient sample size with less than desirable statistical confidence, power, and precision. Inadequate compensation for outliers, missing data, and corrupt data. Inadequate coverage and proportionality leading to samples not representative of the population. Variables that are not the best measure of a population attribute.

These problems are overcome, upfront, by making the investment in time, effort, and thinking. 

This investment prevents placing the statistical analyst in the predicament of having to tell a decision-maker why their study was dead on arrival.

References

1Oxford Reference. (2022). https://www.oxfordreference.com/display/10.1093/acref/9780191866692.001.0001/q-oro-ed6-00004418;jsessionid=9B5E160CDF0FE4B0F8D105336930216F

2Montgomery, D. C. (2019). Design and analysis of experiments. Wiley.

3Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149-1160. https://link.springer.com/article/10.3758/BRM.41.4.1149


Branford McAllister

Branford McAllister received his PhD from Walden University in 2005. He has been an instructor and PhD mentor for the University of Phoenix, Baker College, and Walden University; and a professor and lecturer on military strategy and operations at the National Defense University. He has technical and management experience in the military and private sector, has research interests related to leadership, and is an expert in advanced quantitative analysis techniques. He is passionately committed to mentoring students in post-secondary educational programs.