How to Read Your DOE Results: Simplified Reading of DOE Output (Without P-Hacking)

July 2026 13 min read Bioprocess Engineering

Key Takeaways

Contents

  1. The effects Pareto
  2. ANOVA & p-values
  3. R², R²-adj, R²-pred
  4. Lack-of-fit
  5. Residual & Q-Q plots
  6. Model reduction (keep hierarchy)
  7. Reading DOE output without p-hacking
  8. Frequently Asked Questions

You ran the design, the software spat out a wall of tables, and now what? Reading DOE output is a learnable, repeatable skill — you read the same six panels in the same order every time, and each answers one question. This guide walks that order with a worked bioprocess example, so that by the end you know exactly how to read DOE results: which terms are real, whether the model fits, whether you can trust it, and how to avoid the trap of p-hacking your way to a false conclusion. If you have not run the design yet, build it first in a free design of experiments calculator; this article picks up where the run sheet comes back full of numbers.

The effects Pareto

The effects Pareto is the first thing to read: a bar chart of the absolute standardized effect of every term, sorted largest to smallest, with a reference line marking significance. Bars past the line are the factors that drive your response; bars short of it are noise. It answers "which factors matter?" in a single glance, which is why it is the workhorse view for reading DOE output.

Each bar's height is the effect divided by its standard error — a standardized effect, so main effects and interactions are on the same scale. The reference line comes from a t-test when you have an error estimate (replicates or center points), or from Lenth's method when the design is unreplicated and saturated. Anything to the right of the line is unlikely to be chance.

Figure 1. An effects Pareto for a 2³ titer study. Three terms clear the ~0.10 g/L significance threshold (teal); the rest are noise (grey). Read this first.

In the example above, temperature (A), feed rate (C), and a temperature×pH interaction (AB) clear the line; pH's own main effect (B) and the small interactions do not. That is your shortlist before you even open the ANOVA table. Notice that pH barely registers on its own yet appears in a significant interaction — a pattern that decides which terms you must keep later. The presence of a significant interaction is the payoff of running a factorial rather than changing one factor at a time — a contrast drawn out in our full factorial design guide.

ANOVA & p-values

ANOVA (analysis of variance) puts a p-value on each effect, confirming what the Pareto suggested. The ANOVA DOE table partitions the total variation into the part explained by each term and the part left as error, then tests each term with an F-ratio. A term with p < 0.05 is statistically significant at the usual 5% level.

Read the ANOVA table top to bottom: the overall model p-value should be small (the model as a whole explains real variation), each retained term's p-value should be below 0.05, and the lack-of-fit line (covered below) should be large. Do not stop at the p-values, though — a tiny p-value on a trivially small effect is statistically real but practically useless. Always cross-read the effect size against the p-value.

Table 1. ANOVA for the 2³ titer study (response in g/L). Significant terms in bold.
TermEffect (g/L)F-valuep-valueVerdict
A — Temperature+0.3096.4<0.001Significant
C — Feed rate+0.2251.80.002Significant
AB — Temp×pH+0.1938.60.006Significant
B — pH+0.063.80.098Keep (hierarchy)
AC, BC, ABC≤0.04<2>0.20Drop

R², R²-adj, R²-pred

The three R² values tell you how well the model fits and whether it will predict new data. They are the most misread numbers in DOE, because a high R² alone proves almost nothing.

is the fraction of variation the model explains. Its flaw: it always increases when you add a term, even a meaningless one, so a saturated model can hit R² = 1.00 while predicting nothing. Adjusted R² penalizes each extra term, so it drops if a term earns its keep less than a random term would — a better honest-fit measure. Predicted R² is the one that matters most: computed by leaving each run out in turn (the PRESS statistic) and predicting it, it estimates performance on new data.

The diagnostic to watch is the gap between adjusted and predicted R². If they are close, the model generalizes. If predicted R² is more than about 0.2 below adjusted R² — or worse, negative — the model is overfit: it memorized your runs and will not predict the next one. For the worked example, R² = 0.96, adjusted R² = 0.93, predicted R² = 0.87 — a healthy, generalizable model.

Lack-of-fit

The lack-of-fit test asks whether the model misses real structure beyond random noise, and you want it to be NOT significant (p > 0.05). It works by splitting the leftover error into two parts: pure error, measured from replicate runs (or center points) at identical settings, and lack-of-fit error, the systematic mismatch between model and data.

If lack-of-fit is significant (p < 0.05), the model is missing something — almost always curvature that a straight-line two-level model cannot bend to capture. The fix is not to keep torturing the current data; it is to add center points to confirm the curvature and then move to a response surface methodology design (central composite or Box-Behnken) that can fit the quadratic. In the worked example, lack-of-fit p = 0.42 — comfortably non-significant, so the linear-plus-interaction model is adequate over the tested range.

Generate the design behind the output

Build the factorial, screening, or response-surface design in a free DOE generator, complete with center points and a randomized run order, then bring the numbers back here to read.

Open the free DOE generator →

Residual & Q-Q plots

Residual plots check the assumptions the p-values rely on: constant variance, normality, and independence. A model can pass every p-value test and still be wrong if its residuals misbehave, so this panel is not optional.

Read two plots. Residuals vs fitted values should be a shapeless, horizontal cloud — a funnel (spread growing with the fitted value) signals non-constant variance and often calls for a transform (log, square-root). The normal Q-Q plot should have points hugging the diagonal — systematic curvature means the residuals are not normal, undermining the F-tests. A third quick check: residuals vs run order should show no drift or trend, which would flag a time or batch effect.

Residual diagnostic plots for a DOE model Residuals vs fitted + fitted value ✓ random cloud — constant variance Normal Q-Q theoretical quantile ✓ points on the line — normal
Healthy residual diagnostics: a structureless cloud on the left (constant variance) and points on the diagonal on the right (normality). A funnel shape or a curved Q-Q line means the model assumptions are violated.

Model reduction (keep hierarchy)

Model reduction means dropping the non-significant terms to get a simpler, better-predicting model — but you must respect effect hierarchy. The hierarchy rule: if an interaction (say AB) is in the model, both of its parents (A and B) must stay, even if a parent is individually non-significant.

In the worked example, pH (B) has p = 0.098 — not significant on its own. It is tempting to drop it. But the temperature×pH interaction (AB) is significant, and pH is one of its parents; removing a lower-order parent of a retained interaction produces a model that is not hierarchical and is harder to interpret and transform. So pH stays in the model even though its own main effect is small. The standard practice is backward elimination: remove the least significant term that is not protected by hierarchy, refit, and repeat, watching predicted R² rise as clutter falls. Stop when every remaining term is either significant or required by hierarchy.

Worked read-through: the 2³ titer study

  1. Pareto: A (temp, +0.30), C (feed, +0.22), and AB (+0.19) clear the ~0.10 line; B, AC, BC, ABC do not.
  2. ANOVA: model p < 0.001; A (p < 0.001), C (p = 0.002), AB (p = 0.006) significant; B (p = 0.098) kept for hierarchy; AC/BC/ABC dropped.
  3. Fit: R² = 0.96, adjusted R² = 0.93, predicted R² = 0.87 — small adj-to-pred gap, generalizes well.
  4. Lack-of-fit: p = 0.42 — not significant; the linear+interaction model is adequate over the range.
  5. Residuals: random cloud vs fitted, points on the Q-Q line — assumptions hold.
  6. Conclusion & confirm: temperature and feed rate are the dominant positive main effects, so set both high; the significant temperature×pH interaction (AB) means the best pH depends on temperature, so pH must be tuned jointly with temperature rather than in isolation. Predicted best-corner titer ≈ 3.1 g/L. Run that condition to confirm before believing it.

Reading DOE output without p-hacking

P-hacking is manipulating the analysis until something crosses p < 0.05, and it is the single biggest threat to a trustworthy DOE conclusion. With enough models, responses, and subsets to try, a false positive is almost guaranteed — and DOE output, with its many terms and responses, is fertile ground for it.

The classic p-hacking moves in reading DOE output: fitting a dozen candidate models and reporting only the one where your favored factor is significant; measuring ten responses and highlighting only the one with p < 0.05; adding and removing terms while watching the p-value until it dips under the line; or treating an exploratory data-driven finding as if it had been hypothesized in advance. Each inflates the false-positive rate far above the nominal 5%.

Five habits keep you honest. Pre-specify the model form and the responses before you run. Respect hierarchy rather than surgically keeping only the terms you like. Judge fit with predicted R², which punishes overfitting the way in-sample R² cannot. Correct for multiplicity when you genuinely test many responses. And above all, confirm the predicted optimum with a fresh run — a real confirmation experiment is the one test p-hacking cannot survive, because a spurious effect will not reproduce. The discipline behind these habits is documented in the wider science-reproducibility literature on p-hacking, and it maps directly onto the way you should read every DOE. For the analysis-to-optimum workflow in a bioprocess setting, see our DOE for bioprocess optimization guide, and to make sure the study was powered to find a real effect in the first place, size it with how many experiments a DOE needs.

Table 2. Reading DOE output cheat-sheet — what each statistic means, the good value, and the red flag.
Read thisIt tells youGood valueRed flag
Effects ParetoWhich terms are largeA few bars past the lineNothing clears the line
Model p-valueDoes the model explain anything< 0.05> 0.05 (no signal)
Term p-value (ANOVA)Is this effect real< 0.05 (or hierarchy)Cherry-picked models
Adjusted vs predicted R²Will it predict new dataGap < 0.2Predicted R² low or negative
Lack-of-fitIs structure missingp > 0.05p < 0.05 (curvature)
Residual / Q-Q plotsAre assumptions metRandom cloud, on-lineFunnel or curved Q-Q

Frequently Asked Questions

How do I read my DOE results?

Read a DOE in a fixed order. Start with the effects Pareto to see which terms are large. Confirm significance with the ANOVA p-values (keep terms with p < 0.05). Check the fit quality with R-squared, adjusted R-squared, and predicted R-squared, and confirm the lack-of-fit test is not significant. Finally, validate the assumptions with residual and normal Q-Q plots. Only then trust the model's predicted optimum, and confirm it with a real run.

What is an effects Pareto in DOE?

An effects Pareto is a bar chart of the absolute standardized effect of each term (main effects and interactions), sorted from largest to smallest. A reference line (from a t-test or Lenth's method) marks the significance threshold; bars past the line are the terms that matter. It is the fastest single view for reading DOE output because it shows at a glance which few factors drive the response and which are noise.

What is the difference between R-squared, adjusted R-squared, and predicted R-squared?

R-squared is the fraction of variation the model explains, but it always rises when you add terms, even useless ones. Adjusted R-squared penalizes extra terms, so it can fall if a term adds nothing. Predicted R-squared, computed by leaving each run out in turn (PRESS), estimates how well the model predicts new data. A large gap between adjusted and predicted R-squared (more than about 0.2) signals overfitting: the model fits the runs you have but will not predict new ones.

What is p-hacking in DOE and how do I avoid it?

P-hacking is torturing a dataset until something looks significant: trying many models, dropping and re-adding terms, or testing dozens of responses and reporting only the p < 0.05 hits. In DOE it inflates false positives. Avoid it by pre-specifying the model and responses before you run, respecting effect hierarchy, using predicted R-squared rather than R-squared to judge fit, and always confirming the predicted optimum with a fresh experiment. Reading DOE output without p-hacking means letting the pre-planned analysis decide, not the other way around.

What does a significant lack-of-fit mean?

A significant lack-of-fit (p < 0.05) means the model misses real structure in the data beyond pure noise, usually curvature that a straight-line model cannot capture. It compares the model's residual error to the pure error estimated from replicate runs. You want lack-of-fit to be NOT significant (p > 0.05). If it is significant, add higher-order terms or move from a two-level factorial to a response-surface design.

Related Tools

References

  1. Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), e1002106. DOI: 10.1371/journal.pbio.1002106
  2. Montgomery, D.C. (2017). Design and Analysis of Experiments, 9th ed. Wiley. ISBN 978-1119113478.
  3. NIST/SEMATECH (2012). e-Handbook of Statistical Methods, Section 5.4: Analysis of DOE data. itl.nist.gov

Resources & Further Reading