Should you use Frequentist or Bayesian for A/B testing?
I'm often asked whether businesses should use frequentist or Bayesian statistics for their A/B testing.
There are good arguments on both sides, and smart people can disagree on the best approach. However, my view is that it often doesn't make a huge difference in practice.
TLDR version - for high-traffic A/B tests allowed to run to completion, frequentist and Bayesian statistics often lead to the same conclusion.
Here's my perspective as someone who leans slightly towards the frequentist camp.
First, a quick refresher on the difference between the two schools of statistics:
- Frequentist statistics relies solely on the data from the current experiment. It calculates probabilities based on the frequency of observed outcomes. For A/B testing, this means looking at the results only from the current test, without considering any prior information.
- Bayesian statistics incorporates prior beliefs or assumptions along with the current data. It uses Bayes' theorem to update probabilities as more information becomes available. For A/B testing, this means using learnings from previous tests to inform the analysis of the current test.
So in a nutshell: Frequentists ignore prior information while Bayesians embrace it.
The Core Debate
The frequentist vs Bayesian debate essentially boils down to whether you believe there's a fixed, objective truth that's unchanging or whether truth evolves as you gather more data and information.
Frequentists hold the perspective that there is an objective, fixed reality. Probabilities express our uncertainty about that reality. As we run experiments and gather data, we gain a better understanding of true probabilities.
Bayesians, on the other hand, believe probabilities are subjective and evolve as we obtain more evidence. All beliefs are provisional and open to updating. Prior knowledge informs current probabilities, which can change as new data arrives.
Do Fixed Truths Exist?
For most A/B testing situations, I would argue the frequentist perspective makes more sense. We generally believe there is a fixed, true conversion rate for each version that is unchanging over the experiment timeframe. Our goal is to estimate that true underlying conversion rate as accurately as possible based on experiment data.
The Bayesian view suggests that conversion rates evolve over time as visitor behavior changes. However, most A/B tests run for a short period where major shifts in customer behavior are unlikely. Plus, longer-term trends affect both versions and don't necessarily change the difference between them.
So for these reasons, I think assuming a fixed, unchanging truth that we are estimating via data is reasonable for the constrained timeframe of most A/B tests.
Given the above, frequentist statistics has some advantages for A/B testing purposes:
- It is intuitively simple - if version A converted 100 out of 1000 visitors and version B converted 110 out of 1000, it's natural to conclude version B is probably better
- It provides objectivity - results are driven entirely by current data without room for subjective priors or assumptions
- It is conservative - it avoids prematurely claiming an ineffective change is better or overstating the confidence
- It detects long-term changes - as data accumulates, real differences will eventually reveal themselves regardless of priors
These qualities make frequentist statistics a good fit for A/B testing's purpose of identifying which version truly performs better.
Bayesian statistics also has appealing characteristics:
- It allows incorporating helpful prior data - learnings from past tests can provide useful context
- It can help make decisions faster - particularly when sample sizes are very small
- It accounts for uncertainty - priors quantify the initial uncertainty and its evolution as data arrives
- It provides a common-sense interpretation - probabilities express degree of belief rather than just frequency
These traits can be useful in some A/B testing scenarios, especially early in a test or for niche segments. Overall though, I don't find them essential for most standard A/B tests.
It Doesn't Matter
Here's the key point many debates ignore: for high-traffic A/B tests allowed to run to completion, frequentist and Bayesian statistics often lead to the same conclusion.
While they calculate probabilities differently, as the sample size grows large, random variability gets minimized and the priors' influence reduces. Under the law of large numbers, the frequentist and Bayesian methods converge on similar rate estimates.
For example, say we have a test where the frequentist confidence interval is 5.0% +/- 1.0%. The Bayesian credible interval with weak priors might be 5.1% +/- 0.9%. Such minor differences have no practical impact on decision making.
So especially if you follow best practices like setting appropriate power targets and allowing tests to reach statistical significance, the choice of statistical philosophy doesn't greatly affect the outcome.
When does it Matter?
However, there are a few A/B testing scenarios where the frequentist vs Bayesian choice is more consequential:
- Very low traffic - with only a few hundred visitors, priors' influence is much greater
- Attempting to end tests early - fewer data points means priors sway results more
- Niche segments - small groups exaggerate differences in statistics
- Radical changes - big jumps can conflict more with weak priors
- Multiple testing - bias from priors accumulates over many fast tests
In these cases, the right statistical school depends on your perspective and risk tolerance. But for most standard, high-traffic A/B tests, either philosophy leads to materially similar results.
What's more Important?
Rather than obsessing over frequentist vs Bayesian, most A/B testing practitioners should focus on other aspects:
- Crafting a high-quality test design: clear hypothesis, key metrics, big enough sample, etc.
- Implementing changes correctly: no confounding factors, proper randomization, etc.
- Analyzing the impact over time: learning effects, seasonality, post-test performance, etc.
- Combining A/B testing with other data: surveys, multi-touch attribution, observational data, etc.
- Building a testing roadmap: priority questions, key segments, power analyses, etc.
- Interpreting results thoughtfully: within context, acknowledging limitations, etc.
- Website speed and performance: Web performance during the testing period.
These factors often have a much bigger influence on A/B testing success than the core statistical approach taken.
In summary, here is my take as someone who leans frequentist:
- For standard A/B tests, either statistical philosophy often leads to similar results
- Frequentist has advantages in simplicity, conservatism, and detecting long-term changes
- Bayesian offers benefits like faster learning and expressing uncertainty
- But other aspects of testing have far more impact on success
So don't get overhung up on being a devoted frequentist or Bayesian. Take a pragmatic approach, use the techniques that make sense for each situation, and focus on all the other areas that create great A/B testing practices.
- Frequentist vs Bayesian is endlessly debated but often inconsequential in practice
- For typical A/B tests, assuming a fixed truth estimated by data is reasonable
- Frequentist has strengths like simplicity, objectivity and conservatism
- Bayesian offers incorporating priors and quantifying uncertainty
- But for high-traffic tests, the results usually converge over time
- Choice matters more for niche segments, early stopping, or radical changes
- Many other test design factors have more impact on A/B testing success
So take a balanced approach and don't overindex on statistics philosophy. There are bigger fish to fry in creating an excellent A/B testing program.