Why shouldn't I stop the test as soon as it is significant?

Checking repeatedly and stopping on the first significant reading dramatically inflates your false-positive rate, so you declare winners that are really noise. Commit to a sample size up front and evaluate once you reach it.

A/B Test Significance Calculator: Bayesian & Chi-Squared

Q: Is a Bayesian A/B test better than chi-squared?

For making a decision, yes - it directly gives the probability one variant beats the other and the expected cost of being wrong, instead of a p-value about a hypothetical no-difference world. Chi-squared is not wrong, it just answers a narrower question. This tool shows both.

Q: What does sample ratio mismatch (SRM) mean?

It means your traffic did not split the way you intended, on enough traffic that chance cannot explain it. That usually signals a broken experiment, so investigate before trusting the numbers.

Q: Can I test more than two variants at once?

Yes. Add A/B/C/D and the tool reports each variant's probability of being best and its expected loss, runs an omnibus test for any overall difference, and Bonferroni-corrects the per-variant comparisons so testing many options does not produce false winners.

Results assume a properly randomized, fixed-horizon experiment. Decide at a pre-committed sample size; repeatedly checking and stopping when it looks significant inflates false positives.

What this A/B test calculator does

An A/B test (or split test) shows two versions of something (a page, an email, a button) to two random groups, then measures which performs better. The hard part isn’t collecting the numbers, it’s deciding whether the difference you see is real or just luck. This tool turns your raw counts into a clear, defensible decision.

It works for two kinds of metric. Pick Conversion rate when the outcome is yes/no (signups, purchases, clicks) and enter the visitors and conversions for each group. Pick Continuous mean when the outcome is a number per user (revenue, order value, time on page) and enter the sample size, average and standard deviation.

How to read the results

P(B beats A): the probability that variant B is genuinely better than the control, given your data. This is the number to act on. A common bar is 95%.
Expected uplift: how much better B is likely to be, with a 95% credible interval showing the plausible range. If the interval crosses zero, B might actually be worse.
Expected loss: if you ship the apparent winner and it turns out you were wrong, how much you’d lose on average. It puts a price on the risk, so you can ship when that risk is acceptably small.
Significance cross-check: the classic p-value from a two-proportion z-test (identical to a chi-squared test for a 2×2 table) or Welch’s t-test for means. Below 0.05 is the usual “statistically significant” threshold.

Testing more than two variants (A/B/n)

Add variants with + Add variant and the tool switches to multi-variant mode. Instead of a single “B beats A”, each variant gets a probability of being the best and an expected loss; you ship the leader once its expected loss is small enough. Because comparing many variants at once raises the odds of a fluke “winner”, the significance cross-check first runs an omnibus test (chi-squared or ANOVA: is anything different?) and then applies a Bonferroni correction to each variant-vs-control comparison. The planner widens the required sample size to match the extra comparisons.

Bayesian vs. chi-squared: why we show both

A chi-squared or t-test answers one narrow question: “assuming there were no real difference, how surprising is this data?” That p-value doesn’t tell you which version is better, by how much, or how confident to be in your choice. The Bayesian approach answers the question you actually have, “what’s the chance B is better, and what do I risk by switching?” so we lead with it and keep the familiar significance test alongside as a sanity check. When the two disagree, the tool says so: treat that as borderline and collect more data.

Plan the test first (sample size)

The single biggest mistake in A/B testing is peeking: watching the numbers and stopping the moment they look significant. Do that and you “find” winners that are pure noise. The fix is to decide your sample size before you start. The Plan tab does this from four inputs (your baseline rate or mean, the smallest improvement worth detecting, your confidence level, and statistical power) and tells you how many visitors per group you need. Run to that number, then decide once.

Built-in guardrails

After analysis the tool runs a sample-ratio-mismatch (SRM) check. If you intended a 50/50 split but the groups came in noticeably uneven, something in your setup is likely broken (bad randomization, a redirect, a tracking bug) and the result can’t be trusted, so it flags it. It also reminds you to reach your pre-committed sample size before calling a winner.

Privacy & method

Everything runs in your browser: your numbers never leave your device and there are no ads or trackers. Probabilities and credible intervals are computed by Monte Carlo simulation with a fixed seed, so the same inputs always give the same answer. Conversion metrics use a Beta-Binomial model; continuous metrics use a Normal / Student-t model.

Frequently asked questions

How many visitors do I need for an A/B test?

It depends on your baseline rate and the smallest effect you care about: smaller effects need far more traffic. Use the Plan tab: enter your baseline, minimum detectable effect, confidence and power, and it returns the required sample size per group.

Is a Bayesian A/B test better than chi-squared?

For making a decision, yes: it directly gives the probability one variant beats the other and the expected cost of being wrong, instead of a p-value about a hypothetical no-difference world. Chi-squared isn’t wrong, it just answers a narrower question. This tool shows both.

What is a good probability to call a winner?

A common bar is 95% probability that B beats A, combined with a small expected loss. If you’re only at, say, 80–90%, the result is promising but not yet conclusive, so keep the test running.

Why shouldn’t I stop the test as soon as it’s significant?

Because if you check repeatedly and stop on the first significant reading, you dramatically inflate your false-positive rate, so you’ll declare winners that are really noise. Commit to a sample size up front and evaluate once you reach it.

What does sample ratio mismatch (SRM) mean?

It means your traffic didn’t split the way you intended (e.g. 52/48 when you set 50/50, on enough traffic that chance can’t explain it). That usually signals a broken experiment, so investigate before trusting the numbers.

Can I test more than two variants at once?

Yes. Add A/B/C/D… and the tool reports each variant’s probability of being best and its expected loss, runs an omnibus test for any overall difference, and Bonferroni-corrects the per-variant comparisons so testing many options doesn’t produce false winners.