Bayesian decision metrics with a frequentist cross-check, not just a p-value.
Results assume a properly randomized, fixed-horizon experiment. Decide at a pre-committed sample size; repeatedly checking and stopping when it looks significant inflates false positives.
An A/B test (or split test) shows two versions of something (a page, an email, a button) to two random groups, then measures which performs better. The hard part isn’t collecting the numbers, it’s deciding whether the difference you see is real or just luck. This tool turns your raw counts into a clear, defensible decision.
It works for two kinds of metric. Pick Conversion rate when the outcome is yes/no (signups, purchases, clicks) and enter the visitors and conversions for each group. Pick Continuous mean when the outcome is a number per user (revenue, order value, time on page) and enter the sample size, average and standard deviation.
Add variants with + Add variant and the tool switches to multi-variant mode. Instead of a single “B beats A”, each variant gets a probability of being the best and an expected loss; you ship the leader once its expected loss is small enough. Because comparing many variants at once raises the odds of a fluke “winner”, the significance cross-check first runs an omnibus test (chi-squared or ANOVA: is anything different?) and then applies a Bonferroni correction to each variant-vs-control comparison. The planner widens the required sample size to match the extra comparisons.
A chi-squared or t-test answers one narrow question: “assuming there were no real difference, how surprising is this data?” That p-value doesn’t tell you which version is better, by how much, or how confident to be in your choice. The Bayesian approach answers the question you actually have, “what’s the chance B is better, and what do I risk by switching?” so we lead with it and keep the familiar significance test alongside as a sanity check. When the two disagree, the tool says so: treat that as borderline and collect more data.
The single biggest mistake in A/B testing is peeking: watching the numbers and stopping the moment they look significant. Do that and you “find” winners that are pure noise. The fix is to decide your sample size before you start. The Plan tab does this from four inputs (your baseline rate or mean, the smallest improvement worth detecting, your confidence level, and statistical power) and tells you how many visitors per group you need. Run to that number, then decide once.
After analysis the tool runs a sample-ratio-mismatch (SRM) check. If you intended a 50/50 split but the groups came in noticeably uneven, something in your setup is likely broken (bad randomization, a redirect, a tracking bug) and the result can’t be trusted, so it flags it. It also reminds you to reach your pre-committed sample size before calling a winner.
Everything runs in your browser: your numbers never leave your device and there are no ads or trackers. Probabilities and credible intervals are computed by Monte Carlo simulation with a fixed seed, so the same inputs always give the same answer. Conversion metrics use a Beta-Binomial model; continuous metrics use a Normal / Student-t model.
It depends on your baseline rate and the smallest effect you care about: smaller effects need far more traffic. Use the Plan tab: enter your baseline, minimum detectable effect, confidence and power, and it returns the required sample size per group.
For making a decision, yes: it directly gives the probability one variant beats the other and the expected cost of being wrong, instead of a p-value about a hypothetical no-difference world. Chi-squared isn’t wrong, it just answers a narrower question. This tool shows both.
A common bar is 95% probability that B beats A, combined with a small expected loss. If you’re only at, say, 80–90%, the result is promising but not yet conclusive, so keep the test running.
Because if you check repeatedly and stop on the first significant reading, you dramatically inflate your false-positive rate, so you’ll declare winners that are really noise. Commit to a sample size up front and evaluate once you reach it.
It means your traffic didn’t split the way you intended (e.g. 52/48 when you set 50/50, on enough traffic that chance can’t explain it). That usually signals a broken experiment, so investigate before trusting the numbers.
Yes. Add A/B/C/D… and the tool reports each variant’s probability of being best and its expected loss, runs an omnibus test for any overall difference, and Bonferroni-corrects the per-variant comparisons so testing many options doesn’t produce false winners.