A/B Test Sample Size Calculator: Baseline, Power & Alpha for CRO

On this page

A/B test sample size calculation is the step most CRO teams skip — and it is why so many split tests get called as winners on day four and quietly underperform once shipped. This guide walks through baseline conversion rate, minimum detectable effect, alpha, statistical power and the two-proportion formula, then runs the math against a real funnel scenario so you can plan tests that reach statistical significance instead of guessing.

Most A/B Tests Never Reach Statistical Significance

The pattern is the same across most teams. A new variant goes live on Monday. By Thursday it is up 12% on conversion rate. Someone screenshots the dashboard, declares a winner, ships the change to 100% of traffic and moves on. Three weeks later the overall conversion rate is exactly where it started or slightly lower, and nobody can explain why.

The reason is that the split test never had the statistical power to detect anything real in the first place. The 12% lift on Thursday was noise, not statistical significance. Run the same coin flip a thousand times and you will hit a four-day stretch where heads beats tails by a wide margin. That stretch does not mean the coin is biased.

This is what conversion rate optimisation actually requires before you touch a page: an A/B test sample size calculation. Four numbers go in — baseline conversion rate, minimum detectable effect, alpha (the statistical significance threshold) and power — and one number comes out: how many visitors per variant the test needs before the result means anything. Most teams skip this entirely. The ones that run it through a sample size calculator find out their planned test would need three months of traffic to detect the lift they were hoping for, and they go back and pick a more aggressive change.

This guide walks through each of the four inputs, then shows the formula and a worked scenario using a real funnel. By the end you will be able to plan a split test, calculate the sample size, convert it into a duration and decide whether the test is worth running at all.

The Scenario We Will Use Throughout

Here is the funnel we are going to test against. Last 30 days, a B2C SaaS subscription product running paid acquisition to one landing page:

Step	Volume (30d)	Rate
Landing page sessions	180,000	—
Reached pricing	54,000	30.0% of LP
Reached signup form	7,560	14.0% of pricing
Completed signup	3,250	43.0% of signup · 1.8% of LP
Activated within 7 days	780	24.0% of signups

Segment splits matter and we will come back to them: mobile is 58% of sessions at 1.3% LP-to-signup, desktop is 42% at 2.5%. Paid is 65% at 1.5%, organic is 35% at 2.4%. ARR per paying user is around $79/month and recurring revenue added in the last 30 days is roughly $257K.

Eyeballing the funnel, the biggest leak is pricing-to-signup at 14.0%. Out of 54,000 people who reached pricing, only 7,560 clicked through to start signup. That is the step worth testing first because the absolute volume of people lost there is the largest of any single step in the funnel. Our hypothesis: a simplified pricing page lifts pricing-to-signup from 14.0% to 15.4%, a 10% relative lift.

Free audit

Want this implemented for your funnel?

Get a free 90-day growth plan with tracking, channel priorities, and next-week actions.

Book free audit See case studies

Step 1: Calculate the Right Baseline

Baseline is the current conversion rate of the step you are testing. The most common mistake here is using the end-to-end funnel rate as the baseline for a single-step test. If you do that for the pricing page test above, you would use 1.8% (LP to completed signup) instead of 14.0% (pricing to signup). The required sample size jumps from a few thousand per variant to several tens of thousands, because tiny baselines need huge samples to detect lifts.

The rule is: baseline is the conversion rate of the specific transition that the change can influence. A pricing page change cannot make people land on the site, so LP traffic does not belong in the denominator. A pricing page change can move people from pricing to signup, so pricing sessions are the denominator and signup-form entries are the numerator.

The baseline should be measured over a stable recent window, typically 28 to 90 days, with any anomalies excluded. A promotional week or a tracking outage will skew the baseline and make every downstream calculation wrong. For our scenario, baseline is p₁ = 14.0%, measured over the last 30 days, with no campaigns or outages distorting the period.

Step 2: Choose Your Minimum Detectable Effect (MDE)

MDE is the smallest lift you want the test to be able to detect. It is a planning input, not a measurement. You are saying in advance: if the variant produces at least this much improvement over the baseline, I want to be confident the test will catch it. If the variant produces a smaller improvement than that, I am willing to miss it.

MDE is usually expressed as a relative lift over the baseline, not an absolute one. A 10% relative MDE on a 14.0% baseline means you want to detect a new rate of 15.4% (14.0 × 1.10). A 5% relative MDE on the same baseline means you want to detect 14.7% (14.0 × 1.05).

The mathematical relationship between MDE and sample size is brutal. Halving the MDE roughly quadruples the sample size required. This is the reason "let us see if anything moves" tests almost never finish — they implicitly demand a tiny MDE and the sample size for that runs into months of traffic. In CRO the practical floor for MDE is usually 5% relative for high-traffic sites and 10 to 15% relative for everything else. Below 5% you are testing for changes that are smaller than the natural week-to-week variance of the metric and the test cannot tell signal from noise.

For our scenario, MDE = 10% relative. We want to detect p₂ = 15.4% or better.

Step 3: Set Alpha — Your Statistical Significance Threshold

Alpha (α) is the probability you are willing to accept of declaring a winner when there is actually no difference. This is a false positive — also called Type I error. Alpha is what people are referring to when they say a test "reached statistical significance" — a result is significant when its p-value is below α. The conventional value is α = 0.05, meaning you accept a 5% chance of being wrong in that direction. Lower alpha (0.01) means stricter evidence required and larger sample size. Higher alpha (0.10) is looser and rarely used outside very early-stage exploration.

Two-tailed vs one-tailed is a separate decision and one that most CRO posts get wrong by reflex. A two-tailed test asks "is the variant different from the control in either direction?" A one-tailed test asks "is the variant better than the control?" and treats "worse" as the same outcome as "no difference".

One-tailed tests cut the required sample size by roughly 20% which is why they are tempting. The problem is that they assume you genuinely do not care if the variant turns out worse — that you would ship it either way, or that the consequences of shipping a losing variant are zero. In real CRO neither of those is true. If your simplified pricing page actually drops signup entries by 8%, that is roughly $20,500 of MRR) — see glossary" class="glossary-link">monthly recurring revenue you need to know about, not absorb silently. Use two-tailed unless you have a very specific reason not to.

For our scenario, α = 0.05 two-tailed. The corresponding critical value is Z_α/2 = 1.96.

Step 4: Set Power — and Why 80% Is the Floor

Power is the probability of detecting a real effect of the size you specified, if it actually exists. Power = 1 − β, where β is the probability of a false negative (Type II error — missing a real winner).

The convention is power = 0.80, meaning if the true lift is exactly your MDE, the test will return a significant result 80% of the time. Power = 0.90 is stricter — roughly 30% more sample size — and is worth using when the cost of missing a real winner is high. Power below 70% means you are running tests that mostly cannot succeed even when the variant is genuinely better, which is the silent reason most CRO programs stall: a backlog of "inconclusive" results that were actually winners the test was never going to detect.

For our scenario, power = 0.80. The corresponding critical value is Z_β = 0.8416.

Hand-drawn diagram showing two overlapping normal distributions — control (p₁ = 14%) and variant (p₂ = 15.4%) — with the alpha region (false positive, 5%) under the control curve to the right of the decision threshold, the beta region (false negative, 20%) under the variant curve to the left of the threshold, and the power region (1 − β = 80%) shaded under the variant curve to the right of the threshold — Alpha is the small red sliver of the control distribution past the decision threshold. Power is the large green region of the variant distribution past the same threshold. Beta is what is left over.

The Sample Size Formula for Two Proportions

For an A/B test comparing two conversion rates, the standard sample size per variant is:

n = ( Z_α/2 · √(2·p̄·q̄)  +  Z_β · √(p₁·q₁ + p₂·q₂) )²  /  (p₁ − p₂)²

Where:

p₁ = baseline conversion rate, q₁ = 1 − p₁
p₂ = baseline × (1 + MDE), q₂ = 1 − p₂
p̄ = (p₁ + p₂) / 2 (pooled rate), q̄ = 1 − p̄
Z_α/2 = 1.96 for α = 0.05 two-tailed
Z_β = 0.8416 for power = 0.80

The formula gives sample size per variant. Total exposures needed is 2n for an A/B test, 3n for an A/B/C, and so on. The (p₁ − p₂)² in the denominator is the part that punishes small MDEs — halve the difference and the denominator shrinks fourfold, which means n grows fourfold.

Worked Calculation: The Pricing Page Test

Plugging the scenario into the formula:

p₁ = 0.140, q₁ = 0.860
p₂ = 0.140 × 1.10 = 0.154, q₂ = 0.846
p̄ = (0.140 + 0.154) / 2 = 0.147, q̄ = 0.853
p₁ − p₂ = −0.014, (p₁ − p₂)² = 0.000196
p₁·q₁ = 0.1204, p₂·q₂ = 0.1303, sum = 0.2507
2·p̄·q̄ = 0.2508

Computing:

n = ( 1.96 · √0.2508  +  0.8416 · √0.2507 )²  /  0.000196
n = ( 1.96 · 0.5008  +  0.8416 · 0.5007 )²  /  0.000196
n = ( 0.9815  +  0.4214 )²  /  0.000196
n = ( 1.4029 )²  /  0.000196
n = 1.968  /  0.000196
n ≈ 10,042 per variant

So we need about 10,042 visitors who reach the pricing page on each variant — roughly 20,084 total pricing-page exposures across the two variants.

Convert Sample Size Into Test Duration

Sample size in isolation is just a number. What matters operationally is how long the test will take to reach it given the current traffic to the test surface.

The test surface here is the pricing page. From the funnel, that is 54,000 visitors per month, which is roughly 1,800 per day. Splitting traffic 50/50 between control and variant gives 900 per variant per day. Reaching 10,042 per variant takes 10,042 / 900 ≈ 11.2 days mathematically.

Mathematically. In practice you should always run for full weeks, with a 14-day floor, regardless of what the math says. Day-of-week effects are real and a test that ran Monday to Friday is not measuring the same population as a test that ran across two full weeks. Visitor behaviour, traffic mix and even page speed differ by day of week. Testing across one full Monday-to-Sunday cycle is the minimum that absorbs that variance; two full cycles is better.

So: math says 11.2 days, run it for 14. The extra days do not cost you anything and they protect you from shipping a result that was an artefact of two strong Wednesdays in a row.

What Happens When You Reduce the MDE

To make the squared-relationship concrete, here is the same scenario at different MDEs:

Relative MDE	Target rate (p₂)	Sample / variant	Test duration (50/50, 1,800/day)
15%	16.1%	~4,550	~6 days math · run 14
10%	15.4%	~10,040	~12 days math · run 14
5%	14.7%	~39,380	~44 days · run 49
2%	14.3%	~246,300	~274 days — do not run this

Two takeaways. First, halving the MDE roughly quadruples the sample size — the math is unforgiving. Second, very small MDEs are not testable on most pages. If you cannot articulate a hypothesis that you believe will lift the metric by at least 5 to 10% relative, the test is not worth designing because you do not have enough traffic to find the answer.

The Segment Trap: Mobile vs Desktop, Paid vs Organic

Looking at the scenario again: mobile is 58% of sessions at 1.3% LP-to-signup, desktop is 42% at 2.5%. That is roughly a 2x difference in conversion rate between devices. The instinct is to test only on desktop where the rate is higher, or only on mobile where the volume is.

The math problem with segmenting is that you are now testing on a smaller population. Desktop is 42% of pricing-page traffic, so roughly 756 desktop visitors per day. The 10,042-per-variant requirement now takes 27 days of mathematical sample, which means a 28-day test in practice (four full weeks) to absorb day-of-week effects. That can still be the right call — testing only on desktop avoids contaminating the result with a mobile experience that may not benefit from the same change — but you have to plan for the longer duration.

The bigger trap is making the segmenting decision after seeing the result. If you run a sitewide test, see that desktop won and mobile lost, and then declare the desktop result a winner, you have just inflated your false positive rate by an unknown amount. Pre-register which segment you are testing, run the test, and treat any post-hoc segment splits as hypothesis generation for the next test, not as decisions you ship from.

Where Most CRO Tests Go Wrong

The four mistakes that account for most false-positive shipped variants:

Peeking and stopping early. Looking at the dashboard daily and stopping the test when significance appears inflates your alpha from 5% to 20% or higher, depending on how often you peek. Either commit to the predetermined sample size and only check at the end, or use a sequential testing method (Bayesian, group sequential) designed to handle interim looks. Most CRO tools support both — pick one and follow its rules.

Novelty effects. A new variant often gets a temporary lift just because it is new to returning visitors. This decays over a few weeks. Tests shorter than two full weeks will overstate the lift of any visually distinctive change. If your variant is a major redesign rather than a small change, run for at least three weeks.

Multiple variants without correction. Running A/B/C/D on the same page with α = 0.05 each gives you a combined false positive rate well above 5%. With four variants, the chance that at least one shows a fake lift jumps to roughly 18%. Either reduce alpha per variant (Bonferroni: α / number of variants) or commit to head-to-head tests.

Ignoring the post-test funnel. A pricing page change might lift pricing-to-signup by 12% but drop signup-to-activation by 15% because the variant attracts more curious browsers who never actually use the product. Always measure the downstream step too. The metric you are optimising should be activated users per pricing-page visitor, not pricing-to-signup in isolation.

You do not need to run the formula by hand once you understand what it is doing. The A/B test sample size calculator on this site takes the same inputs walked through above and adds a test-duration estimator that uses your actual daily traffic — so you get sample size per variant and the number of days the split test will need to run, in one place.

If you want a second opinion or prefer a different interface, these statistical significance calculators all use the same underlying math:

Evan Miller Sample Size ↗

Free, two-proportion calculator with clear inputs

Optimizely Calculator ↗

Quick web calculator, no account needed

AB Testguide ↗

Sample size with one-tail / two-tail toggle

CXL Calculator ↗

Sample size and significance in one tool

The CRO Test Planning Checklist

Before any A/B test goes live, you should be able to answer all of the following in writing:

What step in the funnel does this change affect? That step is your baseline denominator.
What is the baseline rate over the last 28–90 days? Excluding any anomalies.
What MDE are you committing to detect? Stated as a relative lift over baseline.
What alpha and power? Default to 0.05 two-tailed and 0.80.
What sample size per variant does that produce? From the formula or a calculator.
How many days at current traffic? Then round up to the nearest full week, with a 14-day floor.
Are you testing the whole population or a segment? Decide before, not after.
What is the downstream metric? The step you are optimising should not cannibalise the next step.

If you cannot answer these eight questions before the test launches, the test is not designed yet. The cost of writing them down is fifteen minutes. The cost of not writing them down is six months of inconclusive results and shipped false positives that quietly drag the metric you were trying to improve. The teams that compound CRO wins are the ones that treat the planning step as the actual work and the test itself as the easy part. User behaviour data tells you what to test; the math in this post tells you whether the test you designed can actually answer the question.

Newsletter

Get weekly growth ideas in your inbox

Practical SEO, PPC, analytics, and CRO notes with zero spam.

Frequently Asked Questions

What is baseline conversion rate in an A/B test?▾

Baseline is the current conversion rate of the step you are testing, measured over a stable recent period — typically the last 28 to 90 days, excluding any anomalies like a sale or an outage. If you are testing a pricing page change, your baseline is the pricing-to-checkout rate, not the landing-page-to-purchase rate. Using the wrong denominator inflates required sample size by 5 to 20 times because end-to-end conversion rates are tiny.

What is Minimum Detectable Effect (MDE) in CRO testing?▾

MDE is the smallest relative lift you want the test to be able to detect with the chosen alpha and power. If your baseline is 14.0% and your MDE is a 10% relative lift, the test needs enough sample to reliably distinguish 14.0% from 15.4%. Halving the MDE roughly quadruples the sample size required, which is why "let us just see if it moves the needle" tests usually need months of traffic.

What does alpha 0.05 two-tailed actually mean?▾

Alpha is the probability of declaring a winner when there is no real difference — a false positive. 0.05 means you accept a 5% chance of being wrong in that direction. Two-tailed means you are testing whether the variant is different in either direction, better or worse. One-tailed only tests in one direction. CRO almost always wants two-tailed because shipping a variant that quietly tanks revenue is a real outcome you need to detect.

What is statistical power and why does 80% matter?▾

Power is the probability of detecting a real effect of the size you specified, if it exists. 80% power means that if the true lift is exactly your MDE, the test will return a significant result 80% of the time. Lower power means more genuine winners get classified as inconclusive. 80% is the conventional floor; 90% is better but roughly 30% more sample. Power below 70% means you are running tests that mostly cannot succeed even when the variant is genuinely better.

How long should an A/B test run?▾

Long enough to hit the calculated sample size on each variant, and at minimum one full business cycle to absorb day-of-week effects. For most B2C and ecommerce sites this means a 14-day floor even if the math says you could call it in 6 days. Buyer behaviour on Tuesday is not the same as Saturday and a test that ran Tuesday to Friday is not measuring your actual customer mix.

Can I stop an A/B test early if a variant is clearly winning?▾

Not without sequential testing or Bayesian methods designed for early stopping. Standard frequentist A/B tests assume you only look at the result once, at the predetermined sample size. Peeking at results daily and stopping when significance appears inflates your false positive rate from 5% to 20% or higher. The variant that looked like a winner on day four is often noise that mean-reverts by day fourteen.

Wameq

Digital marketing consultant — SEO, PPC, analytics & CRO.

How to Measure PPC Performance When AI Controls the Auction

A/B Test Sample Size Calculator: Baseline, MDE, Alpha and Power for CRO

Most A/B Tests Never Reach Statistical Significance

The Scenario We Will Use Throughout

Want this implemented for your funnel?

Step 1: Calculate the Right Baseline

Step 2: Choose Your Minimum Detectable Effect (MDE)

Step 3: Set Alpha — Your Statistical Significance Threshold

Step 4: Set Power — and Why 80% Is the Floor

The Sample Size Formula for Two Proportions

Worked Calculation: The Pricing Page Test

Convert Sample Size Into Test Duration

What Happens When You Reduce the MDE

The Segment Trap: Mobile vs Desktop, Paid vs Organic

Where Most CRO Tests Go Wrong

The CRO Test Planning Checklist

Get weekly growth ideas in your inbox

Frequently Asked Questions

Related posts

How to Measure PPC Performance When AI Controls the Auction

How to Audit a Page in Minutes With Smart SEO Checker Pro

A/B Test Sample Size Calculator: Baseline, MDE, Alpha and Power for CRO

Most A/B Tests Never Reach Statistical Significance

The Scenario We Will Use Throughout

Want this implemented for your funnel?

Step 1: Calculate the Right Baseline

Step 2: Choose Your Minimum Detectable Effect (MDE)

Step 3: Set Alpha — Your Statistical Significance Threshold

Step 4: Set Power — and Why 80% Is the Floor

The Sample Size Formula for Two Proportions

Worked Calculation: The Pricing Page Test

Convert Sample Size Into Test Duration

What Happens When You Reduce the MDE

The Segment Trap: Mobile vs Desktop, Paid vs Organic

Where Most CRO Tests Go Wrong

A/B Test Sample Size Calculators We Recommend

The CRO Test Planning Checklist

Get weekly growth ideas in your inbox

Frequently Asked Questions

Related posts

How to Measure PPC Performance When AI Controls the Auction

How to Audit a Page in Minutes With Smart SEO Checker Pro