A/B Testing
A/B testing compares two versions of something — a webpage, email, pricing page, or product feature — by randomly showing each version to different user groups and measuring which performs better. It replaces opinions with evidence. The method is straightforward; the hard part is ensuring the data behind the test is trustworthy.
The approach works because randomization eliminates confounding variables. If 10,000 users are randomly split between a blue button (control) and a green button (variant), and the green button produces a 12% higher click-through rate with statistical significance, you can confidently attribute the difference to the color change — not to time of day, user demographics, or anything else.
A/B testing splits traffic between two variants (A and B) and measures which drives better outcomes — conversion rates, revenue, engagement. Statistical significance (typically 95% confidence) ensures results are not due to chance. The practice requires clean, consistent data: if your analytics pipeline double-counts conversions or your user segmentation is inconsistent, A/B test results will be misleading regardless of sample size.
How A/B Testing Works
The process follows six steps, each with a specific purpose.
1. Form a hypothesis. "Changing the CTA button from blue to green will increase click-through rate on the pricing page." A hypothesis without a specific, measurable prediction is not testable.
2. Define the metric. Click-through rate on the pricing page CTA. This is the primary metric — the one that determines whether the test wins or loses. Secondary metrics (bounce rate, time on page) provide context but do not drive the decision. Ideally, these metrics map to governed KPIs.
3. Calculate sample size. Based on the baseline conversion rate, the minimum detectable effect (how small a difference matters), and the desired statistical power (typically 80%). Running with too few users produces unreliable results. Running with too many wastes time and traffic.
4. Split traffic randomly. 50/50 between control and variant. Randomization must be user-level — the same user sees the same variant every time they visit. Cookie-based or user-ID-based assignment handles this.
5. Wait for significance. Run the test until the pre-calculated sample size is reached. Do not peek at results early and stop when they look good — this inflates false positive rates dramatically.
6. Analyze and decide. Did the variant beat the control by a statistically significant margin? If yes, implement. If no, iterate or discard. If the result is ambiguous, the test was underpowered and needs to be rerun with a larger sample.
Statistical Foundations
Four concepts separate rigorous A/B testing from guesswork.
Statistical significance. A p-value below 0.05 means there is less than a 5% probability that the observed difference is due to random chance. This is the standard threshold. It does not mean the result is "95% correct" — it means that if there were truly no difference, you would see a result this extreme less than 5% of the time.
Sample size. Calculated before the test starts, based on three inputs: the baseline conversion rate, the minimum detectable effect (the smallest improvement worth detecting), and the desired statistical power. A page with a 2% conversion rate needs far more users to detect a 0.5% improvement than a page with a 20% conversion rate. Online calculators (Evan Miller, Optimizely) make this easy.
Type I errors (false positives). Declaring a winner when there is no real difference. The main cause: peeking at results before the test reaches its planned sample size. Each peek is an implicit statistical test, and running many tests inflates the false positive rate. A test checked daily for 14 days has a much higher false positive rate than the nominal 5%.
Type II errors (false negatives). Missing a real effect because the sample is too small. If a test has 60% power (common when sample sizes are not calculated upfront), there is a 40% chance of missing a real improvement. This is why underpowered tests are worse than no test at all — they create false confidence that a change "doesn't work."
Booking.com runs over 1,000 A/B tests simultaneously. Their key learning: most tests show no significant difference. Teams that expect every test to "win" either stop testing or, worse, cherry-pick results. The value of A/B testing is the rigor it brings to decision-making, not the win rate.
— Kaufman et al., Democratizing Online Controlled Experiments at Booking.com
Common Mistakes That Invalidate Tests
Stopping early. The most common and most damaging mistake. A product manager checks the dashboard on day 3, sees a 15% lift with p=0.03, and calls the test. But statistical significance fluctuates wildly in small samples. Many "significant" results at day 3 disappear by day 14. The fix: commit to a sample size before the test starts and do not stop early.
Testing too many variables at once. Changing the headline, hero image, and CTA button in a single test makes it impossible to attribute the result. If the variant wins, which change drove it? If it loses, which change hurt? Run one change per test, or use multivariate testing with a sample size large enough to detect interaction effects.
Ignoring segment effects. A test shows a 5% overall improvement, but mobile users actually saw a 10% decline while desktop users saw a 15% lift. The overall number looks positive, but the mobile experience got worse. Always check results across key segments: device, geography, new vs. returning users.
Sample ratio mismatch. If a 50/50 split shows 48/52 in practice, something is wrong with the randomization — a bot, a caching layer, or a redirect that selectively drops users from one group. Sample ratio mismatch invalidates the entire test. Check it before analyzing results.
Novelty and primacy effects. Users may react to change itself, not to the variant. A new design gets more clicks for a week because it looks different, then engagement returns to baseline. Run tests long enough to capture steady-state behavior, not just the novelty spike.
Beyond Basic A/B Testing
Multivariate testing tests multiple elements simultaneously — headline, image, and CTA in all combinations. This reveals interaction effects (the blue headline works better with the large image but worse with the small image). The cost: sample size requirements grow multiplicatively. A test with 3 headlines and 3 images needs 9 variants, and each needs enough traffic for significance.
Bandit algorithms dynamically allocate more traffic to the variant that appears to be winning. Instead of a fixed 50/50 split, the algorithm might shift to 70/30 as evidence accumulates. This reduces the "cost of learning" — fewer users see the losing variant — but makes statistical analysis more complex. Best for short-lived decisions like ad creative rotation.
Bayesian A/B testing provides a probability that one variant is better than another ("there is an 89% probability that B outperforms A by at least 2%") instead of a binary significant/not-significant verdict. This is often more useful for business decisions, where stakeholders want to know "how likely is this better?" rather than "can we reject the null hypothesis?"
Netflix uses a contextual bandit approach for personalizing artwork — rather than running fixed A/B tests, the system continuously allocates traffic to the artwork variant most likely to drive engagement for each user segment. This approach shortened the learning cycle from weeks to days.
— Netflix Technology Blog, Artwork Personalization
Data Quality: The Hidden Dependency
A/B test results are only as reliable as the data pipeline that feeds them. Most failed experiments are not failed experiments — they are experiments with corrupted data.
Duplicate event tracking. A JavaScript event fires twice on page load due to a race condition. Conversion rates appear doubled. The test declares a winner based on inflated numbers. The real conversion rate is unchanged. This is a data observability problem, not a testing problem.
Inconsistent user identity. A user visits on their phone, gets assigned to variant A. The same user visits on their laptop, gets a new cookie, and is assigned to variant B. They convert on the laptop. The test attributes the conversion to variant B, but the user's initial exposure was variant A. Cross-device identity resolution is a prerequisite for accurate experimentation.
Delayed data. The analytics pipeline has a 24-hour lag. A test is stopped "at significance" based on data that is actually one day stale. The true numbers — including the last day of traffic — might tell a different story. Real-time or near-real-time data pipelines matter for test integrity.
Metric definition disagreements. Marketing defines "conversion" as a form submission. Product defines it as a completed purchase. Both teams analyze the same test and draw opposite conclusions. This is a data governance problem: the organization needs a single, governed definition for each metric used in experimentation.
Tools and Platforms
Full platforms: Optimizely (enterprise-grade, strong statistical engine), LaunchDarkly (feature flags with built-in experimentation), Statsig (product analytics and experimentation combined). These handle randomization, metric tracking, and statistical analysis end-to-end.
Analytics-integrated: Google Analytics experiments (free, basic), Amplitude Experiment (ties experiments to product analytics events). Good for organizations already using these analytics tools and wanting to add experimentation without a separate platform.
Open source: GrowthBook (full-featured experimentation platform with Bayesian statistics) and Wasabi (A/B testing service originally built by Intuit). Best for organizations that want control over their experimentation infrastructure and data.
How Dawiso Supports Experimentation
Reliable A/B testing requires consistent metric definitions across tools. Dawiso's business glossary provides governed definitions for key metrics — conversion rate, revenue per user, engagement score — so that experimentation platforms and analytics dashboards use the same calculations. When marketing and product disagree about what "conversion" means, the glossary settles it.
The data catalog documents which datasets feed experimentation pipelines, their freshness, and known quality issues. Before trusting an A/B test result, teams can check whether the underlying event data was complete, timely, and correctly deduplicated.
When an A/B test produces surprising results, data lineage helps teams verify that the underlying data is correct before acting on the finding. Tracing from the experiment metric back through the analytics pipeline to the raw event source reveals whether the result reflects a real user behavior change or a data pipeline artifact.
Conclusion
A/B testing is the simplest form of causal inference available to product and marketing teams. The statistical method is well-established. The tooling is mature. What most organizations underestimate is the data quality dependency: an experiment is only as trustworthy as the data pipeline that measures it. Governed metric definitions, reliable event tracking, and consistent user identity are the prerequisites that make experimentation work.