B Testing

Your team has just merged a significant change to the checkout flow. The pull request looked clean, the tests passed, staging was fine. But the last time you pushed a change like this straight to production, conversion dropped 12% for three hours before anyone noticed. You want to ship it — but not to everyone at once. You want a way to expose it to a small slice of real traffic, watch the numbers, and either expand it or pull it back.

That instinct is the origin of all three techniques covered here. Canary releases, blue/green deployments, and A/B testing all involve running more than one version of something simultaneously and directing traffic between them. The mechanisms overlap. The intent does not — and that distinction is where most of the confusion lives.

The same lever, three different questions

Before digging into each technique, it helps to see the shared skeleton. In all three cases, you have two versions of something — a service, a page, a feature — and a routing layer that decides which version a given request sees. What differs is the question you're trying to answer with that routing decision.

Canary asks: is this new version safe to roll out to everyone? Blue/green asks: can I switch production instantly and roll back just as fast if something goes wrong? A/B testing asks: which version produces better outcomes for the user?

Safety, operational control, and product optimisation. Superficially similar plumbing; fundamentally different goals.

Canary releases — rolling out with a safety net

The name comes from the coal mine practice of sending a canary ahead of miners. If the canary died, gas was present. If it survived, it was safe to proceed. A canary release works on the same principle: you send a small percentage of real traffic to the new version first, and you watch what happens before committing to a full rollout.

Imagine your team deploys version 2 of the checkout service alongside the existing version 1. Your load balancer routes 5% of traffic to v2. Everything else still hits v1. For the next 30 minutes, you monitor: error rates, latency p99, downstream service errors, payment gateway timeouts. If the numbers look clean, you shift to 20%, then 50%, then 100%. If something looks wrong at 5%, you shift the 5% back to v1. Only a fraction of users were ever affected.

The key characteristic of a canary release is that the traffic split is temporary and progressive. You start narrow and expand as confidence grows. The new version is the candidate under evaluation. The old version is the safety net. When the canary completes, the old version is decommissioned.

What you're watching during a canary is primarily operational health: errors, latency, crash rates, resource consumption. You're not asking "do users prefer v2?" You're asking "does v2 work correctly under real production load?" That's a different question, and it matters for understanding what a canary can and cannot tell you.

A canary release is not the right tool if the question is which version users prefer. That's A/B testing. A canary exists specifically to de-risk deployment. The answer it produces is binary: safe to proceed, or not safe to proceed.

Blue/green deployments — instant switch, instant escape hatch

You've just deployed a new version of your API service. Two minutes later, a monitoring alert fires: database connection pool exhaustion, cascading into timeouts across dependent services. Every second matters. How quickly can you get back to the last known good state?

Blue/green deployment is built for exactly this situation. The idea is straightforward: you maintain two identical production environments, called blue and green. At any given time, one is live (serving all traffic) and the other is idle (standing by). When you deploy a new version, you deploy it to the idle environment. You run your smoke tests against it there. Then you flip the router: the idle environment becomes live, and the live environment becomes idle. If something goes wrong, you flip the router back. The previous version never stopped running.

The critical difference from a canary is that there is no gradual traffic shift. In a blue/green deployment, the switch is binary: 0% or 100%. One version serves all traffic; the other serves none. The safety mechanism is the instant reversal, not the limited blast radius.

This makes blue/green deployments particularly well suited to services where partial rollouts are difficult or undesirable. Database schema changes that are backward-compatible, services with session state that would be corrupted by split traffic, or any situation where you want the cleanest possible cut between old and new.

The cost of blue/green is infrastructure: you are running two full production environments simultaneously, even though only one is handling traffic. For large-scale services, this is a meaningful operational expense. That cost buys you the fastest possible rollback path — and for services where downtime or a bad deployment has severe consequences, that trade is often worth it.

A/B testing — letting users decide

Your design team has a theory: changing the checkout button from grey to green will increase conversions. Your engineering team has a different theory: it will make no difference. Neither team is wrong to have an opinion, but opinions are not data. You need to run the experiment against real users and measure the outcome.

A/B testing routes different users to different versions of something — not to validate operational stability, but to measure behavioral difference. Version A might be the current checkout button. Version B is the green one. You split incoming users, assign them consistently to one variant, run the experiment long enough to achieve statistical significance, and compare conversion rates. The version that produces better outcomes wins.

The traffic split in an A/B test is intentional and sustained. It is not temporary, and it does not expand as confidence grows. Both variants run in parallel for the duration of the experiment. Users are assigned to a variant and kept there so their behavior is measured consistently. If a user switched between variants mid-experiment, their data would corrupt the results.

This is the sharpest distinction from canary releases. A canary narrows exposure to protect users from a potentially broken version. An A/B test intentionally exposes different groups of users to different experiences because that is the measurement. Both use traffic splitting — but one is risk mitigation and the other is experimentation.

A/B testing has its own discipline that the other two techniques don't require. You need a hypothesis, a primary metric, a minimum detectable effect, and enough traffic to reach statistical significance in a reasonable timeframe. Running an A/B test for too short a period, or stopping it early when one variant looks like it's winning, produces unreliable results. This is known as the peeking problem: looking at results before the sample size is sufficient inflates the false positive rate. A canary release doesn't care about sample size; you're watching for error spikes, which are visible quickly. An A/B test measuring a 2% improvement in conversion requires far more data before the result is trustworthy.

Where they overlap — and where they don't

Canary releases and A/B tests look similar from the outside. Both split traffic. Both run two versions simultaneously. But the overlap is shallow.

A canary uses a temporary, expanding split. An A/B test uses a fixed, sustained split. A canary measures operational metrics: errors, latency, CPU. An A/B test measures behavioral metrics: clicks, conversions, session length. A canary is a deployment mechanism. An A/B test is an experiment framework. In a canary, you want version 2 to perform identically to version 1 (just without the bugs). In an A/B test, you want the two versions to produce measurably different user behavior.

Blue/green and canary overlap more in intent — both are risk-mitigation strategies for deployments — but diverge on mechanism. Blue/green switches all traffic at once and relies on instant reversal. Canary shifts traffic gradually and relies on limited exposure. Blue/green is better when your rollback needs to be zero-latency (a hard requirement, not just a preference). Canary is better when you want to validate the new version under partial real-world load before committing.

It's also common to use canary and A/B testing at the same time, through the same routing layer. You shift 10% of traffic to a new version (canary). Within that 10%, you split users between two UI variants (A/B test). The canary answers "is it safe?" and the A/B test answers "which version works better?" simultaneously. The mechanisms share infrastructure but serve separate purposes.

What the routing layer actually does

All three techniques depend on a routing layer that can direct requests to different service versions. In practice this is implemented at several different points in the stack depending on the architecture.

At the infrastructure level, a load balancer or reverse proxy (NGINX, HAProxy, AWS ALB) routes incoming requests based on weighted target groups. This is sufficient for simple canary releases and blue/green switches where all users are treated the same.

At the application level, a service mesh like Istio or Linkerd can apply traffic rules per-service, based on request attributes like headers, cookies, or source identity. This is more flexible: you can route requests to the canary only if they carry a specific header, which lets internal teams test the new version before real users see it.

For A/B testing specifically, the routing needs to be sticky: the same user must always see the same variant. This is typically implemented via a hashed user identifier (user ID, session ID, or a persistent cookie). The hash determines which bucket a user falls into, and that assignment stays fixed for the duration of the experiment. Without stickiness, a user could see variant A on one request and variant B on the next, which corrupts the measurement.

Feature flags are a closely related mechanism. A feature flag is a runtime switch in code that enables or disables a feature for a subset of users. They can implement any of the three techniques depending on how they're configured: a feature flag that expands from 1% to 100% is a canary. A feature flag that holds at 50%/50% for a measurement period is an A/B test. Feature flags decouple deployment from release: you can deploy code to production with the flag off, then turn it on without another deployment, which gives you even finer-grained control than infrastructure-level routing.

Choosing between them

The question you're trying to answer should drive the choice.

Use a canary release when you're shipping a change and want to validate it against real production traffic before full exposure. The goal is confidence. The measure of success is that operational metrics stay within normal bounds as the rollout expands. Use it for code changes, dependency upgrades, configuration changes — anything where correctness under real load is the concern.

Use blue/green deployment when the primary risk is the deployment mechanism itself rather than the code change. The goal is operational safety. If your deployment process has historically caused outages during the switch, or if you need a guaranteed rollback path that doesn't depend on shifting traffic percentages, blue/green gives you that. It's also the right model when you have strict requirements around not serving mixed versions simultaneously.

Use A/B testing when the question is a product question: does this change improve outcomes? The goal is learning. The measure of success is statistical significance, not error rates. Use it when you have a hypothesis about user behavior and you want to validate it against real data rather than intuition or design reviews.

In mature systems you will use all three, often at the same time, through a shared routing and flag infrastructure. The techniques are not alternatives to each other so much as tools for different jobs. The trap is conflating them — running a canary and treating a conversion drop as a rollback signal (when conversion might have dropped for unrelated reasons), or running an A/B test and evaluating it based on error rate (when both variants could be equally buggy).

Summary

The confusion between these three techniques comes from the fact that they share plumbing. All three split traffic between versions. But traffic splitting is just the mechanism; the intent is what separates them.

Canary releases answer an operational question: is this safe? The traffic split is temporary, expands progressively, and the measurement is system health. Blue/green deployment answers a risk management question: can I switch instantly and reverse just as fast? The split is binary, the safety net is the idle environment, and the measurement is speed of recovery. A/B testing answers a product question: which version produces better outcomes? The split is fixed and sustained, the measurement is user behavior, and you need statistical rigor to trust the result.

Using the wrong tool produces the wrong answer. Treating a canary as an A/B test conflates stability signals with product signals. Treating an A/B test as a deployment mechanism skips the operational validation entirely. The overlap is real, but it's superficial. Once you know which question you're asking, the right tool becomes obvious.

Part of the Explained series — concepts in tech, clearly.