Skip to main content

Command Palette

Search for a command to run...

Why EKS Blue-Green Upgrades Scare Teams More Than They Should

Updated
17 min read
Why EKS Blue-Green Upgrades Scare Teams More Than They Should

Having interviewed with companies of all sizes from smaller startups to large enterprises, and the conversation around EKS tends to follow a familiar pattern. We talk about control plane upgrades, node groups, Karpenter, keeping add-ons aligned with cluster versions, and everyone is comfortable. Then during kubernetes upgrades, I bring up blue-green cluster upgrades, and the tone often changes. The same engineers who are happy to discuss kubernetes nuances become more cautious when the conversation shifts to provisioning a second cluster and migrating production workloads onto it. Even among teams that regularly perform EKS upgrades, the idea is often viewed as introducing more risk rather than reducing it. The more I continue talking about it, the more I sound like a madman! You don't want your interviewer to think you're a madman.

I get why. Moving live traffic between clusters sounds like the sort of thing that ends up in an incident review. But most of that fear is bigger than the real risk, and the part that makes it scary has almost nothing to do with EKS. It's about whether you can actually rebuild your platform from scratch and most engineers don't like to hear it.

I did one of these exactly once, on purpose, in our shared services cluster just to understand it properly and also to 'audit'. And I come away thinking it's worth knowing for the right situations and a bad default for most of them. It's slower, and it costs more while both clusters are alive. It needs a level of discipline around automation, GitOps, DNS, observability, secrets, ingress and stateful stuff that a lot of teams assume they have and don't. But when the version gap is large, or the cluster has years of drift baked into it, or you're dealing with Kubernetes APIs that got removed, it's one of cleanest ways to take the risk out of the upgrade.

There's really two ways to move an EKS cluster to a newer Kubernetes version. You upgrade it where it sits (in-place), or you build a new one and move across (blue-green). Most teams do the first. The second is the one that makes people nervous, so both are worth understanding before you pick.

In-place upgrades

In-place is the default and for most clusters it's the right call. You do it in three layers and the order matters: control plane first, then the data plane, then the managed add-ons.

The three layers of an in-place upgrade, and who owns each one.

The control plane is the easy part, AWS owns it. When you bump the version it upgrades the API server, the scheduler and the controller manager behind the scenes. You don't touch etcd, you don't worry about API server availability, and you can only move one minor version at a time. That last bit catches people out. You can't jump straight from 1.29 to 1.36. You step through it, 1.30 to 1.31 to 1.32 all through to 1.36, validating at each stop, which is exactly why a big version gap turns into a project instead of an afternoon.

The data plane is the opposite. It's yours. Your nodes, your kubelet versions, your runtime, your workloads, your manifests, your controllers, your add-on compatibility, all of it. AWS will happily run a newer control plane while your nodes sit a version behind, inside the supported skew, but it isn't going to fix a workload that falls over because an API it depended on is gone.

How you actually roll the data plane depends on how you run nodes. Managed node groups are the common case and AWS does the rolling replacement for you, cordoning nodes and draining them while it respects PodDisruptionBudgets, then bringing up replacements on the new AMI. Self-managed node groups hand you more control and more rope, you own the rollout. Fargate skips the whole node question since there's nothing to upgrade, though you still recycle the pods to pick up the new version. And Karpenter has changed how a lot of teams think about this, because instead of upgrading a fixed pool you let it bring up new nodes at the target version and drop the old ones through drift and consolidation. With Karpenter a data plane upgrade can be little more than changing the AMI or the node class and letting it churn.

Then the add-ons, which is the layer people forget. CoreDNS, kube-proxy, the VPC CNI, the EBS CSI driver. Each one has versions pinned to Kubernetes versions and each one breaks differently if you let it fall too far behind. kube-proxy and the VPC CNI especially need to stay inside a supported range of the control plane. EKS will tell you which add-on versions go with the target version and mostly it's a matter of moving them up alongside the control plane. The nasty failure mode is the quiet one, where a cluster upgrades the control plane and the nodes but leaves the add-ons sitting there, and everything looks fine for months until DNS or pod networking starts doing something weird under load.

In-place is the simplest thing in the world when three things are true: the cluster's healthy, the jump is small, and your workloads already run on the target version. If you upgrade regularly, stay a version or two off the latest, and keep add-ons current, you'll do in-place upgrades for years and never need anything fancier. The problems start when those stop being true.

Blue-green upgrades

A blue-green upgrade doesn't really upgrade the cluster. You replace it. It sounds radical. It is.

You stand up a second EKS cluster on the target version, bootstrap it with the same platform pieces the old one runs, deploy your workloads onto it, and check it works while it's carrying no real traffic. Then you move traffic over from old to new, keep the old one running as your rollback, and only tear it down once you trust the new one. Old is blue, new is green, traffic goes blue to green.

The blue-green setup. Route 53 out front, two clusters kept in sync by Argo CD, durable state living outside the cluster.

Put like that it sounds simple, and the traffic mechanics usually are simple. The hard part is everything that has to line up for green to be a faithful copy of blue. This isn't really an EKS task. It's a test of whether your company can rebuild its platform from code. If green comes up and settles into the same working state as blue, your platform's in good shape. If it doesn't, blue-green is going to show you exactly where the holes are, usually at the worst possible time.

What needs to be true first

Before any of this is realistic there's some groundwork that has to already exist. Exercised, not knocked together in a panic the week of the upgrade.

You need a repeatable way to create the cluster. Terraform, CDK, Pulumi, bash script with eksctl, whatever you use, the point is the cluster and the AWS stuff around it come from code and you can spin up an identical second one without clicking around the console. You need GitOps, Argo CD or Flux, so workloads are defined in Git and applied by reconciliation instead of by whoever ran kubectl apply last. You need secrets pulled out of the cluster, through the External Secrets Operator backed by Secrets Manager or SSM Parameter Store or something like it, so green can fetch the same secrets blue uses instead of you relying on values somebody pasted in by hand a year ago.

You need ingress that can point at either cluster, and you need to actually know DNS and traffic routing. If the answer to "how does traffic get into this cluster" is a shrug, or the name of someone who left, you're not ready. You need observability working before the cutover and not after, because the only way to trust green is to watch it next to blue under real conditions. You need a plan for persistent data, which is the bit everyone skips and then regrets. And you need a rollback plan, plus some way to compare the two clusters side by side before you move a single request.

Why GitOps and Argo CD do the heavy lifting

GitOps is what makes the rebuild not terrible, and Argo CD is what I reach for. The idea's simple and it changes the whole feel of the migration. Instead of hand-recreating deployments, services, ingress objects, configmaps, RBAC, service accounts and Helm releases on the new cluster, you register green as a target in Argo CD and let it reconcile to whatever's defined in Git. Green becomes a thing that converges to desired state, not a pile of manual work.

In practice you set this up with an app-of-apps pattern, or with ApplicationSets, so pointing Argo CD at a new cluster pulls the entire platform across rather than one app at a time. ApplicationSets are handy here because you can template across clusters and have one definition produce the right apps for green.

What bites people is that the cluster-specific values still matter. Most of your manifests are identical between blue and green, but a few things aren't, and those few things are exactly what cause a silent failure when you get them wrong. Ingress hostnames differ while you're mid-migration. IAM roles for service accounts, whether you're on IRSA or the newer EKS Pod Identity (associations), are tied to a specific cluster's OIDC provider and have to be recreated for green. Storage classes can be different. External secret references and per-environment config have to resolve against the new cluster. GitOps gets you most of the way. The rest is cluster-specific config, and that's where your attention actually goes.

What the switchover looks like

Blue's live and carrying everything. You create green on the target version. You put the core platform components on first, before any workloads: AWS Load Balancer Controller, ExternalDNS, cert-manager, service meshes if you use it, External Secrets Operator or Secrets Store CSI Driver or other secrets tool, Argo CD/Flux CD, your metrics and logging, the ingress controller, a service mesh if you run one. Then you sync workloads into green through Argo CD.

Somewhere before workloads land you fix the deprecated APIs, and this is much better done ahead of time than discovered live. kubent (kube-no-trouble) or pluto will scan your manifests and live objects for APIs that got removed in the target version, and green, running the newer Kubernetes, will just reject anything you missed. Fixing these in the old cluster first, where the old versions still work, beats finding out when green refuses the manifest.

Once green's up you test it without sending it real users. Smoke tests, synthetic traffic, health checks, and a careful look at metrics and logs next to whatever blue is doing. When green looks right, you start adjusting DNS or load balancer routing to move traffic.

How traffic moves is down to your architecture. Some setups just flip. Most of the time you want it gradual, and Route 53 weighted records are the easy way to do that. You point a weighted record set at both clusters' load balancers and shift the weights over time. 100/0 to start, then 80/20, 50/50, 20/80, and finally 0/100 once green's earned it. At each step you're watching error rates, latency and saturation before you push more across.

Shifting Route 53 weights from blue to green a step at a time. Rollback is the same move in reverse.

Rollback, which is the entire point of doing it this way, is just moving the weight back toward blue. Blue's still running. It's still got the workloads. It still works. If green starts misbehaving at 20 percent, you put the weight back to 100/0 on blue and you've lost basically nothing. No frantic restore, no rebuild under pressure, because the old cluster never went anywhere.

Two questions always come up here. The first is whether this is just Argo Rollouts, since that also does blue-green. It isn't, and the clash of names is most of the confusion. Argo Rollouts works inside one cluster: it swaps a Deployment for a Rollout, runs an old and a new ReplicaSet side by side, and moves users between them by flipping a Service or shaping traffic through your ingress or mesh. It has no idea two EKS clusters exist and it never touches Route 53. So it's the right tool for app deploys inside green, and the wrong layer for moving traffic between blue and green. The cluster cutover sits one level up, at Route 53 or a global load balancer or a mesh that spans both clusters, which is the layer this whole section is about.

The second question is how you know green's misbehaving in the first place, and mostly the answer is observability and a human watching. Before green sees a single real user you've already hit it with smoke tests and synthetic traffic, so that part happens off to the side. Then during the weight shift you watch green next to blue: 5xx rate, p95 and p99 latency, pod restarts, saturation, and a couple of real SLOs, in CloudWatch or Prometheus and Grafana or whatever you run. The weight bumps stay manual, gated on those dashboards.

After green's carried full traffic through a bake long enough to cover your real patterns, daily and weekly peaks included, you drain blue, freeze it, then destroy it. Freezing before destroying matters, keep it reachable but idle for a while in case something only shows up days later.

Where it gets hard: state

The thing that actually makes blue-green hard is state.

Stateless services are easy. They hold nothing, so moving them is just moving where they run. Most enterprise aren't stateless at all. They need Databases, queues, persistent volumes, caches, sticky sessions, long-running jobs, anything that assumes it owns a fixed cluster identity, that's the hard part. None of it moves cleanly just because you pointed Argo CD somewhere new.

The cleanest setups dodge the whole problem by keeping durable state out of Kubernetes entirely. If your databases live in RDS, caches in ElastiCache, objects in S3, queues in SQS, tables in DynamoDB, streams in MSK, shared files in EFS, then both clusters talk to the same external state and the migration stays a compute problem. Green points at the same RDS instance blue does. Nothing has to move.

It's worth being precise about which storage this actually means, because not all persistent volumes are the same. The painful one is EBS. A normal EBS volume is single-AZ and ReadWriteOnce, so it attaches to one node at a time, and its PV and PVC are per-cluster objects bound to an AZ-locked disk. There's io2 Multi-Attach, which does let several nodes share one volume, but it's same-AZ only and hands you a raw block device the app has to coordinate writes on, since a normal ext4 or XFS filesystem corrupts with two writers. That's a tool for clustered databases that fence their own writes, not a casual cross-cluster share, and the EBS CSI driver only supports it for io2 in block mode anyway. For the ordinary filesystem volume a StatefulSet uses, you can't have a blue pod and a green pod writing the same EBS volume, so moving it means detach from blue, attach to green, and that turns into a data migration: snapshots, replication, consistency during the cutover,

The fear is really an audit of DRIFT!

This is the actual reason blue-green makes teams nervous, and again, it isn't about EKS.

A blue-green upgrade quietly audits your platform and surfaces things people would rather not look at. If your cluster can't be rebuilt from code, then its not really infrastructure as code, it's a cluster you're scared to touch. If your workloads can't be redeployed onto a fresh cluster, you don't really have GitOps, you've got a Git repo that happens to match production by luck. The CronJob somebody created with kubectl apply during an incident three years ago. The ConfigMap that got hotfixed in production and never made it back to Git. The service account with permissions nobody remembers granting. The ingress rule that only exists because someone was debugging a problem at 2am. None of these things show up as a problem while the existing cluster keeps running. The moment you build green from code and compare it to blue, they become obvious.

That's why it scares teams. It isn't only an upgrade strategy, it's a maturity check, run in production, with traffic on the line. A team that can do a blue-green upgrade without sweating it has a platform that genuinely rebuilds, and most teams aren't as sure of that as they'd like to be.

When I'd reach for it, and when I wouldn't

I think blue-green is worth learning even if you almost never use it.

Doing it forces better habits, because you can't fake your way through it. The rollback story is far safer than anything in-place gives you, since the previous version of the whole cluster is just sitting there. You get to test the new cluster properly before a single user touches it. And it's the natural pick for a specific set of jobs: jumping across several Kubernetes versions at once, cleaning up years of technical debt you'd rather not carry forward, swapping out a service mesh or ingress pattern (I'm looking at you Gateway API) that doesn't fit anymore, or proving out a platform rebuild end to end.

It is not the default though. For a small version bump on a healthy cluster, in-place is almost certainly right, and standing up a second cluster is wasted effort and wasted money. What changes the maths is when the cluster's business-critical and carries years of drift, when it's full of deprecated APIs, when the dependencies aren't fully understood anymore, or when the version jump is big enough that an in-place upgrade carries real risk on its own. If you're part of a team that does blue-green upgrades, your platform team is already mature.

What the AWS guide gets right

AWS has published a proper guide and a working sample for this, and it's worth a read. The EKS Blueprints for Terraform blue/green upgrade pattern spins up two clusters in a shared VPC, bootstraps them with Argo CD using the app-of-apps and ApplicationSets patterns, exposes workloads with the AWS Load Balancer Controller and ExternalDNS, and shifts traffic with Route 53 weighted routing. There's a companion writeup on the AWS containers blog about blue/green and canary migration for stateless Argo CD workloads. I'm not going to reproduce either here, go read them.

What I like about the AWS take is that it treats the cutover as a traffic problem once the new cluster is ready, which is the right way to think about it. There's no magic in it. You build green, you validate green, you shift traffic slow while you watch the metrics, and you keep blue alive until you're confident. One detail worth stealing is how the sample passes cluster-specific metadata from Terraform into Argo CD so each cluster's add-ons and workloads adapt to their own context, which is a clean answer to that cluster-specific values problem from earlier. The value of the guide is seeing a concrete, tested version of the flow, not any one clever trick.

Summary: The upgrade is rarely the real problem

Blue-green EKS upgrades aren't scary because they're technically impossible. The traffic mechanics are well understood and AWS has them documented. They're scary because they show, out in the open and under load, whether your platform can actually be rebuilt.

If your infrastructure is in code, your workloads run through GitOps, your secrets are external, your ingress and DNS are understood, and your observability works against any cluster you point it at, then blue-green is a controlled migration with a clean rollback. Pretty boring, really. If those things aren't true, then the upgrade was never your real problem. The platform is. And the most useful thing about trying a blue-green upgrade even once, even just to learn it, is that it tells you which of those two you're sitting in.