Progressive Delivery on Kubernetes with Argo CD and Argo Rollouts

Argo CD and Argo Rollouts solve different problems in the release path. This guide shows how to use them together for safer canary and blue-green delivery on Kubernetes.

TL;DR

Progressive delivery on Kubernetes is not just a nicer rolling update. Argo CD reconciles Git against the cluster and keeps the desired state honest, while Argo Rollouts adds first-class release strategies such as canary and blue-green, with analysis gates and traffic-aware promotion. When you combine them, you get a clear control boundary: Git defines intent, Argo CD applies it, and Argo Rollouts manages staged exposure and rollback decisions. That split makes release behavior more predictable, especially when you need metric-based promotion instead of blind full-cluster cutovers.

Argo Rollouts is the control plane that adds staged promotion and analysis on top of GitOps-driven delivery.

Rolling Updates Are Not Progressive Delivery

Kubernetes Deployments are good at one thing: replacing old Pods with new Pods while keeping the workload available. That is valuable, but it is not the same as progressive delivery. A rolling update gradually shifts replicas, yet it does not give you a first-class release controller that can pause, analyze, split traffic, or promote only after the new version proves itself.

That difference matters when the release itself is risky. If your service has meaningful traffic, an external dependency, or a history of regressions, a blind rollout is a weak control system. You want the rollout to be observable, reversible, and governed by a policy that says when to continue.

What to watch during a release:

Error rate by version
Latency percentiles by version
Availability of the new ReplicaSet or Pod group
Customer-facing SLO indicators
Any business KPI that should block promotion

Argo CD and Argo Rollouts Are Not the Same Tool

Argo CD and Argo Rollouts sit at different layers.

Argo CD owns reconciliation. It watches the Application definition in Git, renders the desired manifests, and makes the cluster match that desired state. The Application spec gives you the source, destination, and sync policy, including automated sync, prune, self-heal, and retry refresh behavior.

Argo Rollouts owns release strategy. It replaces the plain Deployment release mechanism with a Rollout resource that can move through canary or blue-green phases, run analysis, and use supported traffic routers when you need weighted exposure.

That split keeps the model clean:

Git defines what should exist.
Argo CD makes sure it exists.
Argo Rollouts decides how the new version should be exposed.

The Kubernetes Deployment controller is the baseline comparison, not the target design. Deployments are strong for rolling updates and readiness-gated replacement, but they do not give you a first-class control plane for staged traffic shifts, analysis-driven promotion, or rollback policy tied to metrics. That is the line progressive delivery crosses.

An original control-flow diagram is easier to reason about than a screenshot here: Argo CD keeps Git and cluster state aligned, while Argo Rollouts and the traffic provider handle staged exposure and release decisions.

A practical control-plane split

The important design question is not "Can we run Argo CD and Argo Rollouts together?" They are designed to work together. The important question is which layer owns which behavior.

Argo CD owns desired-state reconciliation.
Argo Rollouts owns release strategy and version exposure.
Kubernetes Deployment remains the baseline if you do not need a higher-order rollout controller.

If you blur those boundaries, the system becomes hard to debug. If Argo CD is expected to manage traffic weights, or if Argo Rollouts is expected to author Git, you have already lost the clean model.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: checkout
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/example-org/platform-gitops.git
    targetRevision: main
    path: apps/checkout
  destination:
    server: https://kubernetes.default.svc
    namespace: checkout
  syncPolicy:
    automated:
      enabled: true
      prune: true
      selfHeal: true
    retry:
      refresh: true

Argo Rollouts adds staged promotion and analysis on top of GitOps reconciliation.

Canary Releases Need a Real Control Loop

Canary is not just "send a little traffic to the new version." The useful part is the loop: expose a small slice, observe metrics, compare them against a threshold, and only then increase exposure. If you are not measuring and gating, you are not really doing progressive delivery.

Argo Rollouts supports that with step-based canary strategies. The docs show canary steps, pause durations, traffic routing integrations, and analysis hooks. That makes the rollout readable in code instead of spread across dashboards and runbooks.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 6
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/example-org/checkout:1.8.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout-stable
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: checkout-success-rate
            args:
              - name: service-name
                value: checkout
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

The important part is not the YAML itself. It is the behavior:

You can pause before full exposure.
You can inspect metrics while real traffic is flowing.
You can abort before the rollout reaches the entire user base.
You can keep the stable ReplicaSet available while you verify the canary.

If you need traffic-aware canarying, Argo Rollouts supports routing integrations through the traffic-management layer. The exact fields depend on the router you use, so keep the rollout strategy and the router-specific config aligned with the supported provider you actually run. The traffic-management docs currently call out providers such as AWS ALB, Istio, NGINX, Gateway API, SMI, Traefik, Kong, Ambassador, Google Cloud, and APISIX. That is the key operational point: the rollout model is stable, but the data-plane integration is provider-specific.

Canary or blue-green?

Use this as a decision table instead of a preference argument.

Question	Canary	Blue-Green
Need gradual exposure?	Strong fit	Weak fit
Need a fast cutover?	Moderate fit	Strong fit
Need to limit blast radius?	Strong fit	Moderate fit
Need a stable preview stack?	Possible, but more moving parts	Strong fit
Need routing integration?	Usually yes for weighted traffic	Often yes for clean switchover
Want the simplest rollback?	Not always	Often easier if the previous stack is preserved

Canary is usually the better default when your main concern is risk reduction under real traffic. Blue-green is often better when your main concern is a clean transition and a very explicit promotion point.

Blue-Green Is About Clean Cutover

Blue-green solves a different problem. Instead of gradually increasing exposure, you run two versions side by side, keep the new one on a preview path, and switch production traffic when the new version is ready.

That is a better fit when the cost of a partial rollout is high and the rollback model should be a simple switch back to the previous stable stack.

Argo Rollouts exposes that model directly with activeService, previewService, and promotion controls.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments
spec:
  replicas: 4
  selector:
    matchLabels:
      app: payments
  template:
    metadata:
      labels:
        app: payments
    spec:
      containers:
        - name: payments
          image: ghcr.io/example-org/payments:2.4.0
          ports:
            - containerPort: 8080
  strategy:
    blueGreen:
      activeService: payments-active
      previewService: payments-preview
      autoPromotionEnabled: false
      autoPromotionSeconds: 30
      prePromotionAnalysis:
        templates:
          - templateName: payments-prep-check
      postPromotionAnalysis:
        templates:
          - templateName: payments-post-check
      scaleDownDelaySeconds: 60

Blue-green is not "safer by default" than canary. It is safer for a different class of change. If the failure mode is bad new code plus partial traffic exposure, canary is usually better. If the failure mode is bad cutover and you want a clean preview path, blue-green is usually better.

Argo Rollouts also documents scaleDownDelaySeconds because service selector changes can take time to propagate through node networking. That delay is part of the release safety model, not an optional polish setting. The blue-green docs also call out abortScaleDownDelaySeconds, which is useful when you want the old version to stay available briefly after an abort, and dynamicStableScale, which changes how the stable ReplicaSet is scaled during the rollout.

Analysis Gates Turn Metrics Into Policy

Analysis is what makes progressive delivery more than a ceremonial rollout.

An AnalysisTemplate defines what to measure and what success looks like. The controller then creates an AnalysisRun and uses the outcome to continue, pause, or abort. The docs support Prometheus-based analysis, other providers, secret arguments, and explicit success and failure conditions.

That also means analysis is not binary in practice. The controller can surface Successful, Failed, or Inconclusive outcomes, and that difference matters. A failed run should abort. An inconclusive run is a signal that the measurement was not strong enough to decide automatically, which often means human judgment is required.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-success-rate
spec:
  args:
    - name: service-name
    - name: prometheus-port
      value: "9090"
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.95
      failureCondition: result[0] < 0.90
      provider:
        prometheus:
          address: http://prometheus.example.com:{{args.prometheus-port}}
          query: |
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
            )) /
            sum(irate(
              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
            ))

Two details are worth calling out:

A metric can be successful, failed, or inconclusive depending on the conditions you define.
The docs explicitly note that metric providers can return NaN or infinity, so your conditions should be written with those cases in mind.

The analysis docs also show background analysis, where a rollout keeps progressing while the analysis keeps sampling, and the rollout aborts if the analysis crosses a failure threshold. That is often the right pattern for production canaries because it measures the real traffic path instead of a synthetic preflight only.

That is the right mental model for progressive delivery. Analysis does not "watch the rollout." Analysis decides whether the rollout is allowed to continue.

Rollout Lifecycle

The lifecycle is what turns the separate tools into one operational system.

A developer changes the manifest in Git.
Argo CD sees the Application as OutOfSync and applies the new desired state.
The Rollout controller creates or updates the active ReplicaSet.
The traffic router shifts exposure according to the canary or blue-green plan.
Analysis runs against the live version and produces a pass, fail, or inconclusive result.
The rollout either promotes the new version or aborts back to the stable version.

That sequence is why the roles matter. Argo CD is not the traffic shifter. Argo Rollouts is not the Git system. The traffic router is not the policy engine. Each layer has one job.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-with-smi
spec:
  replicas: 4
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/example-org/checkout:1.9.0
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        smi:
          trafficSplitName: checkout-split
      steps:
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

Why Native Deployments Are Not Enough

Native Deployments are excellent for the 80 percent case. If you only need a standard rolling update with basic rollout history, they are enough.

They are not enough when you need one or more of these:

Weighted traffic shifting
A preview environment before production cutover
Pre-promotion or post-promotion analysis
Explicit pause and resume behavior
Rollback logic tied to release health instead of manual judgment

That is why progressive delivery is usually a controller choice, not a manifest tweak.

The release path becomes easier to operate when each layer has one job:

Argo CD keeps the desired manifests in sync.
Argo Rollouts controls the release strategy.
Prometheus or another metrics backend supplies evidence.
The router, if used, handles traffic shifting.

Choosing The Traffic Router

Traffic management is the part most teams underestimate. Argo Rollouts supports traffic-aware delivery, but the support is not generic. It is provider-specific, and the manifest shape depends on the router you actually run.

That means the right question is not "Does Argo Rollouts support traffic?" The right question is "Does my router support the release pattern I need, and does it do so in the way I expect?"

In practice, verify all of the following before standardizing on a router integration:

Whether the provider supports weighted traffic for canary releases.
Whether the provider supports a preview path for blue-green promotion.
Whether the traffic shift is request-level, route-level, or service-selector-based.
How long it takes for selector or routing changes to propagate.
Whether aborting a rollout returns traffic to the stable version immediately or after a delay.

That last point matters because some integrations rely on service selector propagation, which is not instantaneous. The release controller can request a change, but the cluster data plane still has to honor it. That is why fields like scaleDownDelaySeconds exist and why you should treat them as safety controls rather than optional decoration.

If you do not have traffic routing at all, canary still has value, but only as a best-effort exposure model. In that mode, replica weighting is not the same as exact request weighting, and you should not present it to operators as if it were.

For critical services, prefer a router configuration that you can test end to end in staging with the same ingress or service-mesh behavior you expect in production. The gap between "supported in docs" and "works under load" is where most rollout surprises live.

What To Validate In Staging

Before you let a rollout strategy touch production, validate the whole release loop in an environment that looks enough like production to expose the failures you care about.

Confirm Argo CD applies the correct manifest revision and respects sync ordering.
Confirm the Rollout controller moves through the intended canary or blue-green phases.
Confirm the traffic router actually shifts exposure the way the manifest says it should.
Confirm analysis can succeed, fail, and remain inconclusive without leaving the rollout in a confusing state.
Confirm abort behavior returns service to the stable revision in the time window you expect.
Confirm the rollback story is still acceptable when automated sync is enabled.
Confirm your alerting can tell the difference between a blocked rollout and a broken rollout.

This is the point where good teams learn whether the strategy is safe enough for their traffic profile. A staged rollout only helps if the failure path is just as deliberate as the happy path.

Safety Caveats

Progressive delivery still has sharp edges, and the docs are explicit about them.

scaleDownDelaySeconds matters because traffic propagation is not instantaneous.
dynamicStableScale changes how much capacity the stable ReplicaSet keeps during rollout.
abortScaleDownDelaySeconds affects how quickly the old version disappears after an abort.
Weighted canary without a traffic router is only a best-effort model based on replica placement, not exact request-level split.
Argo CD automated sync does not make rollback trivial; the Argo CD docs note that rollback cannot be performed against an application with automated sync enabled.

If you treat those as footnotes, you will eventually get surprised in production.

What Argo Rollouts Does Not Do

Argo Rollouts is easy to misunderstand because it sits near GitOps and traffic management.

It does not read or write Git.
It does not replace Argo CD.
It does not collect metrics itself.
It does not guarantee zero-downtime releases.
It does not magically make every traffic provider behave the same way.

The FAQ is clear on the Git boundary: Argo Rollouts changes the cluster state for the release, but it has no Git repository knowledge of its own. Argo CD owns the Git link.

A Practical Operating Model

If you are introducing progressive delivery to a real cluster, start with a narrow policy:

Keep the Application in Argo CD automated sync mode.
Use Rollouts for only the services that justify the extra control plane.
Start with canary if you need gradual exposure, or blue-green if you need a clean cutover path.
Add analysis gates before you trust the rollout to auto-promote.
Expand router-specific traffic management only after the basic control loop is stable.

That approach avoids the two common failure modes: over-engineering every service and under-engineering the risky ones.

If you also use Argo CD sync waves, keep them for dependency ordering and bootstrap sequencing, not for traffic promotion. Sync waves solve apply order. Rollouts solve release exposure.

Frequently Asked Questions

Q: Should Argo CD or Argo Rollouts own the release? A: Argo CD owns Git reconciliation and application sync. Argo Rollouts owns the staged exposure and promotion strategy for the new version. They work together, but they are not interchangeable.

Q: Can I use Argo Rollouts without traffic routing? A: Yes. Some rollout patterns only need pauses and analysis. If you want weighted traffic shifting, then you need a supported traffic router integration.

Q: What does pre-promotion analysis do in blue-green? A: It blocks promotion until the analysis succeeds. That gives you a gate before production traffic moves to the new stack.

Q: When is blue-green a better choice than canary? A: Blue-green is better when you want a clean preview stack and a fast, explicit switch to production. Canary is better when you want smaller blast radius and gradual exposure.

Q: What does Argo Rollouts do when analysis is inconclusive? A: An inconclusive result means the controller could not confidently mark the run as pass or fail. In practice, that usually leaves the rollout paused and requires human judgment or a policy adjustment.

Search This Blog

DevOpsDreams