Argo CD Auto-Sync and Health Checks: An Operator's Guide to Safe GitOps Reconciliation

Argo CD Auto-Sync and Health Checks: An Operator's Guide to Safe GitOps Reconciliation

Auto-sync and health are not the same thing in Argo CD. This guide shows how reconciliation, pruning, self-healing, retries, health checks, hooks, and sync waves fit together in real clusters.

TL;DR

Argo CD auto-sync and health checks solve different problems. Auto-sync decides when the controller reconciles Git with the cluster, while health checks decide whether the resulting resources are ready, degraded, or suspended. In practice, you need both: automated sync with prune and self-heal keeps desired state converged, retry refresh helps failed syncs recover on new revisions, and health checks plus custom Lua logic make readiness visible to operators. Sync waves and hooks add ordering when resources depend on each other.

The control loop most teams misunderstand

Argo CD has two separate jobs that often get blurred together in day-to-day platform talk.

The first is reconciliation. Should Argo CD apply the desired state from Git to the cluster right now, or should it wait?

The second is health. After the desired state has been applied, are the resources actually ready, still progressing, degraded, or suspended?

That distinction matters. A workload can be synced and still unhealthy. It can also be out of sync and still temporarily healthy. If you collapse those two ideas into one, you end up with confusing operational behavior: dashboards that look "green" when they should not, or automation that treats a successful apply as proof that the service is ready.

On EKS, that boundary is still the right one even when AWS manages Argo CD for you. AWS describes Argo CD as a declarative GitOps CD tool and notes that with EKS capabilities, Argo CD can be fully managed by AWS. That changes who runs the controller, not the meaning of sync and health. AWS EKS Argo CD docs

Argo CD sync waves diagram showing ordered application of resources by phase and wave
Argo CD sync waves are about order. Health is about readiness after order has been applied.

What auto-sync actually does

Auto-sync tells Argo CD to reconcile an application automatically when it detects drift between Git and the live cluster. The official docs are explicit: the controller compares the desired manifests in Git with the cluster state and syncs when the application is OutOfSync. Argo CD auto-sync docs

That is useful because it removes a lot of human coordination from the deployment path. CI can commit changes to Git, and Argo CD can apply them without a manual sync button click or direct API call from the pipeline.

Here is a practical Application manifest that turns on the common operating features most platform teams want:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: storefront
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/example-org/platform-manifests.git
    targetRevision: main
    path: apps/storefront
  destination:
    server: https://kubernetes.default.svc
    namespace: storefront
  syncPolicy:
    automated:
      enabled: true
      prune: true
      selfHeal: true
      allowEmpty: false
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
      refresh: true

Three details matter here:

  • prune: true removes resources from the cluster when they disappear from Git.
  • selfHeal: true tells Argo CD to correct live drift, not just Git drift.
  • retry.refresh: true makes retries refresh on new revisions, which is useful when a failed sync should be retried against the current Git state rather than the one that originally failed.

What auto-sync does not do

Auto-sync does not guarantee readiness.

It does not mean the deployment is healthy. It does not mean every dependent controller has finished reacting. It does not mean every custom resource has the correct health logic attached.

The Argo CD docs also note important semantics:

  • Automated sync only runs when the application is OutOfSync.
  • Argo CD only attempts one automated sync per unique commit SHA and parameter combination unless self-heal is involved.
  • Automated sync will not reattempt a failed sync against the same commit and parameters.
  • Rollback cannot be performed against an application with automated sync enabled.
  • The automated sync interval is tied to the controller reconciliation setting in argocd-cm, which defaults to about two minutes plus jitter. Argo CD auto-sync docs

Those rules are why auto-sync is best treated as a reconciliation policy, not as a deployment success signal.

Why prune and self-heal are separate knobs

Teams often enable auto-sync and stop there. That usually leaves two operational gaps.

The first is garbage collection. If a manifest is removed from Git, should Argo CD delete the live object too? That is what prune controls. Without it, you can accumulate stale resources and wonder why old config still exists.

The second is drift correction. If someone patches a live Deployment or ConfigMap by hand, should Argo CD restore the Git version? That is what self-heal controls.

Those are different failure modes:

  • Prune handles "Git no longer wants this object."
  • Self-heal handles "The live object changed outside Git."

In other words, prune is about deletion intent and self-heal is about drift intent.

Health is a separate signal

Argo CD surfaces health at the resource level and then aggregates it into the Application view. Built-in health checks exist for common Kubernetes objects such as Deployments, StatefulSets, DaemonSets, Services, Ingresses, Jobs, CronJobs, and PVCs. The health states you will typically see are Healthy, Progressing, Degraded, and Suspended. Argo CD health docs

This is where a lot of platform teams make a conceptual mistake. They assume "synced" means "good." It does not. Sync says Git and cluster match. Health says whether the resource is actually ready or degraded according to its status.

That distinction is why an app-of-apps tree or a dependency chain may need health-based gating even after sync has completed. Argo CD can only make good orchestration decisions if the resource type has a useful health model.

Custom health checks

When a built-in check is not enough, Argo CD supports custom health checks written in Lua. The docs describe defining them in argocd-cm, and the script returns a health status and optional message. This is the right tool when a CRD exposes meaningful readiness in its status fields but Argo CD does not know how to interpret it yet. Argo CD health docs

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.cert-manager.io_Certificate: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "Ready" and condition.status == "False" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
          if condition.type == "Ready" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for Ready condition"
    return hs

That pattern is reusable. The exact status path will change per CRD, but the operating principle does not: expose controller truth to Argo CD instead of relying on a generic fallback.

Sync waves and hooks are the ordering layer

Health tells you whether a resource is ready. Sync waves and hooks tell Argo CD what to apply first.

Argo CD orders sync by phase, then by wave, then by kind, then by name. Lower waves run first. The controller then waits between waves so other controllers can react to the updated spec before the next wave proceeds. Argo CD sync waves docs

That matters for real bootstrap flows:

  • Apply a Namespace before namespaced workloads.
  • Install CRDs before custom resources.
  • Run schema migrations before the app Deployment.
  • Gate a rollout on a PreSync or Sync hook when a job must finish first.

Here is a minimal example:

apiVersion: v1
kind: Namespace
metadata:
  name: payments
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  namespace: payments
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "0"
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: ghcr.io/example-org/payments-migrate:1.8.0
          command: ["./migrate.sh"]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: payments
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: api
          image: ghcr.io/example-org/payments-api:2.4.1
          ports:
            - containerPort: 8080

Two caveats keep this honest:

  • Hooks are lifecycle tools, not a substitute for good health checks.
  • Waves do not make unhealthy resources healthy; they only prevent later resources from running too early.

A workable operator model

The safest pattern is to treat Argo CD like two coordinated systems inside one controller:

  • Reconciliation policy: auto-sync, prune, self-heal, and retry refresh.
  • Readiness policy: built-in health checks, custom Lua health checks, and the status of dependent resources.

Once you think that way, the operational model gets simpler:

  1. Use auto-sync for the steady-state path.
  2. Use prune only when deletion from Git should really delete from the cluster.
  3. Use self-heal for the classes of drift you actually want corrected.
  4. Use sync waves and hooks only when ordering matters.
  5. Add custom health checks when the default status model is too shallow for your CRDs.

That approach is especially useful on EKS because the controller may be managed for you, but the application lifecycle problems remain the same.

Why this matters on EKS

AWS now positions Argo CD as part of its managed EKS capabilities, which removes controller operations from your day-to-day burden. That is useful, but it does not remove the need for precise GitOps semantics. AWS EKS Argo CD docs

If your platform team is responsible for dozens of services, the cost of getting these semantics wrong is usually not a crash. It is ambiguity:

  • Why did Argo CD sync but the service still failed readiness?
  • Why did a deleted manifest remain in the cluster?
  • Why did the rollout wait on the wrong resource?
  • Why does a custom resource stay stuck in Progressing forever?

The answer is usually that sync policy, health modeling, and orchestration order were treated as one thing when they are really three.

Frequently Asked Questions

Q: Can a synced application still be unhealthy? A: Yes. Sync means the live resources match the desired manifests in Git. Health means the resources are ready or degraded according to their status. Those are separate signals, and you should alert or gate on them differently.

Q: Should I always enable prune and self-heal? A: No. Enable prune when deleting a manifest from Git should delete the live object. Enable self-heal when you want Argo CD to correct drift caused by manual cluster changes. Many teams want both, but neither should be a reflex.

Q: When should I use retry refresh? A: Use retry.refresh: true when you want a failed sync retry to pick up newer Git revisions during the retry cycle. It is useful in active delivery pipelines where commits may land while an earlier sync is still retrying.

Q: What is the best way to model readiness for a CRD? A: Define a custom health check in argocd-cm and map the CRD's status fields to Argo CD health. That gives you a resource-specific readiness signal instead of relying on a generic Progressing fallback.

Q: Do sync waves replace hooks? A: No. Waves are for ordering. Hooks are for lifecycle phases such as PreSync, Sync, PostSync, and SyncFail. They work together, and the right choice depends on whether you need sequencing or side effects.

Resources

Comments

Popular posts from this blog

Bootstrapping Kubernetes Clusters with Terraform and Argo CD: A Durable Two-Layer Approach

Kubernetes Multi-Tenancy with Namespaces and Network Policies: A Practical Guide for GitOps Teams