Multi-Tenant GitOps with Argo CD: Isolation Patterns That Survive Production

Multi-Tenant GitOps with Argo CD: Isolation Patterns That Survive Production

Multi-tenant Argo CD is not just RBAC. This deep dive shows how to combine AppProjects, namespaces, OIDC groups, sync controls, and policy gates into a safer GitOps platform.

TL;DR

Multi-tenant Argo CD works only when tenancy is enforced at several layers at once: Git repository boundaries, AppProject source and destination rules, Kubernetes namespaces, OIDC-backed RBAC, admission policy, and deletion controls. The practical model is to give each tenant a narrow AppProject, generate applications from approved repository paths, constrain sync and prune behavior, and let Kubernetes enforce runtime quotas and policy. This keeps GitOps self-service useful in production without turning the Argo CD control plane into a shared cluster-admin escape hatch.

Original generated diagram concept showing tenant Git repositories flowing into Argo CD AppProjects, RBAC groups, ApplicationSets, namespace quotas, admission policies, and scoped cluster resources.
A production Argo CD tenant boundary is layered: Git path, AppProject, identity group, namespace, policy, and runtime quota all carry part of the isolation model.

The Real Multi-Tenant Argo CD Problem

The easiest way to make Argo CD "multi-tenant" is to create accounts and call it done. That is also the easiest way to create a shared deployment plane where every team can accidentally route around the platform's guardrails.

Argo CD is powerful because it reconciles desired state from Git into Kubernetes. In a single-team cluster, that usually feels straightforward: one Git repository, one Argo CD instance, one cluster, and a small group of trusted operators. In a platform environment, the trust model changes. Multiple teams want self-service deployment, each team owns different repositories, some workloads are production, and some namespaces contain shared infrastructure. A mistake in Git can become a cluster mutation minutes later.

That means Argo CD tenancy cannot live in one control. It needs layered enforcement:

  • Git controls which manifests a tenant can submit.
  • AppProjects control which source repositories, clusters, namespaces, and resource kinds Argo CD can deploy.
  • Argo CD RBAC controls which users and groups can create, sync, override, or delete applications.
  • Kubernetes namespaces, quotas, service accounts, and admission policies control what the applied workload can do at runtime.
  • Sync windows, prune rules, finalizers, and progressive rollout gates control change timing and blast radius.

The architecture goal is not "every tenant gets cluster admin through Git." The goal is narrower: every tenant gets enough autonomy to ship their own workloads, while the platform team keeps strong control over cluster-scoped resources, shared ingress, policy, identity, and deletion paths.

Start With The Isolation Boundary

A practical tenant boundary usually maps to one of these units:

BoundaryUse WhenRisk If Too Broad
TeamOne engineering team owns several servicesShared permissions can hide service-level ownership gaps
ProductMultiple teams operate one product surfaceEmergency access can sprawl across unrelated services
EnvironmentStaging and production need different controlsLower environments may inherit production friction
Compliance zonePCI, regulated, or data-sensitive workloadsToo many exceptions make policy hard to reason about

For most platform teams, the clean starting point is one AppProject per team per environment class. For example, payments-nonprod and payments-prod are easier to reason about than one global payments project. The production project can block manual sync overrides during deny windows, restrict sync timing, and require tighter admission policy while the non-production project stays more permissive.

Argo CD's project model supports this shape. An AppProject can restrict allowed source repositories, destinations, namespace-scoped resource kinds, cluster-scoped resource kinds, orphaned resource monitoring, sync windows, and project-local roles. The important design choice is to treat AppProjects as security boundaries, not folder labels.

AppProject As The First Hard Gate

The default project is convenient for demos, but it is too permissive for a shared platform. A tenant project should pin all of the following:

  • Source repositories the tenant may deploy from.
  • Destination clusters and namespaces the tenant may deploy into.
  • Namespace-scoped Kubernetes kinds the tenant may manage.
  • Cluster-scoped kinds the tenant may manage, usually none by default.
  • Project roles mapped to identity-provider groups.
  • Sync windows for sensitive environments.

A production tenant project can look like this:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: payments-prod
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  description: Production GitOps boundary for the payments team
  sourceRepos:
    - https://github.com/example-org/payments-platform.git
    - https://github.com/example-org/payments-services.git
  destinations:
    - server: https://kubernetes.default.svc
      namespace: payments-prod
  clusterResourceWhitelist: []
  namespaceResourceWhitelist:
    - group: ""
      kind: ConfigMap
    - group: ""
      kind: Service
    - group: apps
      kind: Deployment
    - group: autoscaling
      kind: HorizontalPodAutoscaler
    - group: networking.k8s.io
      kind: NetworkPolicy
  orphanedResources:
    warn: true
  syncWindows:
    - kind: deny
      schedule: "0 9 * * 1-5"
      duration: 8h
      applications:
        - "*"
      manualSync: false
  roles:
    - name: deployer
      description: Payments production deploy access
      groups:
        - oidc:payments-prod-deployers
      policies:
        - p, proj:payments-prod:deployer, applications, get, payments-prod/*, allow
        - p, proj:payments-prod:deployer, applications, sync, payments-prod/*, allow

This example is intentionally restrictive. It does not allow tenants to create Namespace, ClusterRole, CustomResourceDefinition, MutatingWebhookConfiguration, or storage classes. Those resources are platform-owned because one tenant's cluster-scoped object can change behavior for another tenant.

The key implementation detail is that the destination namespace and source repository restrictions must match the repository structure. If payments-prod can deploy from a shared repository path where another team can write manifests, the AppProject boundary is weaker than it looks.

RBAC: Bind Groups To Capabilities, Not To Good Intentions

Argo CD RBAC uses policy rules and identity information from local accounts or SSO/OIDC. In SSO setups, groups or other configured scopes are read from the identity token and matched to Argo CD policies. That gives you a clean way to map platform groups to project capabilities.

Keep global Argo CD roles small. Most tenant permissions should live in project roles or project-scoped policies. A common pattern is:

  • role:readonly: view-only access across approved applications.
  • proj:<tenant>:developer: view and sync non-production apps.
  • proj:<tenant>:prod-deployer: sync production apps during approved windows.
  • role:platform-admin: tightly held platform team role for cluster-scoped controls.

Example argocd-rbac-cm fragment:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.default: role:readonly
  scopes: '[groups, email]'
  policy.csv: |
    p, role:readonly, applications, get, */*, allow
    p, role:readonly, projects, get, *, allow

    g, oidc:platform-admins, role:admin
    g, oidc:payments-prod-deployers, proj:payments-prod:deployer

    p, role:tenant-app-admin, applications, create, payments-nonprod/*, allow
    p, role:tenant-app-admin, applications, update, payments-nonprod/*, allow
    p, role:tenant-app-admin, applications, delete, payments-nonprod/*, deny

The deny on delete is not paranoia. In GitOps, delete can mean removing the Argo CD Application object, deleting tracked Kubernetes objects, or triggering finalizer behavior depending on how the application is configured. Production deletion should usually be a separate break-glass or platform-mediated operation.

Also check the identity token, not just the Argo CD policy. If your OIDC provider does not emit the expected group claim, the policy may look correct while users fall back to a default role. Treat group-claim verification as part of platform onboarding.

Kubernetes Still Enforces Runtime Reality

Argo CD controls desired state application. Kubernetes controls workload execution. You need both.

At minimum, every tenant namespace should have:

  • ResourceQuota to cap aggregate CPU, memory, object counts, and persistent volume claims.
  • LimitRange to force default requests and limits.
  • NetworkPolicy to block default east-west traffic when the CNI supports it.
  • Dedicated service accounts with minimal RBAC.
  • Admission policy to block privileged pods, hostPath mounts, unsafe capabilities, mutable tags, and unapproved registries.

Example namespace baseline:

apiVersion: v1
kind: Namespace
metadata:
  name: payments-prod
  labels:
    tenant.platform.example.com/name: payments
    pod-security.kubernetes.io/enforce: restricted
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: payments-prod-quota
  namespace: payments-prod
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 80Gi
    limits.cpu: "40"
    limits.memory: 160Gi
    pods: "80"
    services.loadbalancers: "0"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: payments-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress

This namespace baseline should be platform-owned. Tenants can deploy workloads into the namespace, but they should not be able to remove the guardrails that make the namespace safe to share on the same cluster.

ApplicationSet Tenancy: Useful, But Easy To Over-Permission

ApplicationSet is attractive in platform environments because it can generate many Argo CD Applications from Git directories, cluster lists, pull requests, or files. For tenancy, the Git generator and matrix generator are especially useful. They let platform teams create an application factory where each service gets a predictable Application object.

The dangerous pattern is templating the project field from tenant-controlled data. Argo CD's own ApplicationSet documentation warns that Git generators can be used by non-admin developers to create Applications, and that templated project fields can let users create Applications under Projects with excessive permissions.

A safer pattern is to fix the project in the template and let repository data control only low-risk fields such as app name, path, and namespace within an approved prefix.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payments-services-prod
  namespace: argocd
spec:
  syncPolicy:
    preserveResourcesOnDeletion: true
  generators:
    - git:
        repoURL: https://github.com/example-org/payments-services.git
        revision: main
        files:
          - path: services/*/service.yaml
  template:
    metadata:
      name: 'payments-{{name}}-prod'
      labels:
        tenant.platform.example.com/name: payments
    spec:
      project: payments-prod
      source:
        repoURL: https://github.com/example-org/payments-services.git
        targetRevision: main
        path: 'services/{{name}}/overlays/prod'
      destination:
        server: https://kubernetes.default.svc
        namespace: payments-prod
      syncPolicy:
        automated:
          prune: false
          selfHeal: true
        syncOptions:
          - CreateNamespace=false
          - PruneLast=true

Notice the guardrails:

  • project is fixed to payments-prod.
  • repoURL is fixed to the tenant's approved repository.
  • destination.namespace is fixed to the tenant namespace.
  • CreateNamespace=false prevents the application from creating arbitrary namespaces.
  • prune=false avoids automatic deletion until the team has stronger operational maturity.
  • preserveResourcesOnDeletion=true prevents generated Applications from receiving the Argo CD resources finalizer, which is useful when deleting the generator should not delete production workloads.

You can later enable automated prune, but do it per project and after you have tested orphaned resource reporting, finalizer behavior, and recovery paths.

Resource Exclusions And Allow-Lists

AppProject resource allow-lists decide what a tenant Application is allowed to manage. Argo CD also has controller-level resource.exclusions and resource.inclusions settings in argocd-cm. Use them sparingly: they change what Argo CD discovers and syncs, so they are platform-wide behavior rather than tenant-specific authorization.

A shared-control-plane configuration can exclude noisy or controller-owned resources while keeping the inclusion list limited to resource families the platform expects Argo CD to reconcile:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.exclusions: |
    - apiGroups:
        - events.k8s.io
      kinds:
        - Event
      clusters:
        - "*"
    - apiGroups:
        - coordination.k8s.io
      kinds:
        - Lease
      clusters:
        - "*"
  resource.inclusions: |
    - apiGroups:
        - ""
        - apps
        - batch
        - networking.k8s.io
        - external-secrets.io
        - argoproj.io
      kinds:
        - "*"
      clusters:
        - "*"

Do not use exclusions to hide a security problem. If tenants must not create ClusterRole, MutatingWebhookConfiguration, or raw Secret objects, enforce that with AppProject allow-lists, Kubernetes RBAC, and admission policy. Resource exclusions are most useful when Argo CD should ignore resources that are high-volume, generated by another controller, or intentionally outside the GitOps ownership model.

Sync, Prune, And Deletion Blast Radius

Automated sync is not a binary good or bad choice. It is a control loop. The risk depends on what the loop is allowed to change.

For non-production tenant workloads, automated sync with self-heal often gives useful feedback. For production, the platform should decide whether changes deploy immediately, during windows, or after a separate progressive-delivery controller has evaluated metrics. Argo CD supports sync windows at the project level, and sync waves/hooks can order resources inside a sync operation.

Prune deserves special handling. If a manifest disappears from Git and prune is enabled, Argo CD may delete the corresponding Kubernetes resource. That is desirable for cleaning up stale objects, but dangerous for shared resources, stateful workloads, and resources that were moved between applications.

A safer production rollout often uses this progression:

  1. Enable automated sync without prune.
  2. Enable orphaned resource warnings.
  3. Add policy that blocks tenants from managing platform-owned kinds.
  4. Test deletion behavior in staging with the same AppProject rules.
  5. Enable prune only for specific application classes.

Example application sync policy:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api-prod
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: payments-prod
  source:
    repoURL: https://github.com/example-org/payments-services.git
    targetRevision: main
    path: services/api/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: payments-prod
  syncPolicy:
    automated:
      prune: false
      selfHeal: true
    syncOptions:
      - ApplyOutOfSyncOnly=true
      - PruneLast=true

PruneLast=true is not a substitute for review, but it can reduce ordering surprises by pruning after other resources are applied. For workloads with database migrations, use sync waves or hooks deliberately rather than hoping lexical manifest order will save you.

The finalizer is the sharper control. Argo CD's deletion docs state that resources-finalizer.argocd.argoproj.io makes the application controller cascade-delete the Application's managed resources when the Application is deleted. ApplicationSet-generated Applications add another layer: if preserveResourcesOnDeletion is false, generated Applications can receive that finalizer, so deleting an ApplicationSet can delete generated Applications and then the workloads those Applications manage.

That behavior is useful for preview environments. It is dangerous for production if teams do not understand ownership. A good rule is: preview ApplicationSets may cascade, production ApplicationSets should preserve by default, and production decommissioning should be a reviewed workflow rather than a side effect of removing a generator file.

Secrets And Workload Identity

Tenant Git repositories should not contain plaintext credentials. Even if the repository is private, Git history, forks, local clones, CI logs, and code search can turn a leaked secret into a long-lived incident.

Argo CD's secret-management guidance is intentionally unopinionated: teams use External Secrets Operator, Sealed Secrets, Vault-backed workflows, SOPS, and plugins. In a multi-tenant platform, the safest default is to keep secret values outside tenant Git and let tenant manifests reference a namespace-scoped secret source.

Common production patterns include:

  • External Secrets Operator or a similar controller that syncs secrets from a cloud secret manager into tenant namespaces.
  • Sealed or encrypted secrets when the team explicitly accepts Git-stored encrypted values and key rotation processes.
  • Cloud workload identity where pods get short-lived access to cloud APIs without static keys.
  • Admission policy that blocks raw Secret manifests unless they match an approved exception.

The platform-owned secret controller should run with cloud permissions. Tenant workloads should get only the runtime secret values or workload identity bindings they need. Avoid handing each tenant a broad cloud credential and calling it GitOps.

Policy Gates Close The Gap Between Git And Cluster

Argo CD can say "this Application is allowed to deploy a Deployment into payments-prod." It cannot by itself prove that the Deployment uses a signed image, has resource requests, avoids privileged mode, and references only approved registries. That is the job of admission policy.

Kyverno, Gatekeeper, Kubernetes ValidatingAdmissionPolicy, and Sigstore policy-controller all fit into this layer depending on your standardization. The important point is placement: policy should run in the cluster admission path, not only in a CI job. CI catches issues earlier; admission prevents bypass.

Example policy intent:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-digest-and-resources
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-image-digest
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Container images must be pinned by digest in production."
        pattern:
          spec:
            containers:
              - image: "*@sha256:*"
    - name: require-requests-and-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "CPU and memory requests and limits are required."
        pattern:
          spec:
            containers:
              - resources:
                  requests:
                    cpu: "?*"
                    memory: "?*"
                  limits:
                    cpu: "?*"
                    memory: "?*"

Use this as a policy shape, not a universal copy-paste. Real production policies need exceptions, test coverage, and compatibility checks for init containers, ephemeral containers, and generated pods.

Why We Built It This Way

The model holds up because no single control has to be perfect.

If a developer pushes a risky manifest, AppProject rules can block forbidden destinations or resource kinds. If the AppProject is too broad, Kubernetes admission policy can still reject privileged pods or unsigned images. If a user has sync permissions but not delete permissions, they can reconcile approved applications without removing the project boundary. If a repository path changes unexpectedly, ApplicationSet generation and fixed project names prevent a tenant from silently moving into a more privileged project.

This is the same reason production Kubernetes platforms separate control-plane operations from tenant workload operations. Argo CD should be a deployment interface, not a universal privilege escalator.

Tenant Onboarding Flow

A repeatable onboarding process keeps the platform from accumulating one-off exceptions:

  1. Create the tenant namespace, quota, limit range, default network policy, and service accounts.
  2. Create the tenant AppProject with explicit source repositories and destinations.
  3. Map identity-provider groups to project roles and verify token claims.
  4. Create an ApplicationSet from approved repository paths.
  5. Add admission policy labels or namespace selectors.
  6. Run a dry-run sync in non-production.
  7. Enable production sync windows and decide prune behavior.
  8. Document break-glass access and deletion approvals.

If this flow takes days, automate it. A platform repository can own tenant bootstrap manifests, while application repositories own service manifests. That separation keeps tenants productive without letting application teams rewrite the platform boundary.

Common Failure Modes

The most common failure is a broad AppProject with sourceRepos: ["*"] and destinations that include every namespace. That is not multi-tenancy; it is shared admin through a nicer UI.

The second failure is trusting repository structure without repository permissions. If a tenant can open a pull request that changes another tenant's production path, Argo CD will not know that the organizational boundary was violated. Git ownership rules, branch protection, and code owners matter.

The third failure is enabling automated prune before understanding ownership. Prune is safe only when the Application is the clear owner of the resources it tracks.

The fourth failure is letting tenants deploy cluster-scoped resources because one application needed a CRD. CRDs, webhooks, cluster roles, and storage classes are platform lifecycle concerns. Put them in a platform project with stricter review.

Frequently Asked Questions

Q: Should every team get its own Argo CD instance? A: Not always. Separate Argo CD instances provide stronger operational isolation, but they increase controller, upgrade, and repository-management overhead. A shared instance with strict AppProjects, RBAC, and Kubernetes policy can work for many internal platforms; regulated or high-risk tenants may still justify dedicated instances.

Q: Can AppProjects stop a tenant from deploying privileged pods? A: AppProjects can limit destinations and resource kinds, but they do not express every pod-level security rule. Use Kubernetes admission policy and Pod Security Admission labels to enforce runtime restrictions such as privileged mode, host networking, hostPath mounts, and image provenance.

Q: Should tenants be allowed to create namespaces? A: Usually no in production. Namespace creation should be part of platform onboarding because quotas, network policy, labels, service accounts, and policy selectors need to be created together. Letting applications create arbitrary namespaces weakens the tenancy model.

Q: How do sync windows help multi-tenancy? A: Sync windows let a project allow or deny sync operations during scheduled periods. They are useful when production changes need change-freeze windows, business-hour controls, or coordinated release periods without disabling GitOps entirely.

Q: What is the fastest way to audit an existing shared Argo CD instance? A: List every AppProject, find projects using wildcard source repositories or wildcard destinations, map RBAC groups to project roles, identify applications with automated prune, and check which projects can deploy cluster-scoped resource kinds. Those four checks expose most tenancy risks quickly.

Resources

Related internal guides

External references

Comments

Popular posts from this blog

Bootstrapping Kubernetes Clusters with Terraform and Argo CD: A Durable Two-Layer Approach

Argo CD Auto-Sync and Health Checks: An Operator's Guide to Safe GitOps Reconciliation

Kubernetes Multi-Tenancy with Namespaces and Network Policies: A Practical Guide for GitOps Teams