DevOpsDreams

Posts

Autoscaling EKS Clusters with Karpenter: A Policy-First Model That Holds in Production

May 15, 2026

Autoscaling EKS Clusters with Karpenter: A Policy-First Model That Holds in Production Karpenter can improve EKS scaling speed and flexibility, but reliable outcomes depend on NodePool policy, EC2NodeClass boundaries, and disruption controls. TL;DR Karpenter works best in production when autoscaling is treated as policy, not only capacity automation. Modern Karpenter workflows are built around NodePool, EC2NodeClass, and NodeClaim resources. Teams should enforce explicit requirements, limits, and disruption budgets, and run the Karpenter controller outside Karpenter-managed capacity. Cost and reliability improvements come from combining scaling policy with workload resource discipline and clear observability through NodeClaim lifecycle and metrics. Production autoscaling starts with explicit NodePool and EC2NodeClass policy. Karpenter Succeeds in Production Only When Scaling Policy Is Explicit Karpenter can scale EKS clusters faster and with wider instance selection than static-no...

Kubernetes Gateway API vs Ingress: A Practical Production Model for Platform Teams

April 20, 2026

Kubernetes Gateway API vs Ingress: A Practical Production Model for Platform Teams Ingress still works, but new routing requirements in shared clusters are better served by Gateway API. This guide explains what changes operationally, what to migrate first, and how to validate support safely. TL;DR Ingress is not removed from Kubernetes, but its API is frozen and Kubernetes recommends Gateway API for future evolution. The practical win for platform teams is governance: GatewayClass and Gateway can be owned by infrastructure teams, while HTTPRoute and related route objects can be owned by application teams. Migration should be incremental, with Ingress and Gateway resources coexisting while behavior is validated against your specific implementation. Production success depends on conformance checks, supported feature verification, and explicit cross-namespace attachment policy rather than direct YAML translation. Gateway API maps platform ownership and application ownership more cleanl...

Improving Kubernetes Cost Visibility with OpenCost

April 07, 2026

Improving Kubernetes Cost Visibility with OpenCost OpenCost gives Kubernetes teams a practical way to see allocation, idle cost, and cloud billing in one place. This guide shows how to install it and read the numbers correctly. TL;DR OpenCost is useful when you need more than a cloud bill and less than a full financial model. It turns Kubernetes telemetry, Prometheus data, and cloud pricing inputs into allocation views that help teams understand who is using what and how much of the cluster is idle or shared. The important caveat is that the numbers are only as good as the telemetry and pricing data behind them, so the right goal is trustworthy cost visibility, not magical accounting precision. Cost Visibility Is Not Cost Guessing Most Kubernetes cost discussions start with the cloud bill and end with spreadsheet politics. That works until you need to answer a more useful question: which namespace, workload, team, or service is actually consuming the cluster, and how much of the pl...

Building an Internal Developer Platform on EKS

April 05, 2026

Building an Internal Developer Platform on EKS An internal developer platform is not just a cluster plus CI/CD. This guide shows how Backstage, GitOps, and EKS fit together as a product layer for self-service delivery. TL;DR An internal developer platform on EKS works best when you treat it as a product, not a cluster project. EKS provides the runtime substrate, but the platform is the contract layer that turns infrastructure into self-service capabilities: catalog, templates, deployment paths, health visibility, and guardrails. Backstage is useful because its catalog and software templates expose those capabilities in a developer-facing interface, while GitOps keeps the actual platform state declarative and auditable. If you want adoption, focus on what developers can request and understand, not only on what the cluster can run. An IDP Is A Product Layer, Not A Cluster Project The easiest way to build the wrong internal developer platform is to treat it like an infrastructure chec...

Monorepo vs Multirepo for GitOps Manifests: A Practical Guide for Flux and Argo CD

March 30, 2026

Monorepo vs Multirepo for GitOps Manifests: A Practical Guide for Flux and Argo CD GitOps repository structure is not a style choice anymore. The right answer depends on how you split ownership, reconcile paths, and scale teams, controllers, and environments. TL;DR Monorepo versus multirepo is no longer a simple Git preference debate. GitOps tools care about ownership boundaries, reconciliation scope, and how quickly changes should move through the system. Flux can structure monorepos, repo-per-team, and repo-per-app setups, while source decomposition can split a single repository into smaller deployable artifacts with independent lifecycles. Argo CD uses Application paths, automated sync, health assessment, and sync waves to control how Git state becomes cluster state. The best design is the one that keeps ownership explicit, keeps blast radius small, and matches how your platform team actually operates. The Real Debate Is Not Monorepo Versus Multirepo GitOps has changed the quest...

Argo CD Auto-Sync and Health Checks: An Operator's Guide to Safe GitOps Reconciliation

March 30, 2026

Argo CD Auto-Sync and Health Checks: An Operator's Guide to Safe GitOps Reconciliation Auto-sync and health are not the same thing in Argo CD. This guide shows how reconciliation, pruning, self-healing, retries, health checks, hooks, and sync waves fit together in real clusters. TL;DR Argo CD auto-sync and health checks solve different problems. Auto-sync decides when the controller reconciles Git with the cluster, while health checks decide whether the resulting resources are ready, degraded, or suspended. In practice, you need both: automated sync with prune and self-heal keeps desired state converged, retry refresh helps failed syncs recover on new revisions, and health checks plus custom Lua logic make readiness visible to operators. Sync waves and hooks add ordering when resources depend on each other. The control loop most teams misunderstand Argo CD has two separate jobs that often get blurred together in day-to-day platform talk. The first is reconciliation. Should Arg...