It’s all too easy to kill a k3s cluster. I’ve been using k3s for years now and I’ve had plenty of adventures tweaking various aspects of running it. Before it’d take just a small change to an Argo Application to trigger a cascading failure. Hopefully now it’s a bit more resilient. Just a bit.
Have a dedicated master node (control plane)
In my cluster the k3s server usually eats anywhere between 0.3 and 0.7 CPU. It’s a loadful by itself, and if you add some other heavy nodes (like Argo’s application controller) it can easily overload a small node (like a linode nanode). And if you’re running k3s by yourself, you are probably using smaller nodes.
This is really easy to achieve with Node taints. The official docs mention this in the high-availability install instructions too. I use the CriticalAddonsOnly=true:NoSchedule
taint because I have a pod that can’t be evicted due to a local-path volume bind. With NoSchedule
no new pods (other than “critical addons”) will get scheduled on the node, but already running pods will be left intact. I wanted to evict everything I could though, which I achieved with kubectl drain control-node --ignore-daemonsets --pod-selector=app.kubernetes.io/name!=pehelypress
(where “pehelypress” is the name of the (minimal WordPress) Helm chart of the pod I wanted to leave there). Once all the other pods were evicted, I then un-cordoned the node with kubectl uncordon control-node
to make it schedulable for everything that has a toleration for the above NoSchedule
taint.
Pay attention to FreeDiskSpaceFailed
events
I’ve had trouble with these. The nanodes I use only have 25GB disk space, which can be pretty quickly filled up if something goes wrong. Even if things go well. Limiting the space journalctl is allowed to use can help if that’s the root cause. But sometimes there are just many things on a node, or something just takes up space naturally (like the photos on this blog).
On my cluster the symptoms are high CPU use on the control plane node and on the disk-starved node, also lots of FreeDiskSpaceFailed
events telling me that it wanted to free some space but only could free much less (usually zero). I don’t know at this point what’d be the ideal way to alert on events. The simple way would be to have something just log every event to stdout, that then Loki could collect, which then I could alert on from Grafana. eventrouter looked nice except it doesn’t seem to be well maintained right now. I’ll check back later. (Hi me from 2024. Yes I considered this. It was pretty much abandonware. Use it if it isn’t anymore.)
There are a few ways to remedy disk pressure on the nodes. One obvious is to stop using the local-path provisioner for persistent volumes. While local-path is easy and simple for sure, it means that a pod using a local-path volume can only live on the node it was first scheduled to. Which might not be ideal (see above with the master node).
I think most providers at this point will give you some way to create external volumes from Kubernetes (Linode and Digital Ocean definitely do). These volumes will cost money, but they can be attached to any node and can be pretty big.
Another option is Longhorn. While it has its quirks and jellies, it utilizes the disk space of the nodes without binding any volume to a particular node. It does eat a bit of CPU and is sensitive to node reboots, but it doesn’t cost money. I haven’t used it long enough yet to comment on its details.
Consider resources
Most Helm charts will install without resource requests or limits specified. Which means it’s wild west. Grafana helps a lot with setting meaningful requests and limits. You can see on graphs how much CPU and memory each pod or each container uses. You can get a general sense of how much it needs “idling” and how much it needs when actively used.
On the one hand this can help the Kubernetes scheduler figure out which node to use for a pod (though it can still over-commit resources). On the other hand it’ll make you realize beforehand that maybe adding that bitcoin miner will be too much strain on your cluster (it is definitely too much strain on the planet, so cut it off).
Pods that can get quite heavy (so you definitely want to keep them off the master node and preferably on separate nodes each) include the Prometheus server (the more you’re scraping in your cluster, the more) and Argo’s controllers (they fetch repos, expand Helm charts, check for diffs and apply manifests at the very least). Even in the case of “proper” (not k3s) Kubernetes clusters (for example running on EKS Fargate), Argo can become quite a lumbering beast if the cluster has a lot of pods and changes often.