white and black outdoor lamp

My cluster is now running on k3s 1.20.6 and Argo CD 2.0.0 with its Helm chart at 3.2.2. Actually, upgrading Argo itself wasn’t much of a problem. I just changed the targetRevision of the Application and it was up and running in a few minutes. Then a few days later things got interesting.

There were no downtimes, but I noticed that Argo started failing to sync itself. Apparently a new minor version of the Helm chart came out (though it was still the same application version) that added support for the networking.k8s.io/v1 version of Ingress. However, it also accidentally broke clusters running Kubernetes before 1.19. And mine was one such.

While the Argo people are figuring out how to fix this (if), I decided to go and take this opportunity to upgrade my cluster. This wasn’t as painless as it should’ve been though.

grayscale photo of heart shaped stone

First I had to upgrade k3s itself. This alone shouldn’t be much of an issue: set the K3S_TOKEN, K3S_NODE_NAME and K3S_URL (for the agent nodes) environment variables, pipe curl into shell (with the desired options) and be done. That was the easy part. The problem was that for some reason my agent nodes just wouldn’t appear.

I checked the k3s logs with journalctl -f -u k3s.service and noticed errors saying “starting kubernetes: preparing server: https://<server node>:6443/v1-k3s/server-bootstrap: 401 Unauthorized.” Unauthorized? I double checked that K3S_TOKEN was set to the same value as on the server node. Turns out when I upgraded the server node, it generated a new different token.

The token is somewhere on the server node in a file called “node-token”, in my case /var/lib/rancher/k3s/server/node-token. Once I re-run the agent nodes’ installation with that token, they checked in and everything was back in order… Or so I hoped.

Apparently Argo didn’t like this change of scenery and things got stuck. Argo tried to sync itself to the new version (3.2.2) which now didn’t error at resource validation like before, instead timed out (my cluster isn’t exactly beefy you see). Then other Applications started going Unknown/Progressing, with resources getting recreated. After a while I got fed up, and just wiped Argo altogether. Once I confirmed that everything else was less dead, I re-installed Argo and now things were alright.

Not exactly though. I noticed that in the first place the reason resources kept getting recreated was that they got evicted and nodes were having very high CPU usage. Looking closer it turned out that the nodes were on the verge of getting the disk pressure taint (though they didn’t actually get tainted), so k3s was burning CPU trying to free disk space by GCing images (which didn’t work since I barely have any). I checked with ncdu and noticed that journald‘s logs were eating up gigabytes of space. I didn’t look closer to see if it was all k3s 401 errors, but I cleaned up with journalctl --vacuum-size=100M and changed its config SystemMaxUse so this won’t happen again.