I ran into a really weird problem today. I noticed some strange config drift on one of my nodes (shit happens when I manually experiment in “production”), so I decided to reinstall/upgrade the k3s agent. For a while now I’ve been connecting my nodes through tailscale, so that I can have my homelab machine join the cluster from my home network as well. k3s has a(n experimental) feature for “natively” integrating with tailscale and it’s been working just fine so far.

The weirdness today was that the restarted node wasn’t able to talk to my homelab node for some reason. They showed on each other’s tailscale status report with active; relay connections instead of direct. This in turn resulted in that pods on the restarted k3s node weren’t able to talk to CoreDNS, which unluckily got scheduled on the homelab node.

It took a while to figure this out, since the only pod having issues was this blog—it wasn’t able to connect to the database (the other two linodes had direct tailscale connections to the homelab node). At first I suspected it was caused by the host’s DNS being broken, since dig couldn’t talk to systemd-resolved on 127.0.0.53 for some reason. (In retrospect I guess maybe because it was also trying to talk to CoreDNS and failed?) Then again resolvectl query worked just fine…

So I decided to kubectl exec into the blog’s WordPress pod and see if I can dig the database’s hostname—and indeed it failed. At first I suspected it was some issue with tailscale not advertising routes properly, but then the other nodes should’ve been failing too. That’s when it occurred to me to check what this 10.43.0.10 was (it was CoreDNS) and then it was one more kubectl describe to realize it was on the homelab node, and then tailscale status gave me the final hint.

In my case upgrading tailscale to the latest version and using tailscale ping from the homelab to try to talk to the restarted node resolved the active; relay issue and turned it into a direct connection.