I was thinking about how nice it would be if I could see on my main Grafana dashboard (the only one I use at this point actually) when there are new versions of something deployed. This way if by chance there is a problem with something afterwards I can see at a glance what could have gone wrong. Also I really like just looking at that Grafana dashboard and see that everything is alive and well. (Except when it isn’t.)
The other day I was struggling with a very weird error when upgrading to Ruby 3. The initial migrations for a Rails app would fail with “Mysql2::Error: Unknown MySQL error (ActiveRecord::StatementInvalid)”, but only in certain environments. The error would occur when Rails tried to check what migrations are already applied by looking at the schema_migrations table.
2022-01-19 09:12:59.205 +0000 [DEBUG] (1.7ms) SELECT GET_LOCK('4252831219231700070', 0) /usr/local/bundle/gems/mysql2-0.5.3/lib/mysql2/client.rb:131: warning: rb_tainted_str_new_cstr is deprecated and will be removed in Ruby 3.2 2022-01-19 09:12:59.222 +0000 [DEBUG] (2.0ms) SELECT `schema_migrations`.`version` FROM `schema_migrations` ORDER BY `schema_migrations`.`version` ASC 2022-01-19 09:12:59.224 +0000 [DEBUG] (1.8ms) SELECT RELEASE_LOCK('4252831219231700070') rails aborted! ActiveRecord::StatementInvalid: Mysql2::Error: Unknown MySQL error /usr/local/bundle/gems/mysql2-0.5.3/lib/mysql2/client.rb:131:in `_query'
Symptoms: CPU load on all the nodes, but not the pods. Looking at Grafana, I noticed that CPU load on some of my nodes was constantly very high. At the same time, even the total CPU use of all the pods summed wasn’t above 0.4. What gives? This usually gives that the control plane is getting fried by something. It may be trying to relieve disk pressure, or in this case, trying to revive CSI.
Trying to figure out what was causing problems I checked the pods in kube-system with
kubectl get pods -n kube-system. It quickly became apparent that there is a problem: disk-related pods like csi-resizer, csi-snapsotter and csi-provisioner were in CrashLoopBackOff.
I’ll be quite honest in that I’m not sure what the problem was. A few searches later I came to the conclusion that an earlier node reboot had left the pods with a corrupted DNS cache or something along those lines. Basically every issue I found with the symptoms I was seeing came down to DNS problems (longhorn/longhorn#2225, longhorn/longhorn#3109, rancher/k3os#811).
Alas I haven’t touched any of the networking machinery of Kubernetes (nor configured any of it for k3s) so my first idea was just the good old one from IT Crowd: “have you tried turning it off and on again?” So I did. Luckily another restart of the afflicted nodes solved the issue. I’m glad it did because I dread what I’d have had to do otherwise.
It’s all too easy to kill a k3s cluster. I’ve been using k3s for years now and I’ve had plenty of adventures tweaking various aspects of running it. Before it’d take just a small change to an Argo Application to trigger a cascading failure. Hopefully now it’s a bit more resilient. Just a bit.
You might happen to use the wicked_pdf gem for PDF output in your Rails app. You might happen to use the wkhtmltopdf-binary gem to provide the required binaries. You might want to get the above to work on the latest (at this point 3.0.3-bullseye) Ruby docker image. Short answer: give up. A bit longer answer: it’s easier than you think.
Monkey patching is bad. That’s where you should start from. It can cause trouble where you’d least expect it, conflicts with libraries you’d least expect in ways you’d least expect. And yet here I am sharing code for patching the delayed_job gem to (more or less) work with Ruby 3. Doesn’t this violate my own policies? There are a few choices.
- give up upgrading to Ruby 3 altogether
- monkey patch delayed_job as an emergency fix and make time to figure out what to do
- contribute to delayed_job making sure the gem is solid on Ruby 3
- get rid of all the
.delaycalls and switch to another async job library
Rails uses a “shifted” “semantic” “versioning” which pretty much comes down to the following. Major version: “we’ll most definitely break everything you ever depended on, half of them without warning.” Minor version: “we’ll probably break many stuff you depend on, some of them without warning.” Patch version: “we might accidentally some core APIs, but we promise it’s not intentional (or documented).” Knowing that, I still embarked on the grand endeavor of upgrading from Ruby on Rails 18.104.22.168 to 22.214.171.124. What could possibly go wrong, right?
Rich Hickey will tell you that breaking changes are horrible and versioning is stupid. The idea is nice. No breaking changes, ever. You get the API design of whatever you’re building perfectly at the first try. Oh wait. Obviously no one can do that, and no one could ever do that.
The question then becomes just how long exactly are you willing to carry the dead weight of code you don’t really want to carry anymore. Or rather even, how long exactly are you able to pay the costs of maintaining a possibly very problematic old API design.
I’ve never used Haskell. I won’t claim I’m good at Rust. I mostly work with Ruby and Clojure, both dynamic languages where you don’t really need to worry about types. But then of course that’s not true. Even if you put Rails’s magic aside, it’s way too easy to write code that accidentally works (in an absolutely unintended fashion).
Tagsale anime art beer blog code coffee concoct deutsch emo english fansub filozófia food gaming gastrovale geek hegymász jlc kaja kultúra language literature live magyar movie másnap politika rant seven summits sport suli szolgálati közlemény travel társadalom ubuntu university weather work zene 日本 日本語 百名山 艦これ 軽音