How I took down and brought back prod

hosting
server
servers
infrastructure
kubernetes

This is not a paid advertisement. All views and opinions are my own. That said, use Omni.

I fucked up¶

I'm running kubernetes in my homelab, which hosts this website and is the main ingress for all the services I'm running. It started when I was installing KubeVirt. It seemed like a really good idea to install KubeVirt so I can migrate every last VM out of proxmox except the kubernetes ones. There's many reasons why it would make things better, but mostly it would massively improve reboot times for the server if kubernetes is the only thing that has to start and the kube schedular can handle launching everything else. It currently takes around 2 hours for everything to come back up, which is not ideal.

How I fucked up¶

One of the blog posts online I was reading about KubeVirt suggested the way to add VM's to the LAN is by adding a linux bridge network to the hosts. So I opened Omni up and made a cluster wide machine patch that added a bridge network to every node. The Talos machines at 1.9.3 didn't complain, at first. Then I wanted to migrate the machines to the latest Talos version, 1.10.2, so I started the upgrade rollout from within Omni, and this is where things went bad.

Omni updates one machine at a time, in case of update failure like what happened here. The update failed because of the bridge patch for Kubernetes. No big deal I thought, I'll simply rollback the upgrade and revert the patch and try again. Not so much. Omni updates the control planes first, so one of my control planes was now down. Whatever. I left it like that for a couple days and everything still seemed to work for me. Then a friend who was using my GitLab to store his project told me he can't access it from outside.

Investigation time, I can still access it from inside the network, so this is weird. I check the Ingress objects, they're still fine. I check BGP on the switch, it seems fine. I updated traefik to the latest version and set it back to the reliable Cluster mode instead of Local mode to see if that does anything. Still can't load anything from outside. Maybe my ISP is blocking ports suddenly? I change my MAC address and remove the hostname announcement to my ISP, and cycle my IP, still nothing. All I can think at this point is maybe it's that downed control plane and the routing somewhere somehow is messed up. So I'll try to fix that control plane.

I'll need to remove the old one, so first I start adding a new control plane to take the place of the old one. It never comes available. I deleted the old control plane but still, nothing's coming up. I'm no Kubernetes master yet, I don't know what could be failing if it's not an obvious error dumped to the system logs. I do what I thought was logical, and start adding more and more control planes. I had it up to 9 at one point. 9 is far to many control planes.

At this point I start looking into the Omni, Talos, and Kubernetes Disaster Recovery posts online. I manage to use talosctl take take an etcd snapshot, even though none of the control planes were healthy any snapshot is better than nothing at this point. I was also able to take a snapshot in Omni. Now I needed to secure this snapshot. Proxmox coming in clutch here, I use it to make a backup of the VM running Omni inside of docker, this way even if I screw up the Omni I'll be able to restore it (And yes I did have to.). Finally I also used omnictl to dump the yaml configuration of the cluster.

Port Blocked?¶

Just in case my ISP was port blocking me I also enabled for 12 hours the Cloudflare Proxy for the domain. Not tunnels or anything, just that quick toggle of the proxy. In those 12 hours overnight I had 1.55k unique visitors make over 15k requests to my not online server. This wasn't left on, but it was very interesting to see that much traffic hitting my server. Also it didn't help things. I still have no way to be absolutely sure it's not the ISP, I could have spun up a server on the gateway to test and make sure but until I get the cluster back up I'm not really caring what caused traefik to be unable to reach the gitlab (I'm 95% sure now it was just something wrong with the Nat64 gateway. Layers upon layers yo. )

Website's down¶

At this point all the things are failing, and I'm scrambling to bring it back up without any data loss. Here I did something very stupid, I changed the CIDR addresses in the cluster configuration yaml, not entirely sure why I did but I did. I saved the changed CIDR addresses, and continued trying to recover.

Following the disaster recovery steps to the T (Except for changing the CIDR.. this bites me in the ass), and the cluster won't come back. I try creating a new cluster using the etcd to bootstrap it, still won't turn on. I create a support ticket on the Omni github issue tracker, one of the main devs is responding almost immediatly trying to help me figure this out. I'm sending in logs and support bundles, it makes no sense. etcd is bootstrapping, it does log things that it could only log if it had recovered the old configuration.

I'm thinking my server's very slow because the cpu's are old and slow, and my network isn't great and I'm using CEPH so storage is extra slow too. Maybe it's just taking to long, I leave it overnight and hope it just needs time. This is not it. I'm making new clusters.

What happened to Day 2?¶

On Day 1 I was reading everything and trying everything. I don't want to bug devs unless I'm really stuck. It wasn't until day 2 I reached out to the Omni project for help. I went there because the environment, even though it's all "Just Kubernetes", Omni manages a whole hell of a lot of the deployment details for me, making it far easier to work with. I don't have to track the SSL certs manually anymore, Omni does all that. This also means to recover this cluster using the old keys and whatnot I'd have to do it through Omni.

Day 3¶

The Omni dev suggested I try to bootstrap a 'cluster' of one node, a single control plane node, and send in the logs for that. So I bootstrap that single cluster node, and before I send in any logs I mine as well take a look. There are three containers running , kube-apiserver, kube-controller-manager and kube-schedular, so I'm watching kube-controller-manager, at this point thinking some certificate is probably the issue because it's usually SSL when something goes really bad. But then i see the most curious thing, warnings about the CIDR's and it crashes on startup at the end saying the Node ip isn't in the CIDR range. Didn't I change the CIDR in the recovery template? I sure as shit did. Dumbass. Now I completely forgot the original CIDR, but I knew the services had a /12 and the pods a /24. Using the error messages i was able to calculate what the original services /12 was , and the /24 was obvious. I put these back, tried again to bootstrap a single node cluster, and what do you know? It's coming online. One more time I erase the test recovery cluster and start to rebuild the main cluster. I say one more time... let's just say thank fuck I had that proxmox backup. Omni had release 0.50.0 literally 4 hours before I started doing all this, so I updated, and the was a great move. Some of the resources in the last release weren't ... releasing, but after the update they did finally.

Relief¶

With the main cluster now restoring with the etcd backup and the proper fucking CIDR's , I could see the CP's coming back fully. Now I just had to put everything back, switch the IP's around for the CP's and WN's so that I didn't have to go modify the cisco switches BGP settings to add more IP's (I changed them so the test clusters have the same IP's .. Funny how I was so careful about this set of IP addr's but the CIDRs I was an idiot with.) With baited breath I waited. Green lights started to come on, CP1 says it's fully running, now CP2 and CP3, I try to access the Gitlab because it runs in a VM, not in the kubernetes, but the Kubernete's runs the ingress for it. It's .. loading?! Yes, things really are coming back online.

Rancher? Not yet. I wait another half hour, try again, Rancher is back! Now I can login and check on everything. I knew it would take a while to fully come back up since there are a few docker.io images and they rate limit the shit out of pulls. I can see Pull Backoff's due to rate limiting popping up in the events, but everything is coming back. Everything except keycloak, but that's another story that involves default settings and Out Of Memory killings.

The cluster is back online, no data is lost, but it did take me 3 days. Thank god this was just my homelab and not some companies prod server.

What I've learned¶

Backup, backup, and when you think you don't need any more backups, backup again. Backup your etcd cluster, backup your PV's, backup EVERYTHING. Even if it's just a backup on the same storage network it's better than nothing. Take snapshots regularly. Make it your religion to backup, have a plan to backup, restore, and test your restores. Everything has to work.

Also, don't screw with things like CIDR's after a cluster is bootstrapped. It doesn't end well. Make a new cluster if you need different CIDR ranges for whatever reason (like needing more addresses or going ipv6 only) and use something like Velero to migrate clusters. Learn from my mistakes.

Shoutouts¶

Massive shoutout and huge thank yous to the Omni/Talos folks. They took time to respond to my request for help without me being a paid customer and led me to trying the single control plane node which made it much easier to track down what the actual problem was. If I ever find a job and they want suggestions for Infra management solutions, I know what I'll be recommending.

https://github.com/siderolabs/omni/issues/1220