Skip to main content
  1. Posts/

Bare Metal K8s with HA

·1315 words·7 mins·
Kubernetes Tutorial
Author
Markus Ongyerth
Table of Contents

If you’ve looked at some of my previous posts (or repositories) you might be aware that I’m running my own k8s cluster at home for fun and absolutely no profit. Since the current setup is such a no-profit thing, I decided that I need to upgrade.

Thus the idea was born to re-do my cluster with two goals in mind:

  • Infrastructure as Code from bare metal (mostly ansible)
  • 1-node failover capabilities for any node

This blog post details how I set this up. As per usual, it’s a mix of blogging, documenting for myself, and providing a tutorial for others. Both goals have a slight caveat in the end, but I think the caveats are natural in the real world.

Hardware
#

Since the goals already established that this should be automated from bare metal, let’s take inventory of the relevant metal.

Control Plane
#

The control plane is going to run on two Odroid M1. Both of them equipped with a 1TB SATA SSD for boot and operational storage, and a 1TB NVMe SSD for “cloud native” storage.

Both drives chosen for showing up on my feed during Amazon’s black Friday sales.

Workers
#

The worker nodes on my system are a mess of cobbled together systems:

They are ranked roughly in order of compute power and quality of their storage medium.

Storage
#

To allow proper any node failover, the storage needs to be distributed across multiple nodes as well.

In my old setup, the “Old Computer” has a raidz1 over 4 1TB SSDs. Most workloads get that storage NFS mounted via the k8s CSI mechanisms and are happy. This was ok, since this node also acted as single-node control plane and was the primary worker of my old cluster.

The new cluster is supposed to tolerate single node outages without exceptions. So I’ll use rook to setup a ceph pool for storage. In this first iteration, it’s a simple mirror (~raid1) on the boot drives of the control-plane nodes, though I got some plans to create a larger error coded pool at some point.

Infrastructure as Code
#

With this cleared up, first I’ll clear up the accepted caveats to “Infrastructure as Code”.

The goal is:

Starting on freshly installed systems, a single command (running on a controller) can get the cluster running

The somewhat mushy part in this definition is “freshly installed”. To save myself some horrors, I define this as:

  • Ubuntu Linux is running, partitions are defined and filesystems are resized + setup.

The resized + properly partitioned step being separate from installation might be surprising. This is an artifact of the common install method for SBCs. The easy way to install an OS to the Odroid systems is via pre-defined images simply dded to a drive. These then need some extra work to use the entire drive. Or in this case, use half the drive, and offer an empty partition for the ceph pool later on.

Ubuntu
#

While I’m not the biggest fan of Ubuntu for servers, it’s often the supported OS for SBCs, so the choice is simply by ease of installation.

Accordingly the ansible playbook does some basic cleanup of the images. Remove some annoying bits that Ubuntu comes with (like snapd) and make sure my user exists and can be used as pseudo-admin via asymmetric cryptography based SSH login.

After that, the OS specific bits really boil down to getting k8s from the correct sources. Or if you are adventurous, maybe some systemd bits. Though most distributions share these by now.

Shared K8s Preparations
#

There’s some basic tunables we need to setup. To follow IaC principles, this is done via an ansible role. Every node in the k8s cluster needs to run the kubelet and at least during the setup phase requires kubeadm. In my case, I’ve chosen to use containerd as container runtime.

While I might build my own packages at some point, the bootstrap is easier with upstream releases. Due to packaging cycles the containerd comes from docker and the kubernetes components come from the official upstream binary releases.

The only slightly non-standard bit during this step is the configuration of containerd. The docker provided .deb file comes with a configuration I don’t quite understand. To get a predictable state, we use containerd config default and generate a default configuration.

This configuration only needs minimal adjustments for now.

Control Plane Setup
#

The hardware section starts with the control-plane nodes. While it’s not running on any special hardware in my case, they are the brains of the cluster, and also need to be the “root” of a cluster setup since they contain knowledge about all other nodes.

“Leader”
#

While the control-plane in general needs to be the root of the k8s cluster, in practice there needs to be a single host that acts as root. The other hosts of the control-plane then join this host with a special role.

to make my life easy, the ansible playbook just selects the first control-plane node in the inventory as leader.

The role then initializes the cluster with kubeadm, installs cilium, the gateway api resources and deploys some bootstrapping resources for software running inside the cluster.

Control Plane
#

All control plane nodes also run the role for control plane nodes. The method it uses to avoid re-running also makes sure that the leader just continues to run.

The leader is important again to get the join command. Even if this was used to join more than one other node.

keepalived
#

kubernetes high availability setup with kubeadm requires a shared address for all control plane nodes. This allows bootstrapping information about the real addresses of cluster nodes without a single point of failure discovery node.

While this can be done with DNS, I’ve chosen Keepalived to configure my control plane nodes with a “shared” IP address. While keepalived supports configurations with load balancing on a single address, in my case I just use it as rapid failover so the second node can take over the functions of the first one if it fails.

The setup is pretty simple and contained in an ansible role. The only slightly special thing about this setup: The leader runs the role before the leader k8s setup, while all other control-plane nodes run it after joining the cluster. This ensures that it always routes to a useful node during setup.

Worker Setup
#

Worker nodes are even simpler than non-leader control plane nodes. They just need to run a single command to join the cluster. Of course that’s done via an ansible role.

ETCD
#

At this point, the cluster is running. It has two control plane nodes and a couple of workers. Everything is set up with ansible. So this should be the point to make a note. Huge Success.

Sadly, not quite. Turning off one of the nodes breaks the cluster 😔. This happens due to the backing storage for the kube-apiserver. Namely etcd. This table provides the answer.

To get 1 node fault tolerance, we need 3 nodes in the cluster. And with kubeadm each control-plane node provides a single etcd instance. It also won’t work to simply scale etcd nodes on the existing control-plane nodes, since it needs strictly more than half nodes to work. I.e. when there’s two nodes at least one will always take down the cluster.

To fix this, I’ve elected a single worker node to also host an etcd instance. The ansible role sets this up. It does some manual copying of certificates and signs them on the leader, then joins the node to the cluster.

Once this is done, any node going down keeps 2 nodes alive. So we retain quorum if any node goes down.

CEPH
#

This story goes on in Declarative k8s Storage