If you’ve looked at some of my previous posts (or repositories) you might be aware that I’m running my own k8s cluster at home for fun and absolutely no profit. Since the current setup is such a no-profit thing, I decided that I need to upgrade.
Thus the idea was born to re-do my cluster with two goals in mind:
- Infrastructure as Code from bare metal (mostly ansible)
- 1-node failover capabilities for any node
This blog post details how I set this up. As per usual, it’s a mix of blogging, documenting for myself, and providing a tutorial for others. Both goals have a slight caveat in the end, but I think the caveats are natural in the real world.
Hardware#
Since the goals already established that this should be automated from bare metal, let’s take inventory of the relevant metal.
Control Plane#
The control plane is going to run on two Odroid M1. Both of them equipped with a 1TB SATA SSD for boot and operational storage, and a 1TB NVMe SSD for “cloud native” storage.
Both drives chosen for showing up on my feed during Amazon’s black Friday sales.
Workers#
The worker nodes on my system are a mess of cobbled together systems:
- Old Computer (i7-2600k)
- Odroid M2
- Repurposed Thin Client on USB storage
- “Industrial” Mini-PC
They are ranked roughly in order of compute power and quality of their storage medium.
Storage#
To allow proper any node failover, the storage needs to be distributed across multiple nodes as well.
In my old setup, the “Old Computer” has a raidz1
over 4 1TB SSDs. Most
workloads get that storage NFS mounted via the
k8s CSI mechanisms and are happy. This
was ok, since this node also acted as single-node control plane and was the
primary worker of my old cluster.
The new cluster is supposed to tolerate single node outages without exceptions.
So I’ll use rook to setup a ceph
pool for storage. In this first iteration, it’s a simple mirror (~raid1
) on
the boot drives of the control-plane nodes, though I got some plans to create a
larger error coded pool at some point.
Infrastructure as Code#
With this cleared up, first I’ll clear up the accepted caveats to “Infrastructure as Code”.
The goal is:
Starting on freshly installed systems, a single command (running on a controller) can get the cluster running
The somewhat mushy part in this definition is “freshly installed”. To save myself some horrors, I define this as:
- Ubuntu Linux is running, partitions are defined and filesystems are resized + setup.
The resized + properly partitioned step being separate from installation might
be surprising. This is an artifact of the common install method for SBCs. The
easy way to install an OS to the Odroid systems is via pre-defined images simply
dd
ed to a drive. These then need some extra work to use the entire drive. Or
in this case, use half the drive, and offer an empty partition for the ceph
pool later on.
Ubuntu#
While I’m not the biggest fan of Ubuntu for servers, it’s often the supported OS for SBCs, so the choice is simply by ease of installation.
Accordingly the
ansible playbook
does some basic cleanup of the images. Remove some annoying bits that Ubuntu
comes with (like snapd
) and make sure my user exists and can be used as
pseudo-admin via asymmetric cryptography based SSH login.
After that, the OS specific bits really boil down to getting k8s from the
correct sources. Or if you are adventurous, maybe some systemd
bits. Though
most distributions share these by now.
Shared K8s Preparations#
There’s some basic tunables we need to setup. To follow IaC principles, this is
done via an
ansible role.
Every node in the k8s cluster needs to run the kubelet
and at least during the
setup phase requires kubeadm
. In my case, I’ve chosen to use containerd
as
container runtime.
While I might build my own packages at some point, the bootstrap is easier with
upstream releases. Due to packaging cycles the containerd
comes from
docker
and the kubernetes
components come from the official
upstream binary releases.
The only slightly non-standard bit during this step is the configuration of
containerd
. The docker provided .deb
file comes with a configuration I don’t
quite understand. To get a predictable state, we use
containerd config default
and generate a default configuration.
This configuration only needs minimal adjustments for now.
Control Plane Setup#
The hardware section starts with the control-plane nodes. While it’s not running on any special hardware in my case, they are the brains of the cluster, and also need to be the “root” of a cluster setup since they contain knowledge about all other nodes.
“Leader”#
While the control-plane in general needs to be the root of the k8s cluster, in practice there needs to be a single host that acts as root. The other hosts of the control-plane then join this host with a special role.
to make my life easy, the ansible playbook just selects the first control-plane node in the inventory as leader.
The
role
then initializes the cluster with kubeadm
, installs
cilium, the
gateway api resources and deploys some
bootstrapping resources for software running inside the cluster.
Control Plane#
All control plane nodes also run the role for control plane nodes. The method it uses to avoid re-running also makes sure that the leader just continues to run.
The leader is important again to get the join command. Even if this was used to join more than one other node.
keepalived#
kubernetes high availability
setup with kubeadm
requires a shared address for all control plane nodes. This
allows bootstrapping information about the real addresses of cluster nodes
without a single point of failure discovery node.
While this can be done with DNS, I’ve chosen
Keepalived to configure my control plane nodes
with a “shared” IP address. While keepalived
supports configurations with load
balancing on a single address, in my case I just use it as rapid failover so the
second node can take over the functions of the first one if it fails.
The setup is pretty simple and contained in an ansible role. The only slightly special thing about this setup: The leader runs the role before the leader k8s setup, while all other control-plane nodes run it after joining the cluster. This ensures that it always routes to a useful node during setup.
Worker Setup#
Worker nodes are even simpler than non-leader control plane nodes. They just need to run a single command to join the cluster. Of course that’s done via an ansible role.
ETCD#
At this point, the cluster is running. It has two control plane nodes and a couple of workers. Everything is set up with ansible. So this should be the point to make a note. Huge Success.
Sadly, not quite. Turning off one of the nodes breaks the cluster 😔. This
happens due to the backing storage for the kube-apiserver
. Namely
etcd
.
This table
provides the answer.
To get 1 node fault tolerance, we need 3 nodes in the cluster. And with
kubeadm
each control-plane node provides a single etcd
instance. It also
won’t work to simply scale etcd
nodes on the existing control-plane nodes,
since it needs strictly more than half nodes to work. I.e. when there’s two
nodes at least one will always take down the cluster.
To fix this, I’ve elected a single worker node to also host an etcd
instance.
The
ansible role
sets this up. It does some manual copying of certificates and signs them on the
leader, then joins the node to the cluster.
Once this is done, any node going down keeps 2 nodes alive. So we retain quorum if any node goes down.
CEPH#
This story goes on in Declarative k8s Storage