Blog

Chaos Engineering with Toddlers and Kubernetes

The Gremlin

When designing infrastructure, we often focus on theoretical failures such power, network changes, or misconfigured deployments. But what happens when chaos strikes in an unexpected form? This article explores key engineering principles around Site Reliability Engineering, chaos engineering, and the value of resilient storage. We do this through a real-world (and surprisingly effective) test conducted by an unlikely agent of chaos: a toddler.


The eldest gremlin that lurks in my home is still a young human. She was not yet 3 years of age at the time of this story and was still largely dependent on her caretakers for supplies, shelter, nutrition, crackers in the shape of a fish, and occasional video entertainment. Her matriarch left her to watch a family home video while her younger sister was fetched from a nap. Alone now for only minutes, the gremlin grew restless after being left on the couch. She began to wander into other rooms in my basement. Since the HVAC unit and adjacent bathroom failed to pique her interest, she wandered into the guest bedroom.

"Oh? What is this?" she must have thought.

There, on the floor, she found something curious. It was a pile of laptops, raspberry pis, and Libre single board computers. She didn't know it, but she had stumbled across the family's handy-dandy Kubernetes installation. At the time of this discovery her often-declared favorite color was blue. What a coincidence this was, as blue also happened to be the color of the Dell laptop power plugs, glowing there in the otherwise dark guest bedroom.

Curious, the gremlin pulled the power plug from the laptop and held it in her hand. She followed the other end until it went to a box where she couldn't trace the cable further. So she pulled a second blue plug and a third, and she went on until each laptop ran on battery power. Her amusement soon faded, and so she wandered into the playroom for a ball pit party for one.

Cross-Section of Discipline

In this article, we'll briefly discuss a cross-section of several related concepts within the cloud engineering discipline.

  1. Chaos Engineering: The practice of deliberately introducing failures and unpredictable conditions into a system to test its resilience and reliability by simulating real-world disruptions
  2. Site Reliability Engineering: Applies software engineering principles to IT operations to create scalable and highly reliable systems. The primary focus is on automation, monitoring, incident response, and performance optimization to ensure services remain available and within agreed service level objective targets
  3. Distributed Systems Design: The focus is on building systems that operate efficiently and consistently across multiple machines, handling failures gracefully while maintaining high availability and performance.
  4. Cloud-Native Storage: In modern computing, cloud-native storage ensures that data remains available and resilient across distributed nodes through technologies like Ceph, Rook, and persistent volume claims (within the context of Kubernetes).

Evolution of Compute Orchestration

As technology has been embraced by companies, the infrastructure spanning the environment has increased significantly. As an industry, we've moved through several infrastructure phases:

  1. Mainframe: Very few, large machines that run many workloads
  2. Client-Server Era Individual physical machines that each run a dedicated workload
  3. Virtual Machines running entire tenant operating systems under host operating systems on hypervisors (VMware)
  4. Containerization OS-level virtualization using abstractions such as kernel namespaces and cgroups (containerd, Docker)
  5. Container Orchestration The orchestration of containers across datacenters & cloud (Kubernetes, Docker Swarm)

Each of these phases brought about new solutions and problems to overcome. The most recent transition to container orchestration platforms has brought about the ability to create fault-tolerant workloads that can be automatically migrated when an outage occurs. Kubernetes and similar systems provide self-healing capabilities, but workload resilience still depends on how applications are designed (distributed state, persistent storage handling). Now that we're in this latest Container Orchestration phase, let us dig into how configuring these types of environments can make us more resistant to chaotic events.

Configuration prevents migration

You may be tempted to believe that as long as you use Kubernetes in your company, you will automatically benefit from having workloads jump between hosts seamlessly whenever there's unavailability with an underlying system component, such as network connectivity, power, patching, etc. But beware, as there are common mistakes in implementation that prevent workloads from migrating between cluster nodes.

hostPath Storage

The most straightforward way to attach storage to your running container workload is to borrow from one of the nodes you're running Kubernetes on. The storage implementation of hostPath often is paired with node affinity ; this ensures pod replacements are scheduled onto the same host machine that still contains the pod's application data.

Here's an example of the implementation of hostPath storage, which persists data between pod recreation events.

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  nodeSelector:
    role: sandwiches  # This can be any key/value pair
  containers:
  - name: example-container
    image: nginx:latest
    volumeMounts:
    - mountPath: /data
      name: host-storage
  volumes:
  - name: host-storage
    hostPath:
      path: /var/opt/my_app
      type: Directory

Apply the label to the relevant node. Assume our node name is dell-04:

$ kubectl label node dell-04 role=sandwiches # Use key/value pair

While this is simple to implement, this approach is sorely lacking in data resilience. Everytime the example-pod needs a machine to run on, kubernetes will only be able to schedule it onto dell-04! This should be taken a step further to protect against inevitable chaos. In a bare-metal cluster, I reach first for a rook-ceph implementation.

Upgrading from hostPath to Rook-Ceph

You may already be familiar with Ceph, which provides a scalable and distributed storage system designed for high availability and fault tolerance.

Rook makes Ceph Cluster management Kubernetes-friendly by automating the deployment, configuration, and scaling of Ceph clusters within k8s. Integrating and operating cloud-native storage within on-prem environments is easier than ever.

At the time of the gremlin attack, my home cluster was utilizing rook-ceph provided storage. Below is an example of a Kubernetes manifest for the rook-ceph pvc (persistentVolumeClaim). Remember there's no need for a formal pv (persistentVolume) manifest as this part is now handled by rook-ceph automatically.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-block-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: rook-ceph-block

And here's our example k8s pod manifest again, but with hostPath provided storage swapped for rook-ceph provided storage. Remove all annotations surrounding nodeSelector.

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: example-container
    image: nginx:latest
    volumeMounts:
    - mountPath: /data
      name: ceph-storage
  volumes:
  - name: ceph-storage
    persistentVolumeClaim:
      claimName: ceph-block-pvc

There is of course legitimate uses to nodeSelector and hostPath. nodeSelector can be leveraged to schedule pods to specific nodes containing GPUs or specialized storage. hostPath allows kubernetes loads to peek into filesystems for configurations, logs, or persistent storage that exists outside Kubernetes volumes. However, these should be considered the exceptions not the rule for providing durable storage to pods.

Chaos Ensues

So what happened with the gremlin? Did the power plug pulls ruin everything?

No! To my amazement, nearly all the workloads migrated during this live-accidental test of our resilience. The pods found new homes in the cluster's single board computers as the laptop batteries started running out. The only actual failure recorded was the media server. That's right; the only service impacted was the one the gremlin herself was consuming. The AppleTV stream from the home-hosted JellyFin in the adjacent room shut down, and she could no longer reminisce of the good old days of being 90-something days younger.

Blameless Postmortem

It's time to put on our Site Reliability Engineering hats and review the event! When interviewed, the gremlin shared that she had found the bedroom door containing the cluster open. While I usually had it closed, I was drawn into this room to handle an unrelated power issue.

As physical access is the first security layer in every system, I’ve since routed Ethernet under the stairs for better physical security and cooling. The cluster was also lifted off the floor by using a small rack big enough for my servers.

Next steps

Ready to talk about your next project?

1

Tell us more about your custom needs.

2

We’ll get back to you, really fast

3

Kick-off meeting

Let's Talk