
Chaos Engineering with Toddlers and Kubernetes
Chaos Engineering on Kubernetes, a real-life scenario
When designing infrastructure, we often focus on theoretical failures such power, network changes, or misconfigured deployments. But what happens when chaos strikes in an unexpected form? This article explores key engineering principles around Site Reliability Engineering, chaos engineering, and the value of resilient storage. We do this through a real-world (and surprisingly effective) test conducted by an unlikely agent of chaos: a toddler.
The eldest gremlin that lurks in my home is still a young human. She was not yet 3 years of age at the time of this story and was still largely dependent on her caretakers for supplies, shelter, nutrition, crackers in the shape of a fish, and occasional video entertainment. Her matriarch left her to watch a family home video while her younger sister was fetched from a nap. Alone now for only minutes, the gremlin grew restless after being left on the couch. She began to wander into other rooms in my basement. Since the HVAC unit and adjacent bathroom failed to pique her interest, she wandered into the guest bedroom.
"Oh? What is this?" she must have thought.
There, on the floor, she found something curious. It was a pile of laptops, raspberry pis, and Libre single board computers. She didn't know it, but she had stumbled across the family's handy-dandy Kubernetes installation. At the time of this discovery her often-declared favorite color was blue. What a coincidence this was, as blue also happened to be the color of the Dell laptop power plugs, glowing there in the otherwise dark guest bedroom.
Curious, the gremlin pulled the power plug from the laptop and held it in her hand. She followed the other end until it went to a box where she couldn't trace the cable further. So she pulled a second blue plug and a third, and she went on until each laptop ran on battery power. Her amusement soon faded, and so she wandered into the playroom for a ball pit party for one.
In this article, we'll briefly discuss a cross-section of several related concepts within the cloud engineering discipline.
As technology has been embraced by companies, the infrastructure spanning the environment has increased significantly. As an industry, we've moved through several infrastructure phases:
Each of these phases brought about new solutions and problems to overcome. The most recent transition to container orchestration platforms has brought about the ability to create fault-tolerant workloads that can be automatically migrated when an outage occurs. Kubernetes and similar systems provide self-healing capabilities, but workload resilience still depends on how applications are designed (distributed state, persistent storage handling). Now that we're in this latest Container Orchestration phase, let us dig into how configuring these types of environments can make us more resistant to chaotic events.
You may be tempted to believe that as long as you use Kubernetes in your company, you will automatically benefit from having workloads jump between hosts seamlessly whenever there's unavailability with an underlying system component, such as network connectivity, power, patching, etc. But beware, as there are common mistakes in implementation that prevent workloads from migrating between cluster nodes.
The most straightforward way to attach storage to your running container workload is to borrow from one of the nodes you're running Kubernetes on. The storage implementation of hostPath
often is paired with node affinity
; this ensures pod replacements are scheduled onto the same host machine that still contains the pod's application data.
Here's an example of the implementation of hostPath
storage, which persists data between pod recreation events.
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
nodeSelector:
role: sandwiches # This can be any key/value pair
containers:
- name: example-container
image: nginx:latest
volumeMounts:
- mountPath: /data
name: host-storage
volumes:
- name: host-storage
hostPath:
path: /var/opt/my_app
type: Directory
Apply the label to the relevant node. Assume our node name is dell-04:
$ kubectl label node dell-04 role=sandwiches # Use key/value pair
While this is simple to implement, this approach is sorely lacking in data resilience. Everytime the example-pod
needs a machine to run on, kubernetes will only be able to schedule it onto dell-04! This should be taken a step further to protect against inevitable chaos. In a
bare-metal cluster, I reach first for a rook-ceph
implementation.
You may already be familiar with Ceph, which provides a scalable and distributed storage system designed for high availability and fault tolerance.
Rook makes Ceph Cluster management Kubernetes-friendly by automating the deployment, configuration, and scaling of Ceph clusters within k8s. Integrating and operating cloud-native storage within on-prem environments is easier than ever.
At the time of the gremlin attack, my home cluster was utilizing rook-ceph
provided storage. Below is an example of a Kubernetes manifest for the rook-ceph
pvc (persistentVolumeClaim). Remember there's no need for a formal pv (persistentVolume) manifest as this part is now handled by rook-ceph automatically.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-block-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: rook-ceph-block
And here's our example k8s pod manifest again, but with hostPath
provided storage swapped for rook-ceph
provided storage. Remove all annotations surrounding nodeSelector
.
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example-container
image: nginx:latest
volumeMounts:
- mountPath: /data
name: ceph-storage
volumes:
- name: ceph-storage
persistentVolumeClaim:
claimName: ceph-block-pvc
There is of course legitimate uses to nodeSelector
and hostPath
. nodeSelector
can be leveraged to schedule pods to specific nodes containing GPUs or specialized storage. hostPath
allows kubernetes loads to peek into filesystems for configurations, logs, or persistent storage that exists outside Kubernetes volumes. However, these should be considered the exceptions not the rule for providing durable storage to pods.
So what happened with the gremlin? Did the power plug pulls ruin everything?
No! To my amazement, nearly all the workloads migrated during this live-accidental test of our resilience. The pods found new homes in the cluster's single board computers as the laptop batteries started running out. The only actual failure recorded was the media server. That's right; the only service impacted was the one the gremlin herself was consuming. The AppleTV stream from the home-hosted JellyFin in the adjacent room shut down, and she could no longer reminisce of the good old days of being 90-something days younger.
It's time to put on our Site Reliability Engineering hats and review the event! When interviewed, the gremlin shared that she had found the bedroom door containing the cluster open. While I usually had it closed, I was drawn into this room to handle an unrelated power issue.
As physical access is the first security layer in every system, I’ve since routed Ethernet under the stairs for better physical security and cooling. The cluster was also lifted off the floor by using a small rack big enough for my servers.
Read more about the latest and greatest work Rearc has been up to.
Chaos Engineering on Kubernetes, a real-life scenario
An overview on the modern data stack.
The Importance of Post-Quantum Ciphers in Application Development
An example of how notebooks can help analysts quickly identify and report cyber incidents.
Tell us more about your custom needs.
We’ll get back to you, really fast
Kick-off meeting