NetChaos: Declarative Network Chaos with eBPF and Kubernetes
There’s a particular kind of confidence you only earn by breaking your own system on purpose. You can read all the retry logic and timeout configs you like, but until you’ve actually watched packets vanish and seen whether your service shrugs or topples, you don’t really know. That’s chaos engineering — and most tooling for it at the network layer is clumsier than it should be. NetChaos is my attempt at making network chaos feel native to Kubernetes: declarative to ask for, and fast enough to be invisible.
Chaos you can kubectl apply
The core idea is to treat a network fault like any other Kubernetes object. You don’t SSH into a node and fiddle with tc; you write a small YAML manifest describing the disruption you want, and the cluster makes it so:
apiVersion: nettools.io/v1alpha1
kind: FaultInjection
metadata:
name: icmp-dropper
spec:
target:
podSelector:
matchLabels: { app: my-app }
protocol: ICMP
action: DROP
match:
seqMod: 5 # drop every 5th matching packet
kubectl apply introduces the fault; kubectl delete heals it. The disruption lives in version control next to everything else, which means chaos becomes something you can review, diff, and replay in CI rather than a one-off experiment someone ran by hand and forgot.
The control plane: an operator that chases desired state
Underneath, NetChaos is a Kubernetes operator, built with kubebuilder. A Custom Resource Definition teaches the cluster a new noun — a fault rule — and a reconcile loop is responsible for closing the gap between what you asked for and what’s actually running:
func (r *IcmpDropRuleReconciler) Reconcile(ctx context.Context, req ctrl.Request)
(ctrl.Result, error) {
// fetch the desired rule, then make the node's data plane match it:
// rule exists -> ensure the eBPF drop program is loaded & configured
// rule deleted -> detach it and let traffic flow again
return ctrl.Result{}, nil
}
This is the part of Kubernetes I find genuinely elegant. You declare a fact about the world, and a controller works tirelessly to keep it true — the same machinery that schedules pods, repurposed to schedule failure.
The data plane: failure at kernel speed with eBPF
A chaos tool can’t meaningfully distort traffic if it’s slow — the measurement would drown in the instrument’s own overhead. So the actual packet-dropping happens in the kernel, with eBPF attached at the earliest possible hook (XDP, or tc for egress). A tiny verified program inspects each packet and decides, in nanoseconds, whether it lives or dies. That foundation came out of earlier XDP experiments of mine — dropping ICMP at the driver level before the kernel network stack ever sees it.
My favorite detail is the seqMod knob: instead of dropping a random percentage, NetChaos can drop every Nth matching packet. Random loss is realistic but maddening to debug; deterministic loss is reproducible. “Drop every fifth ICMP” gives you a failure you can trigger on demand and assert against in a test — chaos with a repeatable seed.
Where it is, and why I like the shape
NetChaos is an evolving framework: the operator scaffold and the fault CRD are in place, and the work in flight is wiring the reconcile loop to the eBPF data plane on each node. But the architecture is the point, and it’s a clean seam — a declarative control plane in Kubernetes that decides what failure should exist, and a kernel-level data plane in eBPF that delivers it without slowing the host. Most of resilience engineering is really about putting failure where you can see it and govern it. Making that failure a first-class, version-controlled Kubernetes resource felt like the right place to put it.
The code is on GitHub at github.com/arazmj/netchaos.