May 24, 20265 min read

Building Sentry: A Secure Job Runner with cgroups and mTLS

Every so often you run into a problem that sounds simple right up until you start building it. Mine was this: I wanted to hand a server a command, have it run that command under strict CPU, memory, and disk limits, watch its output live, and kill it on demand — all without trusting whoever asked. That turned into Sentry, a small job runner I wrote in Go. This is a tour of what I learned putting it together.

The shape of it is deliberately small: a gRPC server that owns the jobs, and a command-line client that talks to it. The server can start a job, kill it, report whether it’s still alive, list everything that’s running, and stream a job’s logs back to you in real time. Everything in between — the resource limits, the process supervision, the log fan-out — is where the interesting parts live.

cgroups v2 is just a filesystem

The resource-limiting turned out to be the friendliest part, and that surprised me. There’s no exotic API to learn. Under cgroups v2 a control group is a directory under /sys/fs/cgroup, and you configure it by writing plain strings into files. Putting a job in a box is almost anticlimactic:

cgroupPath := "/sys/fs/cgroup/sentry-run-" + jobID
os.MkdirAll(cgroupPath, 0755)

os.WriteFile(cgroupPath+"/cgroup.procs", []byte(pid), 0644)
os.WriteFile(cgroupPath+"/cpu.max",      []byte("50000 100000"), 0644) // 50% of one core
os.WriteFile(cgroupPath+"/memory.max",   []byte("268435456"),    0644) // 256 MiB

Write the PID into cgroup.procs and the kernel moves the process — and anything it later forks — into the group. cpu.max is a quota/period pair (here, 50 ms of CPU every 100 ms), and memory.max is a hard byte ceiling. Once they’re set, the kernel does the enforcing. There’s no daemon to poll, no accounting loop I had to write myself. That’s the whole appeal.

The I/O limit that ate an afternoon

Disk I/O was where I lost the most time. Unlike CPU and memory, io.max doesn’t take a path — it takes a block device, addressed by its major and minor number, like 8:0 wbps=1048576. You can’t hardcode that, because it depends on which disk the job is writing to. The fix was to stat() the filesystem and pull the device number out of the result:

var st syscall.Stat_t
syscall.Stat("/", &st)
major := (st.Dev >> 8) & 0xfff
ioLimit := fmt.Sprintf("%d:0 wbps=%s rbps=%s", major, writeBps, readBps)

The minor number gets pinned to 0 on purpose. I/O throttling in cgroups applies to the whole block device, not an individual partition, so addressing a partition’s exact minor number just earns you a confusing no-op — the limit looks set but never fires. That’s the kind of detail no API surface warns you about; you only find it by watching a 1 MB/s cap do absolutely nothing.

Killing the whole tree, not just the process

Before the process even starts, I set Setpgid so the job and all of its children share a single process group. That one flag is what makes “kill the job” actually mean “kill the whole tree” instead of orphaning a pile of grandchildren that keep chewing through your limits. For a bit of filesystem isolation there’s an optional chroot, wired straight through SysProcAttr.Chroot — cheap, not bulletproof, but enough to keep a job from wandering the host’s filesystem.

Streaming logs without missing the start

The piece I’m happiest with is log streaming. A job’s stdout and stderr are each read in their own goroutine, in 32 KB chunks. Every chunk does two things: it gets appended to an in-memory history buffer, and it gets pushed to a list of subscriber callbacks.

That little bit of bookkeeping pays off in the client experience. When you run sentry logs ten seconds after a job started, the server replays the buffered history first and then drops you into the live feed. You don’t miss the beginning, and you didn’t have to have been watching from the start. It’s a tiny publish/subscribe system, maybe forty lines, but it’s what makes the CLI feel like tail -f over the network rather than a polling loop.

The certificate is the identity

All of this sits behind mutual TLS. The server runs with RequireAndVerifyClientCert, so a connection without a certificate signed by my CA never gets far enough to say hello. The nice side effect is that the certificate becomes the identity — there’s no separate user table, no password, no token to leak. Authorization keys off the client certificate’s Common Name, so deciding who’s allowed to start or kill jobs is just a question of which CN is on the cert. It’s TLS 1.3 only, with modern AEAD ciphers, and the handshake doubles as authentication for free.

One last surprise on the way out

Teardown had a final gotcha. You don’t remove a cgroup by deleting its files — you rmdir the directory, and only after the last process has left it. Try it while the job is still alive and you get EBUSY. So cleanup waits for both output streams to hit EOF (which is a reliable sign the process is gone), and only then removes the group’s directory.

What I left on the table

There’s plenty I didn’t build. The limits are coarse, isolation stops at chroot rather than full Linux namespaces, and nothing survives a server restart. But the lesson that stuck with me is how much “container-like” behavior falls out of a handful of well-placed syscalls and some text files in /sys/fs/cgroup. You don’t need a runtime the size of Docker to put a hard ceiling on a process — you need a directory, a PID, and the patience to track down the right device number.

The code is on GitHub at github.com/arazmj/sentry if you’d like to poke at it.

← All posts