A few days ago I began writing a toy version of gVisor — Google’s userspace kernel — as a way of discovering, in the only way one truly discovers such things, how it actually intercepts a syscall. The project lives at github.com/mtclinton/mini-sentry, and its pitch, shorn of ornament, is this: if a sandboxed Linux program calls write(), the kernel is not to handle it. My Go code is to handle it. That is the shape of gVisor’s security story, and it is a shape worth taking seriously — a kernel exploit sitting inside write() is of no consequence whatever if the kernel never runs write() to begin with.
Most of the project was kinder to me than I had expected. PTRACE_SYSEMU catches every syscall before the kernel touches it, my Go handlers fill in a return value, and I resume the guest. A few hundred lines of this and I had a sandbox capable of running static Linux binaries against a fabricated filesystem. I was, for a pleasant interval, under the impression that I was most of the way done.
Then I tried to make signals work, and the next three phases of the project disappeared into them. What follows is an account of what I encountered there.
The Setup: Syscall Interception in One Page
Before signals intrude, the architecture has the clean lines of something that can be drawn without apology. PTRACE_SYSEMU is a ptrace operation that stops the tracee every time it is about to make a syscall, skips the real call, and permits the tracer to write a return value directly into RAX:
// simplified loop
for {
unix.PtraceSyscall(pid, 0) // SYSEMU under the hood
wait4(pid, &ws)
regs := readRegs(pid)
ret := sentry.HandleSyscall(regs)
regs.Rax = uint64(ret)
writeRegs(pid, regs)
}If I want the kernel to handle something — say mmap, which touches page tables that cannot be forged from userspace — I rewind RIP by two bytes (the width of the syscall instruction) and step with PTRACE_SYSCALL instead. The kernel runs it; I read the result.
That is enough to emulate read, write, openat, getdents64, getpid, getuid, and a dozen of their relations. My sandbox could print “hello” and read fabricated files out of an in-memory VFS. This is the point at which, if the project were a novel, the protagonist would go for a walk and fail to notice the weather.
Phase 3a: A Mirror Without Effect
The first thing one learns about signals is that rt_sigaction(2) appears, on the surface, to be nothing more than a table assignment. The kernel keeps an array of sigaction structs, one per signal, and rt_sigaction(SIGUSR1, &act, NULL) writes into slot 10. Emulating this sounds, for a moment, as though it will not detain us long:
func (s *Sentry) sysRtSigaction(pid int, sc SyscallArgs) uint64 {
signum := int(sc.Args[0])
buf := readFromChild(pid, sc.Args[1], sigactionSize)
s.signals.SetAction(signum, decodeAction(buf))
return 0
}The trouble is that this tracks state and does nothing else. The kernel still owns delivery. When something sends SIGUSR1, the kernel consults its own sigaction table — not mine — and runs whatever was there before I had begun to intercept anything at all. My mirror, for all its diligent bookkeeping, is a read model with no bearing on behaviour.
It was here, too, that I had to understand gVisor’s Task and ThreadGroup split, because I was going to have to reproduce it. Signal masks and pending queues are per-thread; the sigaction table is per-thread-group. gVisor expresses this with a Task struct for the per-thread half and a ThreadGroup for the shared one. I ended up with ThreadState and ThreadGroup doing the same offices. If one has ever wondered why kill(pid, sig) may pick any thread in the group while tkill targets precisely one, this is the reason: the routing predicate consults every per-thread mask in the group.
Phase 3a produced a mirror. It did not yet own a single delivery.
Phase 3b: Becoming the Kernel’s Signal Path
To own delivery, the Sentry must construct the signal frame itself. On x86_64 the frame is a particular data structure — struct rt_sigframe, some 1032 bytes in all — and it must be matched to the byte:
+0x000 ucontext { flags, link, stack, mcontext (RAX..RIP..), sigmask }
+0x1A8 fpstate { 512-byte FXSAVE area }
+0x3A8 siginfo { si_signo, si_code, si_pid, si_uid, ... }
+0x408 pretcode pointer to rt_sigreturn trampolineThe contract is as follows. One writes this structure onto the guest’s stack, sets RSP to point at it, sets RIP to the handler, and resumes the guest. The handler executes and returns with ret, which pops pretcode and lands in a small trampoline that issues rt_sigreturn. That syscall is the exact inverse of frame construction — it reads the frame back, restores the pre-handler register state, and returns the interrupted code to what it had been doing.
In mini-sentry, BuildRtSigframe writes the frame via process_vm_writev and sysRtSigreturn reads it back via process_vm_readv. The round-trip is perhaps 600 lines distributed across frame_amd64.go, deliver_amd64.go, and handlers_signals_amd64.go, and the greater part of it is byte-layout bookkeeping. One decides to own delivery; one signs, in the same motion, for matching the kernel’s struct offsets to the byte.
I got both halves of this written, wired up a guest test that did kill(getpid(), SIGUSR1) against a handler that atomically incremented a counter, ran it, and was answered with:
runtime: newstack sp=0xc00000dbf8 stack=[0xc0000ce000, 0xc0000d6000]Go’s stack guard, which I had not thought to trouble, was informing me that the stack pointer was outside the goroutine’s stack. This was the first of two bugs that cost me real hours, and the more instructive of the pair.
The Compiler’s Quiet Assistance
The guest test installed its handler like this:
act := sigactionKernel{
Handler: reflect.ValueOf(sigusr1Handler).Pointer(),
Flags: saSigInfo | saRestorer,
Restorer: reflect.ValueOf(restoreRT).Pointer(),
}sigusr1Handler was a pure-assembly function — no Go frame, no stack-growth probe, merely LOCK XADDL $1, sigCounter; RET. I had written it in assembly precisely because signal handlers cannot touch Go’s runtime: the kernel’s signal-entry contract is the SysV C ABI, but Go’s ABIInternal expects R14 to hold the current goroutine pointer, and R14 on signal entry is whatever the interrupted code happened to have left there. Any Go-level handler that so much as glances at a runtime primitive comes apart at once.
The trouble — and it was a trouble that hid itself well — was that reflect.ValueOf(fn).Pointer() did not return the address of my assembly. It returned the address of a compiler-generated wrapper. When a .s file defines sigusr1Handler and a .go file declares func sigusr1Handler(), the Go toolchain helpfully synthesises an ABIInternal shim:
sigusr1Handler:
PUSHQ BP
MOVQ SP, BP
CALL sigusr1Handler.abi0 ; the real asm
POPQ BP
RETThat PUSHQ BP and CALL shift RSP by sixteen bytes. For the counter-bumping handler it did not matter — the handler had no opinion on RSP — but the same wrapper had been installed as the sa_restorer. When the handler returned, it popped pretcode into RIP and landed, not in the real restoreRT, but in the restorer’s wrapper:
W_restore:
PUSHQ BP ; RSP -= 8
MOVQ SP, BP
CALL restoreRT.abi0 ; RSP -= 8 (pushes return addr)
; restoreRT.abi0: MOVQ $15, AX; SYSCALLBy the time the syscall instruction inside the real restoreRT actually fired, RSP was sixteen bytes below where my sysRtSigreturn expected to find the frame. It read 1032 bytes of garbage, decoded a nonsensical register set, and PTRACE_SETREGS loaded nonsense into the tracee. The stack-guard panic was the first visible symptom of values that had been quietly corrupted three steps earlier — which is, one is obliged to concede, the most honest possible description of most systems bugs.
The fix was to refuse the wrapper altogether:
GLOBL ·sigusr1HandlerPC(SB), RODATA, $8
DATA ·sigusr1HandlerPC(SB)/8, $·sigusr1Handler(SB)
GLOBL ·restoreRTPC(SB), RODATA, $8
DATA ·restoreRTPC(SB)/8, $·restoreRT(SB)A GLOBL/DATA pair in the .s file exposes the raw ABI0 entry point as a plain uintptr. On the Go side, one deletes the func sigusr1Handler() and func restoreRT() declarations so that the compiler has no occasion to produce wrappers in the first place, and installs the handler using the PC variable. The handler now runs at the address the kernel wrote into the frame, RSP stays exactly where rt_sigreturn expects to find it, and the round-trip completes.
I verified the fix with go tool objdump -s sigusr1Handler ./cmd/guest/guest. The wrapper entry sat a few dozen bytes above the raw handler — one with the PUSHQ BP / CALL prologue, one without. reflect.ValueOf had been returning a pointer, in the most technical sense; it had simply been the wrong one.
The debugging of this cost me about half a day. I do not think I have ever paid such close attention to a disassembler in my life.
The Other Bug: SETREGS Will Not Hear of CS=0
The second bug was briefer but no more obvious. After decoding the signal frame on rt_sigreturn, I needed to write the restored registers back to the tracee:
var restored unix.PtraceRegs
readMContext(frame, &restored)
unix.PtraceSetRegs(pid, &restored) // input/output errorPTRACE_SETREGS returned EIO. The frame’s mcontext stores only general-purpose registers along with RIP, RSP, and EFLAGS; it does not include segment registers. restored.Cs was zero.
The kernel refuses PTRACE_SETREGS with CS=0 on a 64-bit user task. CS=0 is non-canonical, and the kernel treats any attempt to set it as “someone is about to crash userspace, and that is my problem.” The correct remedy was not to rebuild the register set but to merge the restored fields into the live one:
regs, _ := unix.PtraceGetRegs(pid)
regs.Rax = restored.Rax
regs.Rip = restored.Rip
regs.Rsp = restored.Rsp
// ... GPRs, RIP, RSP, EFLAGS
// Do NOT touch Cs, Ss, Fs_base, Gs_base, Orig_rax
unix.PtraceSetRegs(pid, ®s)Segment registers and orig_rax must come from the pre-sigreturn state. The handler could not have altered them in any case — there are no segment-writing instructions on x86_64 that survive into userspace.
The educational payoff here is worth putting plainly: the mcontext is not a complete register file. It is “the subset of user-visible state a signal handler might wish to examine or modify.” Everything else the kernel tracks quietly and in its own keeping, and a userspace reimplementation is obliged to remember this whenever it pretends to be the kernel.
Phase 3c: Routing to the Right Thread
The last phase concerned itself with making all of this behave when the guest has more than one thread. The Go runtime creates OS threads by way of clone(CLONE_THREAD), and by default my tracer attached only to the initial one. Every other thread then ran unsupervised, which is to say that any signal routed to those threads bypassed the Sentry altogether — the guest speaking to the kernel behind my back, as it were.
The remedy was PTRACE_O_TRACECLONE. This option instructs the kernel to stop the parent on every clone and issue a PTRACE_EVENT_CLONE bearing the new tid. I attach to that tid and add it to my ThreadGroup. Every thread in the guest is traced; every syscall lands in the Sentry.
The routing predicate thereafter mirrors gVisor’s canReceiveSignalLocked. For kill(tgid, sig), one walks the threads in deterministic order and selects the first whose per-thread mask does not have sig blocked. For tkill(tid, sig), one routes to that tid directly. For self-sends — kill(getpid(), sig) issued from inside the guest — one resolves to the caller’s host tid via the ptrace-event metadata.
The test that proves the whole arrangement works runs four goroutines, each locks itself to an OS thread, each does tgkill(getpid(), gettid(), SIGUSR1). All four handlers fire, the counter advances from one to five, and the integration test, having been written to be unflattering, passes without complaint.
The arm64 Punchline: Delegation Is Also an Architecture
Everything above is amd64-specific. When I turned to the question of porting to arm64, I discovered that I could, quite simply, decline.
On arm64, the rt_sigreturn trampoline lives in the vDSO, which my Sentry does not parse. Building the frame myself would mean reimplementing a chapter of the kernel’s signal path with no corresponding educational profit. So the arm64 path does something different:
- The Sentry mirrors disposition (as on amd64) and passes
rt_sigactionthrough to the kernel. - When a signal needs to be delivered, the Sentry simply calls
syscall.Kill(guestPid, sig). - The kernel’s own signal path then runs. It builds the frame, runs the handler, and the vDSO trampoline issues
rt_sigreturn, which the kernel handles without ever troubling the Sentry.
This is not a shortcut. It is a different architectural choice, and it reflects — accurately, I think — a trade-off that gVisor itself makes. The reason amd64 does not delegate is that gVisor’s security property depends on not delegating: if the host kernel runs the guest’s handler logic, a kernel exploit inside that path is once again on the table. For an educational project, the calculus is altered — “delegate and ship” is simply a better use of one’s evenings than “reimplement and stall on arm64 hardware access.”
The point I should like to leave here is that both architectures are legitimate, and that a project that does not know which one it is picking at the outset is setting itself up for a very bad month. One owns the thing or one delegates the thing. Not both. Not partially.
What I Actually Learned
Three things seem to me worth writing down.
The first is that signals are the place where a userspace kernel ceases to be a diagram and becomes a machine. One can fake syscalls all day by writing into RAX; signals demand that one match the kernel’s contracts at the byte level — frame layouts, register conventions, vDSO trampolines. Every rough edge in the abstraction announces itself here, and loudly.
The second is that toolchains leak into the ABI, and they do so politely enough that one does not notice. The Go compiler’s ABIInternal wrapper was invisible to me until the moment it could not have been more visible, and it had arranged to be invisible at exactly the worst level of the stack. Anywhere the host toolchain meets a low-level ABI — signal entry, syscall trampolines, FFI — one must assume the compiler is doing something helpful that one must then undo.
The third is the architectural one, and it is the most important of the three. Re-implement and delegate are the two strategies for any userspace reimplementation of a kernel contract, and the project decides which one it wants before it starts. gVisor chose re-implement because its security story requires it. I chose delegate on arm64 because my educational story required only that I understand. The choice is not a detail; it is the first decision, and everything else follows from it.
The source is at github.com/mtclinton/mini-sentry. The signal subsystem begins in signals.go; the amd64-specific frame work is in frame_amd64.go and handlers_signals_amd64.go. ADRs 001–003 under docs/adr/ walk through the Phase 3 design decisions in the order in which they actually happened, including the bugs — so if one is curious about the shape of the commits that produced this account, they are right there on the page.
For the architectural prelude to all of this — the ptrace loop, the seccomp platform, the Gofer — see the earlier mini-sentry post. This one picks up where that one leaves the curtain.
