SysRift
Deterministic syscall-level execution replay engine using Linux ptrace.
~15
Overhead vs Native
100%
Syscalls Intercepted
Faithful
Replay Fidelity
~200
Event Log Size
Problem
Problem statement, constraint shape, and the gap this project explores.
How do you reproduce a non-deterministic program failure without modifying the program?
Built a low-level execution capture and replay system using Linux ptrace to deterministically record and reproduce program behavior at the syscall boundary.
Debugging non-deterministic failures in production systems is hard. Race conditions, timing-dependent crashes, and environment-specific behavior are nearly impossible to reproduce by re-running. SysRift captures the syscall boundary — the interface between user-space and the kernel — and records enough state to replay program execution faithfully without re-executing kernel interactions.
Constraints
Non-negotiable boundaries that shaped the implementation.
Syscall boundary (no source modification)
Must not significantly alter timing behavior
Replay must produce identical observable behavior
User-space memory buffers and return values
Linux x86-64 with ptrace support
Architecture
The primary design surface: flow, subsystem roles, and state boundaries.
SysRift attaches to a target process using ptrace(PTRACE_ATTACH). At each syscall entry and exit, it captures arguments, return values, and relevant memory regions. The record phase produces an event log. The replay phase re-executes the program, intercepting syscalls and substituting recorded return values instead of forwarding to the kernel.
Target process
ptrace attach
[syscall entry intercept
record args]
[syscall exit intercept
record return + buffers]
Event log
Replay: inject recorded returns
Tracer
ptrace-based process monitor that intercepts SIGTRAP on syscall entry/exit.
Event Recorder
Captures syscall number, arguments, return values, and memory snapshots.
Memory Capturer
Reads process_vm_readv to snapshot output buffers at syscall exit.
Replay Engine
Re-runs process under ptrace, substituting recorded return values for live kernel calls.
Event Log
Binary-serialized event stream with timestamps and context.
Engineering Tradeoffs
Design review notes: what was optimized and what was deliberately left behind.
ptrace vs eBPF
ptrace allows direct process control and memory reading without kernel module requirements. eBPF requires higher kernel version and more setup complexity.
Lower overhead (eBPF is faster)
ptrace
Full memory snapshot vs selective capture
Full memory snapshots are prohibitively large. Capturing only the output buffers of I/O syscalls covers the majority of replay correctness requirements.
Complete state reconstruction
Selective buffer capture at syscall exit
Binary vs text event log
Binary logs are significantly smaller and faster to write/read. Text logs add I/O overhead that affects the timing fidelity of the recording.
Human readability
Binary serialized log
Failure Modes
Incident-style notes for the ways the design can break.
Signal delivery during syscall interception
FM-01Signals arriving between syscall entry and exit corrupt the intercept state.
with PTRACE_O_TRACESYSGOOD flag and explicit signal masking.
Multi-threaded processes
FM-02ptrace tracks per-thread, not per-process. Multi-threaded programs require tracing every child thread independently.
is single-threaded process only.
Exec-based process replacement
FM-03execve replaces the process image. Replay must handle re-attachment after exec or use PTRACE_O_TRACEEXEC.
execve replaces the process image. Replay must handle re-attachment after exec or use PTRACE_O_TRACEEXEC.
Memory layout ASLR mismatch on replay
FM-04ASLR assigns different base addresses each run. Replay requires ASLR disabled or base address fixups for pointer arguments.
ASLR assigns different base addresses each run. Replay requires ASLR disabled or base address fixups for pointer arguments.
Benchmarks
Environment first, numbers second. Metrics should be inspectable, not ornamental.
C / Linux
Syscall-level capture
C, Linux, ptrace, Process Tracing
Systems Programming
Project-level benchmark notes
~15% slowdown
100%coverage
Faithfulfor single-threaded
~200bytes/syscall
Linuxx86-64
Lessons Learned
Engineering takeaways from the implementation, including remaining work.
ptrace is powerful but has sharp edges
signal handling and multi-threading require significant additional complexity.
The syscall boundary is sufficient for replay correctness in most single-threaded programs, but misses in-memory concurrency.
Binary event logs are essential
text logs introduce enough I/O latency to alter the timing of the recorded execution.
Future: extend to multi-threaded programs using per-thread ptrace handles and a global ordering constraint on event replay.