Home/Writings/Distributed Systems

Raft Consensus and Failure

A look at why the easy part of consensus is the normal case.

Figure 1A working loop for technical inquiry: observe the system, reason about its constraints, and build to test the model.

Raft is approachable because it gives consensus a shape that humans can hold in their heads. Leaders replicate logs, followers respond, terms advance, and majorities decide what counts as committed.

The normal path is only the beginning. The interesting questions start when messages are delayed, nodes restart with stale state, elections overlap, or a leader believes it is still in charge after the rest of the cluster has moved on.

Good implementations are careful about persistence, term checks, idempotency, and timeouts. The algorithm is simple compared with Paxos, but the operational lesson is the same: distributed state needs a very explicit story about what can be trusted.

Continue reading
Distributed Systems · 16 min read

How We Taught a Scheduler to Predict the Future

The engineering realities of building adaptive prioritization, predictive runtime modeling, and resilient coordination on top of PostgreSQL and why most simple queues fail under real workloads.

Read note
Distributed Systems · 11 min read

Lessons from My First Distributed System

The small surprises that appear when a program stops living in one process.

Read note
Distributed Systems · 8 min read

Emergence in Complex Systems

Why distributed behavior can surprise you even when every local rule is simple.

Read note