Home/Writings/Distributed Systems

Distributed Systems

How We Taught a Scheduler to Predict the Future

The engineering realities of building adaptive prioritization, predictive runtime modeling, and resilient coordination on top of PostgreSQL and why most simple queues fail under real workloads.

June 15, 2026 16 min readDistributed Systems

Distributed SystemsSchedulingPostgreSQLReliabilityBackend

Figure 1A working loop for technical inquiry: observe the system, reason about its constraints, and build to test the model.

In production systems, queues are rarely pure FIFO. Workloads arrive heterogeneous: some jobs finish in hundreds of milliseconds, others drag on for minutes or more. A single long-running task queued ahead of latency-sensitive work creates priority inversion that cascades into user-visible problems. Static priorities help but ossify quickly. They cannot adapt to shifting patterns in job types, resource contention, or execution contexts.

The deeper issue is that most schedulers treat runtime as an unknown constant rather than a learnable distribution conditioned on history. Without feedback, throughput suffers and operators resort to over-provisioning or manual tuning.

The Problem

Heterogeneous workloads on shared workers create inherent tension. Short, latency-sensitive jobs compete with long-running batch jobs for the same execution slots. Traditional approaches (FIFO, fixed priorities, or simple heuristics) fail predictably:

Priority inversion becomes systemic.
Execution times vary with system load, data characteristics, and sequencing effects.
Polling-based dispatch introduces both latency and unnecessary database load.

Naive dynamic prioritization based on declared estimates collapses under inaccurate or adversarial submissions. The system must observe actual runtimes, learn contextual patterns, and re-rank jobs in near real-time without sacrificing safety or exactly-once semantics.

Constraints

Real-world deployment imposes multiple non-negotiable constraints:

Correctness: Exactly-once semantics under crashes requires careful leasing and recovery. The database remains the source of truth.
Performance: Sub-second dispatch for new jobs. Efficient priority queue operations at scale. Minimal polling.
Reliability: Leader election, automatic failover, stuck job reclamation, and retry logic with backoff. No external coordination services.
Scale and operations: Stateless API replicas, horizontally scalable workers, multi-tenant isolation.
Observability: Metrics for queue depth, prediction accuracy (MAPE), worker health, and leadership status.
Minimal dependencies: Lightweight LSTM implemented in pure NumPy, JWT from standard primitives, PostgreSQL as the sole coordination primitive.

The system must also be practical: easy to run locally with Docker Compose and production-ready on Kubernetes with HPA.

Why Existing Approaches Fail

FIFO appears fair because it processes jobs in arrival order. In homogeneous workloads with similar runtimes, this works well enough. Under heterogeneity, however, it becomes harmful. A long-running ETL job blocks a burst of short HTTP callbacks, amplifying tail latency across the system. The harm compounds because queueing delay for short jobs grows linearly with the size of the preceding long job, violating the intuition that "first come, first served" is equitable.

Static priorities seem like an obvious improvement. Assign high priority to latency-sensitive classes and low to batch work. They quickly become operational debt. Workload mixes evolve: new job types appear, business priorities shift, and yesterday's high-priority category becomes today's background noise. Operators find themselves constantly reconfiguring thresholds, creating brittle mappings that lag behind reality.

User-provided runtime estimates are attractive on paper: let submitters declare expected duration and sort accordingly. In practice, estimates are optimistic or outright fabricated. Developers underestimate complexity; adversarial users game the system for faster processing. Without strong verification, the scheduler inherits garbage inputs and produces garbage ordering.

Manual tuning is the final refuge. When automation fails, on-call engineers become the de facto scheduler: reordering jobs via dashboards or scripts during incidents. This does not scale. Human attention is the scarcest resource, and context switches during outages introduce errors. Feedback loops become necessary: the system must learn from its own execution history rather than relying on external declarations or constant human intervention.

These shortcomings explain why a more adaptive architecture becomes inevitable.

Can Runtime Be Predicted?

Runtime prediction matters because ordering decisions fundamentally depend on future cost. Without it, the scheduler operates blindly. Perfect prediction is impossible: execution time depends on transient factors like current load, data locality, network conditions, and even downstream service behavior. Yet approximate prediction remains valuable. Even rough signals that separate "likely short" from "likely long" jobs dramatically reduce inversion.

Historical behavior contains strong signal. Jobs of the same type executed under similar recent context often exhibit predictable patterns. A lightweight model can capture contextual adjustments without needing to model the entire environment. The system only needs useful estimates, not perfect ones. A modest improvement in ranking accuracy compounds across thousands of decisions, improving overall throughput and latency distributions. The key insight is treating prediction not as an ML novelty but as operational infrastructure that closes the feedback loop between execution and scheduling.

Building the System

The architecture separates concerns across services sharing a PostgreSQL backbone: a FastAPI submission layer, a TypeScript scheduler maintaining a min-heap priority queue, Python workers, and a dedicated predictor service.

Major design decisions reflect deliberate tradeoffs against alternatives:

Event-driven dispatch with LISTEN/NOTIFY: Job submission triggers PostgreSQL notifications. The elected scheduler leader maintains a dedicated listener and rebuilds its heap efficiently, with a fallback polling loop for edge cases. This reduces pickup latency substantially compared to polling. Alternatives like Redis pub/sub or dedicated message brokers add operational overhead and another failure domain; sticking with Postgres minimized new infrastructure while accepting database-centric load.
Leader election via advisory locks: Multiple scheduler instances compete for a PostgreSQL advisory lock, with a registry table and heartbeats for tracking. Failover occurs within seconds without external consensus tools. etcd, Consul, or ZooKeeper offer stronger guarantees but introduce additional services to deploy, monitor, and secure. The Postgres approach prioritizes simplicity and reduced blast radius, trading some robustness for operational ease (mitigated by replication).
Lease-based claiming with FOR UPDATE SKIP LOCKED: Workers compete safely for the highest-priority job. This enables concurrent claiming and natural recovery. Dedicated queues (RabbitMQ, Kafka) or Redis-based solutions could offload coordination but would complicate exactly-once semantics and increase dependency surface area. Postgres primitives keep the system self-contained.
PostgreSQL as coordination plane overall: By leveraging advisory locks, SKIP LOCKED, LISTEN/NOTIFY, and row-level concurrency, sophisticated distributed behaviors emerge from one replicated store. This avoids the complexity of polyglot persistence but concentrates load, forcing careful query and schema design.
Lightweight runtime predictor: The 2-layer LSTM (NumPy-only) observes recent job type history and modulates base runtimes for priority calculation. Training on execution logs closes the loop. Heavy frameworks were avoided for deployment simplicity and inference speed; the narrow scope makes a simple model sufficient.

Kubernetes manifests, CPU-based HPA, Prometheus metrics, and CI/CD pipelines support production operation.

Failure Modes

Production reality quickly surfaces broken assumptions:

Scheduler leader crashes cause brief dispatch gaps during failover, though heartbeats and the notification system limit damage.
Worker deaths mid-execution are handled by lease expiration and requeuing, but repeated failures on specific job types reveal deeper payload or environmental problems.
Prediction drift occurs in highly variable environments. The simple LSTM can lag; monitoring MAPE and actual vs. predicted runtimes becomes essential.
Database contention from advisory locks, SKIP LOCKED, and notifications concentrates load on Postgres; a scaling limit that many mature systems eventually work around with dedicated queues or sharding.
Exactly-once guarantees remain fragile under network partitions. The design assumes job logic is idempotent.
Model poisoning from noisy or malicious historical data lacks robust defenses in the current implementation.

These failure modes were not theoretical. The migration scripts, watchdog logic, and recovery paths reflect iterative hardening against them.

Broader Systems Lessons

This work reinforces several enduring principles in systems engineering.

Feedback loops separate viable schedulers from those that degrade. By tying prioritization to observed execution data through lightweight learning, the system evolves with the workload rather than fighting it. Static designs accumulate error; adaptive ones amortize learning over time.

Local optimization versus global optimization appears constantly. Giving one job higher priority helps that job but can reduce overall throughput if it displaces many shorter ones. Effective scheduling requires balancing these tensions, often through predicted cost rather than declared importance.

Emergent behavior dominates queue dynamics. Overall latency distributions and throughput emerge from thousands of local claiming decisions, lease renewals, and prediction adjustments. Scheduling is less an algorithm problem and more a systems problem of incentives, stability, and unintended interactions.

Prediction as infrastructure is a quiet but powerful pattern. Many modern systems (autoscalers, load balancers, cache eviction) depend on forecasting. Here, the narrow LSTM demonstrates that prediction need not be heavy machinery; when scoped tightly and fed good signals, it becomes a reliable operational capability.

Abstraction leaks are inevitable in scheduling. Predicted runtime is useful but incomplete: true cost depends on current worker load, data locality, and external factors not captured in the model. Priority queues make these leaks visible quickly, forcing engineers to confront the gap between model and reality.

Finally, observability turns adaptation from a liability into strength. Metrics on prediction error, leadership flaps, and queue behavior provide the diagnostics needed when assumptions inevitably break.

Conclusion

Effective scheduling for heterogeneous workloads requires continuous observation, safe concurrency primitives, and mechanisms for dynamic re-ranking under failure. The tension between learning, safety, and operational simplicity does not resolve cleanly; tradeoffs are unavoidable.

Building around these realities reinforced that robust distributed coordination often hides in mature databases, that narrowly scoped lightweight models can deliver outsized value, and that failure modes should be designed into the system from the beginning. In production queues, graceful degradation when any component (prediction, leader, worker, or database) misbehaves is what ultimately determines survival.

Distributed Systems · 11 min read

How We Taught a Scheduler to Predict the Future

The Problem

Constraints

Why Existing Approaches Fail

Can Runtime Be Predicted?

Building the System

Failure Modes

Broader Systems Lessons

Conclusion

Lessons from My First Distributed System

Raft Consensus and Failure

Emergence in Complex Systems

The Problem

Constraints

Why Existing Approaches Fail

Can Runtime Be Predicted?

Building the System

Failure Modes

Broader Systems Lessons

Conclusion

Related writings

Lessons from My First Distributed System

Raft Consensus and Failure

Emergence in Complex Systems