SmartQueue
AI-powered adaptive task scheduler with LSTM runtime prediction and K8s autoscaling.
1→5
Worker Scaling
<100
Scheduling Latency
2
LSTM Layers
Improving
Prediction Accuracy
Problem
Problem statement, constraint shape, and the gap this project explores.
How do you build a task scheduler that improves its priority decisions over time without human intervention?
Built a distributed, multi-tenant task scheduling platform that learns from historical execution patterns using a custom LSTM to predict job runtimes and dynamically adjust priorities.
Static priority schedulers are brittle — they assign priority at submission time and never adapt. In heterogeneous workloads where job runtime varies by 10x, static priorities cause priority inversion: low-priority short jobs queue behind high-priority long jobs. SmartQueue uses a custom LSTM trained on historical execution data to predict runtime and compute dynamic priority scores, enabling the scheduler to continuously improve.
Constraints
Non-negotiable boundaries that shaped the implementation.
Org-scoped job isolation with RBAC
Safe concurrent queue access under workers
K8s HPA 1→5 worker pods
LSTM from scratch (NumPy only)
No double-processing under parallel workers
Architecture
The primary design surface: flow, subsystem roles, and state boundaries.
SmartQueue has a Next.js frontend, FastAPI scheduling backend, and PostgreSQL state store. Workers poll the queue using FOR UPDATE SKIP LOCKED for safe concurrent dequeue. The LSTM service runs as a FastAPI sidecar, consuming execution history and outputting runtime predictions used to recalculate priority scores.
Job submit
Priority score (LSTM prediction)
PostgreSQL queue
Worker dequeue (SKIP LOCKED)
Execute
Record result
Retrain LSTM
Job API
FastAPI endpoints for job submission, status, and cancellation.
Scheduler
PostgreSQL advisory locks for leader election; min-heap for priority queue.
Workers
Stateless pods polling queue via FOR UPDATE SKIP LOCKED.
LSTM Service
2-layer LSTM built in NumPy trained on execution history.
Analytics Dashboard
Next.js frontend showing predicted vs actual runtime and accuracy.
K8s HPA
Horizontal pod autoscaler scaling workers 1→5 based on queue depth.
Engineering Tradeoffs
Design review notes: what was optimized and what was deliberately left behind.
FOR UPDATE SKIP LOCKED vs application-level locking
Application-level locking requires distributed coordination. SKIP LOCKED is a single atomic operation that eliminates double-dequeue without a distributed lock manager.
Database-agnostic implementation
PostgreSQL FOR UPDATE SKIP LOCKED
Custom LSTM vs PyTorch
The goal was full understanding of LSTM mechanics and gradient flow. Using PyTorch would obscure the learning objective.
GPU acceleration, automatic differentiation
Hand-built 2-layer LSTM in NumPy
PostgreSQL advisory locks for scheduler election
PostgreSQL is already in the stack. Adding Redis/etcd for one lock is operational overhead without benefit.
Distributed lock manager (Redis, etcd)
pg_try_advisory_lock for leader election
Failure Modes
Incident-style notes for the ways the design can break.
Priority inversion under high load
FM-01LSTM predictions lag actual runtime changes.
by recalculating scores at dequeue time, not just submission time.
Worker crash mid-job
FM-02Job locked but not completed.
with job heartbeat timeout — jobs not heartbeating within 30s are re-enqueued.
LSTM training on stale data
FM-03Long-lived jobs skew runtime distribution.
with exponential decay weighting — recent executions weighted higher.
K8s HPA lag
FM-04Autoscaler reacts after queue depth spikes.
by pre-warming workers based on predicted queue growth from LSTM.
Benchmarks
Environment first, numbers second. Metrics should be inspectable, not ornamental.
Next.js / FastAPI
K8s HPA 1→5 workers
Next.js, FastAPI, TypeScript, Python
Distributed Systems
Project-level benchmark notes
1→5pods (HPA)
<100ms
2layers (NumPy)
Improvingover time
5max
Lessons Learned
Engineering takeaways from the implementation, including remaining work.
FOR UPDATE SKIP LOCKED is underrated
it eliminates an entire class of distributed coordination problems with one SQL clause.
Building LSTM from scratch in NumPy is tedious but essential for deep understanding of gate mechanics and gradient flow.
K8s HPA reacts to metrics, not predictions. Adding predictive scaling on top of reactive HPA reduces cold-start latency significantly.
Future: replace NumPy LSTM with a proper online learning approach that updates weights incrementally rather than full retraining.