Distributed Systems

SmartQueue

AI-powered adaptive task scheduler with LSTM runtime prediction and K8s autoscaling.

Next.jsFastAPITypeScriptPythonPostgreSQLKubernetesLSTM

View Source

1→5

Worker Scaling

<100

Scheduling Latency

LSTM Layers

Improving

Prediction Accuracy

Problem

Problem statement, constraint shape, and the gap this project explores.

Problem Statement

How do you build a task scheduler that improves its priority decisions over time without human intervention?

Challenge

Built a distributed, multi-tenant task scheduling platform that learns from historical execution patterns using a custom LSTM to predict job runtimes and dynamically adjust priorities.

Why Existing Approaches Failed

Static priority schedulers are brittle — they assign priority at submission time and never adapt. In heterogeneous workloads where job runtime varies by 10x, static priorities cause priority inversion: low-priority short jobs queue behind high-priority long jobs. SmartQueue uses a custom LSTM trained on historical execution data to predict runtime and compute dynamic priority scores, enabling the scheduler to continuously improve.

Constraints

Non-negotiable boundaries that shaped the implementation.

Multi-tenancy

Org-scoped job isolation with RBAC

Concurrency

Safe concurrent queue access under workers

Scaling

K8s HPA 1→5 worker pods

LSTM from scratch (NumPy only)

Consistency

No double-processing under parallel workers

Architecture

The primary design surface: flow, subsystem roles, and state boundaries.

Architecture Brief

SmartQueue has a Next.js frontend, FastAPI scheduling backend, and PostgreSQL state store. Workers poll the queue using FOR UPDATE SKIP LOCKED for safe concurrent dequeue. The LSTM service runs as a FastAPI sidecar, consuming execution history and outputting runtime predictions used to recalculate priority scores.

Execution Flow

Job submit

Priority score (LSTM prediction)

PostgreSQL queue

Worker dequeue (SKIP LOCKED)

Execute

Record result

Retrain LSTM

Job API

FastAPI endpoints for job submission, status, and cancellation.

Scheduler

PostgreSQL advisory locks for leader election; min-heap for priority queue.

Workers

Stateless pods polling queue via FOR UPDATE SKIP LOCKED.

LSTM Service

2-layer LSTM built in NumPy trained on execution history.

Analytics Dashboard

Next.js frontend showing predicted vs actual runtime and accuracy.

K8s HPA

Environment first, numbers second. Metrics should be inspectable, not ornamental.

Test Environment

Runtime

Next.js / FastAPI

Workload

K8s HPA 1→5 workers

Stack

Next.js, FastAPI, TypeScript, Python

Scope

Distributed Systems

Evidence

Project-level benchmark notes

Performance Results

Worker Scaling

1→5pods (HPA)

Scheduling Latency

<100ms

LSTM Layers

2layers (NumPy)

Prediction Accuracy

Improvingover time

Concurrent Workers

5max

Lessons Learned

Engineering takeaways from the implementation, including remaining work.

FOR UPDATE SKIP LOCKED is underrated

it eliminates an entire class of distributed coordination problems with one SQL clause.

Building LSTM from scratch in NumPy is tedious but essential for deep understanding of gate mechanics and gradient flow.

K8s HPA reacts to metrics, not predictions. Adding predictive scaling on top of reactive HPA reduces cold-start latency significantly.

Future: replace NumPy LSTM with a proper online learning approach that updates weights incrementally rather than full retraining.

PreviousRaft

All Projects