Systems Programming

HuffSpace

Parallel file compression engine using Huffman coding with multi-GB throughput.

RustrayonclapserdeDocker

40–60

Compression Ratio

~3.8x

Speedup vs Single-thread

4

Max Input Size Tested

4

Chunk Size

01

Problem

Problem statement, constraint shape, and the gap this project explores.

Problem Statement

How do you compress multi-GB files with Huffman coding without running into single-threaded throughput limits?

Challenge

Engineered compression system for high-throughput environments where memory constraints, processing latency, and data integrity requirements shaped core architectural decisions.

Why Existing Approaches Failed

Standard Huffman compression is inherently sequential — you build a frequency table, construct the tree, then encode. This serializes the entire pipeline. For multi-GB inputs, single-threaded encoding becomes a bottleneck. HuffSpace explores a chunk-parallel approach: split input into independent segments, encode each in parallel, and reconstruct with per-chunk Huffman trees embedded in the output stream.

02

Constraints

Non-negotiable boundaries that shaped the implementation.

Input Size

Multi-GB files

Memory

Bounded — no full-file buffering

CPU

All available cores via Rayon

Correctness

Bit-exact decompression required

Streaming

Sequential output despite parallel encoding

03

Architecture

The primary design surface: flow, subsystem roles, and state boundaries.

Architecture Brief

HuffSpace divides input into fixed-size chunks, processes each chunk independently through frequency analysis → tree construction → encoding, then serializes the output stream with per-chunk metadata headers for decompression.

Execution Flow
01

Input file

02

Chunker

03

[Parallel: Frequency

04

Tree

05

Encode]

06

Stream Assembler

07

Compressed output

01

Chunker

Splits input into fixed-size segments for parallel processing.

02

Frequency Analyzer

Builds per-chunk byte frequency tables.

03

Tree Builder

Constructs Huffman trees from frequency tables using a min-heap.

04

Encoder

Bit-packs symbols according to the Huffman code table.

05

Stream Assembler

Serializes chunk outputs with metadata for sequential decompression.

04

Engineering Tradeoffs

Design review notes: what was optimized and what was deliberately left behind.

EDR-01
Decision

Chunk-parallel vs streaming

Why Chosen

Streaming Huffman requires sequential frequency analysis. Chunk parallelism trades slightly higher memory for significant throughput gains.

Alternative Rejected

Memory-optimal streaming approach

Impact

Rayon-based chunk parallelism

EDR-02
Decision

Per-chunk trees vs global tree

Why Chosen

A global tree requires two passes over the entire file. Per-chunk trees allow single-pass parallel encoding at the cost of 5–10% compression ratio.

Alternative Rejected

Optimal global compression ratio

Impact

Per-chunk Huffman trees

EDR-03
Decision

Fixed chunk size vs adaptive

Why Chosen

Fixed chunks simplify parallelism scheduling and memory allocation. Adaptive chunking adds complexity without significant ratio benefit for binary data.

Alternative Rejected

Content-aware chunking

Impact

Fixed 4MB chunks

05

Failure Modes

Incident-style notes for the ways the design can break.

Bit boundary misalignment on decompression

FM-01
Impact

Chunk boundaries require careful bit offset tracking. Off-by-one errors cause cascading decompression failures.

Mitigation

with explicit boundary markers.

Memory pressure on large files

FM-02
Impact

Holding all chunk outputs in memory before assembly caused OOM on constrained systems.

Mitigation

by streaming assembled chunks directly to disk.

Rayon thread pool starvation

FM-03
Impact

Imbalanced chunk sizes caused some threads to idle.

Mitigation

by equalizing chunk sizes and using dynamic work-stealing.

Corrupted output on panic in worker thread

FM-04
Impact

Rayon propagates panics to the joining thread. Added explicit error handling to ensure partial output is discarded cleanly.

Mitigation

Rayon propagates panics to the joining thread. Added explicit error handling to ensure partial output is discarded cleanly.

06

Benchmarks

Environment first, numbers second. Metrics should be inspectable, not ornamental.

Test Environment
Runtime

Rust / rayon

Workload

40–60% compression ratio

Stack

Rust, rayon, clap, serde

Scope

Systems Programming

Evidence

Project-level benchmark notes

Performance Results
Compression Ratio

40–60%

Speedup vs Single-thread

~3.8xon 8 cores

Max Input Size Tested

4GB

Chunk Size

4MB

Peak Memory

~2x chunk sizeper worker

07

Lessons Learned

Engineering takeaways from the implementation, including remaining work.

01

Parallelizing inherently sequential algorithms requires rethinking the data format

per-chunk metadata is necessary overhead.

02

Rayon's work-stealing scheduler handles uneven workloads well, but only if tasks are appropriately sized.

03

Bit-level encoding bugs are extremely hard to debug without property-based test roundtrip checks.

04

Future: explore ZSTD-style dictionary compression across chunks to recover the ratio loss from per-chunk trees.

Akshat