QuantumLock
Post-quantum cryptographic infrastructure benchmarked under CPU-only constraints.
<50
Keygen Latency
27
Keygen Reduction
1M+
Operations Benchmarked
Kyber-768
Security Level
Problem
Problem statement, constraint shape, and the gap this project explores.
Can post-quantum algorithms meet latency requirements on commodity CPU-only hardware?
Built cryptographic infrastructure balancing security guarantees with operational constraints. Designed for environments where performance, memory usage, and algorithm choice directly impact system viability.
Most post-quantum cryptography research benchmarks on high-end hardware with AVX-512 extensions. The real challenge is deploying PQ algorithms on constrained infrastructure — cloud VMs without AVX-512, edge devices, and environments where GPU acceleration is unavailable. QuantumLock explored whether Kyber-based key encapsulation and lattice-based operations could meet sub-50ms latency targets under these constraints.
Constraints
Non-negotiable boundaries that shaped the implementation.
CPU-only, no AVX-512
<50ms per operation
1M+ operations benchmarked
8-core parallel execution
Bounded working set
Architecture
The primary design surface: flow, subsystem roles, and state boundaries.
QuantumLock implements a Kyber-768 key encapsulation pipeline in Rust using the pqcrypto crate family. The benchmark harness runs keygen, encapsulation, and decapsulation across thread pools of varying sizes to expose contention points.
Keygen
Public/Private keypair
Encapsulate (shared secret + ciphertext)
Decapsulate (verify shared secret)
Benchmark record
KEM Pipeline
Kyber-768 keygen → encapsulation → decapsulation chain.
Benchmark Harness
Criterion-based microbenchmarks with statistical analysis.
Thread Pool Manager
Rayon-based parallelism with configurable worker counts.
SIMD Dispatcher
Runtime detection of SIMD capabilities with scalar fallback.
Serialization Layer
serde-based key serialization for persistence benchmarks.
Engineering Tradeoffs
Design review notes: what was optimized and what was deliberately left behind.
SIMD vectorization vs naive parallelism
Naive thread parallelism caused lock contention on shared RNG state. SIMD operates within a single core and scales independently.
Portability across all CPU targets
SIMD vectorization for polynomial arithmetic
Kyber vs NTRU vs McEliece
Kyber is the NIST PQC standard, has mature Rust implementations, and offers the best latency/security tradeoff at the 768-bit security level.
Smaller key sizes (McEliece is smaller)
Kyber-768
pqcrypto vs custom implementation
Cryptographic correctness is non-negotiable. Audited reference implementations reduce the risk of subtle timing side channels in polynomial arithmetic.
Full control over inner arithmetic
pqcrypto crate
Failure Modes
Incident-style notes for the ways the design can break.
Thread contention on RNG
FM-01Multiple threads sharing a single OsRng caused throughput degradation.
by per-thread RNG instances.
Serialization overhead masking crypto latency
FM-02serde serialization dominated benchmark numbers.
by separating crypto benchmarks from serialization benchmarks.
Compiler optimization masking work
FM-03Criterion benchmarks required black_box() hints to prevent dead-code elimination from skewing results.
Criterion benchmarks required black_box() hints to prevent dead-code elimination from skewing results.
SIMD detection false positives
FM-04cpuid queries on some VMs returned incorrect capability flags. Added runtime verification with fallback.
cpuid queries on some VMs returned incorrect capability flags. Added runtime verification with fallback.
Benchmarks
Environment first, numbers second. Metrics should be inspectable, not ornamental.
Rust / ring
<50ms CPU-only latency
Rust, ring, pqcrypto, serde
Cryptography
Project-level benchmark notes
<50ms
27%
1M+ops
Kyber-768NIST L3
8cores
Lessons Learned
Engineering takeaways from the implementation, including remaining work.
RNG contention is the silent killer of parallel cryptographic benchmarks
always use per-thread RNG.
SIMD gains in polynomial arithmetic are more significant than thread-level parallelism for this workload.
Benchmarking crypto correctly requires careful use of black_box() and warm-up iterations to avoid compiler artifacts.
Future: explore hardware acceleration via dedicated crypto coprocessors and measure the gap vs pure-Rust implementations.