VAI (Virtual AI Inference) — What Hardware Engineers See
Why model weights should be treated like ROM, not malloc(). A hardware engineer's perspective on AI inference — sharing what we see, so we can build better systems together.
We're hardware people. RTL, timing closure, DRC, tapeout.
We've spent careers thinking about state machines, memory hierarchies, and what happens when you get the architecture wrong at the gate level.
When we started working with LLMs, we noticed patterns that looked familiar — and some that looked fixable. This is our perspective. Not a critique, but an invitation: here's what we see from the hardware side. Maybe together we can build something better.
The Observation
Every time you switch models in a typical inference setup, the system unloads weights from GPU memory, loads new weights from disk, rebuilds execution state, and warms up again.
This takes 2-10 seconds. Sometimes more.
We kept asking a simple question:
Model weights are immutable. They don't change between inference calls. They don't change between users. They don't change between processes.
In hardware terms, weights are ROM. Configuration data. Firmware.
You don't reprogram ROM on every transaction.
The Reframe
We started mapping AI concepts to hardware concepts — not to say one is right, but because it's the lens we know. And through that lens, the problem looked solvable:
| What ML Calls It | What It Actually Is |
|---|---|
| Model weights | ROM / Firmware |
| KV cache | SRAM / Scratchpad |
| Prompt | Input transaction |
| Token generation | Pipeline execution |
| Model switch | Context select |
Through this lens, a different architecture suggests itself. What if we treated weights the way we treat firmware? Load once, reference forever.
VAI Architecture
VAI is our answer. The core principle is simple:
The implementation uses POSIX shared memory for the weight registry, mmap for zero-copy visibility, a daemon that owns memory and GPU resources, and stateless clients that attach and detach without touching the weights.
Proof: Multi-Model Switch Test
We ran a test alternating between two models — a fast 1.5B parameter model and a deeper 6.7B coding model. Here's what actually happened:
╔═══════════════════════════════════════════════════════════════╗ ║ WROM SHARED MEMORY STATUS ║ ╠═══════════════════════════════════════════════════════════════╣ ║ Daemon PID : Active ║ ║ Daemon Alive : YES ║ ║ Models Loaded : 2 ║ ╠═══════════════════════════════════════════════════════════════╣ ║ 1. qwen2-1.5b-q4 ║ ║ Loaded: Y GPU: Y Refs: 0 Time: 767 ms ║ ║ Vocab: 151936 Layers: 28 Embd: 1536 ║ ║ 2. deepseek-coder-6.7b-instruct.Q4_K_M ║ ║ Loaded: Y GPU: Y Refs: 0 Time: 525 ms ║ ║ Vocab: 32256 Layers: 32 Embd: 4096 ║ ╚═══════════════════════════════════════════════════════════════╝
Both models loaded and resident in GPU memory. Now we alternate between them with real inference tasks:
Running alternating model inference... [ 1] qwen2-1.5b-q4 → "Write Verilog AND gate"... ✓ first=40ms [ 2] deepseek-coder-6.7b → "Explain flip-flop"... ✓ first=630ms [ 3] qwen2-1.5b-q4 → "Write UART module"... ✓ first=13ms [ 4] deepseek-coder-6.7b → "What is CDC"... ✓ first=235ms [ 5] qwen2-1.5b-q4 → "Write FSM"... ✓ first=10ms [ 6] deepseek-coder-6.7b → "Write Verilog AND gate"... ✓ first=325ms [ 7] qwen2-1.5b-q4 → "Explain flip-flop"... ✓ first=13ms [ 8] deepseek-coder-6.7b → "Write UART module"... ✓ first=225ms [ 9] qwen2-1.5b-q4 → "What is CDC"... ✓ first=10ms [10] deepseek-coder-6.7b → "Write FSM"... ✓ first=178ms
Results Summary
╔═══════════════════════════════════════════════════════════════════════════╗ ║ MULTI-MODEL SWITCH RESULTS ║ ╠═══════════════════════════════════════════════════════════════════════════╣ ║ # │ Model │ First Token │ Total │ TPS ║ ╠═════╪═════════════════════════════════╪═════════════╪══════════╪═════════╣ ║ 1 │ qwen2-1.5b-q4 │ 40 ms │ 239 ms │ 160.8 ║ ║ 2 │ deepseek-coder-6.7b │ 630 ms │ 3371 ms │ 11.7 ║ ║ 3 │ qwen2-1.5b-q4 │ 13 ms │ 211 ms │ 161.6 ║ ║ 4 │ deepseek-coder-6.7b │ 235 ms │ 2750 ms │ 12.7 ║ ║ 5 │ qwen2-1.5b-q4 │ 10 ms │ 206 ms │ 163.3 ║ ║ 6 │ deepseek-coder-6.7b │ 325 ms │ 2844 ms │ 12.7 ║ ║ 7 │ qwen2-1.5b-q4 │ 13 ms │ 210 ms │ 162.4 ║ ║ 8 │ deepseek-coder-6.7b │ 225 ms │ 2771 ms │ 12.6 ║ ║ 9 │ qwen2-1.5b-q4 │ 10 ms │ 208 ms │ 161.6 ║ ║ 10 │ deepseek-coder-6.7b │ 178 ms │ 2689 ms │ 12.7 ║ ╚═══════════════════════════════════════════════════════════════════════════╝ AVERAGE FIRST TOKEN BY MODEL: qwen2-1.5b-q4 : 17 ms deepseek-coder-6.7b : 318 ms
The latency difference between models is pure compute — the 6.7B model is larger and takes longer. But there's no switching penalty. Zero overhead on context swap.
What This Means for Silicon
Here's where it gets interesting. The abstractions we've built in VAI map directly to hardware design decisions. This is the perspective we want to share:
VAI defines a stable contract between weights, context, and compute. This maps directly to SRAM blocks, weight banks, and MAC arrays. You can design hardware against it.
Deterministic behavior. No hidden reload states. Clear lifecycle boundaries. Inference becomes verifiable, not probabilistic chaos.
Model residency means predictable memory footprint. Clear partitioning of static vs dynamic memory. This is how you floorplan an NPU.
The Bigger Picture
We didn't build this for cloud serving. We built it because we needed it for our own work — hardware development assisted by AI.
When you're debugging RTL, you might want a fast model for quick syntax questions, a deeper model for architectural reasoning, a specialized model for verification, and another for documentation. Waiting 5 seconds between each defeats the purpose.
But here's what excites us: this isn't just a software optimization. It's a shared language between hardware and software teams. When we describe inference in terms of memory hierarchy, state management, and resource ownership — we're speaking a language that translates directly to silicon.
NPUs and AI accelerators will need to support multiple models resident simultaneously. They'll need clean interfaces between weight memory and compute. By defining these abstractions in software first, we create a blueprint that hardware and software teams can build toward together.
Closing
VAI came from necessity. We were building AI-assisted tools for hardware development and kept hitting the same wall — the inference stack wasn't designed for how we wanted to work.
So we designed something that was.
This is our perspective from the hardware side. We're not saying it's the only way to see things — but it's what we see. And we believe that when hardware engineers and software engineers share their perspectives openly, we build systems that neither could build alone.
If any of this resonates — whether you're in hardware, software, or somewhere in between — we'd love to hear your perspective too.
That's how we move forward. Together.