Infrastructure

VAI (Virtual AI Inference) — What Hardware Engineers See

Why model weights should be treated like ROM, not malloc(). A hardware engineer's perspective on AI inference — sharing what we see, so we can build better systems together.

R&D Team  |  January 2026  |  Design & Verification

We're hardware people. RTL, timing closure, DRC, tapeout.

We've spent careers thinking about state machines, memory hierarchies, and what happens when you get the architecture wrong at the gate level.

When we started working with LLMs, we noticed patterns that looked familiar — and some that looked fixable. This is our perspective. Not a critique, but an invitation: here's what we see from the hardware side. Maybe together we can build something better.

The Observation

Every time you switch models in a typical inference setup, the system unloads weights from GPU memory, loads new weights from disk, rebuilds execution state, and warms up again.

This takes 2-10 seconds. Sometimes more.

We kept asking a simple question:

The Question
Why are we reloading data that never changes?

Model weights are immutable. They don't change between inference calls. They don't change between users. They don't change between processes.

In hardware terms, weights are ROM. Configuration data. Firmware.

You don't reprogram ROM on every transaction.

The Reframe

We started mapping AI concepts to hardware concepts — not to say one is right, but because it's the lens we know. And through that lens, the problem looked solvable:

What ML Calls It What It Actually Is
Model weights ROM / Firmware
KV cache SRAM / Scratchpad
Prompt Input transaction
Token generation Pipeline execution
Model switch Context select

Through this lens, a different architecture suggests itself. What if we treated weights the way we treat firmware? Load once, reference forever.

VAI Architecture

VAI is our answer. The core principle is simple:

Core Principle
Model weights live longer than processes.
VAI Shared Memory Architecture
Client A
Inference Request
Client B
Inference Request
Client N
Inference Request
VAI Daemon
Memory Owner • GPU Owner • Weight Registry
Shared Memory
Model A (1.5B)
Shared Memory
Model B (6.7B)
Shared Memory
Model N
GPU
All Models Resident • Zero-Copy Access

The implementation uses POSIX shared memory for the weight registry, mmap for zero-copy visibility, a daemon that owns memory and GPU resources, and stateless clients that attach and detach without touching the weights.

Proof: Multi-Model Switch Test

We ran a test alternating between two models — a fast 1.5B parameter model and a deeper 6.7B coding model. Here's what actually happened:

vai-shm-test — Shared Memory Status
╔═══════════════════════════════════════════════════════════════╗
         WROM SHARED MEMORY STATUS                             
╠═══════════════════════════════════════════════════════════════╣
 Daemon PID    : Active                                        
 Daemon Alive  : YES                                           
 Models Loaded : 2                                             
╠═══════════════════════════════════════════════════════════════╣
 1. qwen2-1.5b-q4                                            
    Loaded: Y  GPU: Y  Refs: 0    Time: 767 ms              
    Vocab: 151936  Layers: 28   Embd: 1536                   
 2. deepseek-coder-6.7b-instruct.Q4_K_M                      
    Loaded: Y  GPU: Y  Refs: 0    Time: 525 ms              
    Vocab: 32256   Layers: 32   Embd: 4096                   
╚═══════════════════════════════════════════════════════════════╝

Both models loaded and resident in GPU memory. Now we alternate between them with real inference tasks:

vai-shm-test — Alternating Model Inference
Running alternating model inference...

[ 1] qwen2-1.5b-q4                  → "Write Verilog AND gate"...  first=40ms
[ 2] deepseek-coder-6.7b            → "Explain flip-flop"...       first=630ms
[ 3] qwen2-1.5b-q4                  → "Write UART module"...       first=13ms
[ 4] deepseek-coder-6.7b            → "What is CDC"...              first=235ms
[ 5] qwen2-1.5b-q4                  → "Write FSM"...                first=10ms
[ 6] deepseek-coder-6.7b            → "Write Verilog AND gate"...  first=325ms
[ 7] qwen2-1.5b-q4                  → "Explain flip-flop"...       first=13ms
[ 8] deepseek-coder-6.7b            → "Write UART module"...       first=225ms
[ 9] qwen2-1.5b-q4                  → "What is CDC"...              first=10ms
[10] deepseek-coder-6.7b            → "Write FSM"...                first=178ms

Results Summary

vai-shm-test — Results
╔═══════════════════════════════════════════════════════════════════════════╗
                     MULTI-MODEL SWITCH RESULTS                            
╠═══════════════════════════════════════════════════════════════════════════╣
  #  │ Model                           │ First Token │ Total    │ TPS     
╠═════╪═════════════════════════════════╪═════════════╪══════════╪═════════╣
  1  │ qwen2-1.5b-q4                   │      40 ms  │   239 ms │  160.8  
  2  │ deepseek-coder-6.7b             │     630 ms  │  3371 ms │   11.7  
  3  │ qwen2-1.5b-q4                   │      13 ms  │   211 ms │  161.6  
  4  │ deepseek-coder-6.7b             │     235 ms  │  2750 ms │   12.7  
  5  │ qwen2-1.5b-q4                   │      10 ms  │   206 ms │  163.3  
  6  │ deepseek-coder-6.7b             │     325 ms  │  2844 ms │   12.7  
  7  │ qwen2-1.5b-q4                   │      13 ms  │   210 ms │  162.4  
  8  │ deepseek-coder-6.7b             │     225 ms  │  2771 ms │   12.6  
  9  │ qwen2-1.5b-q4                   │      10 ms  │   208 ms │  161.6  
 10  │ deepseek-coder-6.7b             │     178 ms  │  2689 ms │   12.7  
╚═══════════════════════════════════════════════════════════════════════════╝

AVERAGE FIRST TOKEN BY MODEL:
  qwen2-1.5b-q4                 :   17 ms
  deepseek-coder-6.7b           :  318 ms
17 ms
Avg First Token (1.5B model)
318 ms
Avg First Token (6.7B model)
~0 ms
Model Switch Overhead
160+ TPS
Throughput (1.5B model)

The latency difference between models is pure compute — the 6.7B model is larger and takes longer. But there's no switching penalty. Zero overhead on context swap.

Traditional Approach
2-10 sec
unload + reload on every model switch
VAI SHM
~0 ms
context swap only, weights stay resident

What This Means for Silicon

Here's where it gets interesting. The abstractions we've built in VAI map directly to hardware design decisions. This is the perspective we want to share:

For RTL Engineers

VAI defines a stable contract between weights, context, and compute. This maps directly to SRAM blocks, weight banks, and MAC arrays. You can design hardware against it.

For DV Engineers

Deterministic behavior. No hidden reload states. Clear lifecycle boundaries. Inference becomes verifiable, not probabilistic chaos.

For Physical Design

Model residency means predictable memory footprint. Clear partitioning of static vs dynamic memory. This is how you floorplan an NPU.

The Bigger Picture

We didn't build this for cloud serving. We built it because we needed it for our own work — hardware development assisted by AI.

When you're debugging RTL, you might want a fast model for quick syntax questions, a deeper model for architectural reasoning, a specialized model for verification, and another for documentation. Waiting 5 seconds between each defeats the purpose.

But here's what excites us: this isn't just a software optimization. It's a shared language between hardware and software teams. When we describe inference in terms of memory hierarchy, state management, and resource ownership — we're speaking a language that translates directly to silicon.

NPUs and AI accelerators will need to support multiple models resident simultaneously. They'll need clean interfaces between weight memory and compute. By defining these abstractions in software first, we create a blueprint that hardware and software teams can build toward together.

Our Belief
The best systems emerge when hardware and software teams share perspective. VAI is our contribution to that conversation.

Closing

VAI came from necessity. We were building AI-assisted tools for hardware development and kept hitting the same wall — the inference stack wasn't designed for how we wanted to work.

So we designed something that was.

This is our perspective from the hardware side. We're not saying it's the only way to see things — but it's what we see. And we believe that when hardware engineers and software engineers share their perspectives openly, we build systems that neither could build alone.

If any of this resonates — whether you're in hardware, software, or somewhere in between — we'd love to hear your perspective too.

That's how we move forward. Together.