IN-HOUSE EDA - PART 3

Parallel Region-Based Routing on OpenROAD

Multithreading exists. We wanted parallelism.
WIOWIZ Technologies • January 2026 • 10 min read

1. The Distinction That Matters

OpenROAD's TritonRoute already supports multithreading. Within a single routing job, it can use multiple CPU threads to accelerate the work. This is a known capability, documented and functional.

But multithreading and parallelism are not the same thing.

Multithreading: One process, multiple threads, shared memory, single problem.

Parallelism: Multiple processes, independent problems, results merged afterward.

Multithreading helps when you have one large problem. Parallelism helps when you can decompose a problem into independent subproblems. The two approaches are complementary, not competing.

Our question was simple: can we add process-level parallelism on top of OpenROAD's existing multithreading, without modifying the router itself?

2. Why This Is Hard

Routing is not embarrassingly parallel. Nets interact. Metal layers are shared. Design rule checks span across regions. A naive split-and-route approach will produce DRC violations at region boundaries, or worse, electrically broken designs.

The academic literature describes region-based parallel routing, but papers tend to assume that regions are "well-chosen" without defining what that means. The gap between theory and working implementation is significant.

Commercial EDA tools have solved this problem internally, but the solutions are proprietary. What's published is either too abstract to implement or too incomplete to reproduce.

The challenge is not the router. TritonRoute works. The challenge is everything around it: how to partition, what to isolate, when to merge, how to handle nets that span regions.

3. Our Approach

We treat OpenROAD as a black box. No modifications to TritonRoute. No changes to OpenROAD's internals. We control only what goes in (DEF files, routing guides) and what comes out (routed DEFs).

The architecture is a wrapper layer that orchestrates multiple independent OpenROAD processes:

Parallel Region-Based Routing Architecture Input DEF + Guides WRAPPER LAYER Partition • Schedule • Orchestrate region_1.def region_2.def region_n.def OpenROAD Process 1 (multithreaded) OpenROAD Process 2 (multithreaded) OpenROAD Process N (multithreaded) runs in parallel routed_1.def routed_2.def routed_n.def MERGE + VALIDATE Combine routes • DRC check final.def Legend Our layer OpenROAD Output

Each OpenROAD process can still use multithreading internally. We spawn multiple such processes, each handling a different region. Total compute utilization becomes: (number of parallel processes) × (threads per process).

What the wrapper controls

Wrapper Responsibility Router Responsibility (Unchanged)
Region definition and boundaries Detailed routing algorithms
Net assignment to regions Track assignment
DEF generation per region Via optimization
Guide filtering per region DRC checking
Process scheduling Access point generation
Result merging Timing-driven decisions

4. The Partitioning Problem

Early experiments revealed that geometric partitioning — dividing the die area into equal rectangles — does not work. A chip's routing complexity is not uniformly distributed. Dense logic clusters in some areas. I/O and buffers spread sparsely in others.

When we split a test design into four equal quadrants, one region contained far more routing work than the others. That single region became the bottleneck, and parallel execution offered no benefit. In fact, it was slower than sequential routing due to overhead.

Parallel wall time = max(region time)

If one region takes 10× longer than the others, parallelism provides no speedup. The problem is not how many regions you have. It's whether they're balanced.

The fix required complexity-aware partitioning. Regions must be balanced by routing difficulty, not by geometric area. How exactly to measure and balance that complexity is where most of the engineering effort went. The specific approach is not detailed here, but the principle is: no region should exceed a solvable threshold.

5. What We Encountered

The path to working parallel routing was not straightforward. Each iteration exposed a different constraint in the DEF/routing ecosystem. We share this history because it reflects the real nature of the problem.

v1–v4
Basic region extraction by spatial bounds
Failed
Nets reference components outside region boundaries. DEF sections are deeply interconnected.
v5–v6
Added PINS, TRACKS, SPECIALNETS sections
Failed
Power/ground nets connect to ALL cells globally. Cannot be partitioned.
v7–v8
Signal-only routing with guide filtering
Failed
Missing access points. Router needs complete guide information for each net.
v9
Fixed guide parser, proper net filtering
Partial success
4/16 regions routed. Small isolated regions work. Large regions timeout.
v10
Equal-area partitioning with 4 regions
Partial success
3/4 regions completed. One region (5,400+ nets) dominated runtime and timed out.
v11
Balanced partitioning
Success
All 10 regions routed. 5.57× speedup achieved.

Each failure taught something specific. The DEF format is more interconnected than it appears. Power and ground cannot be regionalized. Guides are mandatory and must be complete. And most importantly: equal geometry does not mean equal complexity.

6. What We Measured

Test design: a RISC-V processor core. Small by production standards, but sufficient to validate the approach.

Design characteristics:
~8,500 standard cells
~7,200 routable nets
Technology: Skywater 130nm (open PDK)

We compared sequential routing (single OpenROAD process, full chip) against parallel routing (multiple OpenROAD processes, partitioned regions, merged result).

643.7s
Sequential Baseline
115.5s
Parallel Execution
5.57×
Speedup
10
Parallel Regions

All regions completed successfully. The merged result passed basic validation. The speedup matched theoretical predictions within measurement noise.

7. What This Does Not Prove

A 5.57× speedup on an 8,500-cell design is a data point, not a conclusion. We are careful about what we claim and what remains unproven.

Open questions:
  • Does this scale to 100K cells? 1M cells?
  • How do clock nets and power grids affect partitioning?
  • What happens with timing-critical paths that span regions?
  • How do macro blocks (SRAMs, hard IPs) change the equation?
  • Is the merged result truly DRC-clean under all conditions?

Each of these is a potential failure mode at larger scale. We have not yet tested designs with significant macro content, complex clock trees, or aggressive timing constraints.

8. The Road Ahead

This experiment established that process-level parallelism can work on top of OpenROAD without modifying the router. That was the first question. Many more remain.

Feasibility
Can independent regions be routed in parallel and merged? — Answered
Robustness
Does this work across different design styles and sizes? — In progress
Global nets
How to handle clocks, resets, and high-fanout signals? — Future phase
Timing closure
Can timing constraints be preserved across region boundaries? — Future phase
Production scale
1M+ instances with real SoC complexity? — Future work

We are at the doorstep, not the finish line. The first result is encouraging, but the harder problems — global net handling, timing closure, macro-dominated layouts — are still ahead.

9. Why We're Sharing This

Open-source EDA has made remarkable progress. OpenROAD, in particular, has lowered the barrier to chip design significantly. But production flows still rely heavily on commercial tools, partly because certain capabilities — like scalable parallel routing — aren't fully realized in open tools.

Our goal is not to replace commercial EDA. It's to understand what's possible with open tools and where the gaps are. If parallel routing can be added as a wrapper layer without modifying OpenROAD itself, that's useful knowledge for the community.

We're sharing the existence of this result, not the implementation details. The partitioning strategy, the net handling logic, the merge algorithm — those are where the engineering value lives, and we're still refining them.

10. Scaling Further: CV32E40P

After validating the approach on PicoRV32, we applied the same flow to a larger design: CV32E40P, the OpenHW Group's production-grade RISC-V core.

CV32E40P characteristics:
~26,000 standard cells
~16,800 routable nets
~3× larger than PicoRV32

The goal here was not speed. It was to validate correctness and robustness at scale.

Design Cells Nets Primary Focus Outcome
PicoRV32 ~8.5k ~7.2k Parallel speedup Successful
CV32E40P ~26k ~16.8k Signoff robustness Successful

We intentionally evaluated the flow on two designs with different goals: PicoRV32 to measure routing acceleration, CV32E40P to validate correctness and signoff behavior at scale. This separation avoids over-optimizing for speed alone.

Performance observations

On smaller designs such as PicoRV32, region-based parallel routing demonstrates measurable reduction in wall-clock routing time compared to sequential execution.

On larger designs such as CV32E40P, the primary benefit shifts from raw speed to predictability and completion — avoiding router timeouts commonly observed in monolithic runs on complex netlists.

11. Signoff Validation

The merged layouts were validated through standard physical signoff stages:

DRC
Design Rule Check
LVS
Layout vs Schematic
PEX
Parasitic Extraction
IR Drop
Static IR Analysis
GDS
Final Layout Export

All checks completed successfully on both designs. Minor DEF formatting issues encountered during parasitic extraction were unrelated to routing quality and were resolved through standard post-processing.

No routing constraints were relaxed, and no post-processing shortcuts were applied to achieve convergence.

12. What's Next

Two designs validated. The foundation is solid. But this is still early.

The next challenges:

  • Designs with significant macro content (SRAMs, hard IPs)
  • Timing-critical paths that span region boundaries
  • Clock tree routing integration
  • 100K+ cell designs

We're not claiming to have solved parallel routing. We're showing that a careful wrapper-based approach can work on real designs, pass signoff, and scale beyond toy examples.

The work continues.

Summary

OpenROAD supports multithreading. We added process-level parallelism on top.

PicoRV32: 5.57× routing speedup. CV32E40P: signoff-clean at 3× scale.

Both designs passed DRC, LVS, PEX, IR drop, and GDS export.

The foundation is in place. Next phases are underway.

#semiconductor #EDA #OpenROAD #physical-design #routing #RISC-V

Part of our ongoing EDA research. Previously: EQWAVE (RTL vs GLS comparison), VHE (GPU-accelerated simulation)