Parallel Region-Based Routing on OpenROAD
1. The Distinction That Matters
OpenROAD's TritonRoute already supports multithreading. Within a single routing job, it can use multiple CPU threads to accelerate the work. This is a known capability, documented and functional.
But multithreading and parallelism are not the same thing.
Parallelism: Multiple processes, independent problems, results merged afterward.
Multithreading helps when you have one large problem. Parallelism helps when you can decompose a problem into independent subproblems. The two approaches are complementary, not competing.
Our question was simple: can we add process-level parallelism on top of OpenROAD's existing multithreading, without modifying the router itself?
2. Why This Is Hard
Routing is not embarrassingly parallel. Nets interact. Metal layers are shared. Design rule checks span across regions. A naive split-and-route approach will produce DRC violations at region boundaries, or worse, electrically broken designs.
The academic literature describes region-based parallel routing, but papers tend to assume that regions are "well-chosen" without defining what that means. The gap between theory and working implementation is significant.
Commercial EDA tools have solved this problem internally, but the solutions are proprietary. What's published is either too abstract to implement or too incomplete to reproduce.
3. Our Approach
We treat OpenROAD as a black box. No modifications to TritonRoute. No changes to OpenROAD's internals. We control only what goes in (DEF files, routing guides) and what comes out (routed DEFs).
The architecture is a wrapper layer that orchestrates multiple independent OpenROAD processes:
Each OpenROAD process can still use multithreading internally. We spawn multiple such processes, each handling a different region. Total compute utilization becomes: (number of parallel processes) × (threads per process).
What the wrapper controls
| Wrapper Responsibility | Router Responsibility (Unchanged) |
|---|---|
| Region definition and boundaries | Detailed routing algorithms |
| Net assignment to regions | Track assignment |
| DEF generation per region | Via optimization |
| Guide filtering per region | DRC checking |
| Process scheduling | Access point generation |
| Result merging | Timing-driven decisions |
4. The Partitioning Problem
Early experiments revealed that geometric partitioning — dividing the die area into equal rectangles — does not work. A chip's routing complexity is not uniformly distributed. Dense logic clusters in some areas. I/O and buffers spread sparsely in others.
When we split a test design into four equal quadrants, one region contained far more routing work than the others. That single region became the bottleneck, and parallel execution offered no benefit. In fact, it was slower than sequential routing due to overhead.
If one region takes 10× longer than the others, parallelism provides no speedup. The problem is not how many regions you have. It's whether they're balanced.
The fix required complexity-aware partitioning. Regions must be balanced by routing difficulty, not by geometric area. How exactly to measure and balance that complexity is where most of the engineering effort went. The specific approach is not detailed here, but the principle is: no region should exceed a solvable threshold.
5. What We Encountered
The path to working parallel routing was not straightforward. Each iteration exposed a different constraint in the DEF/routing ecosystem. We share this history because it reflects the real nature of the problem.
Each failure taught something specific. The DEF format is more interconnected than it appears. Power and ground cannot be regionalized. Guides are mandatory and must be complete. And most importantly: equal geometry does not mean equal complexity.
6. What We Measured
Test design: a RISC-V processor core. Small by production standards, but sufficient to validate the approach.
~8,500 standard cells
~7,200 routable nets
Technology: Skywater 130nm (open PDK)
We compared sequential routing (single OpenROAD process, full chip) against parallel routing (multiple OpenROAD processes, partitioned regions, merged result).
All regions completed successfully. The merged result passed basic validation. The speedup matched theoretical predictions within measurement noise.
7. What This Does Not Prove
A 5.57× speedup on an 8,500-cell design is a data point, not a conclusion. We are careful about what we claim and what remains unproven.
- Does this scale to 100K cells? 1M cells?
- How do clock nets and power grids affect partitioning?
- What happens with timing-critical paths that span regions?
- How do macro blocks (SRAMs, hard IPs) change the equation?
- Is the merged result truly DRC-clean under all conditions?
Each of these is a potential failure mode at larger scale. We have not yet tested designs with significant macro content, complex clock trees, or aggressive timing constraints.
8. The Road Ahead
This experiment established that process-level parallelism can work on top of OpenROAD without modifying the router. That was the first question. Many more remain.
We are at the doorstep, not the finish line. The first result is encouraging, but the harder problems — global net handling, timing closure, macro-dominated layouts — are still ahead.
9. Why We're Sharing This
Open-source EDA has made remarkable progress. OpenROAD, in particular, has lowered the barrier to chip design significantly. But production flows still rely heavily on commercial tools, partly because certain capabilities — like scalable parallel routing — aren't fully realized in open tools.
Our goal is not to replace commercial EDA. It's to understand what's possible with open tools and where the gaps are. If parallel routing can be added as a wrapper layer without modifying OpenROAD itself, that's useful knowledge for the community.
We're sharing the existence of this result, not the implementation details. The partitioning strategy, the net handling logic, the merge algorithm — those are where the engineering value lives, and we're still refining them.
10. Scaling Further: CV32E40P
After validating the approach on PicoRV32, we applied the same flow to a larger design: CV32E40P, the OpenHW Group's production-grade RISC-V core.
~26,000 standard cells
~16,800 routable nets
~3× larger than PicoRV32
The goal here was not speed. It was to validate correctness and robustness at scale.
| Design | Cells | Nets | Primary Focus | Outcome |
|---|---|---|---|---|
| PicoRV32 | ~8.5k | ~7.2k | Parallel speedup | Successful |
| CV32E40P | ~26k | ~16.8k | Signoff robustness | Successful |
We intentionally evaluated the flow on two designs with different goals: PicoRV32 to measure routing acceleration, CV32E40P to validate correctness and signoff behavior at scale. This separation avoids over-optimizing for speed alone.
Performance observations
On smaller designs such as PicoRV32, region-based parallel routing demonstrates measurable reduction in wall-clock routing time compared to sequential execution.
On larger designs such as CV32E40P, the primary benefit shifts from raw speed to predictability and completion — avoiding router timeouts commonly observed in monolithic runs on complex netlists.
11. Signoff Validation
The merged layouts were validated through standard physical signoff stages:
All checks completed successfully on both designs. Minor DEF formatting issues encountered during parasitic extraction were unrelated to routing quality and were resolved through standard post-processing.
12. What's Next
Two designs validated. The foundation is solid. But this is still early.
The next challenges:
- Designs with significant macro content (SRAMs, hard IPs)
- Timing-critical paths that span region boundaries
- Clock tree routing integration
- 100K+ cell designs
We're not claiming to have solved parallel routing. We're showing that a careful wrapper-based approach can work on real designs, pass signoff, and scale beyond toy examples.
The work continues.
Summary
OpenROAD supports multithreading. We added process-level parallelism on top.
PicoRV32: 5.57× routing speedup. CV32E40P: signoff-clean at 3× scale.
Both designs passed DRC, LVS, PEX, IR drop, and GDS export.
The foundation is in place. Next phases are underway.
Part of our ongoing EDA research. Previously: EQWAVE (RTL vs GLS comparison), VHE (GPU-accelerated simulation)