
NVIDIA Research introduced HORIZON, a hands-free agent framework for hardware design. It treats hardware design as repository-level code evolution. This research team exercises the register-transfer level (RTL) instantiation. A structured Markdown harness becomes a project pack. A self-contained agent loop then evolves an isolated git worktree. It commits a version only when an executable acceptance gate passes.
The research team reports 100% completion across every evaluated RTL benchmark suite. It also states plainly that agentic hardware design is not solved.
What is HORIZON?
Single-turn code generation has a clear limit on executable design tasks. Plausible Verilog is not enough for real hardware. Correctness depends on cycle-level behavior, reset conventions, bit widths, and simulator feedback.
HORIZON hosts each design problem as a version-controlled repository, not a one-shot prompt. The only required input is a structured Markdown harness. That harness carries four components: a goal, domain-knowledge directions, an evaluator specification, and an acceptance predicate.
A bootstrap agent compiles the harness into a project pack. The research team writes this as p = (πagent, Ep, Ap, Γp, Ωp). Those terms cover the agent policy, the executable evaluator, and the acceptance predicate. They also cover the version-control policy and the domain skills.
For RTL, the evaluator Ep may include compilation, simulation, coverage extraction, and assertion or testbench checks. In other domains, that same slot could hold unit tests, theorem provers, profilers, or synthesis tools. Problems are therefore defined over git worktrees, not over a fixed repository type.

How the Repository-Level Loop Works
After bootstrap, the loop runs without further human input. Each cycle plans a target, edits the worktree, invokes tools, and runs the evaluator. The acceptance predicate then decides one thing: commit the new version, or log the failure.
Git is the substrate here, not incidental bookkeeping. Diffs expose proposed state changes. Commits define accepted checkpoints. Notes attach evaluator evidence. The log recovers the full trajectory.
The loop leans on native git commands to keep tracing cheap. Staged edits are inspected with git diff –cached. Each accepted attempt becomes a git commit whose notes carry the verdict and reward. Successful commits become positive repair examples. Rejected attempts are logged as negative examples. The repository history is the experience buffer, not a separate datastore.
The research team borrow semi-Markov decision process vocabulary for one narrow purpose. It names the recorded objects, nothing more. A ‘state’ is a versioned snapshot of the repository. An “option” is one episode between two checkpoints. HORIZON does not train or update an RL policy in this work. The agent backbone stays fixed throughout a campaign.
Session reuse keeps cost down. HORIZON holds a persistent model session across iterations. The harness, project pack, and stable sources are served from the provider’s prompt cache. Newly billed tokens are then dominated by the current diff and the latest evaluator output.
Where HORIZON Sits Among Self-Evolving Systems
HORIZON extends a lineage of repository-scale self-evolution. Earlier systems evolved the software that engineers run. HORIZON instead evolves the hardware artifacts that engineers create.
| System | Object evolved | Domain | Evaluation signal |
|---|---|---|---|
| AlphaEvolve (2025) | Algorithmic kernels | Scientific and algorithmic discovery | Automated evaluators |
| SATLUTION (2025) | Full SAT-solver repositories | SAT solving | Distributed correctness and runtime |
| ABCEvo (2026) | ABC logic-synthesis system | EDA software | Correctness and QoR |
| HORIZON (this work) | RTL sources, testbenches, verification artifacts | Hardware design | Compile, simulate, coverage, assertion checks |
All four share one principle. A candidate change is admitted only when executable evidence supports it.
Benchmark Results
The backbone is GPT-5.3, fixed for all experiments. Every result uses single-agent, hands-free mode. Campaigns ran on an AMD EPYC 9334 32-core host with 512 GB of RAM.
The evaluation spans ChipBench, RTLLM-2.0, and Verilog-Eval. It adds nine CVDP code- and verification-generation categories, CID 002 to 016. CVDP contains 783 human-authored problems across 13 task categories (Pinckney et al., 2025).
An iteration is one automated outer step. The agent edits the worktree, runs the evaluator, then commits a pass or logs a rejection. HORIZON reaches a 100% pass rate on every suite. The one residual miss is a ChipBench specification-harness defect, not an agent failure.
The aggregate first-iteration pass rate is 47.8%. Iteration-0 is not a standalone Pass@1 measurement. It is the repository state after the first agent iteration. The agent may defer debugging and repair to later iterations by design.
| Suite / category | Focus | Iter. 0 | Conv. iter. | HORIZON |
|---|---|---|---|---|
| ChipBench | Mixed RTL generation | 20.0 | 5 | 100.0 |
| RTLLM-2.0 | NL spec to RTL | 78.0 | 2 | 100.0 |
| Verilog-Eval-v2 | HDLBits-style Verilog | 86.2 | 2 | 100.0 |
| CVDP CID 002 | RTL code completion | 3.2 | 82 | 100.0 |
| CVDP CID 003 | NL spec to RTL | 19.2 | 24 | 100.0 |
| CVDP CID 004 | RTL code modification | 10.9 | 36 | 100.0 |
| CVDP CID 005 | Spec-to-RTL module reuse | 9.1 | 14 | 100.0 |
| CVDP CID 007 | Linting / QoR improvement | 0.0 | 24 | 100.0 |
| CVDP CID 012 | Test-plan to stimulus generation | 47.8 | 32 | 100.0 |
| CVDP CID 013 | Test-plan to checker generation | 3.8 | 19 | 100.0 |
| CVDP CID 014 | Test-plan to assertion generation | 79.1 | 1 | 100.0 |
| CVDP CID 016 | Debugging and bug fixing | 25.7 | 13 | 100.0 |
Convergence difficulty varies widely across categories. RTLLM-2.0 and Verilog-Eval reach 100% within two iterations. Checker generation (CID 013) starts at just 3.8%. Yet it climbs steadily to 100% by iteration 19, with almost no plateau. Code completion (CID 002) needs 82 iterations. Its long tail is the single largest token cost.
Interactive Metrics Explainer
‘+
‘
‘+
‘
‘
‘;
tr.onclick=function(){[].forEach.call(rb.children,function(x){x.classList.remove(‘sel’)});tr.classList.add(‘sel’)};
rb.appendChild(tr);
});
// TABS
[].forEach.call(root.querySelectorAll(‘.tab’),function(tab){
tab.onclick=function(){
root.querySelectorAll(‘.tab’).forEach(function(x){x.classList.remove(‘on’)});
root.querySelectorAll(‘.pane’).forEach(function(x){x.classList.remove(‘on’)});
tab.classList.add(‘on’);document.getElementById(tab.dataset.t).classList.add(‘on’);
setTimeout(resize,60);
};
});
// LOOP
var STEPS=[
{t:’Task Action Plan’,d:’The agent reads the current worktree state and plans a target edit.’},
{t:’File Edit & Tool Calls’,d:’It edits candidate artifacts (RTL, testbench) and invokes domain tools.’},
{t:’Evaluate and Score’,d:’The executable evaluator Ep runs compile / simulate / coverage and emits evidence yt.’},
{t:’Correctness Gate & Review’,d:’Acceptance predicate Ap checks yt; an independent step diffs the candidate.’},
{t:’Commit & Git Trace’,d:’On pass: git commit with git notes carrying the verdict and reward. On fail: RejectLog as a negative example.’}
];
var ls=document.getElementById(‘loopSteps’),lstate=document.getElementById(‘loopState’),cur=-1;
function drawLoop(){
ls.innerHTML=”;
STEPS.forEach(function(s,i){
var el=document.createElement(‘div’);el.className=”step”+(i===cur?’ act’:”);
el.innerHTML=’
‘+(i+1)+’
‘;
ls.appendChild(el);
});
}
drawLoop();
document.getElementById(‘loopNext’).onclick=function(){
cur++;
if(cur>=STEPS.length){cur=0;var it=(+lstate.textContent.replace(/\D/g,”))+1;lstate.innerHTML=’State Sw,’+it+’ (committed)’;}
drawLoop();setTimeout(resize,60);
};
document.getElementById(‘loopReset’).onclick=function(){cur=-1;lstate.innerHTML=’State Sw,0‘;drawLoop();setTimeout(resize,60);};
// CONVERGENCE
var sel=document.getElementById(‘convSel’);
S.forEach(function(d,i){var o=document.createElement(‘option’);o.value=i;o.textContent=d.n+’ \u2014 ‘+d.f;sel.appendChild(o);});
sel.value=3; // CID 002 by default (the long tail)
var timer=null;
function ease(x){return 1-Math.pow(1-x,2.2);}
function runConv(){
if(timer)clearInterval(timer);
var d=S[sel.value],max=Math.max(d.c,1);
document.getElementById(‘convMax’).textContent=max;
var log=document.getElementById(‘convLog’);log.innerHTML=”;
var it=0;
var step=Math.max(1,Math.round(max/40));
timer=setInterval(function(){
it+=step; if(it>max)it=max;
var frac=it/max;
var pass=d.i0+(100-d.i0)*ease(frac);
var tok=d.t*frac;
document.getElementById(‘convPass’).textContent=pass.toFixed(1);
document.getElementById(‘convPassBar’).style.width=pass+’%’;
document.getElementById(‘convTok’).textContent=tok.toFixed(1);
document.getElementById(‘convIter’).textContent=it;
var pass2=Math.random();
if(it===max){log.insertAdjacentHTML(‘afterbegin’,’
commit iter ‘+it+’ \u2713 pass=100% \u00b7 gate satisfied
‘);}
else if(pass2>0.5){log.insertAdjacentHTML(‘afterbegin’,’
commit iter ‘+it+’ \u2713 partial repair accepted
‘);}
else{log.insertAdjacentHTML(‘afterbegin’,’
reject-log iter ‘+it+’ \u2717 evaluator mismatch (negative example)
‘);}
resize();
if(it>=max)clearInterval(timer);
},110);
}
document.getElementById(‘convRun’).onclick=runConv;
document.getElementById(‘convReset’).onclick=function(){
if(timer)clearInterval(timer);
document.getElementById(‘convPass’).textContent=”0.0″;document.getElementById(‘convPassBar’).style.width=”0%”;
document.getElementById(‘convTok’).textContent=”0.0″;document.getElementById(‘convIter’).textContent=”0″;
document.getElementById(‘convLog’).innerHTML=”;resize();
};
// DONUT
var groups=[{n:’Legacy suites’,v:2.9,c:’#5a6b3f’},{n:’CID 002′,v:26.7,c:’#76B900′},{n:’CID 003′,v:18.1,c:’#9ede3a’},{n:’CID 004′,v:11.3,c:’#557a1a’},{n:’CID 005′,v:4.4,c:’#b6e86a’},{n:’CID 007′,v:10.3,c:’#6ea015′},{n:’CID 012′,v:15.3,c:’#8fce2e’},{n:’CID 013′,v:6.7,c:’#3f5c12′},{n:’CID 014′,v:0.1,c:’#cdd7bd’},{n:’CID 016′,v:4.2,c:’#456218′}];
var svg=document.getElementById(‘donut’),leg=document.getElementById(‘donutLeg’),ang=-90,cx=60,cy=60,r=44,rr=27;
groups.forEach(function(g){
var a0=ang,a1=ang+g.v/100*360;ang=a1;
var la=(a1-a0)>180?1:0;
function pt(a,rad){var rd=a*Math.PI/180;return [cx+rad*Math.cos(rd),cy+rad*Math.sin(rd)];}
var p0=pt(a0,r),p1=pt(a1,r),p2=pt(a1,rr),p3=pt(a0,rr);
var path=”M”+p0[0]+’ ‘+p0[1]+’A’+r+’ ‘+r+’ 0 ‘+la+’ 1 ‘+p1[0]+’ ‘+p1[1]+’L’+p2[0]+’ ‘+p2[1]+’A’+rr+’ ‘+rr+’ 0 ‘+la+’ 0 ‘+p3[0]+’ ‘+p3[1]+’Z’;
var el=document.createElementNS(‘http://www.w3.org/2000/svg’,’path’);el.setAttribute(‘d’,path);el.setAttribute(‘fill’,g.c);svg.appendChild(el);
var sp=document.createElement(‘span’);sp.innerHTML=’‘+g.n+’ ‘+g.v+’%’;leg.appendChild(sp);
});
var ctr=document.createElementNS(‘http://www.w3.org/2000/svg’,’text’);ctr.setAttribute(‘x’,60);ctr.setAttribute(‘y’,58);ctr.setAttribute(‘text-anchor’,’middle’);ctr.setAttribute(‘font-size’,’11’);ctr.setAttribute(‘font-weight’,’800′);ctr.textContent=”210M”;svg.appendChild(ctr);
var ctr2=document.createElementNS(‘http://www.w3.org/2000/svg’,’text’);ctr2.setAttribute(‘x’,60);ctr2.setAttribute(‘y’,70);ctr2.setAttribute(‘text-anchor’,’middle’);ctr2.setAttribute(‘font-size’,’7′);ctr2.setAttribute(‘fill’,’#9aa691′);ctr2.textContent=”tokens”;svg.appendChild(ctr2);
// RESIZE
function resize(){var h=root.offsetHeight;parent.postMessage({type:’mtpHorizonResize’,height:h+40},’*’);}
window.addEventListener(‘load’,function(){setTimeout(resize,80);});
window.addEventListener(‘resize’,resize);
setTimeout(resize,200);
})();






