rForge
Enterprise software platforms don’t have 12 tools. A mid-sized SaaS company has 50+, and each one comes with a schema the AI agent has to read before it can use it. Current agents load all of them upfront, which means most of the context budget is gone before the conversation has started. The agents make worse choices. You add more instructions to compensate. It doesn’t fix the problem.
I built rForge to try a different approach: instead of loading the full catalogue, the agent asks for what it needs. A search layer returns the 3-5 relevant tools. The agent writes a short program to do the task. A sandboxed environment runs it. The catalogue never enters context.
Code is available on request — code.jyxtn.dev/rforge-repo-portfolio.
Design
Objective: Demonstrate that a code-agent pattern can handle large enterprise tool surfaces without degrading — and measure the difference against the standard approach with an honest, same-surface benchmark.
Design factors: Enterprise deployment drove two non-negotiable constraints. First, managed sandboxes (E2B, Modal) were rejected — data residency and self-hosting requirements are real in enterprise environments; a POC that assumes managed cloud isolation isn’t demonstrating the right thing. Second, the threat model for LLM-generated code running against live enterprise APIs is real: token exfiltration, unauthorized calls, state persistence outside the execution boundary. The isolation had to be genuine, not gestured at.
Technology evaluations and selections:
Go for the harness — goroutine concurrency maps cleanly to concurrent request handling. Type safety across five independently deployable services matters when failure modes include security-relevant bugs.
Temporal over a job queue — HITL suspension that survives process restarts isn’t achievable with a simple queue. Temporal’s activity model gives durable workflow state without custom persistence logic. This was a requirement, not a nice-to-have.
MOSS semantic retrieval over keyword/BM25 search — keyword matching fails on paraphrases and synonyms. Embedding-based retrieval understands intent. P50 < 10ms keeps tool discovery off the critical path.
Wasm inside Firecracker over Python in Docker — evaluated on two axes. Isolation: hardware-level inter-request boundary (Firecracker) plus capability-level intra-request boundary (Wasm). Both layers are independently required — a Wasm exploit inside one nanoVM can’t reach another only because of the hardware boundary, and the hardware boundary alone doesn’t constrain what code can do inside it. Economics: 12MB image vs 150MB, a 10x warm pool difference at any meaningful scale.
LangGraph comparator — same tool surface, same queries, same observability. A straw man comparison would be worse than no comparison.
graph TB
subgraph CLIENT["Client"]
U[User / Teams]
end
subgraph HARNESS["Go Harness — cmd/harness :8080"]
direction TB
API[HTTP API]
TW[Temporal Worker]
subgraph SAFETY["Layer 1 — Safety Membrane"]
LG[LLMGuard\ninjection · PII · toxicity]
CQ[Canonical Query LLM\nRBAC · language · intent]
end
subgraph AGENT["Layer 2 — Code Agent"]
DISC[discover_tools\nMOSS semantic retrieval]
GEN[Code Generation LLM\nprogram synthesis]
end
subgraph VALIDATOR["Layer 3 — Security Validator"]
AST[AST Walker\nallowlist enforcement]
end
subgraph WORKFLOW["Layer 5 — Temporal Workflow"]
WF[RequestWorkflow\ndurable execution state]
ACT[Activities\nSafety · Discovery · Codegen · Validate · Execute]
end
end
subgraph SANDBOX["Layer 4 — Sandbox"]
DOCKER[Docker Container\nno-new-privileges · read-only]
SDK[tool_sdk\nstdlib only · urllib]
end
subgraph INFRA["Infrastructure"]
subgraph REGISTRY["Tool Registry :8082"]
REG[Registry Service\nSQLite]
MOSS[MOSS\nsemantic index\nP50 < 10ms]
end
subgraph MCP["Mock MCP Server :8081"]
MCP3[3-Tool Surface]
MCP50[50-Tool Surface\nPlatform · Observability\nSales/GTM · Finance · Identity]
end
LF[Langfuse\nLLM observability · all layers]
TEMP[Temporal Server\ndurable workflow state · HITL]
end
subgraph COMPARATOR["LangGraph Comparator :8085"]
LGC[FastAPI]
LGG[LangGraph Fixed Graph\n3-node DAG]
LGT[Tool Wrappers]
end
U -->|HTTP POST /execute| API
API --> TW
TW --> WF
WF --> ACT
ACT -->|1. scan| LG
LG -->|pass| CQ
CQ -->|canonical intent| ACT
ACT -->|2. semantic lookup| MOSS
MOSS -->|3-5 tool specs| DISC
DISC --> GEN
GEN -->|program| ACT
ACT -->|3. validate| AST
AST -->|approved| ACT
ACT -->|4. inject & run| DOCKER
DOCKER --> SDK
SDK -->|signed token call| MCP3
SDK -->|signed token call| MCP50
MCP3 --> REG
MCP50 --> REG
REG <--> MOSS
ACT -.->|spans| LF
DOCKER -.->|spans| LF
WF <-->|durable state| TEMP
U -->|HTTP POST /compare| LGC
LGC --> LGG
LGG --> LGT
LGT -->|same MCP surface| MCP3
LGC -.->|spans| LF
Implementation
Context window usage in a standard tool-calling loop scales linearly with tool descriptions in context. Hierarchical routing and summarized descriptions reduce the coefficient; they don’t change the relationship. MOSS semantic retrieval eliminates it — queries the tool registry against a natural language intent string and returns 3-5 specs, P50 under 10ms. The code generation LLM sees only those specs and produces a 20-50 line Python program. An AST walker validates against an explicit allowlist before execution — not policy enforcement, the absence of any capability outside the approved surface.
The sandbox architecture changed mid-build. Python in a stripped Docker container was the starting point. The execution target is now Wasm inside Firecracker microVMs. The .wasm module arrives via stdin at execution time. Import table verification against an allowlist runs before wasmtime sees the bytes — a .wasm module cannot make a system call that isn’t declared in its import section. Exfiltration via an undeclared socket isn’t blocked at the policy layer, it’s impossible at the instruction level. The nanoVM handles inter-request isolation; the Wasm capability model constrains what the program can do inside it.
Compiled Rust binary in a scratch container was the other option for the inner layer. Rejected: .wasm is architecture-independent and the import table is statically auditable before execution. One compile step works for both Apple Silicon dev and x86-64 Firecracker production.
15 ADRs. All ACCEPTED FOR POC.
Repo → — access granted on request