rForge

Enterprise software platforms don’t have 12 tools. A mid-sized SaaS company has 50+, and each one comes with a schema the AI agent has to read before it can use it. Current agents load all of them upfront, which means most of the context budget is gone before the conversation has started. The agents make worse choices. You add more instructions to compensate. It doesn’t fix the problem.

I built rForge to try a different approach: instead of loading the full catalogue, the agent asks for what it needs. A search layer returns the 3-5 relevant tools. The agent writes a short program to do the task. A sandboxed environment runs it. The catalogue never enters context.

Code is available on request — code.jyxtn.dev/rforge-repo-portfolio.

Design

Objective: Demonstrate that a code-agent pattern can handle large enterprise tool surfaces without degrading — and measure the difference against the standard approach with an honest, same-surface benchmark.

Design factors: Enterprise deployment drove two non-negotiable constraints. First, managed sandboxes (E2B, Modal) were rejected — data residency and self-hosting requirements are real in enterprise environments; a POC that assumes managed cloud isolation isn’t demonstrating the right thing. Second, the threat model for LLM-generated code running against live enterprise APIs is real: token exfiltration, unauthorized calls, state persistence outside the execution boundary. The isolation had to be genuine, not gestured at.

Technology evaluations and selections:

Go for the harness — goroutine concurrency maps cleanly to concurrent request handling. Type safety across five independently deployable services matters when failure modes include security-relevant bugs.

Temporal over a job queue — HITL suspension that survives process restarts isn’t achievable with a simple queue. Temporal’s activity model gives durable workflow state without custom persistence logic. This was a requirement, not a nice-to-have.

MOSS semantic retrieval over keyword/BM25 search — keyword matching fails on paraphrases and synonyms. Embedding-based retrieval understands intent. P50 < 10ms keeps tool discovery off the critical path.

Wasm inside Firecracker over Python in Docker — evaluated on two axes. Isolation: hardware-level inter-request boundary (Firecracker) plus capability-level intra-request boundary (Wasm). Both layers are independently required — a Wasm exploit inside one nanoVM can’t reach another only because of the hardware boundary, and the hardware boundary alone doesn’t constrain what code can do inside it. Economics: 12MB image vs 150MB, a 10x warm pool difference at any meaningful scale.

LangGraph comparator — same tool surface, same queries, same observability. A straw man comparison would be worse than no comparison.

graph TB
    subgraph CLIENT["Client"]
        U[User / Teams]
    end

    subgraph HARNESS["Go Harness — cmd/harness :8080"]
        direction TB
        API[HTTP API]
        TW[Temporal Worker]

        subgraph SAFETY["Layer 1 — Safety Membrane"]
            LG[LLMGuard\ninjection · PII · toxicity]
            CQ[Canonical Query LLM\nRBAC · language · intent]
        end

        subgraph AGENT["Layer 2 — Code Agent"]
            DISC[discover_tools\nMOSS semantic retrieval]
            GEN[Code Generation LLM\nprogram synthesis]
        end

        subgraph VALIDATOR["Layer 3 — Security Validator"]
            AST[AST Walker\nallowlist enforcement]
        end

        subgraph WORKFLOW["Layer 5 — Temporal Workflow"]
            WF[RequestWorkflow\ndurable execution state]
            ACT[Activities\nSafety · Discovery · Codegen · Validate · Execute]
        end
    end

    subgraph SANDBOX["Layer 4 — Sandbox"]
        DOCKER[Docker Container\nno-new-privileges · read-only]
        SDK[tool_sdk\nstdlib only · urllib]
    end

    subgraph INFRA["Infrastructure"]
        subgraph REGISTRY["Tool Registry :8082"]
            REG[Registry Service\nSQLite]
            MOSS[MOSS\nsemantic index\nP50 < 10ms]
        end
        subgraph MCP["Mock MCP Server :8081"]
            MCP3[3-Tool Surface]
            MCP50[50-Tool Surface\nPlatform · Observability\nSales/GTM · Finance · Identity]
        end
        LF[Langfuse\nLLM observability · all layers]
        TEMP[Temporal Server\ndurable workflow state · HITL]
    end

    subgraph COMPARATOR["LangGraph Comparator :8085"]
        LGC[FastAPI]
        LGG[LangGraph Fixed Graph\n3-node DAG]
        LGT[Tool Wrappers]
    end

    U -->|HTTP POST /execute| API
    API --> TW
    TW --> WF
    WF --> ACT
    ACT -->|1. scan| LG
    LG -->|pass| CQ
    CQ -->|canonical intent| ACT
    ACT -->|2. semantic lookup| MOSS
    MOSS -->|3-5 tool specs| DISC
    DISC --> GEN
    GEN -->|program| ACT
    ACT -->|3. validate| AST
    AST -->|approved| ACT
    ACT -->|4. inject & run| DOCKER
    DOCKER --> SDK
    SDK -->|signed token call| MCP3
    SDK -->|signed token call| MCP50
    MCP3 --> REG
    MCP50 --> REG
    REG <--> MOSS
    ACT -.->|spans| LF
    DOCKER -.->|spans| LF
    WF <-->|durable state| TEMP
    U -->|HTTP POST /compare| LGC
    LGC --> LGG
    LGG --> LGT
    LGT -->|same MCP surface| MCP3
    LGC -.->|spans| LF

Implementation

Context window usage in a standard tool-calling loop scales linearly with tool descriptions in context. Hierarchical routing and summarized descriptions reduce the coefficient; they don’t change the relationship. MOSS semantic retrieval eliminates it — queries the tool registry against a natural language intent string and returns 3-5 specs, P50 under 10ms. The code generation LLM sees only those specs and produces a 20-50 line Python program. An AST walker validates against an explicit allowlist before execution — not policy enforcement, the absence of any capability outside the approved surface.

The sandbox architecture changed mid-build. Python in a stripped Docker container was the starting point. The execution target is now Wasm inside Firecracker microVMs. The .wasm module arrives via stdin at execution time. Import table verification against an allowlist runs before wasmtime sees the bytes — a .wasm module cannot make a system call that isn’t declared in its import section. Exfiltration via an undeclared socket isn’t blocked at the policy layer, it’s impossible at the instruction level. The nanoVM handles inter-request isolation; the Wasm capability model constrains what the program can do inside it.

Compiled Rust binary in a scratch container was the other option for the inner layer. Rejected: .wasm is architecture-independent and the import table is statically auditable before execution. One compile step works for both Apple Silicon dev and x86-64 Firecracker production.

15 ADRs. All ACCEPTED FOR POC.

Repo → — access granted on request