Self-Improvement Loop

BenchGen is built around one idea: agents should improve themselves. Each cycle through the platform produces better agents, richer synthetic data, and tighter benchmarks — automatically.

  Enterprise Data
        │
        ▼
┌──────────────┐    trajectories    ┌──────────────┐
│   Simulate   │ ─────────────────▶ │    Train     │
│              │                    │              │
│  Digital     │ ◀───────────────── │  RL agents   │
│  twin of     │   trained model    │  learn from  │
│  business    │                    │  feedback    │
└──────────────┘                    └──────────────┘
        │                                  │
        ▼                                  ▼
┌──────────────┐                   ┌──────────────┐
│   Evaluate   │ ◀──── results ─── │   Generate   │
│              │                    │              │
│  Benchmark   │ ──── failures ───▶ │  Unlimited   │
│  against     │                    │  synthetic   │
│  real tasks  │                    │  data        │
└──────────────┘                    └──────────────┘

The Four Stages

Simulate

BenchGen ingests your enterprise data — CRM records, ERP logs, support tickets, data warehouse exports — and builds a digital twin of your business. Agents operate inside this simulation, encountering the same complexity as real workflows but in a safe, controlled environment. Output: interaction logs and trajectories that capture how agents succeed and fail.

Train

Trajectories from simulation become the training signal. RL agents learn by trial, error, and feedback — reinforced on the behaviors that lead to good outcomes and penalized on failures. Fine-tuning is done with LoRA so iteration is fast and adapters are lightweight. Output: a fine-tuned model that handles your specific business context better than a generic base model.

Generate

The trained model runs in the simulation at scale, generating unlimited synthetic data and trajectories. This is the factory part — BenchGen produces the labeled examples that would otherwise require expensive human annotation or waiting for real-world events. Output: a rich synthetic dataset ready to fuel the next training round or export to external tools.

Evaluate

Generated data and agent interactions are benchmarked against real tasks. Eval surfaces exactly where agents still fail — which becomes the input for the next Simulate run. Every benchmark narrows the gap between simulation performance and production performance. Output: structured failure cases that drive the next iteration.

Why continuous improvement matters

Most agent deployments degrade over time as business context shifts. BenchGen’s loop means agents adapt:

Grounded in your data — simulation reflects your actual workflows, not generic tasks.
No human bottleneck — synthetic data generation replaces waiting for labeled real-world examples.
Short cycles — a benchmark failure can kick off a new training run in minutes.
Measurable progress — every loop is justified by an eval result, not a guess.

Next steps

Quickstart — run your first simulation and training loop
Agents overview — set up your simulation environment
Eval overview — benchmark your agents
Train overview — fine-tune on generated trajectories

Get started

Agents

Eval

Train

Documentation Index

​Self-Improvement Loop

​The Four Stages

​Simulate

​Train

​Generate

​Evaluate

​Why continuous improvement matters

​Next steps

Self-Improvement Loop

The Four Stages

Simulate

Train

Generate

Evaluate

Why continuous improvement matters

Next steps