Know when your agent breaks before your users do
BenchGen reads your Hermes trajectories and tells you exactly what's failing — tool-call accuracy, skill coverage, goal completion rate. Catch regressions. Fix what matters. Fine-tune what's good.
Free forever tier · No credit card · Ships Q3 2026
Join 400+ Hermes developers already on the waitlist
How it works
Your trajectories are already there.
BenchGen reads them.
Hermes records every tool call, reasoning step, and outcome in Hermes Atropos trajectory files. BenchGen scores them. No instrumentation, no API keys, no config.
Why it matters
The problems Hermes users hit after week two
Every one of these was reported in real GitHub issues and Discord threads. BenchGen catches them before they reach your users.
The problem we're solving
88% of AI agent pilots never reach production
The single largest reason: teams can't tell when their agent is working well enough to trust in production. Evaluation and observability is cited by 64% of enterprises as the primary blocker.
How Hermes improves
Skills fix behaviour. Fine-tuning fixes the model.
BenchGen tells you which one you need.
Hermes has two improvement layers. Most developers only use Layer 1 — and wonder why some failures keep coming back.
- ·Hermes writes skill files to ~/.hermes/skills/
- ·Works with any model, instant
- ·No GPU required
- ·Fixes prompt-level failures
- ·Updates model weights via Hermes Atropos trajectories
- ·Requires open-weight model + GPU
- ·Offline batch process
- ·Fixes model-level failures permanently
- ·Scores Layer 1 trajectories for quality
- ·Identifies Layer 1 vs Layer 2 failures
- ·Exports only clean trajectories for fine-tuning
- ·Without BenchGen: bad data makes models worse
How it compares
Built for Hermes. Not bolted on.
Generic evaluation tools don't understand Hermes Atropos trajectories, Hermes skill files, or the two-layer improvement model. BenchGen was designed for this stack from day one.
| Feature | BenchGen | LangSmith | Arize Phoenix | DeepEval |
|---|---|---|---|---|
| Native Hermes Agent support | ||||
| Reads Hermes Atropos trajectory files | ||||
| Layer 1 vs Layer 2 failure diagnosis | ||||
| Fine-tuning trajectory export | ||||
| Skill file quality scoring | ||||
| Zero-config CLI install | ||||
| Automated regression alerts | ||||
| Open-weight model support | ||||
| Self-hosted / air-gapped option |
FAQ
Common questions from Hermes developers
Start Evaluating
Your agent is already collecting data.
Start measuring it.
Join 400+ Hermes developers on the waitlist. Free tier ships first — no credit card required.
Free evaluation guide delivered immediately on signup