BenchGen for Hermes Agent — Evaluate, Score, and Fine-Tune Your Agent
Built for Hermes Agent · Early access

Know when your agent breaks before your users do

BenchGen reads your Hermes trajectories and tells you exactly what's failing — tool-call accuracy, skill coverage, goal completion rate. Catch regressions. Fix what matters. Fine-tune what's good.

Free forever tier · No credit card · Ships Q3 2026

Join 400+ Hermes developers already on the waitlist

benchgen scan ~/.hermes — v0.1.0-beta
$ benchgen scan
→ Reading trajectories from ~/.hermes/trajectories/ (847 found)
→ Loading skill library (41 skills)
→ Scoring with BenchGen quality model...
 
━━━━ BENCHGEN QUALITY REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Overall quality score
74 / 100
 
Tool-call accuracy
83%
Goal completion
71%
Error recovery
48%
Skill coverage
79%
Memory utilisation
91%
 
▸ Top failure patterns (3 found)
CRITICAL odoo domain filter syntax error · 12 occurrences · see examples →
WARNING loop-on-empty-result in fetch-invoices skill · 8 occurrences
WARNING skill "quarterly-report" outdated (18 days) · field mapping changed
 
592 trajectories eligible for fine-tuning export
Run benchgen export --finetune to prepare training data

How it works

Your trajectories are already there.
BenchGen reads them.

Hermes records every tool call, reasoning step, and outcome in Hermes Atropos trajectory files. BenchGen scores them. No instrumentation, no API keys, no config.

01 / INSTALL
One command. Zero config.
BenchGen auto-detects your Hermes installation. No API keys. No accounts yet. Works on any machine running Hermes.
pip install benchgen
02 / SCAN
Reads your trajectories.
Scans ~/.hermes/ for Hermes Atropos trajectories, skill files, and memory files. Scores 100+ trajectories in under 3 minutes.
benchgen scan
03 / FIX
See exactly what's failing.
Quality score, failure patterns, specific examples from your trajectories. Know which skills to fix, which to fine-tune, which to discard.
benchgen report --open

Why it matters

The problems Hermes users hit after week two

Every one of these was reported in real GitHub issues and Discord threads. BenchGen catches them before they reach your users.

🔴
You switch models to save money. Quality drops 18 points. You find out from a customer complaint.
BenchGen runs a baseline before every model change and alerts you within minutes if a regression is detected. $29/month vs. cost of one bad production incident.
🔄
Your agent keeps making the same tool-call error even though you created a skill to fix it.
This is a Layer 2 problem — the error is in the model weights, not the skill. BenchGen identifies which failures need a skill fix vs. a fine-tune.
📉
Your Hermes agent has been running for 2 months. You don't know if it's getting better or worse.
BenchGen tracks your Quality Score week over week. You get a trend chart and a diff report every Monday morning.
🎓
You want to fine-tune your model but don't know which trajectories are good enough to train on.
BenchGen scores all your trajectories and exports a filtered training set — only the top 60–70%. No failed runs in your training data.

The problem we're solving

88% of AI agent pilots never reach production

The single largest reason: teams can't tell when their agent is working well enough to trust in production. Evaluation and observability is cited by 64% of enterprises as the primary blocker.

88%
of agent pilots fail to reach production
37%
gap between benchmark scores and real-world performance
64%
of enterprises cite eval as their #1 deployment blocker
171%
average ROI for agents that do reach production

How Hermes improves

Skills fix behaviour. Fine-tuning fixes the model.
BenchGen tells you which one you need.

Hermes has two improvement layers. Most developers only use Layer 1 — and wonder why some failures keep coming back.

Layer 1 — Skills & Memory
  • ·Hermes writes skill files to ~/.hermes/skills/
  • ·Works with any model, instant
  • ·No GPU required
  • ·Fixes prompt-level failures
Layer 2 — LoRA Fine-tuning
  • ·Updates model weights via Hermes Atropos trajectories
  • ·Requires open-weight model + GPU
  • ·Offline batch process
  • ·Fixes model-level failures permanently
BenchGen — sits between layers
  • ·Scores Layer 1 trajectories for quality
  • ·Identifies Layer 1 vs Layer 2 failures
  • ·Exports only clean trajectories for fine-tuning
  • ·Without BenchGen: bad data makes models worse

How it compares

Built for Hermes. Not bolted on.

Generic evaluation tools don't understand Hermes Atropos trajectories, Hermes skill files, or the two-layer improvement model. BenchGen was designed for this stack from day one.

FeatureBenchGenLangSmithArize PhoenixDeepEval
Native Hermes Agent support
Reads Hermes Atropos trajectory files
Layer 1 vs Layer 2 failure diagnosis
Fine-tuning trajectory export
Skill file quality scoring
Zero-config CLI install
Automated regression alerts
Open-weight model support
Self-hosted / air-gapped option

FAQ

Common questions from Hermes developers

Start Evaluating

Your agent is already collecting data.
Start measuring it.

Join 400+ Hermes developers on the waitlist. Free tier ships first — no credit card required.

Free evaluation guide delivered immediately on signup