Question 1

Does BenchGen work with any version of Hermes Agent?

Accepted Answer

BenchGen reads standard Hermes Atropos JSONL trajectory files from ~/.hermes/trajectories/. As long as your Hermes installation is writing trajectories (which all versions do by default), BenchGen will work. No changes to your Hermes config are needed.

Question 2

What does Hermes Atropos have to do with BenchGen?

Accepted Answer

Hermes Atropos is the reinforcement learning framework embedded inside Hermes. Every time your agent completes a task, Hermes Atropos writes a trajectory file recording every step: the observation, the reasoning, the tool call, and the result. BenchGen reads those files and scores them — Hermes Atropos generates the data, BenchGen analyses it.

Question 3

How is this different from just reading Hermes logs?

Accepted Answer

Logs tell you what happened. BenchGen tells you why it failed and which of the two layers needs fixing. Raw trajectory files are JSONL and hard to interpret at scale. BenchGen aggregates across hundreds of trajectories, surfaces recurring failure patterns, scores quality dimensions, and distinguishes prompt-level problems (fix with a skill) from weight-level problems (fix with fine-tuning).

Question 4

Do I need a GPU to run BenchGen?

Accepted Answer

No. BenchGen itself is a lightweight CLI that reads files and calls a scoring model. You only need GPU resources if you want to use the fine-tuning export feature — that runs on DT Cloud H100/H200 infrastructure separately, not on your local machine.

Question 5

Can I use BenchGen if I'm using a closed model like GPT-4 or Claude?

Accepted Answer

Yes for scoring and evaluation — BenchGen can analyse trajectories from any model. However, the fine-tuning export feature (benchgen export --finetune) only applies to open-weight models like Llama 4, Qwen 3, and Hermes 3, since you need model weight access for LoRA training.

Question 6

What's the difference between a skill fix and a fine-tune, and how does BenchGen tell them apart?

Accepted Answer

A skill fix writes a new instruction file to ~/.hermes/skills/ — it changes how the model uses its context. A fine-tune updates the model's weights using trajectory data. BenchGen classifies failures by whether they recur across different prompts and contexts (weight-level, needs fine-tune) or only in specific task patterns (prompt-level, fixable with a skill). If the same tool-call error appears in 40+ trajectories across diverse tasks, it's a weight problem, not a context problem.

Question 7

How many trajectories do I need before BenchGen gives useful results?

Accepted Answer

BenchGen starts returning meaningful patterns from around 50 trajectories, and becomes highly reliable above 200. Most Hermes users who have been running their agent for more than a week already have 500–2,000 trajectories in ~/.hermes/trajectories/. Run ls ~/.hermes/trajectories/ | wc -l to check yours.

Question 8

What happens to my trajectory data — does it leave my machine?

Accepted Answer

In the free CLI tier, all scoring runs locally on your machine. Your trajectory data is never uploaded anywhere. The cloud dashboard (Pro tier) optionally syncs anonymised metrics but never raw trajectory content. The self-hosted enterprise option processes everything within your own infrastructure.

Feature	BenchGen	LangSmith	Arize Phoenix	DeepEval
Native Hermes Agent support
Reads Hermes Atropos trajectory files
Layer 1 vs Layer 2 failure diagnosis
Fine-tuning trajectory export
Skill file quality scoring
Zero-config CLI install
Automated regression alerts
Open-weight model support
Self-hosted / air-gapped option

Know when your agent breaks before your users do

Your trajectories are already there.
BenchGen reads them.

The problems Hermes users hit after week two

88% of AI agent pilots never reach production

Skills fix behaviour. Fine-tuning fixes the model.
BenchGen tells you which one you need.

Built for Hermes. Not bolted on.

Common questions from Hermes developers

Your agent is already collecting data.
Start measuring it.