
What we did
Benchgen built simulation environments from real enterprise energy datasets - SCADA logs, CMMS records, CRM tickets, OMS workflows, and meteorological data - and ran Turkish LLMs and autonomous AI agents through full trajectory evaluations before any agent was cleared for sovereign deployment on Türkiye-based GPU infrastructure.
Energy infrastructure generates enormous volumes of operational data - telemetry from SCADA grid monitoring systems, IoT sensors across substations, grid events, energy consumption datasets, and maintenance histories. Enerjisa wanted to deploy LLM-based agents to analyze this data and support operational decision-making.
Potential agent use cases included anomaly detection in grid behavior, predictive maintenance triage, identification of energy losses, and automated analysis of infrastructure telemetry. But deploying autonomous LLM agents directly into production energy infrastructure without prior evaluation carries significant operational risk.
Standard LLM benchmarks evaluate model quality on static prompt-response tasks. They cannot show how an LLM agent behaves when it must process SCADA telemetry, query maintenance histories, call tools, and make sequential decisions across a multi-step operational workflow - which is exactly where these agents fail.
The evaluation problem was compounded by real operational complexity: fragmented data across SCADA, CRM, OMS, and CMMS; time-sensitive maintenance and outage workflows; SLA constraints; wind farm scenarios driven by weather and telemetry; and national requirements for data sovereignty on Turkish GPU infrastructure.
Enerjisa used Benchgen to create enterprise simulation environments grounded in real energy datasets. Rather than evaluating LLMs on isolated prompts, Benchgen ran Turkish language models and autonomous agents through full operational workflows - interacting with telemetry streams, querying maintenance histories, processing weather inputs, and completing multi-step tasks against enterprise data.
Every evaluation run captured the agent's complete decision trajectory: each tool call, each retrieval step, each intermediate action, and the final outcome. This let engineers see exactly where an LLM agent succeeded, where it failed, and how it behaved under the messy, multi-source data conditions of real energy operations.
The resulting benchmark traces were also structured as RL training data - reward signals, preference datasets, and failure-mode records - feeding reinforcement learning pipelines using PPO, GRPO, and PRM-style methods to iteratively improve agent behavior before sovereign deployment.
This case shows why trajectory-based agent benchmarking matters beyond generic model evaluation. For Enerjisa-style energy environments, the value isn't a better leaderboard score - it's the ability to answer the questions that actually matter before deployment: Can this Turkish LLM reason correctly over real SCADA data? Can this autonomous agent complete a maintenance triage workflow without hallucinating a step? Where exactly does it break under real operational conditions?
Benchgen turned enterprise energy data into a live evaluation environment where LLM agents could be pressure-tested before touching production. The resulting benchmark traces - full decision trajectories, scored tool calls, failure-mode records - also became RL training data, feeding iterative agent improvement pipelines that make each agent version measurably better than the last.
By running entirely on Türkiye-based sovereign GPU infrastructure with no external data exposure, the program demonstrates that rigorous LLM and agent benchmarking doesn't require sending data to cloud providers - and that national AI strategies can be both technically serious and fully sovereign.
More Stories

Building a sovereign, air-gapped LLM benchmarking platform for a national defense organization
How BAU Colleges used Benchgen to simulate and benchmark LLM-powered education agents before smart campus deployment
How DT Cloud used Benchgen to simulate and benchmark LLM-powered infrastructure agents across 20,000+ cloud deployment trajectories before production
How Ravatar used Benchgen to simulate and benchmark AI avatar agents across 25,000+ conversational workflows before production deployment