What we did
Benchgen built a digital twin of DT Cloud's cloud infrastructure operations - Kubernetes clusters, networking, storage, and IAM policies - and ran LLM-powered DevOps agents through full trajectory evaluations across environment provisioning, configuration management, and incident response workflows before any agent was deployed to production.
Modern cloud infrastructure is complex and highly dynamic. A typical environment deployment requires multiple coordinated steps: create virtual networking, configure identity and access policies, provision compute resources, deploy Kubernetes clusters, attach storage volumes, configure monitoring and logging, and validate the environment against security policies.
When introducing AI agents into this process, several challenges emerge: ensuring agents select the correct infrastructure actions, preventing configuration errors, validating security policies, and guaranteeing deployment consistency. Infrastructure errors can lead to downtime, security breaches, or failed deployments - the stakes are too high for trial-and-error in production.
Traditional testing methods cannot simulate the full complexity of infrastructure orchestration. DT Cloud needed a system capable of recreating cloud infrastructure operations in a controlled simulation environment, allowing AI agents to be benchmarked across thousands of realistic scenarios before interacting with real systems.
Benchgen was used to create a digital twin of DT Cloud's cloud infrastructure workflows. Within this simulated environment, AI agents interact with infrastructure APIs and automation pipelines as if they were operating real systems - selecting templates, provisioning networks, deploying Kubernetes clusters, configuring storage, and validating deployments end-to-end.
Instead of testing isolated prompts, Benchgen evaluates complete operational trajectories. A typical trajectory: receive a customer environment request → select the appropriate infrastructure template → create virtual network and security groups → deploy Kubernetes cluster → configure storage and monitoring → validate deployment → deliver environment. Each step becomes a benchmarkable decision point measuring action selection, policy compliance, and deployment success.
Every simulated deployment generated structured trajectory data - sequences of infrastructure actions, API calls, configuration choices, and final outcomes. These execution traces were reused as RL training data, feeding reinforcement learning pipelines that improved agent policies across deployment error reduction, configuration optimization, and incident recovery.
For DT Cloud, the ability to benchmark infrastructure agents before deployment provides a decisive strategic advantage. Autonomous infrastructure management is only viable if the agents can be proven reliable before they touch production - Benchgen makes that proof possible at scale.
The RL feedback loop transforms simulation into a continuous improvement engine. Every trajectory - whether a successful Kubernetes deployment or a failed IAM policy configuration - becomes training signal. Agents improve measurably across thousands of scenarios, reducing misconfiguration rates and MTTR with each iteration cycle.
By running the entire benchmarking program on sovereign Turkish GPU infrastructure, DT Cloud demonstrates that rigorous AI agent validation and national data sovereignty are fully compatible - setting a blueprint for how cloud providers can responsibly deploy autonomous infrastructure management at enterprise scale.
More Stories

Building a sovereign, air-gapped LLM benchmarking platform for a national defense organization

How Enerjisa benchmarked Turkish LLMs and autonomous AI agents on real energy workflows before sovereign deployment
How BAU Colleges used Benchgen to simulate and benchmark LLM-powered education agents before smart campus deployment
How Ravatar used Benchgen to simulate and benchmark AI avatar agents across 25,000+ conversational workflows before production deployment