Data Engineering–Driven World Models for Consequence-Aware Agentic Systems
Brahma Reddy Katam
Technical Lead, Data Engineering and Advanced Computing
Abstract: Large Language Models (LLMs) have enabled a new generation of intelligent agents capable of generating queries, automating workflows, and assisting data engineering tasks through natural language interaction. Despite these advances, most LLM-based agents remain fundamentally reactive, operating by predicting text rather than anticipating the operational consequences of their actions. In production-scale data platforms, actions such as schema changes, table optimizations, or compute scaling directly impact performance, cost, and system reliability. Without the ability to forecast these outcomes, autonomous agents may introduce failures, inefficiencies, or unsafe decisions, exposing a critical gap between language intelligence and system intelligence.
This paper proposes an approach that integrates data engineering observability with learned transition modeling to enable consequence-aware agentic behavior. We introduce Data Engineering–Driven World Models, where agents learn state-transition behavior of data platforms using historical telemetry, system metrics, and action–outcome logs. Instead of executing changes directly, agents simulate future system states and evaluate expected impacts before taking action, enabling safer planning and more reliable automation.
To operationalize this concept, we present the Data System Digital Twin (DSDT) architecture, which combines observability pipelines, structured state encoding, machine learning–based world models, and planning modules with LLM interfaces. The framework continuously captures runtime and cost signals, learns system dynamics, and selects optimal actions through simulation-based reasoning. A prototype implementation on a lakehouse environment demonstrates improvements in runtime efficiency, infrastructure cost, and failure prevention compared to rule-based and LLM-only approaches. This work shows that combining world models with strong data engineering foundations provides a practical pathway toward safe, self-optimizing, and autonomous data platforms.
Keywords
Agentic AI, World Models, Data Engineering, Autonomous Systems, Data System Digital Twin, Predictive Modeling, Intelligent Agents, Lakehouse Optimization, Consequence-Aware Planning, Data Platform Automation