LLM-Based Congnitive AIOps Autonomous Cloud Incident Management
Dr. KAVITHA K S
Computer Science & Engineering Dayananda Sagar College of Engineering Bengaluru, India dr.kavitha-cse@dayanandasagar.edu
AYUSH GUPTA
Computer Science & Engineering Dayananda Sagar College of Engineering Bengaluru, India Ayushgupta1312005@gmail.com
Dr. NAGRAJ M LUTIMATH
Computer Science & Engineering Dayananda Sagar Academy of Technologyand Management Bengaluru, India
nagarajml-cse@dsatm.edu.cin
ADI HEMA AJAY CHARAN
Computer Science & Engineering Dayananda Sagar College of Engineering Bengaluru, India adiajay12367@gmail.com
AKASH SINGH
Computer Science & Engineering Dayananda Sagar College of Engineering Bengaluru, India akashsingh2710670@gmail.com
ANANYA GUPTA
Computer Science & Engineering Dayananda Sagar College of Engineering Bengaluru, India
ananyagupta8303@gmail.com
Abstract
Cloud-based services continue to form the foundational fabric of modern digital infrastructures, supporting everything from enterprise systems to immersive web and AI applications. Yet as organisations scale, the incident management challenge rises sharply: distributed microservices, dynamic workloads, intricate dependencies and ever‐changing configurations combine to create failure modes that defy traditional manual processes. In this context, relying on human-driven Troubleshooting Guides (TSGs) is increasingly unsatisfactory: the sheer volume of alerts, the interwoven nature of faults, and the rapid pace of change make responsiveness slow, resolution accuracy inconsistent and operational costs unacceptable. Moreover, existing AIOps tools— while capable of anomaly detection or alert correlation—often fall short in providing end-to-end autonomous resolution, contextual reasoning and transparent decision-making. To bridge this gap, we propose DreamOps, a novel AI-driven centralised incident management framework tailored for cloud environments. Instead of treating alert generation and remediation as separate silos, DreamOps integrates intelligent perception, cognitive reasoning and deterministic execution into a cohesive pipeline. At its core, DreamOps transforms unstructured operational knowledge—such as incident logs, TSGs and domain expert playbooks—into executable workflows via large-language- model (LLM)-based reasoning agents. The system actively ingests telemetry from monitoring stacks (for example, metrics, logs and traces), applies machine-learning classifiers and anomaly detectors to prioritise and classify incidents, and then invokes LLM agents to generate context-aware mitigation plans. These plans are then validated, converted into deterministic workflows and executed— while retaining human oversight for critical decisions.