Go Back Research Article April, 2026
ICAISET

CostAgent: Self-Improving Autonomous LLM-Based Orchestration for Cost-Optimal Cloud Data Processing at Scale

Abstract

The explosive growth of data-intensive applications has created an urgent need for cost-effective cloud computing solutions. While preemptible cloud instances such as AWS Spot Instances, Azure Spot VMs, and Google Cloud Spot VMs offer substantial cost savings, their unpredictable availability makes them difficult to use for production workloads. We present CostAgent, an orchestration framework that leverages large language models (LLMs) for preemptible instance allocation through a bounded-risk autonomy model. CostAgent combines four core ideas: (1) a compute-bounded autonomy framework with safety bounds under explicit infrastructure constraints, (2) a multi-objective validation layer using clamping-based enforcement across cost, performance, reliability, and security dimensions, (3) an in-context learning architecture that adapts planning using historical execution context without retraining, and (4) a practical implementation with infrastructure-as-code and checkpoint-aware recovery. Using documented experiments on 100GB-scale workloads together with scenario-driven planning tests, we show that the prototype can generate viable plans with average decision latency of 12.83 seconds, sustain 14,492 records/sec in the main throughput test, and recover from injected interruption conditions with bounded loss. The paper should be interpreted as an initial systems validation and implementation study rather than a full production-scale field deployment. Its central contribution is a practical architecture for cost-aware orchestration in which LLM reasoning is combined with deterministic safety enforcement under explicit constraints.

Keywords

cloud computing preemptible instances spot instances large language models autonomous systems cost optimization distributed systems safety validation
Document Preview
Download PDF
Details
Impact Metrics