Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

1UC San Diego 2Massachusetts Institute of Technology 3Georgia Institute of Technology 4Carnegie Mellon University 5Tongji University 6Tsinghua University 7University of Pennsylvania
* Corresponding author
x9zou@ucsd.edu, xuanj@mit.edu
CloudAnoBench poster

Abstract

Anomaly detection in cloud environments remains both critical and challenging. Existing context-level benchmarks typically focus on either metrics or logs and often lack reliable annotation, while most detection methods emphasize point anomalies within a single modality, overlooking contextual signals and limiting real-world applicability. Constructing a benchmark for context anomalies that combines metrics and logs is inherently difficult: reproducing anomalous scenarios on real servers is often infeasible or potentially harmful, while generating synthetic data introduces the additional challenge of maintaining cross-modal consistency. We introduce CloudAnoBench, a large-scale benchmark for context anomalies in cloud environments, comprising 28 anomalous scenarios and 16 deceptive normal scenarios, with 1,252 labeled cases and roughly 200,000 log and metric entries. Compared with prior benchmarks, CloudAnoBench exhibits higher ambiguity and greater difficulty, on which both prior machine learning methods and vanilla LLM prompting perform poorly. To demonstrate its utility, we further propose CloudAnoAgent, an LLM-based agent enhanced by symbolic verification that integrates metrics and logs. This agent system achieves substantial improvements in both anomaly detection and scenario identification on CloudAnoBench, and shows strong generalization to existing datasets. Together, CloudAnoBench and CloudAnoAgent lay the groundwork for advancing context-aware anomaly detection in cloud systems.

CloudAnoBench construction overview

CloudAnoBench Construction: The benchmark is constructed by systematically extracting anomaly scenarios from real-world reports and academic literature, then generating multimodal data through a hybrid pipeline. Metric patterns are synthesized via controlled code execution, while log messages are produced with the assistance of large language models and aligned with the metric trends. To ensure realism and reliability, benign noise and deceptive cases are added, and quality control is enforced through automatic checks, symbolic validation, and human review. This process guarantees cross-modal consistency and produces diverse, reproducible anomaly cases.

CloudAnoBench benchmark overview

CloudAnoBench Overview: CloudAnoBench is a large-scale benchmark designed for context-aware anomaly detection in cloud environments. It contains 1,252 labeled cases with roughly 200,000 log and metric entries, spanning 28 anomalous scenarios and 16 deceptive normal scenarios. Unlike prior datasets that focus only on point anomalies or a single modality, CloudAnoBench jointly incorporates metrics and logs and introduces deceptive cases where abnormal-looking metrics are clarified by benign log events. This design increases ambiguity and difficulty, forcing models to reason across modalities and handle real-world complexity.

CloudAnoAgent pipeline overview

CloudAnoAgent Overview: CloudAnoAgent is an LLM-based agent framework enhanced with symbolic verification for multimodal anomaly detection. It adopts a Fast and Slow Detection design: the Metrics Agent provides responsive detection over time-series signals, while the Log Agent interprets unstructured event semantics. Their outputs are integrated by an agent layer and further validated by a Symbolic Verifier, which performs statistical checks on metrics and regex-based validation over logs. This neuro-symbolic design reduces false positives, improves interpretability, and achieves stronger performance in both anomaly detection and scenario identification across benchmarks.

Experimental results overview

Experimental Results: Experiments on CloudAnoBench demonstrate that CloudAnoAgent substantially outperforms all baseline methods across anomaly detection and scenario identification tasks. Compared with machine learning models and vanilla LLM prompting, CloudAnoAgent achieves the highest average F1-score of 91.3% and the lowest false positive rate of 12.9%, indicating its robustness against deceptive normal cases. The integration of the symbolic verifier further reduces false alarms by 4% and increases F1-score by 2%, confirming its role as a reliable critic that enhances detection stability. For scenario identification, CloudAnoAgent improves accuracy by 13.7% over vanilla LLMs and 20% over ML baselines, though performance decreases as the number of anomaly scenarios increases due to longer context windows and higher semantic overlap. Moreover, CloudAnoAgent exhibits strong generalization ability on other datasets such as HDFS v1, Thunderbird, and BGL, achieving performance comparable to specialized log-based models like LogLLM and LogBERT. Overall, these results highlight CloudAnoAgent's capability to handle multimodal, context-level anomalies and its adaptability to diverse real-world cloud monitoring scenarios.

BibTeX

@misc{zou2025generalizablecontextawareanomalydetection,
  title={Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments}, 
  author={Xinkai Zou and Xuan Jiang and Ruikai Huang and Haoze He and Parv Kapoor and Hongrui Wu and Yibo Wang and Jian Sha and Xiongbo Shi and Zixun Huang and Jinhua Zhao},
  year={2025},
  eprint={2508.01844v2},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.01844v2}, 
}