Modern Cloud and Data Center environments are based on large scale distributed storage systems. Diagosing configuration errors, software bugs and performance anomalies in such systems has become a major problem for large Web hosting sites.
As part of a larger project, which endeavors to design and prototype interactive, guided modelling for such systems I will introduce Semantic-Aware Resource Anomaly Detection (SARAD), and Program-Aware Anomaly Detection (PAAD), two low overhead real-time solutions for detecting runtime anomalies in storage systems. Both SARAD and PAAD are based on the key observation that most state-of-the-art storage server architectures are multi-threaded and structured as a set of repeatable modules, which we call stages, hence provide good opportunities for statistical modelling and anomaly detection.
SARAD and PAAD leverage this observation to collect stage-level resource consumption and log summaries at runtime and to perform statistical analysis across stage instances. Stages that generate either one of i) abnormal resource usage patterns, or ii) rare execution flows or unusually high duration for regular flows at run-time indicate anomalies. Both methods make two key contributions: i) limit the search space for root causes, by pinpointing specific anomalous code stages, and ii) reduce compute and storage requirements for monitoring data and log analysis, while preserving accuracy, through information summarization.
We evaluated both methods on three distributed storage systems: HBase, Hadoop Distributed File System (HDFS), and Cassandra. We show that, with practically zero overhead, we uncover various anomalies in real-time.