DeepLLR-CUSUM: Sequential Change Detection with Learned Log-Likelihood Ratios for Site Reliability Engineering
Abstract
Sequential change detection in streaming telemetry requires swift alerts while adhering to strict false-alarm limits, as delays or omissions undermine reliability and security, and frequent false positives overburden operators. The primary challenge is achieving near-instant detection at specified average run lengths (ARL). Traditional Gaussian CUSUM performs optimally only under accurate assumptions but struggles with non-Gaussian, dependence-driven shifts preserving lower moments, while LSTM-based predictive methods, based on forecast errors, exhibit substantial delays under tight controls. We propose DeepLLR-CUSUM, combining a discriminatively trained multilayer perceptron (MLP) to estimate log-likelihood ratio increments with CUSUM, calibrated via block-bootstrap to meet ARL targets. Tested on CESNET hourly data and synthetic shape/dependence shifts, DeepLLR-CUSUM delivers expected detection delay (EDD) and restricted mean survival time (RMST) of 1.2–1.3 samples, surpassing Gaussian CUSUM (1.3–1.5) and LSTM CUSUM (28–55), while ensuring conservative ARL and full coverage. Outperforming LSTM consistently and often exceeding Gaussian CUSUM in non-Gaussian contexts, DeepLLR-CUSUM enhances detection efficiency and robustness under rigorous false-alarm constraints.