BEDTIME: A Unified Benchmark for Automatically Describing Time Series
Abstract
Many recent works have proposed general-purpose foundation models for a wide range of time series analysis tasks. However, most models are introduced alongside new datasets, leaving a lack of head-to-head comparisons. They also often study complex tasks, making it hard to isolate specific model capabilities. To address these gaps, we formalize and evaluate 3 tasks that test a model's ability to describe time series using language: (1) Recognition, (2) Differentiation, and (3) Generation. We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks, and (3) pre-trained multimodal time series--language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a fundamental task towards enabling capable time series reasoning systems.