Generation-Based Multi-Modal Anomaly Detection for Nuclear Fusion Target Polishing
Abstract
Real-time detection of anomalous operating states is crucial for manufacturing systems, but prior works often separates visual modality and vibration signals, limiting robustness. We present the first multi-modal anomaly-detection study on a nuclear-fusion target polishing testbed, analyzing uncompressed video alongside vibration waveforms. Framing the task as a generation problem rather than classification, we introduce a reconstruction-based anomaly score that extends the autoencoder paradigm to autoregressive models. We compare two architectures: (i) a spatiotemporal Vision Transformer and (ii) Large Language Models adapted to time-frequency tokens. Both approaches are evaluated on our new Polishing-Fusion-200 benchmark (196 synchronized video-vibration episodes), with ablations on individual modalities and minimal fine-tuning across LLM families. This study introduces an end-to-end pipeline for video-plus-vibration anomaly detection and demonstrates a generation-based scoring strategy that avoids domain-specific heads.