Poster
in
Workshop: Machine Learning for Autonomous Driving
Temporal Transductive Inference for Few-Shot Video Object Segmentation
Mennatullah Siam · Richard Wildes
Few-shot object segmentation has been focused on segmenting static images in the query set. Recently few-shot video object segmentation (FS-VOS), where the query images to be segmented belong to a video, has been introduced but is still under-explored. We propose a simple but effective temporal transductive inference (TTI) that uses the temporal continuity in videos to improve the segmentation with a few-shot support set. We use both global and local cues. Global cues focus on learning a consistent prototype on the sequence level, whereas local cues focus on a consistent foreground/background region proportion within a local temporal window. Our model outperforms state-of-the-art attention-based counterpart on few-shot Youtube-VIS with 2% in mean intersection over union (mIoU). Finally, we propose a more realistic FS-VOS setup that operates cross-domain. Our method outperforms the transductive inference baseline that uses static images with 1.3% improvement on two different benchmarks. It demonstrates that our method is a promising direction and opens the door towards a label efficient approach of annotating video datasets with rare classes that occur in different robotics settings such as autonomous driving.