Poster
in
Affinity Workshop: Black in AI
Interactive Video Saliency Prediction: The Stacked ConvLSTM Approach
Natnael Argaw Wondimu · Cedric Buche · Ubbo Visser
Keywords: [ Computer Vision ]
Cognitive and neuroscience of attention researches suggest the use of spatio-temporal features for an efficient video saliency prediction. This is due to the representative nature of spatio-temporal features for data collected across space and time, such as videos. Video saliency prediction aim to find visually salient regions in a stream of images. Many video saliency prediction models are proposed in the past couple of years. However, the problem still remains a considerable challenge. This is mainly due to the complex nature of video saliency prediction and scarcity of representative saliency benchmarks. Given the importance of saliency identification for various computer vision tasks, revising and enhancing the performance of video saliency prediction models is crucial. To this end, we propose a novel interactive video saliency prediction model that employ stacked convLSTM based architecture along with a novel XY-shift frame differencing custom layer. Specifically, we have introduced an encoder-decoder based architecture with a prior layer undertaking XY-shift frame differencing, a residual layer fusing spatially processed (VGG-16 based) features with XY-shift frame differenced frames, and a stacked convLSTM component. Extensive experimental results over the largest video saliency dataset, DHF1K, show the competitive performance of our model against the state-of-the-art models.