ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

56 Citations (Scopus)

Abstract

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

Original languageEnglish
Title of host publicationIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
EditorsLibor Přeučil, Robert Babuška
Place of PublicationUnited States
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages3520-3527
Number of pages8
ISBN (Electronic)9781665417143
ISBN (Print)9781665417150
DOIs
Publication statusPublished - 2021
Event2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021 - Prague, Czech Republic
Duration: 27 Sept 20211 Oct 2021

Publication series

NameIEEE International Conference on Intelligent Robots and Systems
ISSN (Print)2153-0858
ISSN (Electronic)2153-0866

Conference

Conference2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
Abbreviated titleIROS 2021
Country/TerritoryCzech Republic
CityPrague
Period27/09/211/10/21

Fingerprint

Dive into the research topics of 'ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction'. Together they form a unique fingerprint.

Cite this