TY - GEN
T1 - ViNet
T2 - 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
AU - Jain, Samyak
AU - Yarlagadda, Pradeep
AU - Jyoti, Shreyank
AU - Karthik, Shyamgopal
AU - Subramanian, Ramanathan
AU - Gandhi, Vineet
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.
AB - We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.
UR - http://www.scopus.com/inward/record.url?scp=85124220434&partnerID=8YFLogxK
UR - https://www.iros2021.org/
U2 - 10.1109/IROS51168.2021.9635989
DO - 10.1109/IROS51168.2021.9635989
M3 - Conference contribution
AN - SCOPUS:85124220434
SN - 9781665417150
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 3520
EP - 3527
BT - IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021
A2 - Přeučil, Libor
A2 - Babuška, Robert
PB - IEEE, Institute of Electrical and Electronics Engineers
CY - United States
Y2 - 27 September 2021 through 1 October 2021
ER -