TY - GEN
T1 - Can computers learn from humans to see better? Inferring scene semantics from viewers' eye movements
AU - Subramanian, Ramanathan
AU - Yanulevskaya, Victoria
AU - Sebe, Nicu
PY - 2011
Y1 - 2011
N2 - This paper describes an attempt to bridge the semantic gap between computer vision and scene understanding employing eye movements. Even as computer vision algorithms can efficiently detect scene objects, discovering semantic relationships between these objects is as essential for scene understanding. Humans understand complex scenes by rapidly moving their eyes (saccades) to selectively focus on salient entities (fixations). For 110 social scenes, we compared verbal descriptions provided by observers against eye movements recorded during a free-viewing task. Data analysis confirms (i) a strong correlation between task-explicit linguistic descriptions and task-implicit eye movements, both of which are influenced by underlying scene semantics and (ii) the ability of eye movements in the form of fixations and saccades to indicate salient entities and entity relationships mentioned in scene descriptions. We demonstrate how eye movements are useful for inferring the meaning of social (everyday scenes depicting human activities) and affective (emotion-evoking content like expressive faces, nudes) scenes. While saliency has always been studied through the prism of fixations, we show that saccades are particularly useful for (i) distinguishing mild and high-intensity facial expressions and (ii) discovering interactive actions between scene entities.
AB - This paper describes an attempt to bridge the semantic gap between computer vision and scene understanding employing eye movements. Even as computer vision algorithms can efficiently detect scene objects, discovering semantic relationships between these objects is as essential for scene understanding. Humans understand complex scenes by rapidly moving their eyes (saccades) to selectively focus on salient entities (fixations). For 110 social scenes, we compared verbal descriptions provided by observers against eye movements recorded during a free-viewing task. Data analysis confirms (i) a strong correlation between task-explicit linguistic descriptions and task-implicit eye movements, both of which are influenced by underlying scene semantics and (ii) the ability of eye movements in the form of fixations and saccades to indicate salient entities and entity relationships mentioned in scene descriptions. We demonstrate how eye movements are useful for inferring the meaning of social (everyday scenes depicting human activities) and affective (emotion-evoking content like expressive faces, nudes) scenes. While saliency has always been studied through the prism of fixations, we show that saccades are particularly useful for (i) distinguishing mild and high-intensity facial expressions and (ii) discovering interactive actions between scene entities.
KW - Eye movements
KW - Fixations and saccades
KW - Salient entities and interactions
KW - Scene semantics
UR - http://www.scopus.com/inward/record.url?scp=84455173231&partnerID=8YFLogxK
U2 - 10.1145/2072298.2072305
DO - 10.1145/2072298.2072305
M3 - Conference contribution
AN - SCOPUS:84455173231
SN - 9781450306164
T3 - MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops
SP - 33
EP - 42
BT - MM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops
A2 - Candan, K. Selçuk
A2 - Panchanathan, Sethuraman
A2 - Prabhakaran, Balakrishnan
A2 - Sundaram, Hari
A2 - Feng, Wu-Chi
A2 - Sebe, Nicu
PB - Association for Computing Machinery (ACM)
CY - United States
T2 - 19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11
Y2 - 28 November 2011 through 1 December 2011
ER -