Human action recognition from video input has seen much interest over the last decade. In recent years, the trend is clearly towards action recognition in real-world, unconstrained conditions (i.e. not acted) with an ever growing number of action classes. Much of the work so far has used single frames or sequences of frames where each frame was treated individually. This paper investigates the contribution that temporal information can make to human action recognition in the context of a large number of action classes. The key contributions are: (i) We propose a complementary information channel to the Bag-of- Words framework that models the temporal occurrence of the local information in videos. (ii) We investigate the influence of sensible local information whose temporal occurrence is more vital than any local information. The experimental validation on action recognition datasets with the largest number of classes to date shows the effectiveness of the proposed approach.
|Name||2014 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2014|
|Conference||2014 International Conference on Digital Image Computing, Techniques and Applications|
|Abbreviated title||DICTA 2014|
|Period||25/11/14 → 27/11/14|