Hierarchical feature representation for unconstrained video analysis
Data compression, Video analysis, Visual learning
Complex video analysis is a challenging problem due to the long and sophisticated temporal structure of unconstrained videos. This paper introduces pooled-feature representation (PFR) which is derived from a double layer encoding framework (DLE) to address this problem. Considering that a complex video is composed of a sequence of simple frames, the first layer generates temporal sub-volumes from the video and represents them individually. The second layer constructs the pool of features by fusing the represented vectors from the first layer. The pool is compressed and then encoded to provide video-parts vector (VPV). This framework allows distilling the representation and extracting new information in a hierarchical way. Compared with recent video encoding approaches, VPV can preserve the higher-level information through typical encoding in the higher layer. Furthermore, the encoded vectors from both layers of DLE are fused along with a compression stage to develop PFR. The early and late fusion stages are adopted based on the priority of compression stage over concatenation of represented vectors. To validate the proposed framework, we conduct extensive experiments on four complex action datasets: UCF50, HMDB51, URADL, and Olympic. Experimental results demonstrate that PFR with early fusion achieves the state-of-the-art performance by capturing the most prominent features with minimum dimension.
Mohammadi, Eman; Jonathan Wu, Q. M.; Saif, Mehrdad; and Yang, Yimin. (2019). Hierarchical feature representation for unconstrained video analysis. Neurocomputing, 363, 182-194.