Deep Feature Fusion for Video-Based Action Recognition
Abstract
3-Dimensional Convolutional Neural Networks (3D ConvNets) have been adopted for videobased action recognition task recently. Many 3D ConvNets, such as C3D, I3D, and Res3D, have
been proposed and achieved great success. The model ensemble techniques have been very successful in achieving better performance over a single model. But model ensemble could not be
adopted in this case given that the single 3D ConvNets model as a base learner is unrealistic. It
remains an open question about how to achieve better performance by leveraging multiple 3D
ConvNets models. To solve the problem, we present a two-stage framework to combine multiple
3D ConvNets models at the feature level. In the first stage, we treat each pretrained 3D ConvNets
model as a feature extractor to extract features from raw videos. We fuse the extracted features
of different 3D ConvNets models to form the new video representation and then train a classifier
based on the new video representation in the second stage. We explore several widely-used feature
fusion methods for deep features learned from different models, to learn more robust action representations from raw videos. We show that our framework outperforms any single 3D ConvNets
model by a large margin and exhibits comparable performance to the state-of-the-art model on two
video action recognition benchmarks.
Citation
Cheng, Cheng (2020). Deep Feature Fusion for Video-Based Action Recognition. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /191594.