Deep Feature Fusion for Video-Based Action Recognition

Cheng, Cheng

View/ Open

CHENG-THESIS-2020.pdf (2.668Mb)

Date

2020-04-10

Author

Cheng, Cheng

Metadata

Show full item record

Abstract

3-Dimensional Convolutional Neural Networks (3D ConvNets) have been adopted for videobased action recognition task recently. Many 3D ConvNets, such as C3D, I3D, and Res3D, have been proposed and achieved great success. The model ensemble techniques have been very successful in achieving better performance over a single model. But model ensemble could not be adopted in this case given that the single 3D ConvNets model as a base learner is unrealistic. It remains an open question about how to achieve better performance by leveraging multiple 3D ConvNets models. To solve the problem, we present a two-stage framework to combine multiple 3D ConvNets models at the feature level. In the first stage, we treat each pretrained 3D ConvNets model as a feature extractor to extract features from raw videos. We fuse the extracted features of different 3D ConvNets models to form the new video representation and then train a classifier based on the new video representation in the second stage. We explore several widely-used feature fusion methods for deep features learned from different models, to learn more robust action representations from raw videos. We show that our framework outperforms any single 3D ConvNets model by a large margin and exhibits comparable performance to the state-of-the-art model on two video action recognition benchmarks.

Citation

Cheng, Cheng (2020). Deep Feature Fusion for Video-Based Action Recognition. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /191594.