Video Understanding: Data Privacy, Pipeline Simplicity, and Implementation Efficiency

Wu, Zhenyu

View/ Open

WU-DISSERTATION-2021.pdf (13.13Mb)

Date

2021-09-02

Author

Wu, Zhenyu

Metadata

Show full item record

Abstract

Video understanding aims to automatically detect objects of interest from videos and recognize the scene, actions, content, or attributes. Some well-known tasks defined on videos include human action recognition, human action spatio-temporal localization (a.k.a, action detection), and text spotting. We improve the existing performance-driven approaches for video understanding tasks in three aspects: privacy in data sharing, simplicity in model design, and efficiency in hardware deployment. For privacy-aware data sharing, we formulate a novel adversarial training framework to learn an anonymization transform for input videos such that the trade-off between target utility task performance and the associated privacy budgets is explicitly optimized on the anonymized videos. Given few public datasets available with both utility and privacy labels, we constructed a new dataset, termed PA-HMDB51, with both target task labels (action) and selected privacy attributes (skin color, face, gender, nudity, and relationship) annotated on a per-frame basis. For efficiency-driven hardware deployment, we propose a multi-stage image processor that takes videos' redundancy, continuity, and mixed degradation into account. The model is pruned and quantized in a hardware friendly way to further save energy consumption. Our proposed energy-efficient video text spotting solution outperforms all previous methods by achieving a competitive tradeoff between energy efficiency and performance. For simplicity-guided model design, we come up with a new paradigm for video action detection using Transformers, which effectively removes the need for such specialized components and achieves superior performance. Without using pre-trained person/object detectors, RPN, or memory bank, our Transformer-based Video Action Detector (TxVAD) utilizes two types of Transformers to capture scene context information and long-range spatio-temporal context information, for person localization and action classification, respectively.

Citation

Wu, Zhenyu (2021). Video Understanding: Data Privacy, Pipeline Simplicity, and Implementation Efficiency. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /195430.