Show simple item record

dc.contributor.advisorWang, Zhangyang
dc.creatorWu, Zhenyu
dc.date.accessioned2022-01-27T22:20:22Z
dc.date.available2023-08-01T06:41:45Z
dc.date.created2021-08
dc.date.issued2021-09-02
dc.date.submittedAugust 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/195430
dc.description.abstractVideo understanding aims to automatically detect objects of interest from videos and recognize the scene, actions, content, or attributes. Some well-known tasks defined on videos include human action recognition, human action spatio-temporal localization (a.k.a, action detection), and text spotting. We improve the existing performance-driven approaches for video understanding tasks in three aspects: privacy in data sharing, simplicity in model design, and efficiency in hardware deployment. For privacy-aware data sharing, we formulate a novel adversarial training framework to learn an anonymization transform for input videos such that the trade-off between target utility task performance and the associated privacy budgets is explicitly optimized on the anonymized videos. Given few public datasets available with both utility and privacy labels, we constructed a new dataset, termed PA-HMDB51, with both target task labels (action) and selected privacy attributes (skin color, face, gender, nudity, and relationship) annotated on a per-frame basis. For efficiency-driven hardware deployment, we propose a multi-stage image processor that takes videos' redundancy, continuity, and mixed degradation into account. The model is pruned and quantized in a hardware friendly way to further save energy consumption. Our proposed energy-efficient video text spotting solution outperforms all previous methods by achieving a competitive tradeoff between energy efficiency and performance. For simplicity-guided model design, we come up with a new paradigm for video action detection using Transformers, which effectively removes the need for such specialized components and achieves superior performance. Without using pre-trained person/object detectors, RPN, or memory bank, our Transformer-based Video Action Detector (TxVAD) utilizes two types of Transformers to capture scene context information and long-range spatio-temporal context information, for person localization and action classification, respectively.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectVideo Understandingen
dc.subjectPrivacyen
dc.subjectEfficiencyen
dc.subjectSimplicityen
dc.titleVideo Understanding: Data Privacy, Pipeline Simplicity, and Implementation Efficiencyen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberJiang, Anxiao
dc.contributor.committeeMemberSong, Dezhen
dc.contributor.committeeMemberShen, Yang
dc.type.materialtexten
dc.date.updated2022-01-27T22:20:23Z
local.embargo.terms2023-08-01
local.etdauthor.orcid0000-0002-7183-6943


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record