Show simple item record

dc.contributor.advisorChoe, Yoonsuck
dc.creatorLi, Qinbo
dc.date.accessioned2023-02-07T16:15:44Z
dc.date.available2023-02-07T16:15:44Z
dc.date.created2022-05
dc.date.issued2022-05-03
dc.date.submittedMay 2022
dc.identifier.urihttps://hdl.handle.net/1969.1/197283
dc.description.abstractHumans are good at using multimodal information to perceive and to interact with the world. Such information includes visual, auditory, kinesthetic, etc. Despite the advancement in deep learning using single modality in the past decade, there are relatively fewer works focused on multimodal learning. Even with existing multimodal deep learning works, most of them focus on a small number of modalities. This dissertation will investigate various distinct forms of multi-modal learning: multiple visual modalities as input, audio-visual multimodal input, and visual and proprioceptive (kinesthetic) multimodal input. Specifically, in the first project we investigate synthesizing light fields from a single image and estimated depth. In the second project, we investigate face recognition for unconstrained videos with audio-visual multimodal inputs. Finally, we investigate learning to construct and use tools with visual, proprioceptive and kinesthetic multimodal inputs. In the first task, we investigate synthesizing light fields with a single RGB image and its estimated depth. Synthesizing novel views (light fields) from a single image is very challenging since the depth information is lost, and depth information is crucial for view synthesis. We propose to use a pre-trained model to estimate the depth, and then fuse the depth information together with the RGB image to generate the light fields. Our experiments showed that multimodal input (RGB image and depth) significantly improved the performance over the single image input. In the second task, we focus on learning face recognition for low quality videos. For low quality videos such as low-resolution online videos and surveillance videos, recognizing faces based on video frames alone is very challenging. We propose to use audio information in the video clip to aid in the face recognition task. To achieve this goal, we propose Audio-Visual Aggregation Network (AVAN) to aggregate audio features and visual features using an attention mechanism. Empirical results show that our approach using both visual and audio information significantly improves the face recognition accuracy on unconstrained videos. Finally, in the third task, we propose to use visual, proprioceptive and kinesthetic inputs to learn to construct and use tools. The use of tools in animals indicates high levels of cognitive capability, and, aside from humans, it is observed only in a small number of higher mammals and avian species, and constructing novel tools is an even more challenging task. Learning this task with only visual input is challenging, therefore, we propose to use visual and proprioceptive (kinesthetic) inputs to accelerate the learning. We build a physically simulated environment for tool construction task. We also introduce a hierarchical reinforcement learning approach to learn to construct tools and reach the target, without any prior knowledge. The main contribution of this dissertation is in the investigation of multiple scenarios where multimodal processing leads to enhanced performance. We expect the specific methods developed in this work, such as the extraction of hidden modalities (depth), use of attention, and hierarchical rewards, to help us better understand multimodal processing in deep learning.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectMultimodal learning
dc.subjectView synthesis
dc.subjectVideo face recognition
dc.subjectReinforcement learning
dc.titleExploiting Multimodal Information in Deep Learning
dc.typeThesis
thesis.degree.departmentComputer Science and Engineering
thesis.degree.disciplineComputer Science
thesis.degree.grantorTexas A&M University
thesis.degree.nameDoctor of Philosophy
thesis.degree.levelDoctoral
dc.contributor.committeeMemberHu, Xia
dc.contributor.committeeMemberSharon, Guni
dc.contributor.committeeMemberPark, Hangue
dc.type.materialtext
dc.date.updated2023-02-07T16:15:46Z
local.etdauthor.orcid0000-0001-6110-5330


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record