A Hierarchical Approach to Video Prediction
Abstract
The guiding design principle behind humans building machines has been the repeated execution of a particular task in a precise and efficient manner. While we have systems that can solve tasks ranging from the relatively mundane like crunching numbers to highly complex tasks like recognizing objects and harvesting wheat, they are incredibly specific in that they perform the tasks that they are trained for and not good at generalization. For instance, a robotic arm that can weld two parts of a car together may be rendered completely useless in a situation that requires welding components in a computer motherboard. Developing agents that are endowed with human-like abilities to generalize in diverse scenarios is a core research topic in artificial intelligence.
\In order for an agent to develop general-purpose skills in a completely self-supervised manner, learning rich representations of the world that it is embodied in as well as using these representations to adapt and learn more about the environment are useful. The prediction and anticipation of future events is a key component of such intelligent decision-making systems. Prediction serves as a useful means to learn useful concepts about the world even from a raw stream of sensory observations, such as images from a camera. If the agent can learn to predict raw sensory observations directly, it does not need to assume availability of low-dimensional state information or an extrinsic reward signal. This is beneficial in learning skills in real-world environments, where external reward feedback is extremely sparse or non-existent, and the agent has only indirect access to the state of the world through its senses. Images are high-dimensional and rich sources of information, underlying the potential of video prediction to extract meaningful representations of the underlying patterns in video data.
Video prediction refers to the problem of generating pixels of future frames given context information in the form of past frames of a video. When combined with planning algorithms, the agent is able to take actions towards a desired goal in an unsupervised manner by using data as its own supervision. Motivated by this objective of learning generalizable behavior in the real world, we introduce the Hierarchical Variational Autoencoder (HVAE), a model that leverages a hierarchy of latent sequences to solve the task of video prediction.
Citation
Balasubramanian, Suprith (2022). A Hierarchical Approach to Video Prediction. Master's thesis, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /197330.