The full text of this item is not available at this time because the student has placed this item under an embargo for a period of time. The Libraries are not authorized to provide a copy of this work during the embargo period, even for Texas A&M users with NetID.
Constrained Reinforcement Learning for Wireless Networks
Abstract
We first consider dynamic priority resource allocation for media streaming applications at a wireless edge (access) network. We realize that by formulating the policy design question as
a CMDP, and decompose it into single-client problems by applying a Lagrangian relaxation, the optimal policy takes a threshold form in the video buffer length. This structure enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show using the structure of the problem that a three time scale stochastic approximation based policy gradient algorithm can learn the globally optimal policy. We demonstrate that the CRL approach can increase the client Quality of Experience (QoE) by over 30%.
We then study the complementary notions of sample complexity and regret guarantees for CRL. Our first goal is to characterize the number of samples needed to ensure a desired level of accuracy for a CRL algorithm. We develop learning algorithms, and characterize the sample complexity of the algorithms.
Our next goal is to develop a framework for safe exploration without constraint violation (with high probability) for solving CRL problems where the model is unknown. We show that our algorithm DOPE achieves the objective regret of ˜O (|S|√K), with no constraint violation. Furthermore, we show in empirical studies that DOPE has a dramatic performance improvement as compared to existing approaches.
We finally consider an application of learning methods to a traditional cache placement problem. Our goal is to take an online learning viewpoint of cache placement problem, and characterize the “regret” in terms of the finite time difference between the hits achieved by a candidate caching algorithm with respect to a genie-aided scheme that places the most popular items in the cache. We first consider the Full Observation regime wherein all requests are seen by the cache. We show that the Least Frequently Used (LFU) algorithm is able to achieve order optimal regret, which is matched by an efficient counting algorithm design that we call LFU-Lite. We then consider the Partial Observation regime wherein only requests for items currently cached are seen by the cache, making it similar to an online learning problem related to the multi-armed bandit problem. We show how approaching this “caching bandit” using traditional approaches yields either high complexity or regret, but a simple algorithm design that exploits the structure of the distribution can ensure order optimal regret.
Finally, we design new algorithms well suited for network problems using the insights of RL algorithms for generic CMDPs.
Subject
Reinforcement LearningMulti-Armed Bandits
Wireless Networks
Safe exploration
Constrained Reinforcement Learning
Citation
Bura, Archana (2023). Constrained Reinforcement Learning for Wireless Networks. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /200122.