dc.description.abstract | Bandit learning has been widely applied to handle the exploration-exploitation dilemma in sequential decision problems. To solve the dilemma, a large number of bandit algorithms have been proposed. While many of these algorithms have been proved to be order-optimal with respect to regret, the difference between the best expected reward and that actually achieved, there remain two fundamental challenges.
First, the “efficiency” of the best-performing bandit algorithms is often unsatisfactory, where the efficiency is measured jointly with respect to the performance in maximizing rewards as well as the computational complexity. For instance, the Information Directed Sampling (IDS), variance-based IDS (VIDS), and Kullback-Leibler Upper Confidence Bounds (KL-UCB) have often been reported to achieve outstanding performance with respect to regret. Unfortunately, they suffer from high computational complexity even after approximation, and exhibit poor scalability of computational complexity as the number of arms increases. Second, most of the existing bandit algorithms assume that the sequential decision-making process will continue forever without an end. However, users may renege and stop playing. They also assume the underlying reward distribution is homoscedastic. Both these assumptions are often violated in real-world applications, where participants may disengage from future interactions if they do not have a rewarding experience, and at the same time, the variances of underlying distributions differs under different contexts. To address the aforementioned challenges, we propose a family of novel bandit algorithms.
To address the efficiency issue, we propose Biased Maximum Likelihood Estimation (BMLE) - a family of novel bandit algorithms that generally apply to both parametric and non-parametric reward distributions, often have a closed-form solution and low computation complexity, have a quantifiable regret bound, and demonstrate satisfactory empirical performance. To enable bandit algorithms handle the reneging risk and reward heteroscedasticity, we propose a Heteroscedastic Reneging Upper Confidence Bound policy (HR-UCB) - a novel UCB-type algorithm that achieves outstanding and quantifiable performance in the presence of reneging risk and heteroscedasticity. | en |