Show simple item record

dc.contributor.advisorKumar, P. R.
dc.creatorLiu, Xi
dc.date.accessioned2021-01-04T16:59:32Z
dc.date.available2021-01-04T16:59:32Z
dc.date.created2020-05
dc.date.issued2020-04-23
dc.date.submittedMay 2020
dc.identifier.urihttps://hdl.handle.net/1969.1/191750
dc.description.abstractBandit learning has been widely applied to handle the exploration-exploitation dilemma in sequential decision problems. To solve the dilemma, a large number of bandit algorithms have been proposed. While many of these algorithms have been proved to be order-optimal with respect to regret, the difference between the best expected reward and that actually achieved, there remain two fundamental challenges. First, the “efficiency” of the best-performing bandit algorithms is often unsatisfactory, where the efficiency is measured jointly with respect to the performance in maximizing rewards as well as the computational complexity. For instance, the Information Directed Sampling (IDS), variance-based IDS (VIDS), and Kullback-Leibler Upper Confidence Bounds (KL-UCB) have often been reported to achieve outstanding performance with respect to regret. Unfortunately, they suffer from high computational complexity even after approximation, and exhibit poor scalability of computational complexity as the number of arms increases. Second, most of the existing bandit algorithms assume that the sequential decision-making process will continue forever without an end. However, users may renege and stop playing. They also assume the underlying reward distribution is homoscedastic. Both these assumptions are often violated in real-world applications, where participants may disengage from future interactions if they do not have a rewarding experience, and at the same time, the variances of underlying distributions differs under different contexts. To address the aforementioned challenges, we propose a family of novel bandit algorithms. To address the efficiency issue, we propose Biased Maximum Likelihood Estimation (BMLE) - a family of novel bandit algorithms that generally apply to both parametric and non-parametric reward distributions, often have a closed-form solution and low computation complexity, have a quantifiable regret bound, and demonstrate satisfactory empirical performance. To enable bandit algorithms handle the reneging risk and reward heteroscedasticity, we propose a Heteroscedastic Reneging Upper Confidence Bound policy (HR-UCB) - a novel UCB-type algorithm that achieves outstanding and quantifiable performance in the presence of reneging risk and heteroscedasticity.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectBandit Learningen
dc.subjectReinforcement Learningen
dc.subjectMachine Learningen
dc.subjectRecommender Systemsen
dc.subjectLifetime Maximizationen
dc.titleOptimality, Scalability, and Reneging in Bandit Learningen
dc.typeThesisen
thesis.degree.departmentElectrical and Computer Engineeringen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberHou, I-Hong
dc.contributor.committeeMemberQian, Xiaoning
dc.contributor.committeeMemberWang, Zhangyang
dc.type.materialtexten
dc.date.updated2021-01-04T16:59:32Z
local.etdauthor.orcid0000-0001-8177-2720


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record