Optimality, Scalability, and Reneging in Bandit Learning

Liu, Xi

dc.contributor.advisor	Kumar, P. R.
dc.creator	Liu, Xi
dc.date.accessioned	2021-01-04T16:59:32Z
dc.date.available	2021-01-04T16:59:32Z
dc.date.created	2020-05
dc.date.issued	2020-04-23
dc.date.submitted	May 2020
dc.identifier.uri	https://hdl.handle.net/1969.1/191750
dc.description.abstract	Bandit learning has been widely applied to handle the exploration-exploitation dilemma in sequential decision problems. To solve the dilemma, a large number of bandit algorithms have been proposed. While many of these algorithms have been proved to be order-optimal with respect to regret, the difference between the best expected reward and that actually achieved, there remain two fundamental challenges. First, the “efficiency” of the best-performing bandit algorithms is often unsatisfactory, where the efficiency is measured jointly with respect to the performance in maximizing rewards as well as the computational complexity. For instance, the Information Directed Sampling (IDS), variance-based IDS (VIDS), and Kullback-Leibler Upper Confidence Bounds (KL-UCB) have often been reported to achieve outstanding performance with respect to regret. Unfortunately, they suffer from high computational complexity even after approximation, and exhibit poor scalability of computational complexity as the number of arms increases. Second, most of the existing bandit algorithms assume that the sequential decision-making process will continue forever without an end. However, users may renege and stop playing. They also assume the underlying reward distribution is homoscedastic. Both these assumptions are often violated in real-world applications, where participants may disengage from future interactions if they do not have a rewarding experience, and at the same time, the variances of underlying distributions differs under different contexts. To address the aforementioned challenges, we propose a family of novel bandit algorithms. To address the efficiency issue, we propose Biased Maximum Likelihood Estimation (BMLE) - a family of novel bandit algorithms that generally apply to both parametric and non-parametric reward distributions, often have a closed-form solution and low computation complexity, have a quantifiable regret bound, and demonstrate satisfactory empirical performance. To enable bandit algorithms handle the reneging risk and reward heteroscedasticity, we propose a Heteroscedastic Reneging Upper Confidence Bound policy (HR-UCB) - a novel UCB-type algorithm that achieves outstanding and quantifiable performance in the presence of reneging risk and heteroscedasticity.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Bandit Learning	en
dc.subject	Reinforcement Learning	en
dc.subject	Machine Learning	en
dc.subject	Recommender Systems	en
dc.subject	Lifetime Maximization	en
dc.title	Optimality, Scalability, and Reneging in Bandit Learning	en
dc.type	Thesis	en
thesis.degree.department	Electrical and Computer Engineering	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Hou, I-Hong
dc.contributor.committeeMember	Qian, Xiaoning
dc.contributor.committeeMember	Wang, Zhangyang
dc.type.material	text	en
dc.date.updated	2021-01-04T16:59:32Z
local.etdauthor.orcid	0000-0001-8177-2720

Files in this item

Name:: LIU-DISSERTATION-2020.pdf
Size:: 2.050Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record