Nonparametric Estimation and Model Combination in a Bandit Problem with Covariates

Qian, Wei2015-11-062015-11-062014-07https://hdl.handle.net/11299/175307University of Minnesota Ph.D. dissertation. July 2014. Major: Statistics. Advisor: Yuhong Yang. 1 computer file (PDF); vii, 73 pages.Multi-armed bandit problem is an important optimization game that requires an exploration-exploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates. Under a flexible problem setup, we focus on a sequential randomized allocation strategy, under which, the plug-in regression methods for the estimation of mean reward functions play an important role in the algorithm performance. In the first part of the dissertation, we study the kernel estimation based randomized allocation strategy, and establish asymptotic strong consistency and finite-time regret analysis. In addition, although many nonparametric and parametric estimation methods in supervised learning may be applied to the randomized allocation strategy in a convenient plug-in fashion, guidance on how to choose among these estimation methods is generally unavailable. In the second part of the dissertation, we study a model combining allocation strategy for adaptive performance, and establish its asymptotic strong consistency. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed combining strategy. In the existing literature of nonparametric bandit problem with covariates, it is generally assumed that the smoothness parameters of a Holder condition for the reward functions are known. Also, the finite-time regret analysis in the first part of the dissertation remains minimax sub-optimal. In the third part of the dissertation, we address these two issues by proposing a multi-stage randomized allocation strategy with arm elimination. In particular, when the smoothness parameter is unknown, we equip the algorithm with a smoothness parameter selector based on Lepski's method, and show that the regret minimax rate is achieved up to a logarithmic factor.enNonparametric Estimation and Model Combination in a Bandit Problem with CovariatesThesis or Dissertation