Contextual Bandits with Delayed Feedback Using Randomized Allocation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Sakshi Arya IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Advisor: Dr. Yuhong Yang May, 2020 c© Sakshi Arya 2020 ALL RIGHTS RESERVED Acknowledgements I am very grateful to my advisor Professor Yuhong Yang for his unwavering support and invaluable guidance in my research and in various aspects of graduate school. He has always been very accomodating, encouraging and patient with me, for which I could not thank him enough. His advice and insights have helped sustain my enthusiasm for becoming a better researcher and teacher. I would like to thank Professor Galin Jones for being the chair of my thesis committee and for his consistent guidance and support. I would also like to thank my committee members, Professors Xiaoou Li and Bjorn Berg for their consistent support and feedback on research and its applicability; and thanks to Professor Lan Wang for serving on my preliminary thesis committee. I am indebted to all the exceptional faculty who taught me fundamental statistics and prepared me for research. I would like to acknowledge Wei Qian for his help and support with my research and coding. I am grateful to the School of Statistics and College of Liberal Arts for being extremely generous in supporting my research over summers and for academic travel. A big thanks to the staff, Taryn and Taylor for the administrative help and support. These six years would not have been as joyful without the continued support and encouragement of my fellow graduate students and friends. I can not thank Dootika and Haema enough, as this would not have been possible without their support and friendship. The list is long but special thanks to Adam, Aaron, Christina, Dan, James, Kaibo, Karl Oskar, Sanhita, Sarah and Ziyue. Thanks to my amazing mathematician friends; Anuj, Rohit, Saumya and Arunima for being there with me on this journey. A special thanks to my friends in India, who were just a phone call away all this time. Lastly, I am forever grateful to my parents for their unconditional love and support. Whatever I am today is because of them and their happiness means the world to me. i Abstract Contextual bandit problems are important for sequential learning in various prac- tical settings that require balancing the exploration-exploitation trade-off to maximize total rewards. Motivated by applications in health care, we consider a multi-armed bandit setting with covariates and allow for delay in observing the rewards (treatment outcomes) as would most likely be the case in a medical setting. We focus on developing randomized allocation strategies that incorporate delayed rewards using nonparametric regression methods for estimating the mean reward functions. Although there has been substantial work on handling delays in standard multi-armed bandit problems, the field of contextual bandits with delayed feedback, especially with nonparametric estimation tools, remains largely unexplored. In the first part of the dissertation, we study a sim- ple randomized allocation strategy incorporating delayed feedback, and establish strong consistency. Our setup is widely applicable as we allow for delays to be random and unbounded with mild assumptions, an important setting that is usually not considered in previous works. We study how different hyperparameters controlling the amount of exploration and exploitation in a randomized allocation strategy should be updated based on the extent of delays and underlying complexities of the problem, in order to enhance the overall performance of the strategy. We provide theoretical guarantees of the proposed method- ology by establishing asymptotic strong consistency and finite-time regret bounds. We also conduct simulations and real data evaluations to illustrate the performance of the proposed strategies. In addition, we consider the problem of integrating expert opinion into a randomized allocation strategy for contextual bandits. This is also motivated by applications in health care, where a doctor’s opinion is crucial in the treatment decision making process. Therefore, although contextual bandit algorithms are proven to work both theoretically and empirically in many practical settings, it is crucial to incorporate doctor’s judgment to build an adaptive bandit strategy. We propose a randomized allocation strategy incorporating doctor’s interventions and show that it is strongly consistent. ii Contents Acknowledgements i Abstract ii List of Tables vii List of Figures viii 1 Introduction 1 1.1 The standard multi-armed bandit problem . . . . . . . . . . . . . . . . . 2 1.1.1 Exploration and exploitation dilemma . . . . . . . . . . . . . . . 2 1.2 Types of bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Stochastic bandit problem . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Adversarial bandit problem . . . . . . . . . . . . . . . . . . . . . 4 1.3 Algorithms for standard MAB . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 -greedy policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 UCB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Exponential weighting . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Parametric framework . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Nonparametric framework . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Multi-armed bandits with delayed feedback . . . . . . . . . . . . . . . . 16 1.6 Delayed feedback bandit problem . . . . . . . . . . . . . . . . . . . . . . 17 1.6.1 Bayesian setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii 1.6.2 Stochastic setting . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 Nonstochastic setting . . . . . . . . . . . . . . . . . . . . . . . . 22 1.7 Delayed anonymous feedback bandit problem . . . . . . . . . . . . . . . 23 1.7.1 Stochastic setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.7.2 Nonstochastic setting . . . . . . . . . . . . . . . . . . . . . . . . 24 1.8 Contextual bandits and health care . . . . . . . . . . . . . . . . . . . . . 24 1.8.1 Contextual bandits for adaptive clinical trials . . . . . . . . . . . 24 1.8.2 Contextual bandits for mobile health . . . . . . . . . . . . . . . . 26 1.9 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2 Randomized allocation strategy for delayed nonparametric bandits 30 2.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 The proposed strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.1 Consistency of the proposed strategy . . . . . . . . . . . . . . . . 33 2.3 The Histogram method . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Allocation with histogram estimates . . . . . . . . . . . . . . . . 34 2.3.2 Number of observations in a small cube . . . . . . . . . . . . . . 35 2.3.3 Effects of reward delay distributions . . . . . . . . . . . . . . . . 38 2.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.1 Proof of consistency of the proposed strategy . . . . . . . . . . . 43 2.5.2 A probability bound for the histogram method . . . . . . . . . . 44 2.6 Supplementary simulation results . . . . . . . . . . . . . . . . . . . . . . 46 3 To update or not to update? 48 3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 The proposed strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Consistency of the proposed strategy . . . . . . . . . . . . . . . . . . . . 51 3.3.1 The histogram method . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Allocation with histogram estimates . . . . . . . . . . . . . . . . 52 3.3.3 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 iv 3.4 Comparison of strategies, η1 and η2 . . . . . . . . . . . . . . . . . . . . . 58 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 The simulation process and results . . . . . . . . . . . . . . . . . 65 3.6 Other proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.7 Supplementary simulation results . . . . . . . . . . . . . . . . . . . . . . 74 4 Finite-time analysis for randomized allocation strategies 78 4.1 Finite-time regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.1 Nadaraya-Watson regression . . . . . . . . . . . . . . . . . . . . . 81 4.2 Delays dependent on covariates . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Discussion on finite-time results . . . . . . . . . . . . . . . . . . . 91 4.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Proofs of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.2 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4.3 Proof of Theorem 4.1.13 . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.4 Proof outline for the case when delays depend on covariates . . . 109 4.5 Supplementary real-data results . . . . . . . . . . . . . . . . . . . . . . . 114 5 Doctor’s intervention in randomized allocation strategy 116 5.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.1.1 Regret and consistency . . . . . . . . . . . . . . . . . . . . . . . 119 5.2 Proposed allocation strategy . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.1 Regression procedures . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Consistency of the proposed strategy . . . . . . . . . . . . . . . . . . . . 124 5.3.1 Layout of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.2 A preliminary result . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.3 Scenario 1: doctor performs worse than the algorithm . . . . . . 126 5.3.4 Scenario 2: doctor performs at par with the algorithm . . . . . . 127 5.3.5 Consistency for the scenario 1: doctor performs worse . . . . . . 127 5.3.6 Consistency for scenario 2: doctor performs better . . . . . . . . 131 5.4 Proofs for Theorems 5.3.8 and 5.3.12 . . . . . . . . . . . . . . . . . . . . 132 v 6 Conclusion 138 References 141 Appendix A. Appendix 150 A.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 vi List of Tables 1.1 Cumulative regret (Rn(δ)) upper bounds for different multi-armed bandit policies in different settings. . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Regret bounds for multi-armed bandits with stochastic delayed rewards 20 1.3 Cumulative regret upper bounds for multi-armed bandits with nonstochas- tic delayed feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 vii List of Figures 2.1 Per-round regret for the proposed strategy for different delay scenarios. The grid of plots represent 4 different combination of choices for {pin} and {hn}. For a given row, pin remains fixed and hn varies and vice versa for columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Per-round regret averaged over 60 replications for the proposed strategy in section 2.2 for different delay situations. pin = log −2 n and hn decays faster as we move from left to right. . . . . . . . . . . . . . . . . . . . . 42 2.3 Per-round regret averaged over 60 replications for the proposed strategy in section 2.2 for different delay situations. The grid of plots represent four different combinations of {hn} and {pin}. For a given row, pin remains fixed and hn varies and vice versa for columns. . . . . . . . . . . . . . . 47 3.1 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Each row represents a setup, with first column depicting a one-dimensional function used to generate the mean reward functions. The second and the third column depict the average regret over time for delay 3 and delay 4 respectively. Here, {hn} = log−1 n, {pin} = log−2 n. . . . . . . . . . . . 75 3.3 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = n−1/4. . . . . 76 3.4 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = log−1 n. . . . 77 viii 4.1 Boxplots of normalized CTRs for the three methods being compared. Each column represents a particular delay scenario. . . . . . . . . . . . . 91 4.2 Boxplots with 200 replications show similar patterns as Figure 4.1. . . . 92 4.3 Boxplots of normalized CTRs for the three methods for 1000 rounds of initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Flow chart of the allocation strategy . . . . . . . . . . . . . . . . . . . . 122 ix Chapter 1 Introduction Imagine a treatment allocation problem in which patients arrive sequentially at times t = 1, 2, . . ., to be treated for a particular fatal disease. There are ` competing treat- ments available for the disease. The objective is to maximize the expected total patient lifetime. Treatment allocation is sequential in nature, i.e., the previously used treat- ments and patient lifetimes could be used to guide the treatment decision for the current patient. Although we might not have complete survival information on patients treated so far, we can use information such as how long patients survived after treatment and which have survived to the current time, and information on patient characteristics like disease history, age, genetic information and gender, to guide treatment decisions at a given time. This is an example of contextual multi-armed bandit problem with delayed responses, delayed because we might not have survival information for all the previous patients when treating the current patient. Bandit problems have been studied extensively in various fields including statistics, computer science, mathematics, economics, finance, operations research and medicine. We will try to give a summary of the developments in the field in this chapter but due to the enormity of literature, this should by no means be considered exhaustive. To explain the developments in the field in a chronological fashion, we first introduce the general concept of multi-armed bandits in section 1.1, then try to give a summary of the literature in contextual bandits in section 1.4 and finally talk about delayed rewards in both context-free and contextual settings in section 1.5. 1 21.1 The standard multi-armed bandit problem Multi-armed bandit (MAB) problems are sequential allocation problems intended to- wards utilizing “past and present information” to achieve certain goals in a sequential manner. These problems were first introduced in the landmark paper by Robbins (1952). Metaphorically, the idea comes from imagining a slot machine with multiple arms where the goal is to optimize the total reward by strategically deciding the order of arms pulled. The player at each time step has to decide whether to continue pulling the current arm or try a different arm. Each arm sequentially generates random rewards based on a probability distribution specific to that arm with unknown parameters. The objective of the player is to maximize the sum of rewards generated through a sequence of arm plays or equivalently, to minimize the cumulative regret (the shortfall of the reward of the algorithm compared to the optimal). In order to achieve this objective, balancing the exploration-exploitation trade-off plays an integral role. 1.1.1 Exploration and exploitation dilemma The problem is prototypical of a general class of adaptive control problems in which there is a fundamental dilemma between “exploration”/“information”(such as the need to learn from all populations about their parameter values) and “exploitation”/“control” (such as the objective of sampling only from the best population). The main challenge of the bandit problem is that when we pull an arm, rewards of other arms are not observed. Therefore it is necessary to try all arms (explore) in order to form a better estimation. In an exploration step, the goal is to form unbiased samples by randomly pulling all arms to improve the prediction accuracy of arms. Because exploration does not focus on the best arm, this step may lead to large immediate regret but can poten- tially reduce regret for the future exploitation steps. During exploitation, the algorithm suggests the best arm learned from the samples formed during exploration, and the arm is pulled. The intention behind exploitation is maximizing immediate reward (or min- imize immediate regret). Therefore, there is an intrinsic trade-off between exploiting the current knowledge to focus on the arm that seems to yield the highest rewards and exploring the other arms to identify with better precision which arm is actually the best. So the decision maker needs to come up with a strategy that does a good job in 3balancing exploration and exploitation. Definition 1.1.1. A policy or allocation strategy δ is an algorithm that gives the strategy for choosing the next arm to play based on the sequence of past information on arms played and the rewards obtained respectively. Different policies have been formulated based on different statistical approaches and assumptions on data generating processes. Next, we discuss some of the type of problems considered in the literature. 1.2 Types of bandit problems We will discuss two fundamental formalizations of the bandit problem depending on the assumed nature of the reward process: stochastic and adversarial. Different allocation strategies have been developed for the two types. We will briefly discuss these strategies, their regret bounds and other developments in this section. 1.2.1 Stochastic bandit problem The standard setting of the bandit problem is defined as follows: let Yi,j denote se- quential rewards for 1 ≤ i ≤ ` and j ≥ 1 where each i is indexing the arms and j is indexing the time point. Successive plays of the arm i yield rewards Yi,1, Yi,2, . . . which are independent and identically distributed according to an unknown probability distri- bution Fi with unknown expectation µi. Independence also holds across machines; i.e. Yi,t and Yk,s are independent for each 1 ≤ i < k ≤ ` and each s, t ≥ 1. The goal is to find an efficient allocation strategy and to determine whether a strategy is ‘good’. This standard bandit problem where we assume that the rewards of a given arm is an i.i.d. sequence of random variables, fall under the category of stochastic bandit problems. The standard approach is to compare the algorithm performance to the best performance one could possibly achieve for a given problem if the mean rewards were known. The best mean reward is denoted by µ∗ := max1≤i≤` µi. At each time step j = 1, 2, . . . n, the decision maker selects an arm Ij ∈ {1, 2, . . . , `} and receives reward YIj ,j . Then the 4regret of δ after n arm plays is defined by Rn(δ) = µ ∗n− E( n∑ j=1 µIj ), where µIj is the mean reward generated when the jth arm is pulled by strategy δ. The analysis of this stochastic bandit problem was pioneered in the seminal paper of Lai and Robbins (1985), who introduced the technique of upper confidence bounds for the asymptotic analysis of regret. Another popular heuristic is the -greedy strategy, which is commonly used and will be discussed in section 1.3. Another potential goal of solving a multi-armed bandit problem could be to identify the best arm out of the competing arms. The regret in this case is defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. Details can be found in Audibert and Bubeck (2010a). 1.2.2 Adversarial bandit problem Another type of bandit problems is called adversarial or non-stochastic MAB problems. In this problem, rather than a well behaved stochastic process controlling the rewards, an adversary has full control over the rewards. No statistical assumptions are made on the nature of the process generating the rewards of the arms. This formulation takes its roots in game theory. This alternative formulation as an instance of multi-armed bandit problem was recognized by Auer et al. (1995). They describe the problem as gambling in a rigged casino in which the owner sets the gains/rewards for the various arms in a slot machine. The owner may observe the way a gambler plays in order to design even more evil sequences of rewards. To formalize the bandit problem as a game between a player choosing actions/arms and an adversary choosing the rewards associated with each action, we assume that all rewards belong to the unit interval [0, 1]. The game is played in a sequence of trials t = 1, 2, . . . , n and we assume ` possible actions/arms. 1. The adversary selects a vector Y (t) ∈ [0, 1]` of current rewards where the ith component is reward associated with arm i at time t. 2. Without knowledge of the rewards chosen by the adversary, the player chooses an action/arm by picking an arm It ∈ {1, 2, . . . , `} and receives a corresponding reward YIt . 53. Two possible scenarios: • Full information game: The player observes the entire vector Y (t) of current rewards. • Partial information game: The player observes only the reward YIt for the chosen action It. The adversary could be oblivious if the rewards are independent of the actions/arms chosen by the player. Otherwise, the adversary is called non-oblivious. The authors in Auer et al. (1995) provide an algorithm/allocation rule for the adversarial bandit prob- lem with non-oblivious adversary and show that in a sequence of n plays, the expected reward of the algorithm approaches that of the best arm at a rate of O(n−1/3). The authors claim that it is impossible to obtain an upper bound of O(log n) which is consid- ered to be the optimal in the stochastic bandit problems. Some standard algorithms for adversarial MABs are Hedge for full information game and Exp3 (Exponential-weight algorithm for Exploration and Exploitation) for partial information games. For a de- tailed background, a reader is referred to Cesa-Bianchi and Lugosi (2006) and references therein. Next, we review some of the commonly used algorithms for a standard multi- armed bandit problem. 1.3 Algorithms for standard MAB 1.3.1 -greedy policy As we have previously described, in a multi-armed bandit setting, a player is faced with the challenge of choosing between competing arms/options. An amateur bandit player could be tempted to choose the arm with the best estimated reward every single time. This strategy is called the greedy strategy. However, the problem with playing greedily is that one can fail to find the best action with positive probability, resulting in high regret. A simple strategy to balance the exploration-exploitation trade-off is to fix  > 0, go with the greedy choice with probability 1− (`−1) and choose any other action with probability , thus being termed as -greedy strategy. Even better, one could choose a non-increasing sequence j , such that, at time j, 6• with probability 1− (`− 1)j , play the arm with highest empirical mean, • with probability j , play a random arm other than the arm with the highest empirical mean. Theoretical guarantees for this heuristic/policy are provided by Auer et al. (2002a) where the results are as follows: let ∆i = µ ∗ − µi and ∆ = mini:∆i>0 ∆i and consider j = min( 6l ∆2j , 1). When j ≥ 6l ∆2 , the probability of choosing a suboptimal arm i is bounded by C ∆2j for some constant C > 0. As a consequence one gets the logarithmic regret, E[Tn(i)] ≤ C∆2 log n, where Tn(i) be the number of times arm i is played in n plays. This results in, Rn ≤ ∑ i:∆i>0 C∆i ∆2 log n. A drawback of this strategy is that it requires knowledge of ∆ and it does not distinguish between sub-optimal arms. Note that, this becomes a purely greedy strategy when j = 0. 1.3.2 UCB algorithm Lai and Robbins (1985) introduced the UCB technique and since then there has been a myriad of developments and improvisations in the policy, leading to its widespread use. This policy is explained by the principle of optimism in the face of uncertainty. This means that despite our lack of knowledge in what actions are best, we construct an optimistic guess as to how good the expected reward of each arm is, and pick the arm with the highest guess. Specifically, we define an upper confidence bound on the difference between the expected reward of each arm i with the prescribed upper bound. We want to know with high probability that the true expected reward of an arm, µi, is less than our prescribed upper bound. One general way to do that is using some concentration inequality, such as, the Chernoff-Hoeffding inequality, which states that for any a ∈ R, P (Y¯i − µi < a) ≤ e−2na2 . Let Yi,j be the reward variables for a single arm i in the rounds for which we have chosen i. Then Y¯i is just the empirical average reward for action i. Let Tn(i) be the number of times arm i is played in n plays. Use a = a(i, n) = √ 2 log(n)/Tn−1(i) we get that P (Y¯i > µi+a) ≤ n−4, which converges to zero very quickly as the number of rounds n played grows. The UCB policy says that, at time n play the arm i which maximizes 7Y¯i,n−1 + √ 2 logn Tn−1(i) where Y¯i,n−1 = 1 Tn−1(i) ∑Tn−1(i) j=1 Yi,j . It can be shown that the regret bound Rn ≤ ∑ i 6=i∗ min( 10 ∆i log n, n∆i) as in Auer et al. (2002a). Several versions of UCB algorithms can be designed by using different concentration inequalities like Bernstein inequality and other concepts such as KL divergence. Such developments can be found in Audibert et al. (2009); Maurer and Pontil (2009); Audibert and Bubeck (2010b); Lattimore and Szepesva´ri (2018) and references therein. 1.3.3 Exponential weighting Exponential weighting schemes are broadly used for balancing exploration and exploita- tion in reinforcement learning. In the stochastic setting, one popular algorithm is the Boltzmann exploration (Softmax) strategy. This is a simple version of exponential weighting which picks an action that is proportional to its average reward, i.e., actions with higher average rewards are picked with higher probability. At time n, let pi(n) be the probability of pulling arm i, 1 ≤ i ≤ `. Pull arm i at time n with probability, pi(n) = exp(µˆi(n)/τ)∑` k=1 exp (µˆk(n)/τ) where τ is a tuning parameter. More details on this heuristic and regret bounds can be found in Cesa-Bianchi and Fischer (1998); Sutton and Barto (2018). A more commonly used exponential weighting approach, perhaps more so in the nonstochastic setting, is called “exponential-weight algorithm for exploration and ex- ploitation” (Exp3). It works by maintaining a list of weights for each of the actions, using these weights to decide randomly which action to take next, and increasing (decreasing) the relevant weights when a payoff is good (bad). 1. Given γ ∈ [0, 1], initialize the weights wi(1) = 1 for i = 1, . . . , `. 2. In each round n: (a) Set pi(n) = (1− γ) wi(n)∑` k=1 wj(n) + γ ` for each i. (b) Pull the next arm In according to the distribution of pi(n). (c) Observe reward yIn(n) and define the estimated reward to be yˆIn(n) = yIn(n)/pIn(n). 8(d) Set wIn(n+ 1) = wIn(n) exp (γyˆIn(n)/`) and wi(n+ 1) = wi(n) for i 6= In. Regret guarantees and theoretical details can be found in Auer et al. (2002b). 1.3.4 Thompson sampling Thompson Sampling is a Bayesian approach to Bandit problem and dates back to Thompson (1933), who proposed a simple strategy for the case of Bernoulli Bandits. Let there be ` arms and each arm i when played produces a reward of 1 with probability µi (mean reward) and a reward of zero with probability 1 − µi. Take the priors over each µi to be beta-distributed with parameters αi and βi. At any time n, because of the conjugacy property, each action’s posterior distribution is also a beta distribution with parameters that can be updated according to this rule, (αi, βi)← (αi, βi) if In 6= i(αi, βi) + (Yn,i, 1− Yn,i) if In = i Let pii,n be the posterior distribution for µi at the nth round, and let µˆi,n ∼ pii,n. The chosen arm is then given by In ∈ arg maxi=1,...,` µˆi,n. Recently there has been a lot of interest for this simple policy and we now have a fairly good understanding of its theoretical properties (Agrawal and Goyal (2012); Kaufmann et al. (2012)) and empirical performances (Chapelle and Li (2011)). Next, we discuss a more practical setting which broadens the applicability of bandit framework to a wide range of problems, namely contextual bandits or multi-armed bandits with covariates. 1.4 Contextual bandits In the literature covered in previous sections, no auxiliary information beyond the ob- served rewards was considered when selecting which arm to play. However, in many practical situations some additional information in the form of covariates can be uti- lized for allocation purposes, and in such cases the reward distributions may depend on this covariate. For example, in sequential treatment allocation, before deciding which treatment to assign to a patient, we can observe covariates such as age, gender, genetic 9information or severity of disease. Taking these covariates into account can help in devising a personalized treatment allocation mechanism. Such a framework is called Contextual Bandits or Multi Armed Bandit problem with Covariates (MABC). The term contextual bandits was coined by Langford and Zhang (2008) and is the most commonly used terminology these days. In this dissertation, we use both these names interchangeably. More formally, in the slot machine analogy, the game player is given a d-dimensional covariate x ∈ Rd before deciding which arm to pull and the expected reward of an arm given covariate x, takes a functional form f(x). Then, one way to model the reward for ith arm at the jth time point is by adopting regression procedures, so we get a model, Yi,j = fi(Xj) + i,j In most of the settings, i,j ’s are considered to be i.i.d. mean zero random variables. Based on assumptions made on f , one could take either a parametric or a nonpara- metric approach for the purpose of estimation. There is a vast amount of literature on contextual bandits (Bubeck and Cesa-Bianchi (2012) for a complete bibliography) and we try to review some relevant work in the following sections. 1.4.1 Parametric framework The most extensively studied parametric regression is the linear regression model. In this, we assume that f(x) = β′ix, such that, Yi,j = β ′ iXj + i,j where Xj , βi ∈ Rd and i,j are iid mean-zero random variables. Then the expected regret of a learning algorithm δ, at time N , would be defined as, RN (δ) = N∑ n=1 (E[Yi∗(x),n − YIn,n]) where i∗(x) = arg maxi β′ix and In is the arm chosen by the algorithm at time n. This problem was first addressed by Woodroofe (1979) who introduced the one- armed bandit problem with covariates. In this setup, we let (Xj , Y0,j , Y1,j), j ≥ 1 denote a sequence of random vectors, where Xj is a covariate at time step j and Yi,j is a 10 reward from from arm i = 0 or i = 1 that is obtained at time j. Suppose (Xj , Y0,j , Y1,j) are i.i.d. copies of (X,Y0, Y1). Woodroofe (1979) considered the problem in a Bayesian setting under the assumption that Y1 = X − θ + , where  is zero mean random variable with known distribution, independent of X. For a given prior distribution of θ, he provided a description of the optimal Bayesian policy. These results were later extended by Sarkar (1991) for the scenarios where the reward distribution belongs to a one-parameter exponential family. The problem was revisited by Goldenshluger et al. (2009), where the authors study minimax complexity of the one-armed bandit problem with covariates and establish policies that achieve non-asymptotic lower bounds on the minimax regret (goal is to minimize the maximal regret). Another related work is Goldenshluger and Zeevi (2013), where a linear response bandit problem with finite number of arms is considered in a minimax setting and optimal rates for the proposed algorithm are established. They propose an algorithm under certain assumptions like i,j are normally distributed and a “margin” condition, and establish a regret bound of order O(d3 logN) for the same. Recently, Bastani and Bayati (2015) have extended the algorithm of Goldenshluger and Zeevi (2013) to the high dimensional case where the vectors βi are sparse. Another notable work by Bastani et al. (2017) is where they show that under mild conditions, the greedy algorithm is rate optimal in cumulative regret for a two-armed linear bandit model. For situations where the assumptions are hard to verify, they propose, what the call ‘Greedy-First’ algorithm which follows a greedy policy in the beginning and only performs exploration when the data indicates the need for it. Furthermore, Filippi et al. (2010) developed bandit policies for a generalized linear model framework. There has been a significant amount of work using heuristics like UCB. First studied by Auer et al. (2007) under the name “linear reinforcement learning”, and later in the context of web advertisement by Li et al. (2010), Chu et al. (2011), is a variant when the set of available arms changes from time step to time step, but has the same finite cardinality in each step. Li et al. (2010) gave the LinUCB algorithm (Algorithm 1), that is based on the linearity assumption E(Yi|X = x) = β′ix. It adds a confidence term to each arm’s current estimate and the arm which maximizes this sum is chosen. 11 Algorithm 1 LinUCB (Li et al., 2010) Inputs: α > 0 Ai = Id, bi = 0d×1 for all i for n = 1 to N do Observe covariate xn. βˆi = A −1 i bi for all i. pn,i = βˆi ′ xn + α √ x′nA −1 i xn for all i. . Upper confidence bound Take action In = arg maxi(pn,i) and observe reward Yn,In For i = In, update Ai = Ai + xnx ′ n, bi = bi + YIn,nxn end for A slightly different setting was studied by Dani et al. (2008), Abbasi-Yadkori et al. (2011), Rusmevichientong and Tsitsiklis (2010), where they study linear contextual bandits for the case when the set of available actions does not change between time steps but the set can be an almost arbitrary, even infinite, bounded subset of a finite- dimensional vector space. Agrawal and Goyal (2013b); Russo and Van Roy (2014) studied linear contextual bandits in the Bayesian setting using Thompson sampling and established regret guar- antees. Going beyond the world of linearity, Agarwal et al. (2012) consider a setting where the mean reward functions are assumed to lie in a general class of finitely many members. Agrawal and Goyal (2013a) analyze Algorithm 2 which chooses a Gaussian prior with mean zero and covariance matrix σ2Id×d. They assume that Yi|x ∼ N(βix, σ2) and as a result get a Gaussian posterior using conjugacy property. Then, it draws sam- ples from the posterior distribution and chooses the action with the highest posterior mean. For  ∈ (0, 1), they prove a regret bound of O(d √ N1+  (logN log (1/δ))) with probability 1 − δ. A big advantage of using Thompson Sampling is that it does not require the covariates to be iid, which is usually the case in a lot of practical settings. Soare et al. (2014) study the problem of best-arm identification in linear bandits. The corresponding regret rates for much of the work discussed above are tabulated in 1.1. 12 Algorithm 2 Thompson Sampling (Agrawal and Goyal, 2013b) Input: σ2 (variance parameter used in the prior) Ai = Id×d, bi = 0d×1 for all i for n = 1 to N do Compute βˆi = (Ai) −1bi for all arms i Sample β˜i from N(βˆi, σ2(Ai)−1) for all i . Sample from the posterior Take action In = arg maxi(β˜i) ′xn and observe reward Yn,In For i = In, update Ai = Ai + xnx ′ n, bi = bi + YIn,nxn end for 1.4.2 Nonparametric framework The first work to venture outside the realm of parametric modeling assumptions was by Yang and Zhu (2002). They show that using nonparametric estimation techniques like histogram and K-nearest neighbor methods, the function estimation is strongly consis- tent. As a result, the cumulative reward of their randomized allocation rule (similar to an -greedy policy) is asymptotically equivalent to the optimal cumulative reward. However, the property of strong consistency does not address the issue of how quickly the total reward based on the allocation strategy approached the optimal one. The allocation rule used in this work is essentially a variant of the −greedy policy and will be discussed in chapter 2 as it will lay the ground for our proposed work. This notion of reward strong consistency as in Yang and Zhu (2002) was then established by May et al. (2012) for their Bayesian sampling method. The question of establishing a more refined notion than Yang and Zhu (2002) of optimality for nonparametric MABC was addressed in Rigollet and Zeevi (2010), where they proposed UCB-type policies. They derive near-optimal bounds on the regret in the case of a two-armed bandit problem under only two assumptions on the underlying functional form that governs the arm rewards. The two conditions are, 1. Smoothness condition: We say that an algorithm satisfies the Ho¨lder smoothness condition with parameters (κ, ρ) if fi satisfies, |fi(x)− fi(x′)| ≤ ρ||x− x′||κ ∀x, x′ ∈ X , i = 1, . . . , `, for some κ ∈ (0, 1] and ρ > 0. 13 2. Margin condition: Given x ∈ X , define f ](x) to be f ](x) = max1≤i≤`{fi(x) : fi(x) < f∗(x)} if min1≤i≤` fi(x) < f∗(x),f∗(x) otherwise. f is said to satisfy the margin condition if there exists α ∈ (0, d/κ], t0 ∈ (0, 1) and c0 > 0 such that PX(0 < f ∗(X)− f ](X) ≤ t) ≤ c0tα for all t ∈ [0, t0]. The margin condition encodes the “separation” between the functions that describe the arms’ responses and was originally studied by Goldenshluger et al. (2009) in the one armed bandit problem. Rigollet and Zeevi (2010) introduced the idea of binning covariates which is an intuitive concept and could be described using the example of clinical trials: patients are segmented into groups with similar characteristics and then the treatment is allocated based on the responses over that group. This setup of two armed bandit problem was extended by Perchet and Rigollet (2013) to the `- armed bandit problem with covariates when ` may be large. They use successive arm elimina- tion algorithms to establish a regret upper bound (O(N κ+d−ακ 2κ+d )) with the same order as the minimax lower bound of a two-armed MABC problem in Rigollet and Zeevi (2010). Subsequently, Qian and Yang (2016a) use “chaining” arguments to show uniform strong consistency in estimation using kernel methods along with providing a finite-time re- gret analysis. Given Ho¨lder smoothness parameter κ and total time horizon N , the expected cumulative regret upper bound for the policy proposed is O(N 2κ+d 3κ+d ) which is slightly sub-optimal and might reflect theoretical limitation of the proposed algorithm. Another contribution of this paper is that it introduces a fully data-driven model com- bining technique to help choosing the best estimation method for each arm integrated in the randomized allocation strategy for MABC. Motivated by the observation that us- ing randomized allocation strategy alone may give sub-optimal rate for the cumulative regret, the authors in Qian and Yang (2016b) propose an algorithm (RAEE) which em- beds the arm-elimination technique of Perchet and Rigollet (2013) into the randomized allocation strategy. They show that near minimax regret upper bounds can be achieved without prior knowledge of the smoothness parameter. In particular, they use Lep- ski’s method to adaptively estimate the smoothness parameter under a “self-similarity” condition. 14 Note that, in the smoothness condition κ ≤ 1 corresponds to non-differentiable functions, weaker form of Lipschitz continuity (κ = 1). Also κ = ∞, corresponds to infinitely-extrapolable functions such as linear and other parametric functions. Hu et al. (2019) develop a novel algorithm that bridges the gap between infinitely-smooth linear response bandit and the non-smooth non-differentiable response bandit. They characterize the smoothness of the mean reward functions in terms of highest order of continuous derivatives. They propose an algorithm for every level of smoothness 1 ≤ κ < ∞ and prove that it achieves minimax optimal regret rate up to polylogs (O˜(N κ+d−ακ 2κ+d )). Other works on nonparametric bandits include Fontaine et al. (2019) who incorporate regularization in contextual bandits and Wanigasekara and Yu (2019) who deal with nonparametric bandits in an unknown metric space. A slightly different approach is taken by (Langford and Zhang (2008)), in which they imposes neither linear nor any smoothness assumption on the mean reward func- tion. Instead, they fix a class of policies, H, and then aim to minimize the expected regret relative to the class H. They do not require knowledge of the time horizon N , as they run exploration-exploitation steps in epochs (batches) with sample dependent exploitation step such that the resulting regret bound is no more than three times the regret for known N . The regret bound of Epoch-Greedy, with a finite class H is O(N2/3(log |H|)1/3). However, the authors note that for an infinite class H with fi- nite VC dimension, a similar regret bound could be shown. An advantage of using this methodology is that it does not make assumptions like the margin condition on the underlying reward generating functions. Additionally, the authors show that upon imposing some assumptions (such as gap between best and second best bandit) on the Epoch-Greedy algorithm, one could achieve the regret of the form O(logN). A different but related setting to MABC problem considers the arm space (with possibly infinitely many arms) instead of the covariate space. For reference, see Auer et al. (2007), Kleinberg et al. (2008). Plug-in type policies explained for the full informa- tion case in Audibert et al. (2007), have gained popularity in the context of continuum armed bandit (uncountably many). See Slivkins (2014) for reference. These bandit problems consider joint covariate and arm space. Some contextual bandit policies, such as the ones proposed in Beygelzimer et al. (2011); Langford and Zhang (2007), allow for adversarially chosen covariates and establish regret bounds in this more general setting. 15 Setting Regret upper bound Standard stochastic MAB O(log n)∗ Adversarial MAB O˜( √ `n)∗ Linear contextual bandits O(d3 log n) (Goldenshluger and Zeevi, 2013) O(d log n √ n log n/δ) (Dani et al., 2008) O(d log n √ n+ √ dn log n/δ) (Abbasi-Yadkori et al., 2011) O(d √ n1+  (log n log (1/δ))) (Agrawal and Goyal, 2013a) Contextual bandit O(n2/3(log |H|)1/3) (Langford and Zhang, 2007) Nonparametric bandits O(n κ+d−ακ 2κ+d )∗ (Perchet and Rigollet, 2013) O(n 2κ+d 3κ+d ) (Qian and Yang, 2016a) O˜(n κ+d−ακ 2κ+d )(Qian and Yang, 2016b) Smooth contextual bandits O˜(n κ+d−ακ 2κ+d ) Hu et al. (2019) Table 1.1: Cumulative regret (Rn(δ)) upper bounds for different multi-armed bandit policies in different settings. d denotes the covariate dimension, δ ∈ (0, 1) such that the regret bound holds with probability 1− δ,  is a fixed number between (0, 1), H is the size of the policy set in Epoch-Greedy, κ is the smoothness parameter and α is the parameter from the margin condition. ∗ denotes rates that have been shown to be minimax optimal, O˜ signifies some missing log or polylog terms. Next, we discuss a pertinent issue that arises in most practical situations where 16 multi-armed bandit finds its applications, that is, delay in observing the rewards. 1.5 Multi-armed bandits with delayed feedback Delays in observing rewards in bandits manifest itself in different ways in various prac- tical settings. A lot of these have been identified and addressed in the literature. We try to review them in this section. Online advertising is one of the main application areas of multi-armed bandit algorithms. Typically, a user is regularly shown advertise- ments on social media or other websites by a bandit algorithm. When a user clicks on an advertisement or buys a product, the bandit algorithm updates and takes this information into account in future recommendations. However, usually the click or the user feedback is not instantaneous but might be received some time after the algorithm presented the advertisement. Another common example where delayed rewards are no- ticed is in applications in health care, like allocating treatments for a particular disease. In this case, one would most likely not have observed results of previous patients before making current treatment decisions for a specific disease. Therefore, not all information is available to make real-time decisions and hence delays are expected to play a crucial role in determining the performance of such a sequential treatment allocation scheme. In some cases, the algorithm (player) receives the delayed feedback in the shape of arm-reward pairs, in which the player knows both the reward observed and which arm generated it. This is called the delayed feedback bandit problem. However, in some online situations, it is not possible to distinguish which advertisement corresponds to the observed delayed reward, that is, one does not have any information on which ad- vertisement was actually responsible for attracting user’s interest. This, seemingly a harder problem, is known as delayed anonymous feedback bandit problem. The complex- ity of these problems could then vary depending on the assumptions made on delays, ranging from fixed known delays to random unbounded delays. Another dimension of complexity would be the contextual bandits with delayed feedback case, which is where lies our research contribution. 17 1.6 Delayed feedback bandit problem As mentioned in section 1.5, delayed bandit feedback problem arises when one observes rewards at a delayed time along with the knowledge of the arm it corresponds to. For example, this is a scenario one would expect to see in an adaptive treatment allocation setting, where you would keep track of which outcome corresponds to which treatment. 1.6.1 Bayesian setting One of the most extensively studied ways of tackling a bandit problem is through the use of Bayesian methods. Anderson (1964) and Suzuki (1966) highlighted the impor- tance of considering delays for the choice of optimal sequential decision procedures and characterized Bayes solutions of sequential decision problems under delayed feedback. Motivated by ethical and practical issues in the designs of sequential clinical trials, a ban- dit process with delayed responses was studied by Eick (1988b,a), where the existence of optimal Gittins indices is shown under some conditions. The responses considered were patients’ survival times after the treatment, which might be censored upon their obser- vation, with the objective of maximizing the total discounted expected survival times. More recently, Chapelle and Li (2011) along with providing deeper insights through an empirical comparison between Thompson Sampling and UCB algorithm, conducted an empirical study to illustrate robustness of Thompson sampling in the case of constant delayed feedback. 1.6.2 Stochastic setting A substantial amount of work that has been done in this direction assumes fixed and bounded delays. In the last section of their paper, Dudik et al. (2011) considered a constant known delay which resulted in an additional additive penalty in the regret for contextual bandits with finite decision sets. A more systemic study of online learning problems with delayed feedback was conducted by Joulani et al. (2013), who devel- oped meta-algorithms which in a black-box fashion use algorithms developed for the non-delayed case into ones that can handle delays in a feedback loop. They propose Algorithm 3 (QPM-D) where they use any bandit algorithm for the non-delayed case (which they call as BASE) to tackle delays. They create buffers (queues) Q[i] for each 18 arm i. While rewards for the arms chosen are available in the queues, the BASE algo- rithm runs and makes further predictions. When no feedback is available for a given arm, (that is, Q[I] is empty for some arm I), the algorithm keeps choosing the same arm until a reward is observed for that arm. Their results show that the price of delay leads to a multiplicative increase in the regret in adversarial problems and an additive increase in stochastic problems. Their results only hold for finite side information sets and not for a fully contextual setting. Following this notable work, Mandel et al. (2015) devised a method that guarantees good black-box algorithms when leveraging a prior dataset and incorporating a heuristic to help improve empirical performance while re- taining the strong theoretical guarantees of Joulani et al. (2013). Another approach to handling delays in bandits was by Desautels et al. (2014). They analyzed the case of Gaussian Process Bandits, and developed algorithms for parallelizing exploration- exploitation trade-offs and provide regret bounds for them respectively. Another similar setting in online learning is motivated by delayed conversions in advertising and product recommendations on e-commerce websites (Chapelle (2014)). Conversion is a generic term used to refer to user’s buying decision. It occurs when the reward that is immediately obtained (for example, click) is a proxy for an actual outcome (for example, a corresponding sale) which might take hours or days to happen. For this setting, Vernade et al. (2017) consider potentially infinite stochastic delays, resulting in some feedback being censored (not observable), which happens in online settings due to limited memory. The strategy proposed in Joulani et al. (2013) does not work in this setting because their algorithm acts like a queuing mechanism where the number of draws of an arm as well as the cumulated sum of the subsequent rewards are only updated when the observation arrives to the learner. However, in this setting (Joulani et al. (2013)), the associated reward corresponding to a click is 1 so the cumulated sum would just be 1 and would not allow to compare arms. Vernade et al. (2017)develop UCB based strategies and their analysis assumes prior knowledge of the delay distribution and does not handle the contextual case. Subsequently, Vernade et al. (2018) extend this work (delayed conversions with censoring) to the contextual case. They assume a linear assumption on modeling the rewards as a function of the covariates. The algorithm they develop is a delayed version of LinUCB and name it DeLinUCB, which can handle ambiguous delayed feedback. Another major improvement over Vernade 19 et al. (2017) is that they do not require any prior knowledge of the delay distribution or the conversion probability. They make the assumptions of bounded scalar rewards, bounded coefficients of any action and bounded noise (mostly consider Bernoulli arms). Although not directly related to bandits, a relevant line of work is to devise methods for conversion rate prediction, such as Chapelle (2014) and Yoshikawa and Imai (2018), where both use two models one for time delays between click and conversion, and the another a classification model for predicting conversion. Both use survival analysis tools for time delays as there is censoring involved, with the latter extends the generalized linear model of the former to a nonparametric model. Recently, Zhou et al. (2019) design delay-adaptive algorithm for generalized linear contextual bandits using UCB-style exploration. In our knowledge, no previous work has addressed adapting for delayed feedback in nonparametric contextual bandits and that is the contribution of this dissertation work. Also, there has been no work on using randomized policies for dealing with delayed contextual bandits and our work seems to be the first one addressing that as well. The regret rates for each of the works listed above are tabulated in Table 1.2. Algorithm 3 Queued Partial Monitoring with Delays (QPM-D) (Joulani et al., 2013) Create an empty buffer Q[i] for each arm i ∈ {1, . . . , `} Let I be the first arm predicted by the BASE algorithm. for each time instant n = 1, 2, . . . , N do Prediction Step: while Q[I] is not empty do Update BASE with a reward from Q[I]. BASE predicts an arm In. end while Q[I] is empty (no available reward for I), predict In = I at time instant t to get a reward. Update: for each (s, Ys) ∈ Zn do . (Zn- set of arms and rewards observed at time n.) Add the reward Ys to the buffer Q[Is]. end for end for 20 Stochastic Rewards No covariates Covariates Fixed delays O( √ ` logN(τconst +√ N))(Dudik et al., 2011) Variable delays RN ≤ R′N +O(logN + E[τ ] +√ E[τ ] logN)(Joulani et al., 2013; Mandel et al., 2015) Linear O( √ fN,δ+m τmpc √ 8dN log (1 +N/λ)) (Vernade et al., 2018) RN ≤ C1R ′ N + C2τmax log τmax Desautels et al. (2014) Generalized linear O˜( √ µDd+ √ σGd+ d) √ N) (Zhou et al., 2019) Nonparametric This dissertation Anonymous delayed feedback +O(logN + E[τ ] +√ E[τ ] logN) (Pike-Burke et al., 2017) Table 1.2: Regret bounds for multi-armed bandits with stochastic delayed rewards Here, R′N is the cumulative regret rate for non-delayed BASE policy, τconst is the constant delay and τ is the random delay. In Vernade et al. (2018), {m, τm, pc} are conversion parameters, fN,δ = 2(1 + 1/ log t) log(1/δ) + cd log (d log t), d is the covariate dimension. In Zhou et al. (2019), d is the covariate dimension, µD is the mean for iid delays, σG is a parameter characterizing the tail bound of the delays. 21 Nonstochastic Delays No covariates Fixed delays Non-anonymous feedback O(√(`+ d)N log `)∗ (Cesa-Bianchi et al., 2016) Composite Anonymous feedback O(√d`N log `))∗ (Cesa-Bianchi et al., 2018) Variable delays Bounded O(√(`N +D) log `) (Sommer Thune et al., 2019) Unbounded, unknown N, D O( √ log `(`2N +D)) (Bistritz et al., 2019) Unbounded, observed at action time O(minβ |Sβ|+ β log `+ `N+Dββ ) (Sommer Thune et al., 2019) Unbounded, unknown N,d O(√`N+minS(|S|+ √ DS¯ log `)) (Zimmert and Seldin, 2019) Table 1.3: Cumulative regret upper bounds for multi-armed bandits with nonstochastic delayed feedback ` denotes the number of arms, d is for fixed delay, N is the time horizon, |Sβ | is the number of observations with delay exceeding β, and Dβ is the total delay of observations with delay below β. S is the set of rounds excluded from delay counting, S¯ = [N ]/S and DS¯ are the counted rounds. ∗ refers to results that have been proved to be minimax. 22 1.6.3 Nonstochastic setting Delayed rewards also occur in a nonstochastic setting where the rewards are being obtained in a more deterministic fashion. For example, Cesa-Bianchi et al. (2016) give an example of multiple ad servers, which form a communication network through which they can share user information and use real-time bidding to sell their inventory. Each server, learns how to set the auction parameters (e.g., reserve price) using a bandit algorithm sequentially, in order to maximize the network’s overall revenue, and shares feedback information with other advertisers in order to speed up learning. Delay comes in because the rate at which information is exchanged through the communication network is slower than the typical rate at which ads are served. This causes each learner to acquire feedback information from other servers with a delay that depends on the network’s structure. In this communication network, messages that take more than d hops to arrive are dropped, and d is called the delay parameter. Now two scenarios can occur, 1) the learning agents could decide to cooperate to solve the same nonstochastic bandit problem or, 2) not cooperate and ignore the information received from other agents. The authors propose a version of Exp3 algorithm and prove that with ` actions and M agents, the average per-agent regret after N rounds is at most of order√ (d+ 1 + KN α≤d)(T logK), where α≤d is the independence number of the d-th power of the communication network G (i.e., graph G augmented with all edges between any two pair of nodes at shortest-path distance less that or equal to d). The authors provide results on regret bounds depending on the nature of the underlying graph structure G. More recently, Li et al. (2019); Sommer Thune et al. (2019), have developed algorithms with regret guarantees for the case when the delays are variable (unrestricted). In Sommer Thune et al. (2019), prior knowledge of delays is no longer required but they assume that delays are available at action time (when an arm is pulled) for the doubling technique to work. This assumption of delays being available at action time is justified in the work as it is satisfied in the above-mentioned setting of interaction between ad servers. However, Zimmert and Seldin (2019) relax this assumption of delays being available at action time. Their results require no advance knowledge of delays and the time horizon N , with no requirement of a doubling technique. The tightness of the regret bound achieved by Zimmert and Seldin (2019) still remains an open problem. More recently, Bistritz et al. (2019) considered the delayed Exp3 algorithm and propose 23 a novel doubling trick for online learning with delays to deal with the case where the total delay and time horizon are unknown. Solving the problem of delay in nonstochastic bandits with covariates has not been addressed thus far in our knowledge. The regret bounds for the literature reviewed here are tabulated in Table 1.3. 1.7 Delayed anonymous feedback bandit problem As mentioned in section 1.5, delayed anonymous feedback problem arises when rewards are observed at a delayed time but without the knowledge of the corresponding arms. These situations often arise in online learning settings. We discuss work done in this area in both stochastic and adversarial realms in section 1.7.1 and section 1.7.2 respectively. 1.7.1 Stochastic setting This problem was formulated by Pike-Burke et al. (2017, 2018), motivated by application in online advertising. In this problem, along with assuming delayed feedback, it is also assumed that the player does not observe the outcome of a specific action. Instead, at each time step, t, a player selects an action It and then receives reward Yt which could be a cumulative/aggregated reward from any of the past t plays of the bandit process. Although this seems to be a harder problem due to this anonymity, the authors devise a strategy and show that one can achieve regret of similar order to the simpler delayed feedback problem of Joulani et al. (2013). The key idea used is that playing an arm consecutively for long period of time helps obtain an accurate estimate of the mean reward of that arm. In our knowledge, extending this to contextual bandits still remains an unsolved problem. More recently, Cella and Cesa-Bianchi (2019) proposed a slightly different problem motivated by recommendation problems in music streaming platforms. They propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. They introduce a class of ranking policies, and propose an algorithm that achieves a regret of O˜(`N) with respect to the best ranking policy. 24 1.7.2 Nonstochastic setting In Cesa-Bianchi et al. (2018), the authors study the delayed anonymous feedback in a nonstochastic setting, where rewards (or losses) are generated by some unspecified deterministic mechanism. In addition, they consider a more general setup by assuming that the loss for choosing an action at time t is adversarially spread over at most d (known, fixed delay) consecutive time steps in the future, t, t + 1, . . . , t + d − 1. They call this setting composite anonymous setting as it can accommodate scenarios where actions have a lasting effect which combines additively over time. These scenarios can occur in various online settings, for example, an impression results in immediate click-through, later followed by a conversion; or a user interacts multiple times with the recommended item. For this setting, the authors provide an upper bound and a matching lower bound (up to log factors), showing that in the nonstochastic case, anonymous feedback is strictly harder than non-anonymous feedback. Extending this to the contextual case and unknown delays still remains an unsolved problem. 1.8 Contextual bandits and health care In recent times, majority work in contextual bandits has been driven by personal rec- ommendation problems arising on the web, such as providing advertisement recommen- dations based on user and webpage features, or tailoring news article recommendations based on user interests, or many more of the like. However, multi-armed bandits from the very beginning have been motivated by their potential applications in health care. Since the work proposed in this dissertation is motivated by its potential applications to health care, we discuss some of the work done in two directions in the broad field of health care in the following sections. 1.8.1 Contextual bandits for adaptive clinical trials Traditionally, clinical trials have followed a non-adaptive design with known randomiza- tion of patients to treatments throughout the trial. These designs are very well-studied from a long time and largely prevalent in the present times, due to good properties like maintaining low Type I error and controlling bias. However, these trials often are 25 very long, expensive and could lead to poor patient outcomes and inconclusive results. Therefore, adaptive designs are being studied and encouraged even by regulatory bodies like FDA in the U.S. There are several types of adaptive-designs, most commonly used is the response-adaptive design (Chow and Chang (2012); Murphy (2005)). Such de- signs are usually Bayesian in nature, intending to learn from the accumulating patient responses by making procedural changes (like changing randomization probabilities) while the trial is still ongoing. Contextual bandits are extremely relevant in the context of adaptive clinical tri- als. The covariate information xi for the ith patient can be used to “personalize” the treatment selection for the patient as in biomarker-guided therapies in personalized medicine. For example, the BATTLE trial in Kim et al. (2011) demonstrated that per- sonalizing the chemotherapy regimen led to increased success rates in cancer patients. In Lai and Liao (2012), the authors develop an asymptotic theory for efficient outcome- adaptive randomization schemes and optimal stopping rules. They extend the classic MAB theory to developing asymptotic lower bounds for the expected sample sizes for the treatment arms and the control arm, using generalized likelihood ratio procedures to obtain these bounds. Some of these ideas are used in Bartroff et al. (2013), Wason and Jaki (2012). A LASSO bandit algorithm for parametric bandits is used in Bastani and Bayati (2015) on a medical decision making problem of Warfarin dosing and it is shown that it improves overall treatment benefits. Szorenyi et al. (2015) have developed an innovative qualitative MAB approach in which the rewards are not assumed to be numerical. They use a quantile-based online learning approach for regret minimization and finite time analysis. Another notable work, chiefly motivated by clinical trials is batched bandit problems (Perchet et al. (2016)), where the allocation policy must split trials into a small number of batches. They show that optimal regret bounds can be attained even when the number of batches are much smaller than log T where T is the total number of time steps. There is a long list of published work discussing sequential approaches (including multi-armed bandits) and their role in randomized and adaptive clinical trials, some examples include Anscombe (1963); Armitage et al. (1975); Wei and Durham (1978); Lai et al. (1985); Wason and Jaki (2012); Sverdlov (2015); Villar et al. (2015); Ahuja and Birge (2016). Although multi-armed bandits have long been used as the motivating example for adaptive clinical trials, in our knowledge it still has limited 26 application in real clinical trials. 1.8.2 Contextual bandits for mobile health Mobile health is a term used for the practice of medicine and public health supported by mobile devices such as mobile phones, tablet computers, and wearable devices such as smart watches, for health services, information, and data collection. As contextual bandits provide a natural framework for personalized decision making over time, they are expected to be found useful in personalizing mobile health interventions to a spe- cific person in a particular context. Recently, a concept called Just-in-time adaptive intervention or JITAI (Nahum-Shani et al. (2017)) was formulated to unify a number of decision making problems that arise in mobile health. JITAIs are increasingly being used to support health behavior changes in domains such as physical inactivity, alcohol use, mental illness, smoking and obesity. Contextual bandits seem to be a promising tool to help personalize JITAIs, where the arms chosen are the interventions provided to the users. Widespread use of technologies such as smart phones, tablets etc. and their portable nature enables individuals to access and receive interventions anytime and anywhere. Moreover, mobile-phone sensing (e.g., GPS), computing sensors in wearable devices, and digital footprints make it possible to monitor individuals continuously and hence know when and why to intervene. For example, MyBehavior (Rabbi et al. (2015)) is a lifestyle mobile intervention (a smart phone application) that uses multi-armed ban- dits. It uses sensor data to suggest a frequent behavior (e.g., walking) when the person is in a particular location and life context (e.g., on the way home after work) or run- ning (which might happen less frequently). As the person repeats these behaviors, the online algorithm updates and the recommendations are prompted more frequently in that setting. Although, JITAIs are now being used extensively, they are still in early stage of development and face some unique challenges like attrition (people abandoning their mobile health resources) and the rapid, unexpected nature of the problem. Tewari and Murphy (2017) review the existing contextual bandits literature in the light of mo- bile health applications and discuss specific technical and statistical challenges in this direction. Some of the statistical challenges mentioned are: good initialization of the learning algorithm, assessing usefulness of covariates, robustness to failure of assump- tions, dealing with variables that are expensive to acquire or are missing, and finding 27 interpretable policies. 1.9 Our contribution In section 1.4.2, we discussed contextual bandits from a nonparametric estimation point of view and in section 1.6.2, we discussed the crucial component of incorporating delays in the stochastic bandit framework. However to our knowledge, the combination of these two concepts has not been considered in the literature thus far, that is, contextual bandits with delayed rewards have not been considered using nonparametric methods of estimation. Also, not much work has been done to handle unrestricted delays in a fully contextual bandit problem. For example, most of the work is either restricted to constant delays (Dudik et al. (2011)) or finite decision sets for covariates (Joulani et al. (2013)). In the work presented in this dissertation, we try to develop contextual bandit strategies with unrestricted delays, using a nonparametric estimation approach. We present the cumulative regret analysis for these proposed strategies, and illustrate their performance both theoretically and empirically. The motivation for considering this framework comes from applications in sequential treatment allocation for health care as discussed in section 1.8. In our setting, we consider the delay in observing rewards to be a random variable and allow for them to be unbounded with some mild assumptions. In the first part of our work, we extend the proposed randomized allocation strategy of Yang and Zhu (2002) by incorporating delayed feedback in the framework. The strategy proposed is an annealed -greedy type of strategy for contextual bandits with delays. We prove that the proposed strategy is strongly consistent, which shows that the cumulative reward of the proposed strategy is asymptotically equivalent to the optimal cumulative reward. For this result, the only assumption we make on the nature of the underlying mean reward functions is that they are continuous. Therefore, this general setup along with the mild assumptions on delays fits in a sequential treatment allocation framework for a variety of settings as discussed in chapter 2. Applying nonparametric methods of estimation usually requires making good choices for hyperparameters, such as the binwidth in the histogram method, bandwidth in kernel regression (like the Nadaraya-Watson estimator), and k in the k-nearest neighbors 28 algorithm. Since sequential procedures like contextual bandits require time dependent updates of these hyperparameters for estimation at each time point, the user needs to determine an appropriate choice of the binwidth sequence. Another hyperparameter sequence that needs to be chosen is the exploration probability sequence, used in the annealed -greedy strategy to balance the exploration-exploitation trade-off. These have been well-studied in the no-delay setting and choices that would guarantee optimal rates have been suggested in Qian and Yang (2016b). Once proper choices of these user determined sequences have been made, one faces the question of how to update both these sequences in the presence of delayed feedback. There are two possible choices: 1) update the sequences only after observing a new reward, or, 2) keep updating at all time points irrespective of having observed a reward. Based on these two choices, we consider two strategies that differ in how the exploration probability sequence is updated. However, for both strategies, the binwidth sequence is only updated when a new reward is observed for reasons described in chapter 3. Then, through simulations we show that both the proposed strategies are advantageous in different scenarios. This helps us understand that black-box procedures for incorporating delays might not always be advisable and it is important to opt for strategies that take into account several factors like the complexity of the problem and expected magnitude of delays. In the last part of this work on delayed feedback in contextual bandits, we provide finite-time regret bounds for our proposed strategies. We provide upper bounds for the cumulative regret for both the strategies as discussed in the previous paragraph, which helps further our understanding of how these strategies compare in finite time. In addition, we try to relax the assumption of independence of delays with covariates, and provide finite-time analysis for this setting under some additional assumptions. Finally, we illustrate these results by applying them to the Yahoo! Front Page Module User Click dataset, a benchmark dataset extensively used for contextual bandit problems. We compare our results with the DeLinUCB algorithm of Vernade et al. (2018) and discuss future directions. Finally, in the last chapter, we switch focus on a different and pressing problem of improving applicability of contextual bandits in health care. We consider a setup where a doctor can intervene in the automated decision making process of multi-armed bandit algorithms. With contextual bandits, the decision maker uses a computer algorithm 29 that can balance the tendency to apply treatments that have done well in the past with the option to try other treatments that might be more beneficial in the future. However, based on their experience, doctors may consider certain patient cases to be special and would want to allot a different treatment than the one proposed by the algo- rithm. Therefore, we develop a consistent treatment allocation strategy that holistically integrates the adaptive learning by the bandits algorithm and expert interventions. The dissertation is organized as follows: In chapter 2 we propose a randomized algorithm which incorporates delayed feedback and show that it is strongly consistent. In chapter 3, we modify the strategy proposed in chapter 2 by understanding how the hyper-parameter sequences should be updated in the presence of delays, in order to balance the exploration-exploitation dilemma in a better way. Then in chapter 4, we provide finite-time regret upper bounds for the two strategies proposed in chapter 3 and illustrate their applicability on a real dataset. Finally in chapter 5, we change the focus from delayed feedback and propose a randomized allocation strategy that incorporates expert (doctor) intervention in the automated bandit strategy. Appendix A contains useful inequalities and probabilistic tools that are often used in the proofs. A table for common notations used in this dissertation is given in section A.2. Chapter 2 Randomized allocation strategy for delayed nonparametric bandits In this chapter, we propose a contextual bandit algorithm accounting for delayed rewards with sequential treatment decision making as the motivation. We use nonparametric estimation to estimate the functional relationship between the rewards and the covari- ates. We show that the proposed algorithm is strongly consistent in that the cumulative rewards almost surely converge to the optimal cumulative rewards. 2.1 Problem setup The general contextual bandit setup of the problem is as follows. Assume that there are ` ≥ 2 arms available for allocation. Each arm allocation results in a reward which is obtained at some random time after the arm allocation. For each time j ≥ 1, a treatment Ij is alloted based on the data observed previously and the covariate Xj . We assume that the covariates are d-dimensional continuous random variables and take values in the hypercube [0, 1]d. Since the rewards can be obtained at some delayed time, we denote {tj ∈ R+, j ≥ 1} to be the observation time for the rewards for arms {Ij , j ≥ 1} respectively. Let Yi,j be the reward obtained at time tj ≥ j for arm i = Ij . The mean 30 31 reward with covariate Xj for the i th arm is denoted as fi(Xj), 1 ≤ i ≤ `. The observed reward with covariate Xj by pulling the ith arm is modeled as, Yi,j = fi(Xj) + i,j , where i,j denotes independent random error with E(i,j) = 0 and Var(i,j) <∞ for all 1 ≤ i ≤ ` and j ∈ N. The functions fi are assumed to be unknown and not of any given parametric form. The rewards are observed at delayed times tj ; the delay in the reward for arm Ij pulled at the jth time is given by a random variable dj := tj − j. Assume that these delays are mutually independent, independent of the covariates, and could be drawn from different distributions. That is, let {dj , j ≥ 1} be a sequence of independent random variables with probability density functions {gj , j ≥ 1} and the cumulative distribution functions {Gj , j ≥ 1}, respectively. Let {Xj , j ≥ 1} be a sequence of covariates independently generated according to an unknown underlying probability distribution PX , from a population supported in [0, 1]d. Let δ be a sequential allocation rule, which for each time j chooses an arm Ij based on the previous observations and Xj . The total mean reward up to time n is ∑n j=1 fIj (Xj). To evaluate the performance of the allocation strategy, let i ∗(x) = arg max1≤i≤` fi(x) and f∗(x) = fi∗(x)(x). Without the knowledge of the random errors, the ideal performance occurs when the choices of arms selected I1, . . . , In match the optimal arms i∗(X1), . . . , i∗(Xn), yielding the optimal total reward ∑n j=1 f ∗(Xj). The ratio of these two quantities is the quantity of interest, Rn(δ) = ∑n j=1 fIj (Xj)∑n j=1 f ∗(Xj) . (2.1) It can be seen that Rn is a random variable no bigger than 1. Definition 2.1.1. An allocation rule δ is said to be strongly consistent if Rn(δ) → 1 with probability 1, as n→∞. In section 2.2, we propose an allocation rule which takes into account reward delays. Then in sections 2.2.1 and 2.3.1, we discuss the consistency of the proposed allocation rule under some assumptions and then validate those assumptions when the histogram method is used to estimate the regression functions respectively. 32 2.2 The proposed strategy Let Zn,i denote the set of observations for arm i whose rewards have been obtained up to time n, that is, Zn,i := {(Xj , Yi,j) : 1 ≤ tj ≤ n and Ij = i}. Let fˆi,n denote the regression estimator of fi based on the data Z n,i. Let {pij , j ≥ 1} be a sequence of positive numbers in [0, 1] decreasing to zero. Step 1. Initialize. Allocate each arm once, w.l.o.g., we can have I1 = 1, I2 = 2, . . . , I` = `. Since the rewards are not immediately obtained for each of these ` arms, we continue these forced allocations until we have at least one reward observed for each arm. Suppose, that happens at time m0. Step 2. Estimate the individual functions fi. For n = m0, based on Z n,i, estimate fi by fˆi,n for 1 ≤ i ≤ ` using the chosen regression procedure. Step 3. Estimate the best arm. For Xn+1, let iˆn+1(Xn+1) = arg max1≤i≤` fˆi,n(Xn+1). Step 4. Select and pull. Randomly select an arm with probability 1 − (` − 1)pin+1 for i = iˆn+1 and with probability pin+1, for all other arms, i 6= iˆn+1. Let In+1 denote this selected arm. Step 5. Update the estimates. Step 5a. If a reward is obtained at the (n+ 1)th time (could be one or more rewards corresponding to one or more arms Ij , 1 ≤ j ≤ (n+ 1)), update the function estimates of fi for the respective arm (or arms) for which the reward (or rewards) are obtained at (n+ 1)th time. Step 5b. If no reward is obtained at the (n + 1)th time, use the previous function estimators, i.e. fˆi,n+1 = fˆi,n ∀ i ∈ {1, . . . , `}. Step 6. Repeat. Repeat steps 3-5 when the next covariate Xn+2 surfaces and so on. The choice of pin in the randomization step 4 is crucial in determining how much ex- ploration and exploitation is done at any phase of the trial. To emphasize the role of pin, we may use δpi to denote the allocation rule. In order to select the best arm as time progresses, pin needs to decrease to zero but the rate of decrease will play a key role in 33 determining how well the allocations work. For example, if in our set-up we have large delays for some arms then it might be beneficial to decrease pin at a slower rate so that there is enough exploration and the accuracy of our estimates is not affected in the long run. We use a user-determined choice of pin in this work, that is, the sequence pin does not adapt to the data. 2.2.1 Consistency of the proposed strategy Let An := {j : tj ≤ n}, denote the time points for which rewards were obtained by time n. If An is known, then the total number of observed rewards until time n, denoted by τn, is also known. Recall that it is possible to observe multiple rewards at the same time point. Given An, let {sk, k = 1, . . . , τn} be the reordered sequence of these observed reward timings, {tk, k ∈ An}, arranged in a non-decreasing order. Assumption 2.2.1. The regression procedure is strongly consistent in L∞ norm for all individual mean functions fi under the proposed allocation scheme. That is, ||fˆi,n − fi||∞ a.s.→ 0 as n→∞ for each 1 ≤ i ≤ `. As described in the allocation strategy in section 2.2, fˆi,n is the estimator based on all previously observed rewards. That is, after initialization, the mean reward function estimators are only updated at the time points {sk, k = 1, . . . τn} where τn is the number of rewards observed by time n. Therefore, this condition is equivalent to saying ||fˆi,sn− fi||∞ a.s.→ 0 as n→∞. Assumption 2.2.2. Mean functions satisfy fi(x) ≥ 0, A = sup 1≤i≤` sup x∈[0,1]d (f∗(x) − fi(x)) <∞ and E(f∗(X1)) > 0. Theorem 2.2.3. Under Assumptions 2.2.1 and 2.2.2, the allocation rule δpi is strongly consistent as n→∞. Proof. Note that consistency holds only when the sequence {pin, n ≥ 1} is chosen such that pin → 0 as n→∞. The proof is very similar to the proof in Yang and Zhu (2002). The details can be found in section 2.5.1). Note that Assumption 2.2.1, seemingly natural, is a strong assumption and it re- quires additional work to verify this assumption for a particular regression setting. We 34 verify this assumption for the histogram method in section 2.3.1. On the other hand, Assumption 2.2.2 does not involve the estimation procedure and does not require any verification. 2.3 The Histogram method Partition [0, 1]d into M = (1/h)d hyper-cubes with side width h, assuming h is chosen such that 1/h is an integer. For some x ∈ [0, 1]d, let J(x) denote the set of time points, for which the corresponding design points observed until time n fall in the same cube as x, say B(x), and for which the corresponding rewards are observed by time n. Let N(x) denote the size of J(x). That is, let J(x) = {j : Xj ∈ B(x), tj ≤ n} and N(x) = ∑n j=1 I{Xj ∈ B(x), tj ≤ n}. Furthermore, let J¯i(x) be the subset of J(x) corresponding to arm i and N¯i(x) is the number of such time points, that is, J¯i(x) = {j ∈ J(x) : Ij = i} and N¯i(x) = ∑n j=1 I{Ij = i,Xj ∈ B(x), tj ≤ n}. Then the histogram estimate for fi(x) is defined as, fˆi,n(x) = 1 N¯i(x) ∑ j∈J¯i(x) Yj . For the estimator to behave well, a proper choice of the bandwidth, h = hn is necessary. Although one could choose different widths hi,n for estimating different fi’s, for simplic- ity, the same bandwidth hn is used in the following sections. For notational convenience, when the analysis is focused on a single arm, i is dropped from the subscript of fˆ , N¯ and J¯ . Other nonparametric methods like nearest-neighbors, kernel method, spline fitting and wavelets can also be considered for estimation. Assumption 2.2.1 could be verified for these methods using the same broad approach as illustrated in the following sections for the histogram method, along with some method specific mathematical tools and assumptions. 2.3.1 Allocation with histogram estimates Here, we show that the histogram estimation method along with the allocation scheme described in section 2.2, leads to strong consistency under some reasonable conditions on 35 random errors, design distribution, mean reward functions and delays. As already dis- cussed in section 2.2.1, we only need to verify that Assumption 2.2.1 holds for histogram method estimators. Along with Assumption 2.2.2, we make the following assumptions. Assumption 2.3.1. The design distribution PX is dominated by the Lebesgue measure with a density p(x) uniformly bounded above and away from 0 on [0, 1]d; that is, p(x) satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯. Assumption 2.3.2. The errors satisfy a moment condition that there exists positive constants v and c such that, for all m ≥ 2, the Bernstein condition is satisfied, that is, E|ij |m ≤ m!2 v2cm−2. Assumption 2.3.3. The delays, {dj , j ≥ 1}, are independent of each other, the choice of arms and also of the covariates. Assumption 2.3.4. Let the partial sums of delay distributions satisfy, ∑n j=1Gj(n − j) = Ω(nα logβ n) 1 for some α > 0, β ∈ R or for α = 0 and β > 1. Note that, the choice nα logβ n could be generalized to a sub-linear function q(n) with a growth rate faster than log n. 2.3.2 Number of observations in a small cube From Assumption 2.3.1 and Assumption 2.3.3, we have that for a fixed cube B with side width hn at time n, P (Xj ∈ B, tj ≤ n) = P (Xj ∈ B)P (tj ≤ n) ≥ chdnGj(n − j). Let N be the number of observations that fall in B and are observed by time n, that is N = ∑n j=1 I{Xj∈B,tj≤n}. It is easily seen that N is a random variable with expectation β ≥∑nj=1 chdnGj(n− j). From the extended Bernstein inequality (A.2), we have P ( N ≤ ch d n ∑n j=1Gj(n− j) 2 ) ≤ exp ( −3ch d n ∑n j=1Gj(n− j) 28 ) . (2.2) Lemma 2.3.5. Let  > 0 be given. Suppose that h is small enough such that w(h; f) < . Then the histogram estimator fˆn satisfies, PAn,Xn(||fˆn − f ||∞ ≥ ) ≤M exp ( −3pin min1≤b≤M Nb 28 ) + 2M exp ( −min1≤b≤M Nbpi 2 n(− w(h; f))2 8(v2 + c(pin/2)(− w(h; f))) ) , 1 f(n) = Ω(g(n)) if for some positive constant c,f(n) ≥ cg(n) when n is large enough 36 where the probability PAn,Xn denotes conditional probability given design points Xn = (X1, X2, . . . , Xn) and An = {j : tj ≤ n}. Here, Nb is the number of design points for which the rewards have been observed by time n such that they fall in the bth small cube of the partition of the unit cube at time n. Proof. The proof of Lemma 2.3.5 is included in section 2.5. Theorem 2.3.6. Suppose Assumptions 2.2.2, 2.3.1-2.3.4 are satisfied. If for some α > 0 and β ∈ R or α = 0 and β > 1, hn and pin are chosen to satisfy, nα(log n)β−1hdnpi 2 n →∞, (2.3) then the allocation rule δpi is strongly consistent. Proof of Theorem 2. The histogram technique partitions the unit cube into M = (1/h)d small cubes. For each small cube Bb, 1 ≤ b ≤ M , in the partition of the unit cube, let Nb denote the number of time points, for which the corresponding design points fall in the cube Bb and corresponding arm rewards are observed by time n. In other words, Nb = ∑n j=1 I{Xj∈Bb,tj≤n}. Using inequality (2.2) we have, P ( Nb ≤ chdn ∑n j=1Gj(n− j) 2 ) ≤ exp ( −3ch d n ∑n j=1Gj(n− j) 28 ) ⇒ P ( min 1≤b≤M Nb ≤ chdn ∑n j=1Gj(n− j) 2 ) ≤M exp ( −3ch d n ∑n j=1Gj(n− j) 28 ) . (2.4) Let W1, . . . ,Wn be Bernoulli random variables indicating whether the ith arm is selected (Wj = 1) for time point j, or not (Wj = 0). Note that, conditional on the previous observations and Xj , the probability of Wj = 1 is almost surely bounded below by pij ≥ pin for 1 ≤ j ≤ n. Let w(hn; fi) be the modulus of continuity as in Definition A.0.2. Note that, under the continuity assumption of fi, we have w(hn; fi) → 0 as 37 hn → 0. Thus, for any  > 0, when hn is small enough, − w(hn; fi) ≥ /2. Consider, P (||fˆi,n − fi||∞ > ) = P ( ||fˆi,n − fi||∞ > , min 1≤b≤M Nb ≥ chdn ∑n j=1Gj(n− j) 2 ) + P ( ||fˆi,n − fi||∞ > , min 1≤b≤M Nb < chdn ∑n j=1Gj(n− j) 2 ) ≤ EPAn,Xn ( ||fˆi,n − fi||∞ > , min 1≤b≤M Nb ≥ chdn ∑n j=1Gj(n− j) 2 ) + P ( min 1≤b≤M Nb < chdn ∑n j=1Gj(n− j) 2 ) , where PAn,Xn denotes conditional probability given the design points until time n, Xn = {X1, X2, . . . , Xn} and the event, An := {j : tj ≤ n}. From Lemma 2.3.5, we have that given the design points and the time points for which rewards were observed, for any  > 0, when h is small enough, PAn,Xn(||fˆn − f ||∞ ≥ ) ≤M exp ( −3pin min1≤b≤M Nb 28 ) + 2M exp ( −min1≤b≤M Nbpi 2 n(− w(hn; f))2 8(v2 + c(pin/2)(− w(hn; f))) ) . Using the above inequality and (2.4), we have, P (||fˆi,n − fi||∞ > ) ≤ 2M exp ( −ch d n( ∑n j=1Gj(n− j))pi2n(− w(hn; fi))2 16(v2 + cpin/2(− w(hn; fi))) ) +M exp ( −3ch d npin ∑n j=1Gj(n− j) 56 ) + exp ( −3ch d n ∑n j=1Gj(n− j) 28 ) . It can be shown that the above upper bound is summable in n under the condition, hdnpi 2 n ∑n j=1Gj(n− j) log n →∞. (2.5) It is easy to see that this follows from Assumption 2.3.4 and (2.3). Since  is arbitrary, by the Borel-Cantelli lemma, we have that ||fˆi,n − fi||∞ → 0. This is true for all arms 1 ≤ i ≤ `. Hence, this completes the proof of Theorem 2.3.6. 38 2.3.3 Effects of reward delay distributions As one would expect, the amount of delay in observing the rewards will have a consid- erable effect on the speed of sequential learning. In terms of treatment allocation, if there are substantial delays in observing patient responses for a particular treatment, the learning for that treatment will slow down and as a result the efficiency of the al- location strategy will decrease. Therefore, Assumption 2.3.4 imposes some restrictions on the delay distributions to ensure that at least a small proportion of rewards will be obtained in finite time. It is of interest to see how the delay distribution affects the rate at which pin and hn are allowed to decrease. This relationship can be understood by examining condition (2.3) for Theorem 2.3.6. Note that Assumption 2.3.4 and (2.3) in Theorem 2.3.6 can be generalized to include any function q(x) with at least a growth rate faster than logarithmic growth rate. We assume ∑n j=1Gj(n− j) = Ω (q(n)) where q(n) satisfies, q(n)/ log(n)→∞ as n→∞. Then it is easy to see that hn and pin can be chosen such that, hdnpi 2 nq(n) log(n) →∞ as n→∞. (2.6) which implies condition (2.5) holds. A possible advantage of this is that we allow a wide range of possible delay distributions with mild restrictions on the delays. Below, we consider some cases of the delay distributions and see how they effect exploration (pin) and bandwidth (hn) of the histogram estimator as time progresses. 1. In condition (2.3), q(n) = nα logβ n for α > 0 and β ∈ R or α = 0 and β > 1. Let us first consider the case when α = 0 and β > 1, we have q(n) = logβ n for β > 1 and we want ∑n j=1Gj(n− j) = Ω(logβ n). Consider, pin = (log n)−(β−1)/(2+d) for n > m0 and β > 1, then for (2.5) to hold we need the bandwidth hn also to be of order Ω((log n)−(β−1)/(2+d)). For example, hn = (log n)−(β−1)/β(2+d) would guarantee consistency. Notice that with these pin and hn, one would spend a lot of time in exploration and the bandwidth would also decay very slowly which would effect the accuracy of the reward function estimates until n is sufficiently large. Notice that the restriction of partial sum of probability distributions for the delays, being at least of the order logβ n gives the possibility of modeling cases with extremely large delays. For example, in clinical studies when the outcome of 39 interest is survival time and we want to administer treatments for a disease such that the survival time is maximized. With the unprecedented advances in drug development, the life expectancy of patients is more likely to increase, hence the survival time for a patient given any treatment would be large. Therefore, the assumption that partial sums of probability distributions for the delays until time n need only be at least logβ n seems to be quite reasonable when the expected waiting times (in this case survival times) are long. For example, diseases like diabetes and hypertension which have a long survival time, since they cannot be cured, but can be controlled with medications. These diseases also have fairly high prevalence, so a large sample size to be able to get close to optimality would not be a problem. For such diseases, assuming that one would only observe the responses (survival times) of a small fraction of patients in finite time seems reasonable. 2. For the case when α > 0 and β ∈ R, we have that ∑nj=1Gj(n− j) = Ω(nα logβ n). Consider, pin = n −α/(2+d) for n > m0, then for the condition (2.5) to hold we need hn to also be of order Ω(n −α/(2+d)). For example, hn = n−α/2(2+d) results in hdnpi 2 nn α logβ−1 (n) = nαd/2(2+d) logβ−1 (n) → ∞ as n →∞, irrespective of the value of β. Here the lower bound on the partial sums of probability distributions for the delays can grow faster than the previous case, depending on the values of α and β. This restriction of order nα(logβ n) can model cases with moderately large delays. From a clinical point of view, one could model diseases in which treatments show their effect in a short to moderate duration of time, for examples diseases like diarrhea, common cold, headache, and nutritional deficiencies. Here the response of interest would be improvement in the condition of a patient as a result of a treatment. For such diseases, one can expect to see the treatment effects on patients in a short period of time. Hence, the delay in observing treatment results will not be too long. If the response considered was survival (survived or not), then stroke could also fall in this category because of high mortality. Note that, Assumption 2.3.4 only restricts on the proportion of rewards expected to be observed in the long run. Therefore, it is possible for strong consistency to be achieved even when there is infinite delay in observing the rewards of some 40 arms (non-observance of some rewards). 2.4 Simulation study We conduct a simulation study to compare the effect of different delay scenarios on the per-round average regret of our proposed strategy. The per-round regret is given by, rn(δ) = 1 n ∑n j=1(f ∗(Xj)− fIj (Xj)). Note that if 1n ∑n j=1 f ∗(Xj) is eventually bounded above and away from 0 with probability 1, then Rn(δ)→ 1 a.s. is equivalent to rn(δ)→ 0 a.s. 2.4.1 Simulation setup Consider number of arms, ` = 3, and the covariate space to be two-dimensional, d = 2. Let Xn = (Xn1, Xn2) where Xni i.i.d.∼ Unif(0, 1). We assume that the errors n ind∼ 0.5N(0,1). The first 30 rounds were used for initialization. The following true mean reward functions are used, f1(x) = 0.7(x1 + x2), f2(x) = 0.5x 0.75 1 + sin(x2), f3(x) = 2x1 0.5 + (1.5 + x2)1.5 . We consider the following delay scenarios and run simulations until N = 10000. 1) No delay ; 2) Delay 1: Geometric delay with probability of success (observing the reward) p = 0.3; 3) Delay 2: Every 5th reward is not observed by time N and other rewards are obtained with a geometric (p = 0.3) delay; 4) Delay 3: Each case has probability 0.7 to delay and the delay is half-normal with scale parameter, σ = 1500; 5) Delay 4: In this case we increase the number of non-observed rewards. Divide the data into four equal consecutive parts (quarters), such that, in part 1, we only observe every 10th (with Geom(0.3) delay) observation by time N and not observe the remaining; in part 2, we only observe every 15th observation; in part 3, only observe every 20th observation; in part 4, only observe every 25th observation. In Figure 2.1, we plot the per-round regret vs time by delay type for four combina- tions of pin and hn. As one would expect (see Figure 2.1), the severity of delay has a clear effect on the regret, and for delay scenarios where a large number of rewards are not observed in finite time, the regret is comparatively higher. Note that most delay 41 scenarios for which a substantial number of rewards can be obtained in finite time, tend to converge in quite similar patterns. 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 pin = n −1 4 , hn = n−1 6 Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 pin = n −1 4 , hn = log(n)−1 Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 pin = n −1 6 , hn = n−1 6 Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 pin = n −1 6 , hn = log(n)−1 Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 Figure 2.1: Per-round regret for the proposed strategy for different delay scenarios. The grid of plots represent 4 different combination of choices for {pin} and {hn}. For a given row, pin remains fixed and hn varies and vice versa for columns. Choice of {pin} and {hn}: According to Theorem 2, if pin and hn are chosen such that condition (2.3) is met, consistency of the allocation rule follows. Therefore, for the case with d = 2, which is the case of the simulation setting, we have to choose sequences slower than (pin = n −1/2, hn = n−1/2), even in the case of no delays. Keeping this in mind, we chose two different choices of sequences for pin (n −1/4, n−1/6) and two choices of hn(log −1 n, n−1/6). Note that, in Figure 2.1, for a given row, pin remains fixed while hn varies and vice versa for columns. It can be seen that the regret gets worse 42 when hn decays too fast (in our range of n as N = 10000), specially for the scenario (Delay 4) with increasing number of non-observed rewards, possibly because of violation of condition (2.3). Also notice that, slow decaying pin has higher regret (last row). This could be because of large randomization error that leads to high exploration price. In general, there are a large pool of choices for hn and pin that satisfy equation (2.3) as can be seen from the Figure 2.1. The simulation study is replicated 60 times and the averaged per-round regret is plotted in section 2.6, revealing very similar trends and results. The best choice of the hyperparameter sequences amongst those considered seems to be hn = n −1/6 and pin = log−2 n, as seen in Figure 2.2. It is also observed in Figure 2.2 that using a fast decaying {pin} results is better performance (low cumulative regret) for relatively slow to moderate delay scenarios, that is, for delay scenarios 0-3. However, a thorough understanding of the finite-time regret rates and further research would be needed to evaluate optimal choices of {pin} and {hn} for a given scenario. 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = log(n)−2 , hn = n−1 6) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = log(n)−2 , hn = n−1 4) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = log(n)−2 , hn = log(n)−1) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 Figure 2.2: Per-round regret averaged over 60 replications for the proposed strategy in section 2.2 for different delay situations. pin = log −2 n and hn decays faster as we move from left to right. 43 2.5 Proofs 2.5.1 Proof of consistency of the proposed strategy Proof of Theorem 1. Since the ratio Rn(δpi) is always upper bounded by 1, we only need to work on the lower bound direction. Note that, Rn(δpi) = ∑n j=1 fiˆj (Xj)∑n j=1 f ∗(Xj) + ∑n j=1(fIj (Xj)− fiˆj (Xj))∑n j=1 f ∗(Xj) ≥ ∑n j=1 fiˆj (Xj)∑n j=1 f ∗(Xj) − 1 n ∑n j=1AI{Ij 6=iˆj} 1 n ∑n j=1 f ∗(Xj) , where the inequality follows from Assumption 2.2.2. Let Uj = I{Ij 6=iˆj}. Since (1/n) ∑n j=1 f ∗(Xj) converges a.s. to Ef∗(X) > 0, the second term on the right hand side in the above inequality converges to zero almost surely if (1/n) ∑n j=1 Uj a.s.→ 0. Note that for j ≥ m0 + 1, Uj ’s are independent Bernoulli random variables with success probability (`− 1)pij . Since, ∞∑ j=m0+1 Var ( Uj j ) = ∞∑ j=m0+1 (`− 1)pij(1− (`− 1)pij) j2 <∞. we have that ∑∞ m0+1 ((Uj − (` − 1)pij)/j) converges almost surely. It then follows by Kronecker’s lemma that, 1 n n∑ j=1 (Uj − (`− 1)pij) a.s.→ 0. We know that pij → 0 as j → ∞ (the speed depending on the delay times). Thus, we will have 1/n ∑n j=1(`− 1)pij → 0 since pij → 0 as j →∞. Hence, 1/n ∑n j=1 Uj → 0 a.s. To show that Rn(δpi) a.s.→ 1, it remains to show that∑n j=1 fiˆj (Xj)∑n j=1 f ∗(Xj) a.s.→ 1 or equivalently, ∑n j=1(fiˆj (Xj)− f∗(Xj))∑n j=1 f ∗(Xj) a.s.→ 0. Recall from section 2.2.1, given the observed reward timings {tj : tj ≤ n, 1 ≤ j ≤ n}, let {sk : k = 1, . . . , τn} be the reordered sequence of the observed reward timings, arranged in an increasing order. Then for any j,m0 + 1 ≤ j ≤ n, there exists an skj , kj ∈ {1, . . . , τn} such that skj ≤ j < skj+1. Also, note that as j →∞, we also have 44 that kj → ∞. By the definition of iˆj , for j ≥ m0 + 1, fˆiˆj ,skj (Xj) ≥ fˆi∗(Xj),skj (Xj) and thus, fiˆj (Xj)− f∗(Xj) = fiˆj (Xj)− fˆiˆj ,skj (Xj) + fˆiˆj ,skj (Xj)− fˆi∗(Xj),skj (Xj) + fˆi∗(Xj),skj (Xj)− f∗(Xj) ≥ fiˆj (Xj)− fˆiˆj ,skj (Xj) + fˆi∗(Xj),skj (Xj)− fi∗(Xj)(Xj) ≥ −2 sup 1≤i≤` ||fˆi,skj − fi||∞. For 1 ≤ j ≤ m0, we have fiˆj (Xj)− f∗(Xj) ≥ −A. Based on Assumption 2.2.1, ||fˆi,skj − fi||∞ a.s.→ 0 as j → ∞ for each i, and thus sup1≤i≤` ||fˆi,skj − fi||∞ a.s.→ 0. Then it follows that, for n > m0,∑n j=1(fiˆj (Xj)− f∗(Xj))∑n j=1 f ∗(Xj) ≥ −Am0/n− (2/n) ∑n j=m0+1 sup1≤i≤` ||fˆi,skj − fi||∞ (1/n) ∑n j=1 f ∗(Xj) . The right hand side converges to 0 almost surely and hence the conclusion follows. 2.5.2 A probability bound for the histogram method Consider the regression model with i dropped for notational convenience. Yj = f(xj) + j , where j ’s are independent errors satisfying the moment condition in Assumption 2.3.2 of section 2.3.1. Let W1, . . . ,Wn are Bernoulli random variables that decide if arm i is observed or not, that is Wj = I{Ij=i}. Assume, for each 1 ≤ j ≤ n, Wj is independent of {k : k ≥ j}. Let fˆn be the histogram estimator of f . Let An denote the event consisting of the indices (time points) for which the rewards were observed by time n, that is An := {j : tj ≤ n} and Xn = {X1, X2, . . . , Xn}, the design points until time n. Proof of lemma 2.3.5. Note that the inequality of lemma 2.3.5 trivially holds if min1≤b≤M Nb = 0. Therefore, let’s assume that min1≤b≤M Nb > 0. Let N(x) denote the number of time points, for which the corresponding design points xj ’s fall in the same cube as x and 45 for which the corresponding rewards are observed by time n. Let J(x) denote the set of indices 1 ≤ j ≤ n of such design points. Let J¯(x) be the subset of J(x) where arm i is chosen (i.e. where Wj = 1) and let N¯(x) be the number of such design points (note that i is dropped for notational convenience). For arm i, we consider the histogram estimator fˆn(x) = 1 N¯(x) ∑ j∈J¯(x) Yj = f(x) + 1 N¯(x) ∑ j∈J¯(x) (f(xj)− f(x)) + 1 N¯(x) ∑ j∈J¯(x) j ⇒ |fˆn(x)− f(x)| ≤ w(h; f) + ∣∣∣∣∣∣ 1N¯(x) ∑ j∈J¯(x) j ∣∣∣∣∣∣ , where w(h; f) is the modulus of continuity. For any  > w(h; f), with the given design points and the time points for which rewards have been observed by time n, PAn,Xn(||fˆn − f ||∞ ≥ ) ≤ PAn,Xn sup x ∣∣∣∣∣∣ 1N¯(x) ∑ j∈J¯(x) j ∣∣∣∣∣∣ ≥ − w(h; f)  . Note that, in the same small cube B, N(x) and N¯(x), J(x) and J¯(x) are the same for any x, respectively. Let x0 be a fixed point in B. Then consider, PAn,Xn sup x∈B ∣∣∣∣∣∣ 1N¯(x) ∑ j∈J¯(x) j ∣∣∣∣∣∣ ≥ − w(h; f)  = PAn,Xn ∣∣∣∣∣∣ ∑ j∈J¯(x0) j ∣∣∣∣∣∣ ≥ N¯(x0)(− w(h; f))  = PAn,Xn ∣∣∣∣∣∣ ∑ j∈J(x0) Wjj ∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f))  = PAn,Xn ∣∣∣∣∣∣ ∑ j∈J(x0) Wjj ∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f)), N¯(x0)N(x0) > pin2  + PAn,Xn ∣∣∣∣∣∣ ∑ j∈J(x0) Wjj ∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f)), N¯(x0)N(x0) ≤ pin2  46 ≤ PAn,Xn ∣∣∣∣∣∣ ∑ j∈J(x0) Wjj ∣∣∣∣∣∣ ≥ N(x0)pin2 (− w(h; f)) + PAn,Xn (N¯(x0)N(x0) ≤ pin2 ) ≤ 2 exp ( − N(x0)pi 2 n(− w(h; f))2 8(v2 + cpin/2(− w(h; f))) ) + exp ( −3N(x0)pin 28 ) , where the last inequality follows from inequality (A.4) and (A.2) in Chapter A respec- tively. For applying (A.4), we used the fact that Wj is independent of the ik’s for all k ≥ j since Wj depends only on the previous observations and Xj . Given that Nb be the number of design points in the bth small cube whose rewards are observed by time n, we have PAn,Xn(||fˆn − f ||∞) ≤M exp ( −3(min1≤b≤M Nb)pin 28 ) + 2M exp ( −(min1≤b≤M Nb)pi 2 n(− w(h; f))2 8(v2 + c(pin/2)(− w(h; f))) ) . 2.6 Supplementary simulation results We conducted a simulated study in section 2.4 to illustrate the choices of pin and hn under different delay scenarios. The results were presented in Figure 2.1, where the y-axis of the graphs represented the per-round regret and the x-axis was for time. The results represented one run of the proposed algorithm in section 2.2. In this section, we run the proposed algorithm for 60 replications to get a more precise understanding of the performance. We note that the trends observed are very similar to what was noted in section 2.4, except for the fact the curves look much smoother. It can be seen that the average regret gets worse when hn decays too fast, especially for the scenario (Delay 4) with increasing number of non-observed rewards, possibly because of violation of condition (2.3). Also notice that, slow decaying pin has higher regret (last row). This could be because of large randomization error that leads to high exploration price. 47 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = n−1 4 , hn = n−1 6) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = n−1 4 , hn = log(n)−1) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = n−1 6 , hn = n−1 6) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 (pin = n−1 6 , hn = log(n)−1) Time index Av e ra ge re gr et No delay Delay 1 Delay 2 Delay 3 Delay 4 Figure 2.3: Per-round regret averaged over 60 replications for the proposed strategy in section 2.2 for different delay situations. The grid of plots represent four different combinations of {hn} and {pin}. For a given row, pin remains fixed and hn varies and vice versa for columns. Chapter 3 To update or not to update? In this chapter, through slight modifications in the randomized allocation strategy that was proposed in chapter 3 for contextual bandits with delays, we illustrate that one could achieve lower cumulative regret by modifying the exploration-exploitation dilemma when faced with large delays. We employ nonparametric regression framework to model the mean reward functions and allow for unbounded delays. In the presence of delayed feedback, one may choose between using the original exploration sequence which updates at every time point or update the sequence only when a new reward is observed, leading to two competing strategies. In this chapter, we prove that while both strategies lead to strongly consistent allocation, the property holds for a wider scope of situations for the strategy which updates the exploration probability only when a new reward is observed. However, we also illustrate that both strategies have their own advantages and disadvantages depending on the severity of the delay and underlying reward generating mechanisms. 3.1 Problem setup The setup of the problem is the same as the one in section 2.1 in chapter 2. Recall, there are ` > 1 competing arms. The covariates are assumed to be d- dimensional random variables generated according to an unknown underlying probability distribution PX , from a population supported in [0, 1]d. We adopt a regression perspective to model the 48 49 relationship between covariates and rewards, Yi,j = fi(Xj) + j , where j ’s are independent errors with E(j) = 0 and Var(j) <∞ for all 1 ≤ i ≤ ` and j ≥ 1. Now, the problem can be viewed as one of estimating the mean reward functions fi(x) for i ∈ {1, . . . , `} and allocating arms based on the estimators fˆi. We follow a nonparametric approach to estimating these functions. In our setup, the rewards can be obtained at some delayed time, which we denote by {tj ∈ R+, j ≥ 1}. The delay in the reward for pulling arm Ij is given by the random variable, dj := tj − j. We assume that {dj : dj ≥ 0, j ≥ 1} is a sequence of independent random variables. Let the number of rewards obtained at time n be denoted by τn = ∑n j=1 I(tj ≤ n), also a random variable. We devise a sequential allocation strategy δ, incorporating delayed rewards, such that it chooses arms sequentially based on previous observations and present covariates. As a measure of performance of the strategy, we consider the following ratio, Rn(δ) = ∑n j=1 fIj (Xj)∑n j=1 f ∗(Xj) , (3.1) where f∗(x) = max1≤i≤` fi(x), the theoretical best mean reward functional value at x, and i∗(x) is the corresponding arm. Then, we establish strong consistency of δ in section 3.3, that is, we show that Rn(δ) → 1 with probability 1, as n → ∞. Our goal is to compare the two allocation strategies proposed in section 3.2 and use simulation results to illustrate how both can be advantageous in different situations. 3.2 The proposed strategies Define Zn,i to be the set of observations for arm i whose rewards have been obtained up to time n−1, that is, Zn,i := {(Xj , Yi,j) : 1 ≤ tj ≤ n−1 and Ij = i}. Let fˆi,n denote the regression estimator of fi based on the data Z n,i. Let {pij , j ≥ 1} be a sequence of positive numbers in [0, 1] decreasing to zero, such that (` − 1)pij < 1 for all j ≥ 1. We propose two strategies η1 and η2 with a subtle difference in the arm selection step but same structure of the algorithm. 50 3.2.1 Algorithm Step 1. Initialize. Allocate each arm once, I1 = 1, I2 = 2, . . . , I` = `. Since the rewards are not immediately obtained for each of these ` arms, we continue these forced allocations until we have at least one reward observed for each arm. Suppose, that happens at time m0. Step 2. Estimate the individual functions fi. For n = m0 + 1, based on Z n,i, estimate fi by fˆi,n for 1 ≤ i ≤ ` using the chosen regression procedure. Step 3. Estimate the best arm. For Xn, let iˆn(Xn) = arg max1≤i≤` fˆi,n(Xn). Step 4. Select and pull. Recall, τn = ∑n j=1 I(tj ≤ n) is the number of rewards observed by time n. (a) Strategy η1: In = iˆn, with probability 1− (`− 1)pini, with probability pin, i 6= iˆn, 1 ≤ i ≤ `. (b) Strategy η2: In = iˆn, with probability1− (`− 1)piτni, with probability piτn , i 6= iˆn, 1 ≤ i ≤ `. Step 5. Update the estimates. Step 5a. If a reward is obtained at the nth time (could be one or more rewards corresponding to one or more arms Ij , 1 ≤ j ≤ n), update the function estimates of fi for the respective arm (or arms) for which the reward (or rewards) are obtained at nth time. Step 5b. If no reward is obtained at the nth time, use the previous function estimators, i.e. fˆi,n+1 = fˆi,n ∀ i ∈ {1, . . . , `}. Step 6. Repeat. Repeat steps 3-5 when the next covariate Xn+1 surfaces and so on. In the algorithm above, Step 1 initializes the allocations by pulling each arm alterna- tively until we observe at least one reward for each arm. Step 2 estimates the mean reward function for each arm. This could be done using several regression methods, and we use kernel regression and histogram method in this work. Steps 3 and 4 enforce an -greedy type of randomization scheme which prefers the best performing arm so far 51 with some probability and explores with the remaining. The preference is determined by user determined sequence of exploration probability {pin, n ≥ 1}, which for strategy η2 only gets updated when a new reward is observed, that is, piτn . While for strategy η1, it is updated at every time point irrespective of a reward being observed or not, that is, pin. Hence, the two strategies differ in the extent of exploration and exploitation that is allowed over time. Finally in Step 5, the mean reward function estimators are updated if new rewards are observed or they remain the same if no new rewards are observed. Note that, the user determined sequence for both pin in η1 and piτn in η2 are the same. Therefore, for notational convenience, we use {·} to denote a user-determined sequence, such as {pin}, when we only want to refer to the original sequence selected by the user, without distinguishing between when it gets updated. 3.3 Consistency of the proposed strategy Let An := {j : tj ≤ n}, denote the set consisting of the time points for which rewards were observed by time n. We make the following assumptions. Assumption 3.3.1. The regression procedure is strongly consistent in L∞ norm for all individual mean functions fi under the proposed allocation scheme. That is, ||fˆi,n − fi||∞ a.s.→ 0 as n → ∞ for each 1 ≤ i ≤ `, where fˆi,n is the estimator based on all previously observed rewards. Due to the presence of delays, the mean reward function estimators are only updated at the time points where a new reward is observed. Therefore, we need this strong con- sistency in estimation to hold when the mean reward functions are only being updated at the observed data points, which seems to be a somewhat stronger condition. Next, we make a mild assumption on the mean reward functions. Assumption 3.3.2. The mean reward functions are continuous and fi(x) ≥ 0 such that, A = sup 1≤i≤` sup x∈[0,1]d (f∗(x)− fi(x)) <∞ and E(f∗(X1)) > 0. Theorem 3.3.3. Under Assumptions 3.3.1 and 3.3.2, the allocation rule η1 and η2 are strongly consistent as n→∞. 52 Proof. Note that consistency holds only when the sequence {pin, n ≥ 1} is chosen such that {pin} → 0 as n → ∞. The proof is very similar to the proof in section 2.5.1 in chapter 2 so we skip it here. Note that Assumption 3.3.1, seemingly natural, is a strong assumption and it re- quires additional work to verify it for a particular regression setting. We verify this assumption for the histogram method in section 3.3.2 and for the kernel method in 3.3.3. On the other hand, Assumption 3.3.2 is a mild assumption on the mean reward functions and does not require any verification. 3.3.1 The histogram method In this section, we consider the histogram method for the setting with delayed rewards. We assume that the binwidth h is chosen such that 1/h is an integer. At time n, partition [0, 1]d into M = (1/hτn) d hyper-cubes with side width hτn , where τn is the number of observed rewards by time n. For some x ∈ [0, 1]d such that it falls in a hypercube B(x), let J¯i(x) = {j : Xj ∈ B(x), tj ≤ n, Ij = i} and N¯i(x) be the size of J¯i(x). Then the histogram estimate for fi(x) is defined as, fˆi,n(x) = 1 N¯i(x) ∑ j∈J¯i(x) Yj . (3.2) For the estimator to behave well, a proper choice of the bandwidth, {hn} is necessary. Note that, we only update hn to hn+1 when a new reward is observed, hence we denote it as hτn . For notational convenience, when the analysis is focused on a single arm, i is dropped from the subscript of fˆ , N¯ and J¯ . Next, we prove consistency for the proposed strategies in section 3.2.1 using the histogram method. 3.3.2 Allocation with histogram estimates As already discussed in section 3.3, we only need to verify that Assumption 3.3.1 holds for histogram method. Along with Assumption 3.3.1 and 3.3.2, we make the following assumptions. Assumption 3.3.4. The design distribution PX is dominated by the Lebesgue measure with a density p(x) uniformly bounded above and away from 0 on [0, 1]d; that is, p(x) satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯. 53 This assumption is needed to make sure that all regions in the covariate space are observed with positive probability, in order to ensure good estimation in all regions. Assumption 3.3.5. The errors satisfy a moment condition that there exists positive constants v and c such that, for all integers m ≥ 2, the extended Bernstein condition (Birge´ et al. (1998); Qian and Yang (2016a)) is satisfied, that is, E|ij |m ≤ m! 2 v2cm−2. This condition on the errors holds in a lot of settings, for example, normal distribu- tion and bounded errors meet this requirement, thus making it useful in a wide range of applications. The next two assumptions are made on the nature of the delays in observing rewards, so that we could ensure that delays are not being confounded by other factors and we observe a minimum number of rewards with time, so as to ensure proper and effective learning. Assumption 3.3.6. The delays, {dj , j ≥ 1}, are independent of each other, the choice of arms and also of the covariates. Assumption 3.3.7. Let the partial sums of delay distributions satisfy, E(τn) = Ω(q(n)) 1 , where q(n) is a sequence that acts as a lower bound to the expected number of observed rewards by time n, and q(n)→∞ as n→∞. Lemma 3.3.8. Let  > 0 be given. Suppose that h is small enough such that w(h; f) < . Then the histogram estimator fˆn satisfies, Pη1An,Xn(||fˆn − f ||∞ ≥ ) ≤M exp ( −3pin min1≤b≤M Nb 28 ) + 2M exp ( −min1≤b≤M Nbpi 2 n(− w(hτn ; f))2 8(v2 + c(pin/2)(− w(hτn ; f))) ) , (3.3) Pη2An,Xn(||fˆn − f ||∞ ≥ ) ≤M exp ( −3piτn min1≤b≤M Nb 28 ) + 2M exp ( −min1≤b≤M Nbpi 2 τn(− w(hτn ; f))2 8(v2 + c(piτn/2)(− w(hτn ; f))) ) , (3.4) 1 f(n) = Ω(g(n)) if for some positive constant c,f(n) ≥ cg(n) when n is large enough 54 where PAn,Xn denotes conditional probability given design points Xn = (X1, . . . , Xn) and An = {j : tj ≤ n}. Here, Nb is the number of design points for which the rewards have been observed by time n such that they fall in the bth small cube of the partition of the unit cube at time n. Proof. The proof of Lemma 3.3.8 is similar to the proof of Lemma 2.3.5 in section 2.5.2 in chapter 2. For strategy η1, it is easy to see that an exactly similar lemma with hn replaced by hτn could be derived. For strategy η2, pin is replaced by piτn and hn replaced by hτn . This is because the result is a conditional probability result, and given An and Xn, τn is a known quantity. Theorem 3.3.9. Suppose Assumptions 3.3.2-3.3.7 are satisfied. a) If {hn} and {pin} are chosen to satisfy, h2q(n)pi 2 nq(n) log n →∞ as n→∞, (3.5) then the histogram estimator in (3.2) is strongly consistent in the L∞ norm for strategy η1, hence η1 is strongly consistent. b) If {hn} and {pin} are chosen to satisfy, h2q(n)pi 2 q(n)q(n) log n →∞ as n→∞, (3.6) then the histogram estimator in (3.2) is strongly consistent in the L∞ norm for strategy η2, hence η2 is strongly consistent. Proof. The proofs for a) and b) are quite similar, so we prove b) here and consequently discuss a). GivenAn, the indices corresponding to when rewards were obtained, we know that at time n, the Histogram method partitions the unit cube into M = (1/hτn) d small cubes. For each small cube Bb, 1 ≤ b ≤ M , in the partition, let Nb = ∑n j=1 I(Xj ∈ Bb, tj ≤ n). Note that given AN , PAN (Xj ∈ Bb, tj ≤ n) = P (Xj ∈ Bb) ≥ chdτn by Assumption 3.3.6, thus using inequality (A.2) we have, PAn ( Nb ≤ chdτnτn 2 ) ≤ exp ( −3ch d τnτn 28 ) ⇒ PAn ( min 1≤b≤M Nb ≤ chdτnτn 2 ) ≤ exp ( −3ch d τnτn 28 ) . (3.7) 55 Recall, τn = ∑n j=1 I{tj ≤ n}. First, we show that τn a.s.→ ∞ as n → ∞ for both strategies, η1 and η2. By Assumption 3.3.7 we have that for a large enough n, there exists a positive constant a1 > 0 such that, E(τn) ≥ a1q(n). Then using the inequality (A.2) in Lemma A.1.6 we get, P ( τn ≤ a1q(n) 2 ) ≤ P ( τn ≤ E(τn) 2 ) ≤ exp ( −3E(τn) 28 ) ≤ exp (−3a1q(n) 28 ) . Now, it is easy to see that the upper bound is summable in n under the conditions (3.5) and (3.6). By Borel-Cantelli lemma, this implies that event {τn > a1q(n)/2} happens infinitely often, therefore τn a.s.→ ∞. Note that, by construction this implies that hτn a.s.→ 0, and piτn a.s.→ 0 as n→∞. Let w(hτn ; fi) be the modulus of continuity as in the definition A.0.2. Then, continuity of fi leads to the conclusion that w(hτn ; fi) a.s.→ 0, as n → ∞. Thus, for any  > 0, for large enough n, when hτn is small enough, −w(hτn ; fi) ≥ /2, almost surely. Consider, PAn ( ||fˆi,n − fi||∞ ≥  ) = PAn ( ||fˆi,n − fi||∞ ≥ , min 1≤b≤M Nb > chdτnτn 2 ) + PAn ( ||fˆi,n − fi||∞ ≥ , min 1≤b≤M Nb ≤ chdτnτn 2 ) ≤ EXnPAn,Xn ( ||fˆi,n − fi||∞ ≥ , min 1≤b≤M Nb > chdτnτn 2 ) + PAn ( min 1≤b≤M Nb ≤ chdτnτn 2 ) , where EX n denotes expectation with respect to Xn, which appears by applying the law of iterated expectations. From (3.4) in Lemma 3.3.8 and (3.7), we get that, PAn ( ||fˆi,n − fi||∞ ≥  ) ≤M exp ( −3cpiτnh d τnτn 56 ) + 2M exp ( −ch d τnpi 2 τnτn(− w(Lhτn ; fi))2 8(v2 + c(piτn/2) ) +M exp ( −3ch d τnτn 28 ) . (3.8) 56 Now consider, P (||fˆi,n − fi||∞ > ) ≤ P ( ||fˆi,n − fi||∞ ≥ , τn > E(τn) 2 ) + P ( τn ≤ E(τn) 2 ) ≤ EAnPAn ( ||fˆi,n − fi||∞ ≥ , τn > E(τn) 2 ) + P ( τn ≤ E(τn) 2 ) . Let ne = bE(τn)/2c. Then, by using condition (3.6) and (3.8), we have that, for large enough n, P (||fˆi,n − fi||∞ > ) ≤M exp ( −3cpineh d nene 56 ) + 2M exp ( −ch d nepi 2 nene(− w(Lhne ; fi))2 8(v2 + c(pine/2)() ) +M exp ( −3ch d nene 28 ) + exp ( −3ne 14 ) ≤M exp ( − 3c˜piq(n)h d q(n)q(n) 112 ) + 2M exp ( − c˜hdq(n)pi 2 q(n)q(n)(− w(Lhq(n); fi))2 16(v2 + c(piq(n)/2)() ) +M exp ( − 3c˜hdq(n)q(n) 56 ) + exp ( −3a1q(n) 28 ) , (3.9) where, c˜ is a new constant that incorporates functions of a1 and c. It can be seen that the above upper bound is summable in n under the condition hdq(n)pi 2 q(n)q(n) log n →∞. (3.10) Since  is arbitrary, by the Borel-Cantelli Lemma, we have that ||fˆi,n−fi||∞ → 0, almost surely. This is true for all arms 1 ≤ i ≤ `. Note that the result a) could be similarly obtained by using (3.3) from Lemma 3.3.8 to obtain a result similar to (3.8) but with pin instead of piτn . 3.3.3 Kernel regression We can obtain analogous results for strong consistency of strategy η1 and η2 using Nadaraya-Watson estimator. Consider a nonnegative kernel function K(u) : Rd → R that satisfies the following Lipschitz and boundedness conditions. 57 Assumption 3.3.10. For some constants 0 < λ < ∞, |K(u) − K(u′)| ≤ λ||u − u′||∞, for all u, u′ ∈ Rd. Assumption 3.3.11. ∃ constants L1 ≤ L, c3 > 0 and c4 ≥ 1 such that K(u) = 0 for ||u||∞ > L,K(u) ≥ c3 for ||u||∞ ≤ L1, and K(u) ≤ c4 for all u ∈ Rd. Recall, τn = ∑n j=1 I(tj ≤ n), the number of observed rewards by time n. Define, Ji,n+1 = {j : Ij = i, tj ≤ n, 1 ≤ j ≤ n}, that is, the set of time points corresponding to pulling of arm i whose rewards have been observed by time n. Let Mi,n+1 denote the size of Ji,n+1. Let hτn denote the bandwidth, where hτn → 0 almost surely as n → ∞. For each arm i, the Nadaraya-Watson estimator of fi(x) is defined as, fˆi,n+1(x) = ∑ j∈Ji,n+1 Yi,jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) . (3.11) Theorem 3.3.12. Suppose Assumptions 3.3.2-3.3.11 are satisfied, and, 1. If {hn} and {pin} are chosen to satisfy, q(n)h2dq(n)pi 4 n log n →∞, then the Nadaraya-Watson estimator defined in (3.11) is strongly consistent in L∞ norm for strategy η1. 2. If {hn} and {pin} are chosen to satisfy, q(n)h2dq(n)pi 4 q(n) log n →∞, (3.12) then the Nadaraya-Watson estimator defined in (3.11) is strongly consistent in L∞ norm for strategy η2. Proof. The proof for this theorem can be found in section 3.6. 58 3.4 Comparison of strategies, η1 and η2 Note that we got an analogous consistency result as Theorem 3.3.9 for histogram method in chapter 2, which was that, for q(n) as in Assumption 3.3.7, if, hn, pin are chosen to satisfy (2.6), which is, hdnpi 2 nq(n) log n →∞ as n→∞, (3.13) then the proposed allocation rule is strongly consistent. Now if we compare (3.5), (3.6) and (3.13), we see that (3.13) ⇒ (3.6) ⇒ (3.5), but not vice versa, therefore (3.5) seems to give more options for the choice of the user-determined sequences, {hn} and {pin}, to achieve consistency, while there may be a trade-off in the regret rate as we will see in the simulations. The rate of decrease of the average cumulative regret can be slow for some choices of {pin} and {hn} that satisfy both (3.13) and (3.5). Note that, a similar relationship is noticed in Theorem 3.3.12 when using kernel regression. To understand which choices of hyper-parameter sequences help minimize the cumulative regret, let us consider the regret for strategy η, RN (η) = N∑ j=m0+1 (f∗(Xj)− fIj (Xj)) = N∑ j=m0+1 (fi∗j (Xj)− fˆi∗j (Xj) + fˆi∗j (Xj)− fˆIj (Xj) + fˆIj (Xj)− fIj (Xj)) ≤ N∑ j=m0+1 (fi∗j (Xj)− fˆi∗j (Xj) + fˆiˆj (Xj)− fˆIj (Xj) + fˆIj (Xj)− fIj (Xj)) ≤ N∑ j=m0+1 2 sup 1≤i≤` |fi(Xj)− fˆi(Xj)|+AI{Ij 6= iˆj}. Thus the cumulative regret can be rougly decomposed into estimation error and ran- domization error. For the no delay setting, both these error components are studied in a finite-time setting in Qian and Yang (2016b), and it is shown that, {hn} and {pin} can be chosen so as to achieve an optimal (minimax) rate of convergence for the regret. In their work, the choices of {hn} and {pin} also depend on the smoothness parameter of the underlying mean reward functions. Thus, in situations where the underlying mean reward functions are simple and smoother, {hn} and {pin} are chosen to be fast decaying 59 to achieve optimal rates of convergence in no-delay situations. Similarly, for scenarios where the underlying mean reward functions are more complex, {hn} and {pin}, are chosen to be relatively slow decaying in order to guarantee optimal rates under no delay scenarios. Now the question that arises in the presence of delayed rewards is that, how should sequences {hn} and {pin} be updated so as to minimize the resulting cumulative regret? That is, should one update pin to pin+1 (and hn to hn+1) at every time point irrespective of observing a reward or only update upon observing a new reward. Let us try to understand the impact of delay and the reward generating mechanisms on the two components of cumulative regret to answer this question. Different nonparametric methods could be used for estimation purposes, and es- timation accuracy largely depends on the complexity of the underlying mean reward functions and the amount of data available for estimation. The binwidth of methods like histogram and kernel regression, usually is a function of the number of data points available for estimation at a given time. Therefore in the presence of delayed rewards, hτn (τn being the number of observed rewards until n) seems to be the sensible choice for the binwidth. Choosing hn may lead to inefficient estimation due to unavailability of data points in some small neighborhood of [0, 1]d. Therefore, employing a binwidth sequence that guarantees optimal rates of convergence in the no-delay setting, which updates only when a new reward is obtained, seems to be the right choice from an estimation point of view. Hence, we only consider the policies (η1 and η2) that employ hτn as the chosen binwidth sequence. It is important to note that from an asymptotic point of view, based on our theoretical results (Theorem 3.3.9), estimation will improve with time, but this discussion is from a finite time perspective. In terms of randomization error, delayed rewards affect this directly through the ran- domization scheme. This is tied to the exploration-exploitation dilemma which is in turn controlled by the exploration probability {pin}. In the following illustrations, we try to convey the message of why carefully balancing exploration-exploitation is tied to updating the sequence {pin} carefully in the presence of delayed rewards, and the decision to do that can vary in different situations. Illustration 1. Suppose that the underlying mean reward functions are not too complex and are well-separated. In this setting, it will be easy to get good functional estimates over time, even with small sample of information available due to presence of 60 large delays. Since the no delay case is well-studied, for such a setting we could choose an exploration probability sequence {pin} that gives the optimal rate of convergence according to Qian and Yang (2016b). Now, with the delays, we need to decide whether we want to use pin or piτn as the exploration probability sequence. In this setting, it would perhaps be advantageous to opt for pin, which updates at every time step irrespective of whether a reward is obtained or not. This is because in settings where the underlying functions are somewhat easier to learn, major contribution to the regret would come from the randomization error. In order to illustrate that, let Randj(η1) and Randj(η2) denote the indicator I(Ij 6= iˆj) for η1 and η2, respectively. Let σt = min{n¯ :∑n¯ j=m0+1 I(tj ≤ N) ≥ t}, that is, σt is the time index where the tth reward is observed. Then we have that, EAN ( N∑ j=m0+1 Randj(η2)) = N∑ j=m0+1 Pη2,AN (Ij 6= iˆj) = τN∑ t=1 (σt+1 − σt)(`− 1)pit, (3.14) where EAN denotes conditional expectation given AN , the set of indices when the re- wards were observed by time N . Here, τN = ∑N j=m0+1 I(tj ≤ N), number of rewards observed between time m0 and N . However, for strategy η1, since the exploration probability {pij} does not depend on delays, we have that, E( N∑ j=m0+1 Randj(η1)) = N∑ j=m0+1 Pη2(Ij 6= iˆj) = N−m0−1∑ j=1 (`− 1)pij . (3.15) For brevity sake, let us denote N¯ = N −m0 − 1 and we start the counting process at m0 + 1. Now, given τN , the minimum value that we can get for the R.H.S. in (3.14) is when all the rewards from m0 + 1 until τN are observed instantaneously and after that no reward is observed until we hit the horizon N¯ . Likewise, an approximate maximum value of R.H.S. in (3.14) is achieved when the rewards for (m0 +1) th through (N¯−τN )th arms are not observed until time (N¯ − τN ), and all the τN many rewards are observed from time N¯ − τN + 1 to N¯ respectively. Therefore, min AN EAN ( N∑ j=m0+1 Randj(η2)) = (`− 1)[ τN−1∑ t=1 pit + (N¯ − τN )piτN ], max AN EAN ( N∑ j=m0+1 Randj(η2)) = (`− 1)[(N¯ − τN )pi1 + τN∑ t=2 pit]. 61 For the sake of illustration, assume that we observe a fraction of N¯ by time N , that is, τN = αN¯ , for some α ∈ (0, 1). Then we have that, min E( N∑ j=m0+1 Randj(η2)) = (`− 1)[ τN−1∑ t=1 pit + (1− α)N¯piτN ], (3.16) max E( N∑ j=m0+1 Randj(η2)) = (`− 1)[(1− α)N¯pi1 + τN∑ t=2 pit]. (3.17) Notice that the terms (1−α)N¯pi1 and (1−α)N¯piτN in the RHS in (3.16) and (3.17) can be fairly large and grow as N increases for all reasonably fast choices of {pin} such as, n−1/4, log−1 n. From (3.15), (3.16) and (3.17), we also get that, N¯∑ t=τN+1 (`− 1)(pi1 − pit) ≥ E( N∑ j=m0+1 Randj(η2)− Randj(η1)) ≥ N¯∑ t=τN+1 (`− 1)(piτN − pit), where it can be seen that ∑N¯ t=τN+1 (` − 1)(piτN − pit) > 0 for any N and ∑N¯ t=τN+1 (` − 1)(pi1 − pit) → ∞ as N → ∞. Therefore, we see that using strategy η1, which updates pin at every time step irrespective of having observed a reward or not, gives a lower randomization error on average as compared to strategy η2. For example, if we choose {pin} = n−1/4, α = 0.25 (one-fourth of rewards observed) and say m0 = 30 (initialization phase), time horizon N = 10000, then we get that the average randomization error difference approximately satisfies, 0.23(`− 1) ≥ E( ∑N j=m0+1 Randj(η2)− Randj(η1)) N − (m0 + 1) ≥ 0.02(`− 1), for N = 10000,m0 = 30. Therefore, in situations where underlying mean reward functions are not too complex, the randomization error can be quite large and potentially dominate over the estimation error. Thus, using strategy η1 could reduce the cumulative regret substantially as compared to strategy η2 in such situations. Illustration 2. On the other hand, there are situations in which it may be better to use strategy η2 with piτn (updating only when a new reward is observed) as the exploration probability sequence. For example, scenarios where the best arms frequently alternate over regions of covariate space in terms of maximizing reward and it is hard to tell a clear winner with less information available due to presence of large delays. 62 Another such situation is when an arm which is inferior in majority of the covariate space, but is superior with a substantial reward gain in a very small area of the domain and it might be the case that under large delays these under-represented regions remain unexplored. As described, let us assume that the underlying mean reward functions are somewhat complex. In such settings, we would need substantial exploration even in later stages of the trial, specially in the presence of large delays. Here, in the hope of reducing the randomization error, we could employ strategy η1 and use an exploration probability sequence pin, which meets the conditions in Qian and Yang (2016b) that ensure optimal convergence rates in no-delay situations. However, this could be disadvantageous in such complex settings. This is because using η1 may lead to insufficient exploration for the inferior arms. We consider the event that a seemingly inferior arm is chosen at time t, that is, I(It 6= iˆt). Then to ensure enough exploration, we need that this event occurs with a positive probability that is not too small, specially in such complex settings as discussed above. From Yang and Zhu (2002) and Qian and Yang (2016a) for no delay settings, we know that it is necessary to have ∑∞ t=1 pit =∞ for the algorithm to perform optimally both asymptotically and in finite time. We also know that τN →∞ as N → ∞. Therefore, using both these facts, the sum of probability of the event {I(It 6= iˆt), t ≥ 1}, over the time points where rewards were observed for strategy η2 goes to ∞, τN∑ t=1 Pη2(It 6= iˆt) = τN∑ t=1 (`− 1)pit a.s.→ ∞, as N →∞, whereas, for η1, this sum could actually be summable for large delay situations. Let σt = min{n¯ : ∑n¯ j=m0+1 I(tj ≤ N) ≥ t}. Let us assume that the observed rewards are equally spaced, that is, σt = tN/τN , assuming w.l.o.g. that N/τN is an integer. Then, we have, τN∑ t=1 Pη1(It 6= iˆt) = τN∑ t=1 (`− 1)piσt = τN∑ t=1 (`− 1)pitN/τN . Now, it can be shown that this series is summable for a variety of choices of {pin}. For 63 example, let us assume that {pin} = n−1/2, then for strategy η1, τN∑ t=1 Pη1(It 6= iˆt) = τN∑ t=1 (`− 1)pitN/τN = τN∑ t=1 ( tN τN )−1/2 = ( N τN )−1/2 τN∑ t=1 t−1/2 = O ( τN√ N ) . (3.18) If the number of observed rewards are small, say τN = O( √ N), then the series in (3.18) is summable. Therefore by Borel-Cantelli Lemma, we know that the events {It 6= iˆt} can only occurs only finitely many times. This will lead to insufficient exploration and could lead to large regret gains in areas that remain unexplored, specially in the more complex settings. Therefore, if we employ strategy η1 in such settings with large delays, we may end up over-exploiting certain arms and as a result obtain insufficient number of rewards pertaining to a seemingly inferior arm, which may possibly yield higher rewards in some unexplored regions in future. This would adversely affect the performance and lead to high cumulative regret. Therefore, in scenarios like this, it would be advantageous to use strategy η2. In the next section, we demonstrate these ideas using four different simulation setups and illustrate the performance of strategies η1 and η2 in the four setups respectively. These insights also suggest that studying adaptive strategies for updating these parameters more locally could be promising area to explore. 3.5 Simulations We conduct a simulation study to compare the per-round average regret for strategies η and η2 under different delayed rewards scenarios. The per-round regret for strategy δ is given by, rn(η) = 1 n n∑ j=1 (f∗(Xj)− fIj (Xj)). Note that, if 1n ∑n j=1 f ∗(Xj) is eventually bounded above and away from 0 with prob- ability 1, then Rn(η) → 1 a.s. is equivalent to rn(η) → 0 a.s. The data has been generated from the following mean reward functions. We assume d = 2, ` = 2 (or 3) and x ∈ [0, 1]2 and the simulations run until time N = 10000 with first 20-30 rounds of 64 initialization. For each of the setups, we define one-dimensional functions g1 and g2, and then for x1, x2 ∈ [0, 1], we define, f1(x1, x2) = g1(x1) ∗ x2 and f2(x1, x2) = g2(x1) ∗ x2. Setup 1: In this setup, we consider two well-separated sinusoidal functions, where one is a shifted above version of the other. g1(x) = (−2 sin(20pix) + 3), g2(x) = (−2 sin(20pix) + 2); x ∈ [0, 1]. Setup 2: Consider three piecewise-linear functions that are still well-separated but over different regions in the covariate space. Then, f1(x1, x2) = x2g1(x1), f2(x1, x2) = x2g2(x1), f3(x1, x2) = x2g3(x1). g1(x) =  1 0 ≤ x < 0.5 −10x+ 6 0.5 ≤ x < 0.6 0 x ≥ 0.6 , g2(x) =  0 0 ≤ x < 0.5 10x− 5 0.5 ≤ x < 0.6 1 x ≥ 0.6 , g3(x) =  0 0 ≤ x < 0.3 20x− 6 0.3 ≤ x < 0.4 2 0.4 ≤ x < 0.6 −20x+ 14 0.6 ≤ x < 0.7 0 x ≥ 0.7. Setup 3: Consider two sinusoidal functions such that the best arm alternates rapidly as the functions oscillate. g1(x) = 2 cos(5pix) + 2, g2(x) = −2 sin(5pix) + 2, for x ∈ [0, 1]. Setup 4: Consider a setup where one arm dominates over majority of the covariate space, except for a small area where it incurs a considerably high regret. g1(x) = 1, for all x ∈ [0, 1]; g2(x) =  0 0 ≤ x < 0.5 100000x− 50000 0.5 ≤ x < 0.502 200 0.502 ≤ x < 0.503 −100000 ∗ x+ 50500 0.503 ≤ x < 0.505 0 0.505 ≤ x ≤ 1. 65 Note that, in our setup d = 2 and f1(x1, x2) = g1(x1) ∗ x2 and f2(x1, x2) = g2(x1) ∗ x2. Notice that the functions above are constructed such that different data-generating scenarios could be considered when comparing η1 and η2 for delayed reward settings, keeping the discussion in section 3.4 in mind. These one dimensional functions gi, for arm i, have been plotted in Figure 3.1. 3.5.1 The simulation process and results We simulate the data from the above mentioned true mean reward functions as follows: Yi,j = fi(Xj) + 0.5j , i ∈ {1, 2, 3}, j ∈ N where j i.i.d.∼ N(0, 1). We use Nadaraya-Watson estimator with Gaussian kernel to esti- mate the mean reward functions. The algorithm in section 3.2.1 is run, with strategies η1 and η2. We consider the following choices of hyper-parameter sequences but in our discussion, we only illustrate a few combinations to make a comparison for the sake of brevity. pin = {n−1/4, log−1 n, log−2 n;n ≥ 1} and hn = {n−1/4, n−1/6, log−1 n;n ≥ 1} The algorithm is run for 60 replications (time horizonN = 10000), both for strategies η1 and η2. Then the regret is averaged for each round (time point) over the replications, to give a more accurate estimate of the total regret accumulated up to a given time point. Since, we incorporate delays in this work, we artificially create scenarios governing when a reward will be observed. We consider the following delay scenarios in the increased order of severity of delays, No delay; Every reward is observed instantaneously. Delay 1: Geometric delay with probability of success (observing the reward) p = 0.3. Delay 2: Every 5th reward is not observed by time N and other rewards are obtained with a geometric (p = 0.3) delay. Delay 3: Each case has probability 0.7 to delay and the delay is half-normal with scale parameter, σ = 1500. Delay 4: In this case we increase the number of non-observed rewards. Divide the data into four equal consecutive parts (quarters), such that, in part 1, we only observe every 10th (with Geom(0.3) delay) observation by time N and not observe the remaining; 66 in part 2, we only observe every 15th observation; in part 3, only observe every 20th observation; in part 4, only observe every 25th observation. In our simulations, we noted that the difference in the cumulative regret was most discernible in the more extreme delay situations, that is, delay 3 and delay 4. Therefore, we only illustrate the results on those two delay scenarios. The plots in Figure 3.1 can be used to compare performance of strategy η1 and η2, where recall that η1 is when (pin, hτn) are used as hyper-parameters, and η2 is when (piτn , hτn) are used as hyper-parameters. On the y-axis is the average regret plotted against time on the x-axis. The rows in the figure correspond to the simulation setups and columns 2 and 3 correspond to delay 3 and delay 4, respectively. For illustration, we only show the plots corresponding to one choice of hyper-parameter sequences, {hn} = log−1 n and {pin} = log−1 n, however results from other combinations show similar trends and are included in section 3.7. Note that in setup 1 and 2, η1 performs better than η2 in terms of reducing the overall average regret. Both these setups consist of underlying mean reward functions that are well-separated and clear winners in terms of reward gain in substantial portions of the covariate space. Therefore, achieving good function estimation should not be a problem. Thus in these examples, controlling for the randomization error is crucial, which is better achieved by using pin instead of piτn , as illustrated in section 3.4. On the contrary, in setup 3 and 4, we notice that strategy η2 performs better than η1 in terms of lower average regret. This can be attributed to the fact that the underlying data generating functions in these setups are more complicated in a more localized way, thus requiring more exploration for better estimation. Therefore, using piτn instead of pin helps reduce the risk of over-exploitation, especially in the more localized high regret incurring zones. Another interesting observation is that for setup 1 and 2, the average regret curves for strategies η1 and η2 are closer with delay 3 and much separated with delay 4. Whereas, in setup 3 and 4, an opposite trend is seen, where the difference in the average regret curves for η1 and η2 is more pronounced with delay 3 as compared to delay 4. A possible reason for this could be that the mean reward functions for setup 1 and 2 are easy to estimate, even with as few observations as with delay 4, thus fast and continuous exploitation helps reduce the regret. However, the underlying mean reward functions in setup 3 and 4 are harder to learn and perhaps with so few observations as in delay 4, it is hard to do get good estimates even with more exploration using piτn . 67 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Setup 1: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 Setup 2: Mean reward generating function (1D) x f(x ) g1 g2 g3 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Setup 3: Mean Reward Generating Functions (1D) x f(x ) g2 g1 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 Setup 4: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4 Time index Av e ra ge re gr et η1 η2 Figure 3.1: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). 68 3.6 Other proofs Recall, Ji,n+1 = {j : Ij = i, tj ≤ n, 1 ≤ j ≤ n} and Mi,n+1 is the size of Ji,n+1, A = {j : tj ≤ n} and τn = ∑n j=1 I(tj ≤ n). Lemma 3.6.1. Under the setting of the kernel estimation in section 3.3.3, let A ⊂ [0, 1]d be a hypercube with side-width h. For a given arm i, if Assumptions 3.3.5, 3.3.6, 3.3.10 and 3.3.11 are satisfied, then for any  > 0, PAn,Xn sup A ∑ j∈Ji,n+1 jK ( x−Xj hτn ) > τn 1− 1/√2  ≤ exp ( − τn 2 4c24v 2 ) + exp ( − τn 4c4c ) + ∞∑ k=1 2kd exp ( −2 kτn 2 λ2v2 ) + ∞∑ k=1 2kd exp ( −2 k/2τn 2λc ) , where PAn,Xn denotes conditional probability given An = {j : tj ≤ n} and Xn = {X1, . . . , Xn}. Proof. The proof of this lemma follows exactly from the analogous lemma but without delays in Qian and Yang (2016a). The results follow because we condition on An, and given An, τn is a known quantity which plays the role of n in the non-delayed situation as in Qian and Yang (2016a). Next, we provide a proof for Theorem 3.3.12. Proof of Theorem 3.3.12. Here, we prove the result for strategy η2 and discuss how the proof for strategy η1 follows similarly. For each x ∈ [0, 1]d, |fˆi,n+1 − fi(x)| = ∣∣∣∣∣∣∣∣ ∑ j∈Ji,n+1 Yi,jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) − fi(x) ∣∣∣∣∣∣∣∣ = ∣∣∣∣∣∣∣∣ ∑ j∈Ji,n+1(fi(Xj) + j)K ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) − fi(x) ∣∣∣∣∣∣∣∣ 69 = ∣∣∣∣∣∣∣∣ ∑ j∈Ji,n+1(fi(Xj)− fi(x))K ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) + ∑ j∈Ji,n+1 jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣∣∣ ≤ sup x,y:||x−y||∞≤Lhτn |fi(x)− fi(y)|+ ∣∣∣∣∣∣∣∣ 1 Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn ) 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣∣∣ , where the last inequality follows from the bounded support assumption of kernel function K(·). We know from the proof in Theorem 3.3.9 that τn →∞ almost surely as n→∞. Thus, by uniform continuity of the function fi, lim n→∞ supx,y:||x−y||∞≤Lhτn |fi(x)− fi(y)| = 0, almost surely. Therefore, we only need, sup x∈[0,1]d ∣∣∣∣∣∣∣∣ 1 Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn ) 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣∣∣ a.s.→ 0 as n→∞. (3.19) We first show that, inf x∈[0,1]d 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) > c3cL d 1piτn 2 , (3.20) almost surely for large enough n. Indeed, for each n ≥ m0 +1, given τn, we can partition the unit cube [0, 1]d into B˜ bins with bin width L1hτn such that B˜ ≤ 1/(L1hτn)d. We denote these bins by A˜1, A˜2, . . . , A˜B˜. Let σt = inf{n˜ : ∑n˜ j=1 I(tj ≤ n) ≥ t}. Given an arm i and 1 ≤ k ≤ B˜, for every x ∈ A˜k, given τn we have that,∑ j∈Ji,n+1 K ( x−Xj hτn ) = τn∑ t=1 I(Iσt = i)K ( x−Xσt hτn ) ≥ τn∑ t=1 I(Iσt = i,Xσt ∈ A˜k)K ( x−Xσt hτn ) ≥ c3 τn∑ t=1 I(Iσt = i,Xσt ∈ A˜k), 70 where the last inequality follows from Assumption 3.3.11 (boundedness of kernels). Therefore, PAn,Xn  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  ≤ PAn,Xn  inf x∈A˜k 1 τnhdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  ≤ PAn,Xn ( c3 τnhdτn τn∑ t=1 I(Iσt = i,Xσt ∈ A˜k) ≤ c3cL d 1piτn 2 ) ≤ PAn,Xn ( τn∑ t=1 I(Iσt = i,Xσt ∈ A˜k) ≤ cτn(L1hτn) dpiτn 2 ) . Note that, PAn,Xn(Iσt = i,Xσt ∈ A˜k) ≥ c(L1hτn)dpiτn , for all 1 ≤ t ≤ n. Then, PAn,Xn ( τn∑ t=1 I(Iσt = i,Xσt ∈ A˜k) ≤ cτn(L1hτn) dpiτn 2 ) ≤ exp ( −3cτn(L1hτn) dpiτn 28 ) . Therefore, we get that, PAn,Xn  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  ≤ exp ( −3cτn(L1hτn) dpiτn 28 ) . (3.21) Now consider, P  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  = P  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2 , τn > E(τn) 2  + P  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2 , τn ≤ E(τn) 2  71 ≤ EPAn,Xn  inf x∈A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2 , τn > E(τn) 2  + P ( τn ≤ E(τn) 2 ) ≤ exp ( −3c(L1hτn) dpiτn(E(τn)) 56 ) + exp ( −3E(τn) 28 ) , where the last inequality followed from (3.21) and the Bernstein’s inequality (A.2). Hence, P  inf x∈[0,1]d 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  ≤ B˜∑ k=1 P inf A˜k 1 Mi,n+1hdτn ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≤ c3cL d 1piτn 2  ≤ B˜ ( exp ( −3c(L1hτn) dpiτn(E(τn)) 56 ) + exp ( −3E(τn) 28 )) ≤ B˜ ( exp ( −3c˜(L1hq(n)) dpiq(n)(q(n)) 56 ) + exp ( −3a1q(n) 28 )) , where the last inequality follows from Assumption 3.3.7 and the condition (3.12). Here, c˜ and a1 are constants due to the use of Assumption 3.3.7, which says that E(τn) ≥ a1q(n) for some constant a1 > 0. Also, the same condition q(n)h2d q(n) pi4 q(n) logn →∞ ensures that the right hand side above is summable, and by Borel-Cantelli lemma (Lemma A.0.1), we have (3.20). In order to prove (3.19), we now need to show that, sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ = o(piτn), almost surely. (3.22) For each n > m0 + 1, we can partition the unit cube [0, 1] d into B bins with bin length hτn such that B ≤ 1/hdτn . We denote these bins by A1, A2, . . . , AB. Then given  > 0, 72 consider, PAn,Xn  sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn  ≤ B max 1≤k≤B PAn,Xn  sup x∈Ak ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn  ≤ B max 1≤k≤B PAn,Xn  sup x∈Ak ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn, Mi,n+1 τn > piτn 2 ) +BPAn,Xn ( Mi,n+1 τn ≤ piτn 2 ) ≤ B max 1≤k≤B PAn,Xn  sup x∈Ak ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > τnpi 2 τnh d τn 2  +BPAn,Xn ( Mi,n+1 τn ≤ piτn 2 ) ≤ B max 1≤k≤B PAn,Xn  sup x∈Ak ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > τnpi 2 τnh d τn 2  +B exp ( −3τnpiτn 28 ) , (3.23) where the last inequality follows from (A.2). Note that using Lemma 3.6.1, PAn,Xn  sup x∈Ak ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > τnpi 2 τnh d τn 2  ≤ 2 exp ( −( √ 2− 1)2τnpi4τnh2dτn2 32c24v 2 ) + 2 exp ( −( √ 2− 1)τnpi2τnhdτn 8 √ 2c4c ) + 2 ∞∑ k=1 2kd exp ( −2 k( √ 2− 1)2τnpi4τnh2dτn2 8λ2v2 ) + 2 ∞∑ k=1 2kd exp ( −2 k/2( √ 2− 1)τnpi2τnhdτn 4 √ 2λc ) . (3.24) 73 Using (3.23) and (3.24), we get that, PAn,Xn  sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn  ≤ 2B exp ( −( √ 2− 1)2τnpi4τnh2dτn2 32c24v 2 ) + 2B exp ( −( √ 2− 1)τnpi2τnhdτn 8 √ 2c4c ) + 2B ∞∑ k=1 2kd exp ( −2 k( √ 2− 1)2τnpi4τnh2dτn2 8λ2v2 ) + 2B ∞∑ k=1 2kd exp ( −2 k/2( √ 2− 1)τnpi2τnhdτn 4 √ 2λc ) +B exp ( −3τnpiτn 28 ) . Now consider, P  sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn  ≤ EPAn,Xn  sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn, τn > E(τn)2  + P ( τn ≤ E(τn) 2 ) . Let ne = bE(τn)/2c, then using condition (3.12), P  sup x∈[0,1]d ∣∣∣∣∣∣ 1Mi,n+1hdτn ∑ j∈Ji,n+1 jK ( x−Xj hτn )∣∣∣∣∣∣ > piτn  ≤ 2B exp ( −( √ 2− 1)2nepi4neh2dne2 32c24v 2 ) + 2B exp ( −( √ 2− 1)nepi2nehdne 8 √ 2c4c ) + 2B ∞∑ k=1 2kd exp ( −2 k( √ 2− 1)2nepi4neh2dne2 8λ2v2 ) + 2B ∞∑ k=1 2kd exp ( −2 k/2( √ 2− 1)nepi2nehdne 4 √ 2λc ) +B exp ( −3nepine 28 ) + exp ( −3E(τn) 28 ) 74 ≤ 2B exp ( − ( √ 2− 1)2a˜1q(n)pi4q(n)h2dq(n)2 64c24v 2 ) + 2B exp ( − ( √ 2− 1)a˜2q(n)pi2q(n)hdq(n) 16 √ 2c4c ) + 2B ∞∑ k=1 2kd exp ( − 2k( √ 2− 1)2a˜1q(n)pi4q(n)h2dq(n)2 16λ2v2 ) + 2B ∞∑ k=1 2kd exp ( − 2k/2( √ 2− 1)a˜2q(n)pi2q(n)hdq(n) 8 √ 2λc ) +B exp ( −3a˜3q(n)piq(n) 56 ) + exp ( −3a1q(n) 28 ) . where a˜1 is a constant that occurs due to Assumption 6 and the choice of hyperparameter sequence when applied to the constant a1, where a1 is a positive constant such that E(τn) ≥ a1q(n), for large enough n. Using condition (3.12), q(n)pi4 q(n) h2d q(n) logn → ∞, it is easy to see that RHS above is summable. Then, by Borel-Cantelli Lemma we can conclude (3.22), thus proving the theorem. Note, following the same lines of proof, we could prove the strong consistency for η1 by just replacing piτn with pin. 3.7 Supplementary simulation results In this section, we plot the average regret curves for both strategies η1 and η2 for different hyper-parameter choices. In Figure 3.2, we choose {hn} = log−1 n and {pin} = log−2 n. We still notice the same trend, where η1 performs better than strategy η2 in setup 1 and setup 2, while η2 performs better in setup 3 and setup 4. Notice that, for setup 1 and 2, in the case of delay scenario 3, the difference in the average regret is not as noticeable as it is in delay 4. This could be attributed to the fast decaying {pin} = log−2 n, where whether you update at every time point or only at observed reward time points, there is sharp increase in the amount of exploitation with the amount of data available in delay 3 scenario unlike the delay 4 scenario. We also notice that, in setup 3, with delay 4, the average regret does not seem to decay by our time horizon and might need a larger horizon to show some decay, which could be because the exploration probability is decaying too fast for both the algorithms to learn efficiently. Figure 3.3 and Figure 3.4 correspond to the choices {hn, pin} = (n−1/4, n−1/4), (n−1/4, log−1 n) respectively. We see very similar trends as discussed in section 3.5 and for Figure 3.2. 75 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Setup 1: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 Setup 2: Mean reward generating function (1D) x f(x ) g1 g2 g3 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Setup 3: Mean Reward Generating Functions (1D) x f(x ) g2 g1 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 Setup 4: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4 Time index Av e ra ge re gr et η1 η2 Figure 3.2: Each row represents a setup, with first column depicting a one-dimensional function used to generate the mean reward functions. The second and the third column depict the average regret over time for delay 3 and delay 4 respectively. Here, {hn} = log−1 n, {pin} = log−2 n. 76 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Setup 1: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = n−1 4 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = n−1 4 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 Setup 2: Mean reward generating function (1D) x f(x ) g1 g2 g3 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = n−1 4 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = n−1 4 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Setup 3: Mean Reward Generating Functions (1D) x f(x ) g2 g1 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = n−1 4 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = n−1 4 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 Setup 4: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = n−1 4 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = n−1 4 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 Figure 3.3: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = n−1/4. 77 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Setup 1: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−1 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Setup 1: pin = (log(n))−1 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 5 1. 0 1. 5 2. 0 Setup 2: Mean reward generating function (1D) x f(x ) g1 g2 g3 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−1 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 2: pin = (log(n))−1 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Setup 3: Mean Reward Generating Functions (1D) x f(x ) g2 g1 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−1 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 Setup 3: pin = (log(n))−1 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 Setup 4: Mean Reward Generating Functions (1D) x f(x ) g1 g2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−1 , hn = n−1 4 , delay = 3 Time index Av e ra ge re gr et η1 η2 0 2000 4000 6000 8000 10000 0. 0 0. 2 0. 4 0. 6 0. 8 Setup 4: pin = (log(n))−1 , hn = n−1 4 , delay = 4 Time index Av e ra ge re gr et η1 η2 Figure 3.4: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = log−1 n. Chapter 4 Finite-time analysis for randomized allocation strategies In this chapter we conduct a finite-time regret analysis for the strategies η1 and η2 as proposed in algorithm 3.2 in chapter 3. Note that, in terms of notation and to enhance readability, we will use i for arms and j, n for time indices and N for total time horizon. To recall the problem setup, we assume that there are ` ≥ 2 arms available for allocation. Each arm allocation results in a reward which is obtained at some random time after the arm allocation. For each patient j ≥ 1, visiting at known times sj ∈ R+, a treatment Ij is alloted based on the data observed previously and the covariate Xj . We assume that the covariates are d-dimensional continuous random variables and take values in the hypercube [0, 1]d. Since the rewards may be obtained at some delayed time, we denote {tj ∈ R+, j ≥ 1} to be the observation time for the rewards for arms {Ij , j ≥ 1} respectively. Let Yi,j be the reward obtained at time tj ≥ sj for arm i = Ij . The mean reward with covariate Xj for the ith arm is denoted as fi(Xj), 1 ≤ i ≤ l. The observed reward with covariate Xj by pulling the Ijth arm is modeled as, YIj ,j = fIj (Xj) + j , where j denotes random independent error with E(j) = 0 and Var(j) < ∞ for all j ∈ N. The functions fi, 1 ≤ i ≤ `, are assumed to be unknown and not of any given parametric form. Let {Xj , j ≥ 1} be a sequence of covariates independently generated according to 78 79 an unknown underlying probability distribution PX , from a population supported in [0, 1]d. Let {pin, n ≥ 1} be a sequence of probabilities decreasing to 0 as n→∞. Let η be a sequential allocation rule, which for each time j chooses an arm Ij based on the previous observations and Xj . The total mean reward up to time n is ∑n j=1 fIj (Xj). The rewards are observed at delayed times tj ; the delay in the reward for arm Ij pulled at the jth time is given by a random variable dj := tj − sj . Assume that these delays are independent of both the covariates and of the arms. That is, let {dj , j ≥ 1} be a sequence of random variables with a probability distributions, dj ∼ Gj ∀j ∈ N. To evaluate the performance of the allocation strategy, let i∗(x) = arg max1≤i≤` fi(x) and f∗(x) = fi∗(x)(x). Without the knowledge of the random errors, the ideal perfor- mance occurs when the choices of arms selected I1, . . . , In match the optimal arms i∗(X1), . . . , i∗(Xn), yielding the optimal total reward ∑n j=1 f ∗(Xj). Thus we measure the performance of the allocation rule, δ, by the cumulative regret, Rn(η) = n∑ j=1 f∗(Xj)− fIj (Xj). This is the quantity we use for finite-time analysis of our proposed strategies. We also define the per-round or average regret rn(η) by rn(η) = Rn(η) n = 1 n n∑ j=1 (f∗(Xj)− fIj (Xj)). A strategy η is strongly consistent if rn(η) = op(1) and the finite time analysis provides an upper bound on the rate of this decay. 4.1 Finite-time regret analysis We start by making some assumptions on the errors, the underlying functions, the kernel function used in the definition of Nadaraya-Watson estimator (4.1) and the delays, that will be used in the consequent results. 80 Assumption 4.1.1. The errors satisfy a (conditional) moment condition that there exists positive constants v and c such that for all integers k ≥ 2 and n ≥ 1, E(|n|k|Xn) ≤ k! 2 v2ck−2, almost surely. This assumption imposes some moment conditions on the error distributions known as the refined Bernstein condition (as in Birge´ et al. (1998); Qian and Yang (2016a). Assumption 4.1.1 is met for a wide range of distributions, for example, normal distri- bution and bounded errors satisfy this assumption, making it viable in a wide range of applications. Next, we consider two natural assumptions on the mean reward func- tions and the covariate density, respectively. Although we restrict the covariate space to [0, 1]d, any bounded and compact subset of Rd would suffice. Assumption 4.1.2. The functions fi are continuous on [0, 1] d with, A := sup 1≤i≤` sup x∈[0,1]d (f∗(x)− fi(x)) <∞. Assumption 4.1.3. The design distribution PX is dominated by the Lebesgue measure with a continuous density p(x) uniformly bounded above and away from 0 on [0, 1]d; that is, p(x) satisfies c ≤ p(x) ≤ c¯ for 0 < c ≤ c¯. For Kernel regression, we consider a multivariate nonnegative kernel function K(u) : Rd → R that satisfies Lipschitz, boundedness and bounded support conditions. Assumption 4.1.4. For some constants 0 < λ <∞, |K(u)−K(u′)| ≤ L||u− u′||∞, for all u, u′ ∈ Rd. Assumption 4.1.5. There exists constants L1 ≤ L, c3 > 0 and c4 ≥ 1 such that K(u) = 0 for ||u||∞ > L,K(u) ≥ c3 for ||u||∞ ≤ L1 and K(u) ≤ c4 for all u ∈ Rd. Next assumption is an independence assumption on the delays. We try to relax this assumption in Section 4.2. 81 Assumption 4.1.6. Let the delays, {dj , j ≥ 1}, are independent of each other, the choice of arms and also of the covariates. Next assumption is to mildly restrict the expected number of delayed rewards, such that we expect to observe an increasing number of rewards as time progresses. The assumption is not restrictive as it allows for rewards to be unbounded as long as a minimum number of rewards are being observed in finite time. This assumption would naturally hold for a lot of scenarios with delayed rewards, where some informed learning is plausible. Assumption 4.1.7. Let the partial sums of delay distributions satisfy, ∑n j=1Gj(n − sj) = Ω(q(n)), where q(n) could be a function as in Assumption 2.3.4 in Chapter 2, such that q(n)→∞ as n→∞. Next, we also provide a mild assumption on an upper bound for the expected number of observed rewards by the time horizon N . Assumption 4.1.8. We assume that, for a given δ > 0 and time horizon N , E(τN ) < N − √ N 2 log ( 1 δ ) . 4.1.1 Nadaraya-Watson regression We focus on Nadaraya-Watson regression and study its finite time performance under the proposed allocation strategies η1 and η2. Let τn = ∑n j=1 I{tj ≤ n}, be the random running index of the number of rewards observed by time n, and we choose hτn for the bandwidth sequence. Recall, we use {pin} and {hn} to denote the user defined sequences for the exploration probability sequence and the bandwidth. We remove the {·} to distinguish between strategies η1 and η2, and use piτn and pin to denote the sequence updating only at time points corresponding to observed rewards and the sequence updating at every time point, respectively. For arm 1 ≤ i ≤ `, at each time point n, define Ji,n = {j : Ij = i, 1 ≤ tj ≤ n− 1, 1 ≤ j ≤ n − 1}, be the indices corresponding to the rewards which were observed by time n − 1. Let AN = {(sj , tj) : tj ≤ N, j ≥ 1}, denote the time points for which rewards 82 were obtained by time N and Xn = {X1, X2, . . . , Xn} denote the set of design points until time n. Recall that, the Nadaraya-Watson estimator of fi(x) is, fˆi,n+1(x) = ∑ j∈Ji,n+1 Yi,jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) . (4.1) Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 + 1, define Qn+1(x) = {j : 1 ≤ tj ≤ n, ||x−Xj || ≤ Lhτn} and Qi,n+1(x) = {j : 1 ≤ tj ≤ n, Ij = i, ||x−Xj ||∞ ≤ Lhτn}. Let Mn+1(x) and Mi,n+1(x) be the size of Qn+1(x) and Qi,n+1(x), respectively. In the next section, we provide finite time analysis for both strategies η1 and η2, and compare the two to see if the results support our findings in chapter 3. To avoid the case of the denominator of the Nadaraya-Watson estimator in (4.1) being extremely small, we will replace the kernel K(·) in (4.1) with a uniform kernel I(||u||∞ ≤ L). In particular for the case when the complement of the event Bi,n defined as, Bci,n :=  1Mi,n+1(x) ∑ j∈Ji,n+1 K ( x−Xj hτn ) < c5  (4.2) occurs almost surely for some small positive constant 0 < c5 < 1, we will use the uniform kernel. This usage is seen in Lemma 4.1.10 and Lemma 4.1.11. Given 0 < δ < 1 and the total time horizon N , for strategy η2 and a positive constant a˜1, we define a special time point n ′ δ by, n′δ = min { n > m0 : exp ( −3ca˜1(2Lhq(n)) dpiq(n)q(n) 112 ) ≤ δ 4`N } . (4.3) Under the condition (3.6), limn→∞ hdq(n)piq(n)q(n)/ log(n) → ∞, we have, n′δ/N → 0 as N →∞. Similarly, given 0 < δ < 1 and time horizon N , for strategy η1 and some positive constant ˜˜a1, we define a special time point n ′′ δ by, n′′δ = min { n > m0 : exp ( −3c ˜˜a1(2Lhq(n)) dpinq(n) 112 ) ≤ δ 4`N } . (4.4) Under the condition (3.5), limn→∞ hdq(n)pinq(n)/ log(n) → ∞, we will have, n′′δ/N → 0 as N →∞. Therefore, for large enough time horizon N , we will have N > n′′δ . 83 Lemma 4.1.9. Under Assumption 4.1.6 and Assumption 4.1.7, τn a.s.→ ∞ as n→∞. Proof. Recall, τn = ∑n j=1 I{tj ≤ n}. Then, E(τn) = E( ∑n j=1 I{tj ≤ n}) = ∑n j=1 P (tj ≤ n) = ∑n j=1Gj(n − sj). Now, by Assumption 4.1.7 we have, for large enough n, there exists a positive integer a1, such that, ∑n j=1Gj(n − sj) ≥ a1q(n). Then using the inequality (A.2), we get, P ( τn ≤ a1q(n) 2 ) ≤ P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) ≤ exp ( −3 ∑n j=1Gj(n− sj) 28 ) ≤ exp (−3a1q(n) 28 ) . It is easy to see that the upper bound is summable in n under the condition (3.5) and (3.6) from chapter 3. By the Borel-Cantelli lemma, the event {τn > a1q(n)/2} happens infinitely often, therefore τn a.s.→ ∞. Note that, by construction this implies that hτn a.s.→ 0, and piτn a.s.→ 0 as n→∞. As an immediate consequence of this along with continuity of f , we get that w(hτn ; f) a.s.→ 0, as n→∞. Lemma 4.1.10 (For strategy η2). Suppose Assumptions 4.1.1,4.1.2, 4.1.5 and 4.1.6 are satisfied and {pin} is a decreasing sequence. Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 +1, for every  > w(Lhτn ; fi) a.s., we have for strategy η2, P η2Xn,AN (|fˆi,n+1(x)− fi(x)| ≥ ) ≤ exp ( −3Mn+1(x)piτn 28 ) + 4N exp ( −c 2 5Mn+1(x)piτn(− w(Lhτn ; fi))2 4c24v 2 + 4c4c(− w(Lhτn ; fi)) ) (4.5) where P η2Xn,AN (·) denotes the conditional probability for strategy η2 given the design points Xn = {X1, X2, . . . , Xn}, AN = {(sj , tj); tj ≤ N, j ≥ 1} and τn = ∑n j=1 I{tj ≤ n}, which is a known quantity given AN . Similarly, one could derive the analogous result for strategy η1. The proofs for the two results are very similar and only one of them is presented in Section 4.4.1. 84 Lemma 4.1.11 (For strategy η1). Suppose Assumptions 4.1.1,4.1.2, 4.1.5 and 4.1.6 are satisfied and {pin} is a decreasing sequence. Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 +1, for every  > w(Lhτn ; fi) a.s., we have for strategy η1, P η1Xn,AN (|fˆi,n+1(x)− fi(x)| ≥ ) ≤ exp ( −3Mn+1(x)pin 28 ) + 4N exp ( −c 2 5Mn+1(x)pin(− w(Lhτn ; fi))2 4c24v 2 + 4c4c(− w(Lhτn ; fi)) ) , (4.6) where P η1Xn,AN (·) denotes the conditional probability for strategy η1 given the design points Xn = {X1, X2, . . . , Xn} and AN = {(sj , tj); tj ≤ N, j ≥ 1} and τn = ∑n j=1 I{tj ≤ n}, which is a known quantity given AN . It can be seen that Lemma 4.1.10 and Lemma 4.1.11 only differ in the hyperparame- ter choice of piτn and pin, other things remain the same. The reason for this is that both are conditional probability results, and given AN , τn is an observed quantity. Next, we provide the theorems for finite-time regret bounds on the cumulative regret for strategy η2 and η1 respectively. Theorem 4.1.12. Suppose Assumptions 4.1.1-4.1.8 are satisfied and {pin} is a decreas- ing sequence. Assume N > n′δ and the kernel estimator as defined in (4.1) and kernel chosen as described in (4.2). Then for 0 < δ ≤ 1/4, we have that, with probability at least 1− 32δ9 , the cumulative regret for η2 satisfies, RN (η2) < An ′ δ + N∑ n=n′δ+1 2 max 1≤i≤` w(Lhq(n); fi) + CN,δ√ (2L)dhdq(n)piq(n)q(n)  +A N∗(δ)∑ t=1 Mδ(`− 1)pit + max { A √ Mδ E(τN ) 2 log ( 2 δ ) , A √( N 2 ) log ( 2 δ )} , where N∗(δ) = E(τN )+ √ N 2 log ( 1 δ ) , CN,δ = √ 64c24v 2 log(12`N2/δ)/c25c(2L) d and Mδ is a number chosen such that ( 1− a1q(Mδ/2)Mδ/2 )Mδ/2 = δ, where q(.) comes from Assumption 4.1.7. Proof. The proof is in Section 4.4.2. 85 The right hand side of the inequality in Theorem 4.1.12 above consists of several terms that are also intuitively meaningful. The first term An′δ comes from the initial rough exploration. The second term has two components: max1≤i≤`w(Lhq(n); fi) which is associated with the estimation bias, CN,δ/ √ hdq(n)piq(n)q(n) can be associated with the estimation standard error, which depends on delay. That is, if the delays are expected to be large, then q(n) will be small as a result of which the estimation standard error will be large. The next term ∑N∗(δ) t=1 Mδ(`−1)pit is the randomization error, where Mδ is a probabilistic upper bound on the difference between consecutive reward observations. This may potentially be quite large for large delay situations leading to large random- ization error. Finally, the last term is reflective of the fluctuation of the randomization scheme, and this also depends on the extent of delays in observing the rewards. Theorem 4.1.13. Suppose assumptions 4.1.1-4.1.7 are satisfied and {pin} is a decreas- ing sequence. Assume N > n′′δ and the kernel estimator as defined in (4.1) and kernel chosen as described in (4.2). Then with probability larger than 1 − 2δ, the cumulative regret for strategy η1 satisfies, RN (η1) < An ′′ δ + N∑ n=n′′δ+1 2 max 1≤i≤` w(Lhq(n); fi) + CN,δ√ hdq(n)pinq(n) +A(`− 1)pin  +A √( N 2 log ( 1 δ )) , where CN,δ = √ 64c24v 2 log(12`N2/δ)/c25c(2L) d. The proof for Theorem 4.1.13 can be found in Section 4.4.3. The right hand side of the inequality in Theorem 4.1.13 also consists of several terms that are intuitively mean- ingful. The first term An′′δ comes from the initial rough exploration. The second term has three components: max1≤i≤`w(Lhq(n); fi) which is associated with the estimation bias, CN,δ/ √ hdq(n)pinq(n) can be associated with the estimation standard error, which depends on delay. That is, if the delays are expected to be large then q(n) is going to be small as a result of which the estimation standard error will be large. Then the next term (`− 1)pin is the randomization error. This is not affected by the delay because as per the proposed allocation strategy, allocations are made at each time point. The third term A √ N/2 log(1/δ) is reflective of the fluctuation of the randomization scheme. 86 As both the upper bounds in Theorem 4.1.12 and Theorem 4.1.13 consist of compo- nents that reflect the bias-variance trade-off and the exploration-exploitation trade-off, we can compare the bounds to get some idea of the underlying nature of the two strate- gies, η2 and η1 respectively. We notice that there is a trade-off in the bounds of the two strategies. While the upper bound for the estimation bias in the two strategies remains the same, the bound on the estimation standard error component for the former (η2) is smaller than the latter (η1) because piq(n) ≥ pin in the presence of delays. However, randomization error bound for strategy η2 (Theorem 4.1.12) could be large as compared to the randomization error bound for strategy η1 (Theorem 4.1.13), depending upon the extent of delay and corresponding value of Mδ. If Mδ is not too large, we see that the last term corresponding to the fluctuation of the randomization scheme in both the bounds could actually be about the same (≈ A√(N/2) log (1/δ). The extent to which one component (estimation error or randomization error) overpowers the other is also determined by the underlying nature of reward generating functions and severity of de- lays, as discussed in chapter 3. It is important to note that the bounds presented are not tight, so it is hard to precisely quantify the difference in the cumulative regret for both the strategies. 4.2 Delays dependent on covariates It can often be the case that the extent of delay in observing rewards depends on the co- variates. For example, patient characteristics could play an integral role in determining when the treatment outcome for that patient would be observed. For instance, it could be the case that treatments take longer time to show their outcomes in older patients as compared to younger patients. In this section, we consider the scenarios where the delays depend on covariates but are independent of the arms (treatments). In other words, we assume that, dj |(Xj = x) ∼ Gx; Gx ∈ G. In other words, P (dj ≤ n | Xj = x) = Gx(n), where, we assume that, G := {Gx : en ≤ Gx(n) ∀ x ∈ [0, 1]d}, (4.7) 87 where en ∈ (0, 1), is a non-decreasing sequence such that en → 1 as n→∞. Intuitively, en is a sequence that gives a uniform lower bound on the cdf’s of delays for all patients with covariates in the given domain [0, 1]d. In other words, if dj = tj − sj , P (tj ≤ n|Xj = x) = Gx(n− sj) ≥ en−sj ; Gx ∈ G, j ∈ N, Next, we make assumptions on the on the growth of the sequence en to ensure that we expect to observe rewards at a minimum rate that would guarantee a finite-time bound on the cumulative regret. Assumption 4.2.1. The delays {dj , j ≥ 1} depend on the covariates but are indepen- dent of the choice of arms. Assumption 4.2.2. Let, ∑n j=1 e 2 n−sj = Ω(q(n)), where q(n) → ∞ as n → ∞, where q(n) gives a lower bound on the uniform growth rate of number of observed rewards by time n. Under the conditions for consistency (3.5) and (3.6), one can do a similar analysis as the finite-time results of Theorem 4.1.12 and 4.1.13. The proof techniques in certain parts differ due to delays depending on the covariates and those details are discussed in Section 4.4.4. However, a similar structure of the proof can be maintained and consequently the following results can be established but with q(n) as in Assumption 4.2.2. Theorem 4.2.3. Suppose Assumptions 4.1.1-4.1.5,4.2.1 and 4.2.2 are satisfied and {pin} is a decreasing sequence of probabilities. Assume N > n′δ (with q(n) as in 4.2.2) and the kernel estimator as defined in (4.1) and kernel chosen as described in (4.2). Then with probability larger than 1− 2δ, the cumulative regret for η2 satisfies, RN (η2) < An ′ δ + N∑ n=n′δ+1 2 max 1≤i≤` w(Lhq(n); fi) + CN,δ√ hdq(n)piq(n)q(n)  +A τN∑ t=1 (σt+1 − σt)(`− 1)pit + max { A √ E(τN | FN )Mδ 2 log ( 2 δ ) , A √ N 2 log ( 2 δ )} , 88 where CN,δ = √ 64c24v 2 log(12`N2/δ)/c25c(2L) d and Mδ is a number chosen such that( 1− √ q(Mδ/2) Mδ/2 )Mδ/2 = δ, where q(.) is as in Assumption 4.2.2. Also note that, FN is the σ-field generated by (ZN , XN , IN ). Theorem 4.2.4. Suppose assumptions 4.1.1-4.1.5, 4.2.1 and 4.2.2 are satisfied and {pin} is a decreasing sequence of probabilities. Assume N > n′′δ (with q(n) as in Assump- tion 4.2.2) and the kernel estimator as defined in (4.1) and kernel chosen as described in (4.2). Then with probability larger than 1− 2δ, the cumulative regret for strategy η1 satisfies, RN (η1) < An ′′ δ + N∑ n=n′′δ+1 2 max 1≤i≤` w(Lhq(n); fi) + CN,δ√ hdq(n)pinq(n) +A(`− 1)pin  +A √( N 2 log ( 1 δ )) , where CN,δ = √ 64c24v 2 log(12`N2/δ)/c25c ˜˜a1(2L)d. Note that similar to Theorems 4.1.12 and 4.1.13, we gain intuitive insights by com- paring the bounds of Theorems 4.2.3 and 4.2.4. However, it is hard to precisely quantify the difference in the growth rate of the cumulative regret for the two strategies, as the bounds presented are not tight. It may actually be the case that with the dependence assumption of delays on covariates, the bounds on the cumulative regret are more con- servative than the ones in Theorems 4.1.12 and 4.1.13, because we essentially use a uniform lower bound over the whole covariate space, instead of taking into account region-wise contribution to the regret. There is a scope to tighten the bounds by re- laxing Assumption 4.2.2 which would require developing new technical tools that would help in conducting a more involved analysis. 4.3 Real data analysis In this section, we use a benchmark dataset for multi-armed bandit problems, the Yahoo! Front Page Today Module User Click Log dataset to evaluate the proposed allocation strategy by artificially introducing delays in observing the rewards in the data. The 89 complete data is about 46 million Yahoo front page interactions collected during first ten days in May 2009. Each event (page visit interaction) has the following information: 1) five variables on visitor’s information; 2) a pool of about 10-14 editor-picked news articles; 3) one article displayed to the visitor; 4) the visitor’s response (1=click or 0=no click) to the selected article. The five variables on different visitors would reflect their article preferences and hence they act as covariates in our setup. While in the original dataset, the pool of articles is dynamic, we only consider fixed number of arms in our setup. Therefore, we consider only one day’s data (May 1, 2009). Also, we choose four articles (article id 109511-109514), and keep only the events where the four articles are included in the article pool and one of the four articles is displayed to the visitor. As a result, we obtain a reduced dataset consisting of 403,456 interaction events. In order to speed up the computation, we only work with a randomly chosen subset of this data which consists of 30,000 interaction events. We also use the first three principal components for the covariates in order to tackle the curse of dimensionality. We apply the unbiased offline evaluation method proposed by Li et al. (2010). This helps in evaluating a contextual bandit algorithm on real datasets. If the arm proposed by the bandit algorithm matches the actually displayed arm in the data, this event is kept as a “valid” event; and if the proposed arm does not match the actually displayed arm, this event is ignored. This process is sequentially run over all the events to generate the final “valid” dataset, which is then used to evaluate the contextual bandit algorithm using the click-through rate (CTR, the proportion of times a click is made). This works because the displayed arm is selected uniformly at random from the pool, therefore the final ‘valid’ dataset is like a random sample of the underlying population. A random subsample (as in our case) of this ‘valid’ dataset also works using the same logic. In addition to this we induce some delay mechanism in observing the rewards by forcing the time for observing the response of a visitor to be delayed. We consider the following delay scenarios in the increased order of severity of delays, No delay; Every reward is observed instantaneously. Delay 1: Geometric delay with probability of success (observing the reward) p = 0.3. Delay 2: Every 5th reward is not observed by time N = 30000 and other rewards are obtained with a geometric (p = 0.3) delay. Delay 3: Each case has probability 0.7 to delay and the delay is half-normal with scale 90 parameter, σ = 1500. Delay 4: In this case we increase the number of non-observed rewards. Divide the data into four equal consecutive parts (quarters), such that, in part 1, we only observe every 10th (with Geom(0.3) delay) observation by time N and not observe the remaining; in part 2, we only observe every 15th observation; in part 3, only observe every 20th observation; in part 4, only observe every 25th observation. With the subsample of the dataset with N = 30, 000 events with induced delays and the unbiased offline evaluation method, we evaluate the performance of the following algorithms. • Random: An arm is selected uniformly at random. • η1: The randomized allocation strategy (pin, hτn) as proposed in Section 3.2 of chapter 3. • η2: The randomized allocation strategy (piτn , hτn) as proposed in Section 3.2 of chapter 3. • DeLinUCB: The algorithm proposed by Vernade et al. (2018) is a delayed version of LinUCB, which can handle random delayed feedback. They assume a linear assumption on modeling the rewards as a function of the covariates. The choices of bandwidth considered: {hn} = n−1/4, n−1/6, log−1 n, and the choices of exploration probability considered: {pin} = n−1/4, log−1 n, log−2 n. We only illustrate the results for {pin} = log−2 n with {hn} = n−1/6, log−1 n in Figure 4.1. Each of the algorithms listed above is run 100 times over the reduced dataset of size 30,000 with the offline evaluation method. The first 150 events are used for initialization. The resulting CTRs are divided by the mean CTR of the random algorithm which results in the normalized CTRs. The boxplots of these normalized CTRs for the delay scenarios 3-5 are shown in Figure 4.1. Note that our proposed strategies, η1 and η2, work considerably better than DeLinUCB for delay scenario 2, which is moderate delay situation where every 5th reward is not observed by time N = 30000 and other rewards are obtained with a geometric (p = 0.3) delay. We also a somewhat better performance of our strategies, η1 and η2 for scenario 3, which is a more severe delay setting. The performance in delay 4, the most severe delay scenario also shows similar trends with 91 slightly lower normalized CTR values for all three methods, however these results seem to be highly variable (Figure 4.3), largely depending on the sample size and the number of initialization rounds. Perhaps, running these algorithms for an even longer period of time (subsample of the data) for delay 4, would give more insights as more data becomes available after going through the offline evaluation method and might help reduce the variability of the results. Other results (see section 4.5) show similar trends and it is seen that fast decaying {pin} leads to better results for strategies η1 and η2. l l l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 2 , pin = log(n)−2 , hn = n−1 4 l ll l l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 3 , pin = log(n)−2 , hn = n−1 4 l l ll l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 4 , pin = log(n)−2 , hn = n−1 4 l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 2 , pin = log(n)−2 , hn = log(n)−1 l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 3 , pin = log(n)−2 , hn = log(n)−1 ll l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 4 , pin = log(n)−2 , hn = log(n)−1 Figure 4.1: Boxplots of normalized CTRs for the three methods being compared. Each column represents a particular delay scenario. 4.3.1 Discussion on finite-time results In this section, we reflect upon the real data analysis conducted in section 4.3 and discuss some caveats and areas of improvement. Due to computational feasability, we considered a random subsample of size 30000 and ran the algorithms for 100 independent replications. It can be argued that 100 92 independent replications might be insufficient for reaching to conclusions as there might exist significant simulation error. In order to study that, we increased the number of replications to 200 for each algorithm on the same data subsample and made boxplots as in Figure 4.1. We saw very similar patterns (two of them displayed in Figure 4.2) suggesting that it is perhaps reasonable to conclude that strategies η1 and η2 perform better than DeLinUCB for the chosen data subsample. l l l l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 3 , pin = log(n)−2 , hn = log(n)−1 l l l η1 η2 DeLinUCB 0. 7 0. 8 0. 9 1. 0 1. 1 1. 2 1. 3 1. 4 delay 4 , pin = log(n)−2 , hn = log(n)−1 Figure 4.2: Boxplots with 200 replications show similar patterns as Figure 4.1. Since we only considered a random subsample of the full data, another question that arises is how do the algorithms compare on the entire dataset. In order to explore that along with keeping in mind the computational challenge of dealing with a large dataset in R, we partitioned our dataset into 30 disjoint subsets of size ≈ 13000 each. Then we conducted a paired t-test to compare the mean normalized CTR for DeLinUCB and both η1 and η2 respectively. The results were not statistically significant for most combinations of hyperparameter sequence choices for strategies η1 and η2. We think that this could be because of the small sample size of the valid data using the offline evaluation method of Li et al. (2010) for around 13000 events. This led us to consider larger subsample of the entire dataset and we considered a random subsample of 50000 events. Interestingly, we did not find a significant difference in the performance of the three algorithms for this subsample, perhaps reflecting a need for more adaptive 93 algorithms. Another possibility of improvement is using a more efficient dimension reduction technique for the covariates in order to tackle the curse of dimensionality for nonparametric regression methods. Since DeLinUCB assumes a linear model, it is possible that a different parametric model would fit the data better. 4.4 Proofs 4.4.1 Proofs of Lemmas Proof of Lemma 4.1.10. Recall that Qn+1(x) = {j : 1 ≤ tj ≤ n, ||x − Xj || ≤ Lhτn} and Qi,n+1(x) = {j : j ∈ Qn+1(x), Ij = i}. Let Mn+1(x) and Mi,n+1(x) be the size of Qn+1(x) and Qi,n+1(x), respectively. It can be seen that if Mn+1(x) = 0, (4.5) trivially holds. Therefore, without loss of generality we can assume Mn+1(x) > 0. For the event Bi,n = { 1 Mi,n+1(x) ∑ j∈Ji,n+1 K ( x−Xj hτn ) ≥ c5 } . Note that, PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥  ) = PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x) Mn+1(x) ≤ piτn 2 ) + PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x) Mn+1(x) > piτn 2 ) ≤ PXn,AN ( Mi,n+1(x) Mn+1(x) ≤ piτn 2 ) + PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥ , (4.8) Mi,n+1(x) Mn+1(x) > piτn 2 ) a≤ exp ( −3Mn+1(x)piτn 28 ) + PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x) Mn+1(x) > piτn 2 , Bi,n ) + PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x) Mn+1(x) > piτn 2 , Bci,n ) =: exp ( −3Mn+1(x)piτn 28 ) +A1 +A2. (4.9) where the first term in the inequality in step a comes from the extended Bernstein inequality (A.2). By Assumption 4.1.5 and the definition A.0.2 of the modulus of 94 continuity, we have, |fˆi,n+1(x)− fi(x)| = ∣∣∣∣∣∣ ∑ j∈Ji,n+1 Yi,jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) − fi(x) ∣∣∣∣∣∣ = ∣∣∣∣∣∣ ∑ j∈Ji,n+1(fi(Xj) + j)K ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) − fi(x) ∣∣∣∣∣∣ = ∣∣∣∣∣∣ ∑ j∈Ji,n+1(fi(Xj)− fi(x))K ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) + ∑j∈Ji,n+1 jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣ ≤ sup x,y:||x−y||∞≤Lhτn |fi(x)− fi(y)|+ ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣ = w(Lhτn ; fi) + ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jK ( x−Xj hτn ) ∑ j∈Ji,n+1 K ( x−Xj hτn ) ∣∣∣∣∣∣ . (4.10) Under Bi,n, |fˆi,n+1(x)− fi(x)| ≤ w(Lhτn ; fi) + 1 c5Mi,n+1(x) ∣∣∣∣∣∣ ∑ j∈Qi,n+1(x) jK ( x−Xj hτn )∣∣∣∣∣∣ . Using this, we will construct an upper bound for A1. Define σt = inf{n˜ : ∑n˜ j=1 I{Ij = i, tj ≤ n and ||x−Xj || ≤ Lhτn} ≥ t}, t ≥ 1. Then, for large enough n, by Lemma 4.1.9,  > w(Lhτn , fi) a.s., and we have, A1 ≤ PXn,AN ∣∣∣∣∣∣ ∑ j∈Qi,n+1(x) jK ( x−Xj hτn )∣∣∣∣∣∣ ≥ c5Mi,n+1(x)(− w(Lhτn ; fi)), (4.11) Mi,n+1(x) Mn+1(x) > piτn 2 ) ≤ n∑ n¯=0 PXn,AN (∣∣∣∣∣ n¯∑ t=1 σtK ( x−Xσt hτn )∣∣∣∣∣ ≥ c5n¯(− w(Lhτn , fi)), Mi,n+1(x)Mn+1(x) > piτn2 , Mi,n+1(x) = n¯ ) 95 ≤ n∑ dMn+1(x)piτn/2e PXn,AN (∣∣∣∣∣ n¯∑ t=1 σtK ( x−Xσt hτn )∣∣∣∣∣ ≥ c5n¯(− w(Lhτn ; fi)) ) ≤ n∑ dMn+1(x)piτn/2e 2 exp ( − n¯c 2 5(− w(Lhτn ; fi))2 2c24v 2 + 2c4c(− w(Lhτn ; fi)) ) ≤ 2N exp ( −c 2 5Mn+1(x)piτn(− w(Lhτn ; fi))2 4c24v 2 + 4c4c(− w(Lhτn ; fi)) ) , (4.12) where the last inequality follows from Lemma A.1.7 and the upper boundedness of the kernel function. Now, to find the bound for A2, under B c i,n we run into technical problems since the denominator of the Nadaraya-Watson estimator can be extremely small, hence we will replace the kernel K(·) in (4.1) with a uniform kernel I(||u||∞ ≤ L). That is for the case when, Bci,n :=  ∑ j∈Ji,n+1 K ( x−Xj hτn ) < c5 ∑ j∈Ji,n+1 I(||x−Xj ||∞ ≤ Lhτn)  , (4.13) for some small positive constant 0 < c5 < 1, we will use the uniform kernel. Therefore, using (4.10), (4.13) and A.1.7, we get that, A2 ≤ PXn,AN ∣∣∣∣∣∣ ∑ j∈Ji,n+1 jI(||x−Xj || ≤ Lhτn) ∣∣∣∣∣∣ ≥Mi,n+1(x)(− w(Lhτn ; fi)), (4.14) Mi,n+1(x) Mn+1(x) > piτn 2 ) ≤ n∑ n¯=0 PXn,AN (∣∣∣∣∣ n¯∑ t=1 σtI(||x−Xσt || ≤ Lhτn) ∣∣∣∣∣ ≥ n¯(− w(Lhτn ; fi)), Mi,n+1(x) Mn+1(x) > piτn 2 ,Mi,n+1(x) = n¯ ) ≤ n∑ dMn+1(x)piτn/2e PXn,AN (∣∣∣∣∣ n¯∑ t=1 σtI(||x−Xσt || ≤ Lhτn) ∣∣∣∣∣ ≥ n¯(− w(Lhτn ; fi)) ) ≤ n∑ dMn+1(x)piτn/2e 2 exp ( − n¯(− w(Lhτn ; fi)) 2 2v2 + 2c(− w(Lhτn ; fi)) ) ≤ 2N exp ( −Mn+1(x)piτn(− w(Lhτn ; fi)) 2 4v2 + 4c(− w(Lhτn ; fi)) ) . (4.15) 96 Therefore, using the fact that 0 < c5 ≤ 1 ≤ c4, (4.12) and (4.15) in (4.9), we get, PXn,AN ( |fˆi,n+1(x)− fi(x)| ≥  ) ≤ exp ( −3Mn+1(x)piτn 28 ) + 4N exp ( −c 2 5Mn+1(x)piτn(− w(Lhτn ; fi))2 4c24v 2 + 4c4c(− w(Lhτn ; fi)) ) . (4.16) The proof for Lemma 4.1.11 will follow the same steps with piτn replaced by pin. Next, we prove a lemma that would be used to prove Theorem 4.1.12. Lemma 4.4.1. An  that satisfies, 4N exp ( −c 2 5ca˜1(2Lhq(n)) dpiq(n)q(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) ≤ δ 4`N , (4.17) is given by, ˜i,n = w(Lhq(n); fi) + √ 64c24v 2 log(16`N2/δ) c25ca˜1(2L) dhdq(n)piq(n)q(n) . Proof for Lemma 4.4.1. Let Z := − w(Lhq(n); fi), then (4.17) becomes, c25ca˜1(2Lhq(n)) dpiq(n)q(n)Z 2 16c24v 2 + 16c4cZ ≥ log ( 16`N2 δ ) . Let A1 = c 2 5ca˜1(2L) d, A2 = 16c 2 4v 2, A3 = 16c4c. A1q(n)h d q(n)piq(n)Z 2 −A3 log ( 16`N2 δ ) Z −A2 log ( 16`N2 δ ) ≥ 0. (4.18) Left hand side is a quadratic polynomial in Z. Solving for Z, A1q(n)h d q(n)piq(n)Z 2 −A3 log ( 16`N2 δ ) Z −A2 log ( 16`N2 δ ) = 0 ⇒ Z = 1 2 A3 log(16`N2/δ) A1q(n)hdq(n)piq(n) ± √√√√ A23 log2(16`N2/δ) (A1q(n)hdq(n)piq(n)) 2 + 4A2 log(16`N 2/δ) A1q(n)hdq(n)piq(n)  . This will give two real roots for the quadratic equation. Therefore if we want some value of Z such that (4.18) holds, we can use a point that is larger than the roots 97 −b±√b2 + d2 and we know that d ≥ −b±√b2 + d2. Therefore, a potential candidate could be, Z = √ 4A2 log(16`N 2/δ) A1q(n)hdq(n)piq(n) = √ 64c24v 2 log(16`N2/δ) c25ca˜1(2L) dhdq(n)piq(n)q(n) , which means that we want ˜i,n = w(Lhq(n); fi) + √ 64c24v 2 log(16`N2/δ) c25ca˜1(2L) dhdq(n)piq(n)q(n) . A similar lemma with piq(n) replaced by pin could be derived that will be used in the proof of Theorem 4.1.13. Lemma 4.4.2. An  that satisfies, 4N exp ( −c 2 5c ˜˜a1(2Lhq(n)) dpinq(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) ≤ δ 4`N , (4.19) is given by, ˜′i,n = w(Lhq(n); fi) + √√√√ 64c24v2 log(16`N2/δ) c25c ˜˜a1(2L)dhdq(n)pinq(n) . 98 4.4.2 Proofs of Theorems Proof of Theorem 4.1.12. By definition of iˆj , fˆi∗(Xj),j ≤ fˆiˆj ,j(Xj), then the regret accu- mulated after the initial forced sampling period is, N∑ j=m0+1 (f∗(Xj)− fIj (Xj)) = N∑ j=m0+1 (fi∗(Xj)(Xj)− fˆi∗(Xj),j(Xj) + fˆi∗(Xj),j(Xj)− fiˆj (Xj) + fiˆj (Xj)− fIj (Xj)) ≤ N∑ j=m0+1 (fi∗(Xj)(Xj)− fˆi∗(Xj),j(Xj) + fˆiˆj ,j(Xj)− fiˆj (Xj) + fiˆj (Xj)− fIj (Xj)) ≤ N∑ j=m0+1 (2 sup 1≤i≤l |fˆi,j(Xj)− fi(Xj)|+AI{Ij 6= iˆj}) (4.20) Here the first term corresponds to the regret incurred due to estimation error and the second term corresponds to the randomization error. We will first find a lower bound for the estimation error. Note that Lemma 4.1.10 gives a probability inequality for the estimation error conditional on AN and Xn. There- fore, in order to get a probability (not conditional) bound on the estimation error, we first remove this condition on Xn and then remove the condition on AN in (4.5). Given arm i, for a large enough n satisfying n ≥ m0 + 1 and  > w(Lhτn ; fi) a.s., consider, PAN (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ) = PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤ c(2Lhτn) dτn 2 ) + PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) dτn 2 ) (4.21) ≤ PAN ( Mn+1(Xn+1) ≤ c(2Lhτn) dτn 2 ) + PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) dτn 2 ) 99 ≤ exp ( −3c(2Lhτn) dτn 28 ) + exp ( −3c(2Lhτn) dτnpiτn 56 ) + 4N exp ( −c 2 5c(2Lhτn) dτnpiτn(− w(Lhτn ; fi))2 8c24v 2 + 8c4c(− w(Lhτn ; fi)) ) (4.22) where, the above inequality follows from Lemma 4.1.10 and (A.2), and the fact that E(Mn+1(Xn+1) | AN ) ≥ c(2Lhτn)dτn. Now, we want to remove the conditioning on AN . Recall that dj ind∼ Gj , for j ≥ 1. Therefore, for the known visiting times {sj , j ≥ 1}, P (tj ≤ n) = P (dj + sj ≤ n) = P (dj ≤ n− sj) = Gj(n− sj), hence, P (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ) = P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn ≤ ∑n j=1Gj(n− sj) 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > ∑n j=1Gj(n− sj) 2 ) ≤ P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > ∑n j=1Gj(n− sj) 2 ) ≤ P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n) 2 ) = P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) + E [ PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n) 2 )] , (4.23) for large enough n, where a1 is a positive constant arises from Assumption 4.1.7. Also, note that the second term in the last equality (4.23) is due to the law of iterated expec- tation. Let q1(n) = q(n)/2, we get, for τn > a1q1(n), since we have the condition that hdq(n)piq(n)q(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ a˜1hdq1(n)q1(n) and hdτnpiτnτn ≥ a˜1hdq1(n)piq1(n)q1(n), where a˜1 is a constant that is function of constant a1, which depends on the user determined choice of sequences {pin} and {hn}. For large 100 enough n, − w(Lhq(n); fi) > 0, and we have using (4.45) and (A.2) in (4.23), ≤ exp ( −3a1q1(n) 14 ) + exp ( −3ca˜1(2Lhq1(n)) dq1(n) 28 ) + exp ( −3ca˜1(2Lhq1(n)) dq1(n)piq1(n) 56 ) + 4N exp ( −c 2 5ca˜1(2Lhq1(n)) dq1(n)piq1(n)(− w(Lhq1(n); fi))2 8c24v 2 + 8c4c(− w(Lhq1(n); fi)) ) ≤ exp ( −3a1q(n) 28 ) + exp ( −3ca˜1(2Lhq(n)) dq(n) 56 ) + exp ( −3ca˜1(2Lhq(n)) dq(n)piq(n) 112 ) + 4N exp ( −c 2 5ca˜1(2Lhq(n)) dq(n)piq(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) . (4.24) Given 0 < δ < 1, we want to bound the right hand side above by δ. To do that for the first three terms, given total time horizon N , we define a special time point n′δ by n′δ = min { n > m0 : exp ( −3ca˜1(2Lhq(n)) dpiq(n)q(n) 112 ) ≤ δ 4`N } . (4.25) Under the condition that limn→∞ hdq(n)piq(n)q(n)/ log(n) → ∞, we have n′δ/N → 0 as N →∞. Therefore, if the total time horizon is long enough, we have N > n′δ. For the fourth term in the right hand side of (4.24), we want to choose an  such that, 4N exp ( −c 2 5ca˜1(2Lhq(n)) dpiq(n)q(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) ≤ δ 4`N , One such value for  as shown in Lemma 4.4.1 is given by, ˜i,n = w(Lhq(n); fi) + √ 64c24v 2 log(16`N2/δ) c25ca˜1(2L) dhdq(n)piq(n)q(n) . (4.26) By (4.24), (4.25) and (4.26), for n ≥ n′δ, we have that, P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ˜i,n ) ≤ δ 4`N + δ 4`N + δ 4`N + δ 4`N = δ `N , 101 which implies that, P  N∑ n′δ+1 2 sup 1≤i≤` |fˆi,n(Xn)− fi(Xn)| ≥ N∑ n′δ+1 2 max 1≤i≤` ˜i,n−1  ≤ δ. (4.27) Now we want to get a bound for the randomization error. Let σt = min{n¯ : ∑n¯ j=n′δ+1 I(tj ≤ N) ≥ t}, for t ∈ Z. For the cases when the rewards are observed by time N , i.e. for all instances σt, t ∈ Z we update only when a new reward is observed that is at every σt, t ≥ 1. In between the time points corresponding to two consecutive reward observations, {pit} takes the same as the value for the previous observed case. In other words, we have σt+1−σt same values (`−1)pit for the exploration probability for each t, hence ∑N n=n′δ+1 P (In 6= iˆn) = ∑N n=n′δ+1 (`− 1)piτn = ∑τN t=1(σt+1− σt)(`− 1)pit, and w.l.o.g., assume that στN+1 = N . Given  > 0 and the set of observed indices by time N , AN , we have by the Bern- stein’s inequality that, PAN ,XN A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥   ≤ exp ( −  2 2A2( ∑τN t=1(σt+1 − σt)(`− 1)pit[1− (`− 1)pit] + /3) ) . (4.28) Next, for some positive constant M > 0, we study the event Bt := {σt+1 − σt > M} for t ≥ 1. Note that, the event Bt is contained in the event that the first M/2 cases in [σt, σt+1] are delayed by more than M/2, that is, {σt+1 − σt > M} ⊂ { dσt+1 > M 2 , . . . , dσt+M/2 > M 2 } . Therefore, using this fact and by independence of delays, we have that, P (σt+1 − σt > M) ≤ P ( dσt+1 > M 2 , . . . , dσt+M/2 > M 2 ) ≤ M/2∏ s=1 P ( dσt+s > M 2 ) = M/2∏ s=1 ( 1−Gdσt+s ( M 2 )) (4.29) 102 ≤ ( M/2−∑M/2s=1 Gdσt+s(M/2) M/2 )M/2 ≤ ( 1− a1q(M/2) M/2 )M/2 , for all t = 1, . . . , τN , (4.30) where the second to last inequality comes from AM-GM inequality and the last inequal- ity follows from Assumption 4.1.7 and q(M/2) ≤ M/2 for all M , by construction. We see that the above upper bound decays at an exponential rate as M grows. As the above right hand side is free of t (by independence of delays), we have that, P ( max t (σt+1 − σt) ≥M ) ≤ ( 1− a1q(M/2) M/2 )M/2 . We can choose M such that, for a given δ,( 1− a1q(M/2) M/2 )M/2 = δ. (4.31) Since, q is known a priori and a1 is a positive constant, we can solve for M in the above equation. Consequently, since M will depend on δ, we denote it as Mδ. Depending on what q is for a given problem, we will always be able to find a corresponding Mδ. Also, note that using Hoeffding’s inequality (A.1.1), we have that, P ( τN ≥ E(τN ) +  A ) ≤ exp ( − 2 2 A2N ) . (4.32) We can choose 1(N, δ) = √ (N/2) log(1/δ) such that this probability is less that δ, that is, P (τN ≥ E(τN ) + 1(N, δ)) ≤ δ. (4.33) 103 Now consider, P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥   = P  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A ,max t (σt+1 − σt) ≥Mδ  + P  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A ,max t (σt+1 − σt) < Mδ  ≤ P ( max t (σt+1 − σt) ≥Mδ ) + P  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A , max t (σt+1 − σt) < Mδ, τN ≥ E(τN ) +  A ) + P  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A , max t (σt+1 − σt) < Mδ, τN < E(τN ) +  A ) ≤ P ( max t (σt+1 − σt) ≥Mδ ) + P ( τN ≥ E(τN ) +  A ) + P  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A ,max t (σt+1 − σt) < Mδ, τN < E(τN ) +  A ) ≤ δ + exp ( − 2 2 A2N ) + E [ PAN ,XN  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥  A , max t (σt+1 − σt) < Mδ, τN < E(τN ) +  A )] , (4.34) where the first term follows from the (4.30) and the definition of Mδ, the second term from (4.32) and last inequality follows from law of iterated expectation. 104 Then using (4.28) we have that, PAN ,XN A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ , max t (σt+1 − σt) < Mδ, τN < E(τN ) +  A ) ≤  exp ( −  2 2A2Mδ(E(τN ) + )/4 + /3 ) , if maxt(σt+1 − σt) < Mδ, τN < E(τN ) + /A; 0, otherwise. Using this in (4.34), we get, EPAN ,XN A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ , max t (σt+1 − σt) ≤Mδ, τN < E(τN ) + /A ) (4.35) ≤ exp ( −  2 2A2Mδ(E(τN ) + )/4 + /3 ) . (4.36) Therefore, combining (4.34) and (4.36), we get that with probability at least 1-δ, P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥   ≤ δ + exp ( − 2 2 A2N ) + exp ( −  2 2A2Mδ(E(τN ) + )/4 + /3 ) . In order to bound the right hand side by 2δ, let, N,δ = max { A √ Mδ E(τN ) 2 log( 2 δ ), A √ (N/2) log (2/δ) } . For this chosen , we have that, P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ N,δ  ≤ 2δ ⇒ P A N∑ n=n′δ+1 I(In 6= iˆn) ≥ A τN∑ t=1 (σt+1 − σt)(`− 1)pit + N,δ  ≤ 2δ. (4.37) 105 Note that, P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ N,δ  ≥ P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ N,δ, τN ≤ E(τN ) + 1(N, δ),max t (σt+1 − σt) ≤Mδ ) = P ( A ( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥ N,δ ∣∣∣τN ≤ E(τN ) + 1(N, δ), max t (σt+1 − σt) ≤Mδ ) P ( τN ≤ E(τN ) + 1(N, δ) ) P ( max t (σt+1 − σt) ≤Mδ ) ≥ P A  N∑ n=n′δ+1 I(In 6= iˆn)− E(τN )+1(N,δ)∑ t=1 Mδ(`− 1)pit  ≥ N,δ  (1− δ)2, (4.38) where the last inequality follows from (4.33) and (4.31). From Assumption 4.1.8, we also have that E(τN ) + 1(N, δ) < N , so the above statement is meaningful. Now, from (4.37) and (4.38), we get, P A  N∑ n=n′δ+1 I(In 6= iˆn)− E(τN )+1(N,δ)∑ t=1 Mδ(`− 1)pit  ≥ N,δ  (1− δ)2 ≤ P A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ N,δ  ≤ 2δ ⇒ P A  N∑ n=n′δ+1 I(In 6= iˆn)− E(τN )+1(N,δ)∑ t=1 Mδ(`− 1)pit  ≥ N,δ  ≤ 2δ (1− δ)2 (4.39) From (4.27) and (4.39), we get that with probability at least 1 − 2δ/(1 − δ)2, the cumulative regret for strategy η2 satisfies, RN (η2) < An ′ δ + N∑ n=n′δ+1 2 ( max 1≤i≤` w(Lhq(n); fi) + √ 64c24v 2 log(12`N2/δ) c25c(2L) dhdq(n)piq(n)q(n) ) +A N∗(δ)∑ t=1 Mδ(`− 1)pit + max { A √ Mδ E(τN ) 2 log (2 δ ) , A √(N 2 ) log (2 δ )} , 106 for N∗(δ) = E(τN ) + 1(N, δ). Let δ < 1/4 and we get the desired result. 4.4.3 Proof of Theorem 4.1.13 Proof of Theorem 4.1.13. Similar to Theorem 4.1.12, we will first find a lower bound for the estimation error. In order to do so, in (4.1.11), we first remove conditioning on Xn and then remove the conditioning on AN . Given arm i, n ≥ m0 + 1 and  > w(Lhτn ; fi) a.s., consider, PAN ( |fˆi,n+1 − fi(Xn+1)| ≥  ) ≤ PAN ( |fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤ c(2Lhτn) dτn 2 ) + PAN ( |fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) dτn 2 ) ≤ PAN ( Mn+1(Xn+1) ≤ c(2Lhτn) dτn 2 ) + PAN ( |fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) dτn 2 ) ≤ exp ( −3c(2Lhτn) dτn 28 ) + exp ( −3c(2Lhτn) dτnpin 56 ) + 4N exp ( −c 2 5c(2Lhτn) dτnpin(− w(Lhτn ; fi))2 8c24v 2 + 8c4c(− w(Lhτn ; fi)) ) , where, the above inequality follows from Lemma 4.1.11 and (A.2). Now, we want to remove the conditioning on AN . Recall that dj ind∼ Gj , for j ≥ 1. Therefore, for the known visiting times {sj , j ≥ 1}, P (tj ≤ n) = P (dj + sj ≤ n) = P (dj ≤ n− sj) = Gj(n− sj), and hence, P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥  ) = P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn ≤ ∑n j=1Gj(n− sj) 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > ∑n j=1Gj(n− sj) 2 ) 107 ≤ P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > ∑n j=1Gj(n− sj) 2 ) ≤ P ( τn ≤ ∑n j=1Gj(n− sj) 2 ) + EPAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n) 2 ) , where ∑n j=1Gj(n − sj) = Ω(q(n)) from Assumption 4.1.7, that is, for large enough n, we would have that ∑n j=1Gj(n − sj) ≥ a1q(n) for some positive constant a1. Let q1(n) = a1q(n)/2, we get, for τn > q1(n), since we have the condition that hdq(n)pinq(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ ˜˜a1hdq1(n)q1(n) and hdτnpinτn ≥ ˜˜a1hdq1(n)pinq1(n), where ˜˜a1 is a positive constant depending on a1 and the choice of hyperparameter sequences {hn} and {pin}. For large enough n, we have that − w(Lhq(n); fi) > 0, ≤ exp ( −3q1(n) 14 ) + exp ( −3c(2Lhq1(n)) dq1(n) 28 ) + exp ( −3c(2Lhq1(n)) dq1(n)pin 56 ) + 4N exp ( −c 2 5c(2Lhq1(n)) dq1(n)pin(− w(Lhq1(n); fi))2 8c24v 2 + 8c4c(− w(Lhq1(n); fi)) ) ≤ exp ( −3a1q(n) 28 ) + exp ( −3c ˜˜a1(2Lhq(n)) dq(n) 56 ) + exp ( −3c ˜˜a1(2Lhq(n)) dq(n)pin 112 ) + 4N exp ( −c 2 5c ˜˜a1(2Lhq(n)) dq(n)pin(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) . (4.40) Given 0 < δ < 1, we want to bound the R.H.S. above by δ. To do that for the first three terms, given total time horizon N , we define a special time point n′δ by n′′δ = min { n > m0 : exp ( −3c ˜˜a1(2Lhq(n)) dpinq(n) 112 ) ≤ δ 4`N } . (4.41) Under the condition that limn→∞ hdq(n)pinq(n)/ log(n)→∞, then we will have n′′δ/N → 0 as N →∞. Therefore, if the total time horizon is long enough, we have N > n′′δ . 108 For the fourth term in the R.H.S. of (4.40), we want to choose an  such that, 4N exp ( −c 2 5c ˜˜a1(2Lhq(n)) dpinq(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) ≤ δ 4`N . One such value for  is given by, ˜′i,n = w(Lhq(n); fi) + √√√√ 64c24v2 log(16`N2/δ) c25c ˜˜a1(2L)dhdq(n)pinq(n) . (4.42) By (4.40), (4.41) and (4.42), for n ≥ n′′δ , we have that, P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ˜′i,n ) ≤ δ 4`N + δ 4`N + δ 4`N + δ 4`N = δ `N , which implies that, P  N∑ n′′δ+1 2 sup 1≤i≤` |fˆi,n(Xn)− fi(Xn)| ≥ N∑ n′′δ+1 2 max 1≤i≤` ˜′i,n−1  ≤ δ. (4.43) Now we want to get a bound for the randomization error regret. Given  > 0, since P (In 6= iˆn) = (`− 1)pin, we have by the Hoeffding’s inequality that, P A  N∑ n=n′′δ+1 I(In 6= iˆn)− N∑ n=n′′δ+1 (`− 1)pin  ≥   ≤ exp(− 22 NA2 ) . Take  = A √ N/2 log(1/δ), we get, P A N∑ n=n′′δ+1 I(In 6= iˆn) ≥ A N∑ n=n′′δ+1 (`− 1)pin +A √ N 2 log ( 1 δ ) ≤ δ. (4.44) Therefore, from (4.43) and (4.44), we get that with probability at least 1 − 2δ, the cumulative regret satisfies, RN (η1) < An ′′ δ + N∑ n=n′′δ+1 2 max 1≤i≤` w(Lhq(n); fi) + CN,δ√ hdq(n)pinq(n) +A(`− 1)pin  +A √( N 2 log ( 1 δ )) , where CN,δ = √ 64c24v 2 log(12`N2/δ)/c25c ˜˜a1(2L)d. 109 4.4.4 Proof outline for the case when delays depend on covariates Notice that the Lemma 4.5 and 4.6 will remain exactly the same, as those results are conditional probability results, given the time points when rewards were observed and the covariate positions for which arms were assigned. In this section, we will discuss the steps in the proof for Theorem 4.1.12 where the dependence of delays on covariates plays a role. Recall, Mn+1(x) = ∑n j=1 I{||Xj − x||∞ ≤ Lhτn , tj ≤ n} and σt = min{n¯ : ∑n¯ j=n′δ+1 I(tj ≤ N) ≥ t}. Then we have, EAN (Mn+1(x)) = E( n∑ j=1 I{||Xj − x||∞ ≤ Lhτn , tj ≤ n} | An) = n∑ j=1 I{tj ≤ n}P (||Xj − x||∞ ≤ Lhτn | tj ≤ n) ≥ n∑ j=1 I{tj ≤ n}P (||Xj − x|| ≤ Lhτn , tj ≤ n) = n∑ j=1 I{tj ≤ n}P (tj ≤ n | ||Xj − x||∞ ≤ Lhτn)P (||Xj − x||∞ ≤ Lhτn) ≥ τn∑ t=1 en−sσt c(2Lhτn) d, where the last inequality follows by (4.7), which assumes a uniform bound on the cdf’s of delay distributions across the covariate space. Then, in the proof of theorem 4.1.12, the step (4.21) is where dependence of delays on covariates plays a role. Consider, PAN (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ) ≤ PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤ c(2Lhτn) d ∑τn t=1 en−sσt 2 ) + PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) d ∑τn t=1 en−sσt 2 ) ≤ PAN ( Mn+1(Xn+1) ≤ c(2Lhτn) d ∑τn t=1 en−sσt 2 ) + PAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn) d ∑τn t=1 en−sσt 2 ) 110 ≤ exp ( − 3c(2Lhτn) d ∑τn t=1 en−sσt 28 ) + exp ( − 3c(2Lhτn) dpiτn ∑τn t=1 en−sσt 56 ) + 4N exp ( − c25c(2Lhτn) dpiτn ∑τn t=1 en−sσt (− w(Lhτn ; fi))2 8c24v 2 + 8c4c(− w(Lhτn ; fi)) ) . (4.45) Now, in order to provide an upper bound for the estimation error, we want to remove the condition on AN from the above probability bound (4.45). Recall that τn = ∑n j=1 I(tj ≤ n). Consider, P (tj ≤ n) = ∫ [0,1]d P (tj ≤ n|Xj = x)p(x)dx = ∫ [0,1]d Gx(tj ≤ n)p(x)dx ≥ ∫ [0,1]d en−sjp(x)dx = en−sj , where, the second to last inequality follows from (4.7). Therefore, we get that, E( τn∑ t=1 en−sσt ) = E  n∑ j=1 I{tj ≤ n}en−sj  = n∑ j=1 en−sjP (tj ≤ n) ≥ n∑ j=1 e2n−sj . Therefore, by the extended Bernstein’s inequality (A.2), we have that, P ( τn∑ t=1 en−sσt ≤ ∑n j=1 e 2 n−sj 2 ) = P  n∑ j=1 I(tj ≤ n)en−sj ≤ ∑n j=1 e 2 n−sj 2  ≤ exp ( −3 ∑n j=1 e 2 n−sj 28 ) . 111 Now consider, P (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ) = P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt ≤ ∑n j=1 e 2 n−sj 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt > ∑n j=1 e 2 n−sj 2 ) ≤ P ( τn∑ t=1 en−sσt ≤ ∑n j=1 e 2 n−sj 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt > ∑n j=1 e 2 n−sj 2 ) ≤ P ( τn∑ t=1 en−sσt ≤ ∑n j=1 e 2 n−sj 2 ) + P ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt > a1q(n) 2 ) ≤ P ( τn∑ t=1 en−sσt ≤ ∑n j=1 e 2 n−sj 2 ) + EPAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt > a1q(n) 2 ) ≤ exp ( −3 ∑n j=1 e 2 n−sj 28 ) + EPAN ( |fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn∑ t=1 en−sσt > a1q(n) 2 ) , (4.46) for large enough n, where a1 is a positive constant arises from Assumption 4.2.2. Also, notice that ∑τn t=1 en−sσt ≤ τn, hence, τn∑ t=1 en−sσt > a1q(n) 2 ⇒ τn > a1q(n) 2 . Let q1(n) = q(n)/2, we get, for τn > a1q1(n), since we have the condition that hdq(n)piq(n)q(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ a˜1hdq1(n)q1(n) and hdτnpiτnτn ≥ a˜1hdq1(n)piq1(n)q1(n), where a˜1 is a constant that is function of constant 112 a1, which depends on the user determined choice of sequences {pin} and {hn}. Then for large enough n, − w(Lhq(n); fi) > 0, and we have using (4.45) and (A.2) in (4.46), ≤ exp ( −3a1q1(n) 14 ) + exp ( −3ca˜1(2Lhq1(n)) dq1(n) 28 ) + exp ( −3ca˜1(2Lhq1(n)) dq1(n)piq1(n) 56 ) + 4N exp ( −c 2 5ca˜1(2Lhq1(n)) dq1(n)piq1(n)(− w(Lhq1(n); fi))2 8c24v 2 + 8c4c(− w(Lhq1(n); fi)) ) ≤ exp ( −3a1q(n) 28 ) + exp ( −3ca˜1(2Lhq(n)) dq(n) 56 ) + exp ( −3ca˜1(2Lhq(n)) dq(n)piq(n) 112 ) + 4N exp ( −c 2 5ca˜1(2Lhq(n)) dq(n)piq(n)(− w(Lhq(n); fi))2 16c24v 2 + 16c4c(− w(Lhq(n); fi)) ) . (4.47) Note, that the above inequality we get is the same as (4.24). A similar analysis can be done for strategy η2 by replacing piτn with pin. Next, the steps of the proof that follow remain the same and will lead to exactly same results as in the proofs for both strategies η1 and η2. However, it is important to recognize that although the inequalities (4.24) and (4.47) look alike (as will the final regret upper bounds), the underlying meaning of q(n) are different in the two setups and the bounds could certainly look very different in the two settings. The bound for randomization error will also look similar to (4.37) for strategy η1, and the essential difference in the proof lies in the following steps. First step that is different is (4.30), consider, P (σt+1 − σt > M) ≤ P ( dσt+1 > M 2 , . . . , dσt+M/2 > M 2 ) ≤ M/2∏ s=1 P ( dσt+s > M 2 ) ≤ M/2∏ s=1 ∫ [0,1]d P ( dσt+s > M 2 | Xσt+s = x ) p(x)dx 113 = M/2∏ s=1 ( 1− ∫ [0,1]d Gx ( M 2 ) p(x)dx ) ≤ M/2∏ s=1 ( 1− e(M/2)−s ) ≤ ( M/2−∑M/2s=1 e(M/2)−s M/2 )M/2 ≤ ( 1− √ a1q(M/2) M/2 )M/2 . (4.48) Let Fn be the σ-field generated by (Zn, Xn, In). Since the total number of observed rewards τN will depend on the covariates, we will replace E(τN ) in the proof for ran- domization error bound for the independent case with E(τN | FN ) and then we can apply Azuma-Hoeffding’s inequality as stated in Lemma A.1.2. Consider, P ( A ( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  ) = P (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A ,max t (σt+1 − σt) ≥Mδ ) + P (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A ,max t (σt+1 − σt) < Mδ ) ≤ P ( max t (σt+1 − σt) ≥Mδ ) + P (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A ,max t (σt+1 − σt) < Mδ, τN ≥ E(τN | FN ) +  A ) + P (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A ,max t (σt+1 − σt) < Mδ, τN < E(τN | FN ) +  A ) 114 ≤ P ( max t (σt+1 − σt) ≥Mδ ) + P ( τN ≥ E(τN | FN ) +  A ) + P (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A ,max t (σt+1 − σt) < Mδ, τN < E(τN | FN ) +  A ) ≤ δ + exp ( − 2 2 A2N ) + E [ PAN (( N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit ) ≥  A , max t (σt+1 − σt) < Mδ, τN < E(τN | FN ) +  A )] , (4.49) where the second inequality follows from Azuma-Hoeffding’s inequality. Then using Lemma (A.1.5) (Bernstein’s inequality for martingales) we have that, PAN A  N∑ n=n′δ+1 I(In 6= iˆn)− τN∑ t=1 (σt+1 − σt)(`− 1)pit  ≥ ,max t (σt+1 − σt) < Mδ, τN < E(τN | FN ) +  A ) ≤  exp ( −  2 2A2Mδ(E(τN | FN ) + )/4 + /3 ) , if maxt(σt+1 − σt) < Mδ, τN < E(τN | FN ) + ; 0, otherwise. Then, the remaining proof follows as the proof of Theorem 4.1.12. We get the same results as (4.37) and (4.44) for both strategies η1 and η2, except for the fact that q(n) has a different meaning and interpretation, based on our definition in (4.7) and Assumption 4.1.7. Combining the results for estimation error bounds and randomization error bounds, we get the bounds for the cumulative regret for strategies η1 and η2 respectively as in Theorems 4.2.3 and 4.2.4. 4.5 Supplementary real-data results In section 4.3, we conducted a real data analysis with 150 steps of initialization. In Figure 4.3, we change number of initialization steps to 1000. Our proposed strategies η1 and η2 still perform better than DeLinUCB for delay settings 2 and 3, but for delay 115 setting 4, the results are now comparable. This could be due to enough learning during initialization and less observed data due to severe delays, reflecting a need to run on a bigger data sample. l η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 2 , pin = log(n)−2 , hn = n−1 6 η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 3 , pin = log(n)−2 , hn = n−1 6 l l l l ll l η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 4 , pin = log(n)−2 , hn = n−1 6 l η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 2 , pin = log(n)−2 , hn = log(n)−1 l l η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 3 , pin = log(n)−2 , hn = log(n)−1 l l l ll l η1 η2 DeLinUCB 0. 90 0. 95 1. 00 1. 05 1. 10 1. 15 1. 20 1. 25 delay 4 , pin = log(n)−2 , hn = log(n)−1 Figure 4.3: Boxplots of normalized CTRs for the three methods for 1000 rounds of initialization. Chapter 5 Doctor’s intervention in randomized allocation strategy This chapter, somewhat distinct from the previous chapters aims to extend the work of Yang and Zhu (2002) with the intent of improving medical practice. Suppose there are two competing treatments A and B for a disease which have been considered comparable in their effect but the doctor wants to assess the performance of these on the patients. In doing so, they want to take patient characteristics (covariates) into account. We could use an MABC algorithm to help the doctor. Every time a patient visits, the doctor assigns a treatment and consequently the effect/reward for that treatment is measured. After a couple of visits when one has some data to assess the performance of both treatments, the MABC algorithm recommends the next treatment that should be given to the forthcoming patient based on his/her covariates. Since the optimality of these MABC strategies has already been established, we know that the treatment decisions made over time are eventually going to be for the betterment of the patients. However, real life is more complicated and there could be some other factors that were ignored while these treatment decisions were made. In such situations, the doctor might want to give a different treatment to a patient than the one recommended by the algorithm. This disagreement could be a result of some hard to quantify information or doctor’s judgment based on their experience. Therefore, in this work we propose to integrate the cases where such a disagreement arises into an adaptive MABC algorithm. The 116 117 goal is to show that the proposed integrated allocation strategy is consistent, that is, in the long run the cumulative reward of the algorithm is equivalent to the best possible cumulative reward. In section 5.1 we layout the problem setup, in section 5.2 we describe the allocation strategy and in section 5.3 we outline the proof for the proposed allocation strategy. 5.1 Problem setup Assume that there are `, ` ≥ 2 arms available for playing. After pulling an arm, a random reward is generated. Each time before deciding which arm to pull, a d- dimensional covariate x ∈ Rd is observed. This contains information about the patient like their age, gender, genetic factors, etc. Let us assume that we only have one doctor in this study for simplification. We assume that the characteristics or covariates are continuous variables and take values in a hypercube [0, 1]d without loss of generality. The mean reward with the given covariate x for the ith arm is denoted by fi(x), 1 ≤ i ≤ l. Ideally, if the f ′is were known with the observed covariate x, then one would pull the arm with the largest mean reward at the given x; that is, one would choose the arm i∗(x) which results in f∗(x) = max1≤i≤l fi(x). The actual reward with covariate x of pulling the ith arm is modeled as Yi,j = fi(x) + j , where j denotes random independent errors with mean 0 and finite variance. Let, • X1, . . . , Xn, . . . be a sequence of covariates independently generated from a popu- lation supported in [0, 1]d. • PX denote the underlying probability distribution, which is also assumed to be unknown. • Yi,j denote the reward of pulling the ith arm when the covariate Xj is presented. • Ij , j ≥ 1 be the chosen arm at time j ∈ N. • γj is indicator of whether or not the doctor thinks that the jth patient is a special case. 118 • Tj be the indicator of whether or not the system allowed the doctor to make their decision when they declare that the jth case as special. At the time n ≥ 1, let Zn,i denote the set of observations {(Xj , γj , Tj , YIj ,j), 1 ≤ j ≤ n} to which the ith arm is pulled (i.e. Ij = i). The total mean reward up to this time n is∑n j=1 fIj (Xj). The goal is to maximize the total reward after a number of plays. The performance of the doctor is evaluated in batches. The size of each batch is determined by the number of cases considered special by the doctor (dk for the k th batch which are pre-determined). Suppose, we divide the trials until time step n into batches of size Ndk ; k = 1, . . . ,Mn such that Ndk is the time step at which the dkth case was considered special in the kth batch. Therefore, as a result we will assume to have∑Mn k=1Ndk ≤ n (as we are interested in asymptotic results (n→∞), w.l.o.g. we assume that ∑Mn k=1Ndk = n for simplification). It is important to note that {dk; k = 1, . . . ,Mn} are known quantities specified by the decision maker. In order to evaluate the performance in a batch, the proportion of times the doctor is allowed to go with their instinct, for the cases they think are special, is determined adaptively. Since for making a comparison, we are only interested in the cases when doctor felt the patient’s case was special and wanted to use a different treatment than the one proposed by the algorithm, we only look at the cases where this disagreement arises. We will call this as the subsequence indexed by {jv; v = 1, . . . , dk} for each kth batch. Let mk be the proportion of times the doctor is allowed to follow their instinct in the kth batch. Then in the next batch (k + 1th batch), the proportion mk+1 is determined based on the following criterion, Bk+1 := ∑dk v=1 YIjv ,jvI{Tjv=1} mkdk − ∑dk v=1 YIjv ,jvI{Tjv=0} (1−mk)dk > βk+1, (5.1) with βk+1 ≤ 0, a threshold which is determined adaptively by the decision maker. Based on if the above condition is met, we update the value of mk for the next batch. In particular, define mk+1 = remains the same as mk with probability pk+1gets reduced to mk+1 with probability 1− pk+1, 119 with mk+1 ≤ mk, where, pk+1 = Pr (∑dk v=1 YIjv ,jvI{Tjv=1} mkdk − ∑dk v=1 YIjv ,jvI{Tjv=0} (1−mk)dk > βk+1 ∣∣∣∣∣mk ) . (5.2) 5.1.1 Regret and consistency Let δ be a sequential allocation rule proposed and I1, I2, . . ., be the chosen arms at time j = 1, 2, . . .. With the allocation rule, given the previous observations and Xj , the mean reward at the given Xj is fIj (Xj) for j ≥ 1. The total of this mean reward up to time n is ∑n j=1 fIj (Xj). Without knowing the random errors, the ideal performance occurs when the choices I1, . . . In match i ∗(X1), . . . , i∗(Xn), yielding the optimal total reward∑n j=1 f ∗(Xj). The quantity of interest here is the regret of our allocation scheme δ which is given by, Rn(δ) = ∑n j=1 fIj (Xj)∑n j=1 f ∗(Xj) . Clearly, Rn is a random variable no bigger than 1. It measures the performance of the allocation rule relative to the ideal one with the optimal arm known for each x. Definition 5.1.1. An allocation rule δ is said to be strongly consistent if Rn(δ) → 1 with probability 1. Remark If 1n ∑n j=1 f ∗(Xj) is eventually bounded above and away from 0 with prob- ability 1, then Rn(δ) a.s.→ 1 is equivalent to 1n ∑n j=1(fIj (Xj)− f∗(Xj)) a.s.→ 0. We use the strategy developed in Yang and Zhu (2002) and extend it to incorporate the doctor’s interventions and present an adaptive allocation strategy in section 5.2. Then we will consider the two scenarios 1) when the doctor performs poorly as compared to the algorithm, 2) when the doctor performs better or at par with the algorithm and show strong consistency for each of these scenarios. 120 5.2 Proposed allocation strategy There are three main ingredients in our approach on selecting an arm (1) nonparametric estimation of the functions fi, (2) a proper allocation rule to control the exploitation- exploration trade-off and (3) incorporating doctor’s interventions as part of the alloca- tion rule. Let {pij , j ≥ 1} be a sequence of positive numbers decreasing to 0 and let m1 be the proportion of special cases where doctor is allowed to make their decision in the first batch. Step 1 Initialize. Each patient is allotted to a treatment based on what the doctor chooses to give. Let’s say I1 = i1, I2 = i2, . . . , It0 = it0 where ik ∈ {1, . . . , `} for each k in 1, . . . , t0. This allocation is done until the doctor has made his decision t0 times and then from the t0 + 1th time onwards, allot the arms that have never been allotted so far (if any). Suppose that all arms have been allotted at least once in m0 steps. Step 2 Estimate the individual functions fi. For n = m0, based on current data Z n,i, estimate fˆi,n for 1 ≤ i ≤ l using the chosen regression procedure. Step 3 Estimate the best arm. For the next covariate Xn+1, let iˆn+1 be the maximizer of fˆi,n(Xn+1) over 1 ≤ i ≤ l. Now, iˆn+1 is the algorithm’s recommendation at the n+ 1th time step. Step 4 At time step n+ 1, let γn+1(Xn+1) = 1 if doctor disagrees with the recommendation0 if doctor agrees with the recommendation. Let i′n+1(Xn+1) (i′n+1 6= iˆn+1) be the arm chosen by the doctor when γn+1(Xn+1) = 1. Step 4a- If doctor agrees with the recommendation (i.e. if γn+1(Xn+1) = 0), then allocation happens based on -greedy heuristic. That is, randomly select an arm, with probability 1 − (l − 1)pin+1 for i = iˆn+1 and with probability (l − 1)pin+1 for each of the remaining arms. 121 Step 4b- If the doctor disagrees with the recommendation (i.e. if γn+1(Xn+1) = 1), he/she is allowed to make their decision m1 proportion of times there is a disagreement and for the rest 1 − m1 proportion of times their decision is overruled by the system recommendation. Let, Tn+1 = 1 if the system lets doctor decide, i.e. i′n+1 is chosen0 system overrides doctor’s decision. If Tn+1 = 0, randomly select an arm based on the -greedy heuristic. If Tn+1 = 1, then we select i = i ′ n+1 with probability 1. Step 4c- Let In+1 denote the selected arm. Pull the arm In+1 to receive the reward. Step 5 Update the estimates based on the available information only for the cases when the doctor agreed with the system’s decision. After the new observationXn+1, γn+1, Tn+1, In+1, YIn+1,n+1 update the function estimate fi for i = In+1. Step 6 Repeat steps 2-5 when the next covariate Xn+2 surfaces and so on until the time we’ve had total d1 disagreements out of which the doctor has made d1m1 decisions of his own. This will be the first batch with size Nd1 . Step 7 In order to draw a comparison in between a doctor decisions and the recom- mendation’s performance, we denote the sequence where the doctor disagrees as {tv : v = 1, . . . , d1}. Then take d1m1 cases for when the doctor is allowed to make his/her decision and d1(1 −m1) cases when their disagreement is overruled. We compare the cumulative reward for these two subsequences as follows, 1. Doctor performs worse than the algorithm: For the first batch, we will have that for β1 ≤ 0,∑d1 v=1 YItv ,tvI{Ttv=1} m1d1 − ∑d1 v=1 YItv ,tvI{Ttv=0} (1−m1)d1 < β1. (3) If this is the case, we would want to decrease m1 for the next time points (m2 < m1) and force the doctor to go with the system’s recommendation more often. β1 is replaced by β2 from a sequence of non-positive numbers βk converging to zero as k →∞. 122 2. Doctor performs better/ at par with the algorithm: For the sequence βk, the criterion for this first batch in this case is,∑d1 v=1 YItv ,tvI{Ttv=1} m1d1 − ∑d1 v=1 YItv ,tvI{Ttv=0} (1−m1)d1 > β1. (2) In this case we would want to give the doctor more chance to make his/her decisions so we can let m1 be the same for the next batch. β1 is replaced by β2 in the next batch analysis and the rate of increase of βk will be discussed later. Step 8 Repeat all the steps for the second batch of size Nd2 which has total of d2 dis- agreements in which the doctor makes d2m2 decisions. When we are at the nth time step we have repeated the same analysis for k = 2, 3, . . . ,Mn groups with sizes Ndk , k = 1, . . . ,Mn each, such that ∑Mn k=1Ndk = n. This allocation strategy is outlined in the flow chart in Figure 5.1. Patient arrives Estimate the best arm Doctor agrees Doctor disagrees System overrides Doctor’s choice Evaluate doctor’s performance Update the parameters Figure 5.1: Flow chart of the allocation strategy 123 5.2.1 Regression procedures Various regression procedures can be used to estimate the individual mean functions fi’s. In Yang and Zhu (2002), two procedures are discussed; histogram method and nearest neighbor method. Strong consistency for their proposed allocation rule was proved for both the regression procedures, however finite time regret bounds were not established. In another work by Qian and Yang (2016a), strong consistency and finite time regret analysis was performed for their proposed allocation strategy using kernel estimation methods. We will assume the finite regret results from Qian and Yang (2016a) in proving the consistency of our proposed allocation strategy in section 5.2. These regression methods are briefly discussed below. 1. Kernel method: Let Ji,n = {j : Ij = i, 1 ≤ j ≤ n}, the set of past time points at which arm i is pulled. Consider a multivariate nonnegative kernel function K(u) : Rd → R that satisfies Lipschitz, boundedness and bounded support conditions. Let hn denote the bandwidth, where hn → 0 as n → ∞. The Nadaraya-Watson estimator fi(x) is fˆi,n+1(x) = ∑ j∈Ji,n+1 Yi,jK ( x−Xj hn ) ∑ j∈Ji,n+1 K ( x−Xj hn ) . 2. Histogram method: Partition [0, 1]d into M = (1/h)d (hyper-)cubes with side width h. For each x, let J(x) = {j : 1 ≤ j ≤ n, xj and x belong to the same cube}. Let N(x) denote the size of J(x). Then fˆ(x) = 1 N(x) ∑ j∈J(x) Yj 3. Nearest neighbor method: Let d be the Euclidean distance on [0, 1]d. For a chosen integer Nn and x ∈ [0, 1]d, let J(x;N) = {j : 1 ≤ j ≤ n and xj is among the N closest points to x in distance d}. Then let, fˆ(x) = 1 N ∑ j∈J(x;N) Yj . 124 We assume that there exists constants ρ and κ ≤ 1 such that for each reward function fi, the modulus of continuity as defined in section A.0.2 satisfies, w(h; fi) ≤ ρhκ. This will be used in the results in section 5.3.3 where we assume the rate of convergence as in Qian and Yang (2016a). 5.3 Consistency of the proposed strategy We will show consistency of the proposed strategy in section 5.2 for two scenarios: (1) when doctor performs worse than the algorithm, (2) when doctor performs better/at par with the algorithm. 5.3.1 Layout of the proof Recall that mk is the proportion of special cases the doctor is allowed to make their decision. Making use of the inequality in section 5.3.2, we show that for the case when, • Doctor performs worse than the algorithm: mk a.s.→ 0 as k →∞ in section 5.3.3. • Doctor performs better than the algorithm: mk is only reduced for a finite number of batches with probability 1. Then these results are used to prove consistency for both the cases in sections 5.3.5 and 5.3.6, respectively. We start with proving a useful inequality which is used in the following sections. 5.3.2 A preliminary result Since the values of the proportion mk+1 for the k + 1th batch depends on (5.1) being satisfied or not, it is important to have a sense of how (5.2) grows and so having an upper bound for that will guide us in further analysis of the problem. First we make some assumptions. Assumption 5.3.1. The function fi are nonnegative and continuous on [0, 1] d and E[f∗(X1)] > 0. 125 Assumption 5.3.2. The design distribution PX is dominated by the Lebesgue measure with density p(x) uniformly bounded above and below from 0 on [0, 1]d; that is p(x) satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯. Assumption 5.3.3. The errors satisfy a moment condition that there exist positive constants v and c such that, for all m ≥ 2, E|ij |m ≤ m! 2 v2cm−2. (5.3) Lemma 5.3.4. Suppose the Assumptions 5.3.1-5.3.3 are met, then for the kth batch, the cumulative error terms for the cases when the doctor makes decisions versus when the algorithm overrules doctor’s decisions, the following inequality holds, P (∑dk v=1 Itv ,tvI{Ttv=1} mkdk − ∑dk v=1 Itv ,tvI{Ttv=0} (1−mk)dk > β ′ k ) ≤ exp − dk ( mkβ ′ k 2 )2 2(v2 + c mkβ ′ k 2 ) + exp − dk ( (1−mk)β′k 2 )2 2(v2 + c (1−mk)β′k 2 )  . (5.4) Proof. Consider, P (∑dk v=1 Itv ,tvI{Ttv=1} mkdk − ∑dk v=1 Itv ,tvI{Ttv=0} (1−mk)dk > β ′ k ) = P (∑dk v=1 Itv ,tvI{Ttv=1} mkdk + ∑dk v=1(−Itv ,tv)I{Ttv=0} (1−mk)dk > β ′ k ) ≤ P (∑dk v=1 Itv ,tvI{Ttv=1} mkdk ≥ β ′ k 2 ) + P (∑dk v=1(−Itv ,tv)I{Ttv=0} (1−mk)dk ≥ β′k 2 ) = P ( dk∑ v=1 Itv ,tvI{Ttv=1} ≥ dk mkβ ′ k 2 ) + P ( dk∑ v=1 (−Itv ,tv)I{Ttv=0} ≥ dk (1−mk)β′k 2 ) . We have derived the upper bound for each of the summands using Assumption 5.3.3 and (A.3), we get, ≤ exp − dk ( mkβ ′ k 2 )2 2(v2 + c mkβ ′ k 2 ) + exp − dk ( (1−mk)β′k 2 )2 2(v2 + c (1−mk)β′k 2 )  . 126 5.3.3 Scenario 1: doctor performs worse than the algorithm Let us consider the case where the doctor’s instincts are not showing comparable re- sults to the algorithm’s choices of treatments. In this case, we want to show that the proportion of times the doctor gets a chance to make their decision is decreasing with time. That is, we need to show that mk a.s.→ 0 as k → ∞. This means that there are infinite number of batches in which the proportion of times a doctor is allowed to make their special decisions are reduced. Let A denote the set, A := {sample paths ω such that mk gets reduced only for a finite number of batches}, that is, A represents all the sample paths for which the proportion of times a doctor makes their own treatment decisions gets reduced only a finite number of times, assum- ing that the trial is run over an indefinite time period. Also, define the set Ai denoting the event that no reduction happened in mi from the ith batch onwards. In other words the doctor did not perform poorly after ith batch. Ai := {ω : mi−2(ω) > mi−1(ω) and mk(ω) = mi−1(ω) ∀k ≥ i}, given that mi−1(ω) is the proportion of special cases that the doctor is allowed to treat at his/her discretion in the (i− 1)th batch for sample path ω. We make the following assumptions: Assumption 5.3.5. We assume that the doctor performs poorly, (f∗(Xj)−fIj (Xj))I{Tj=1} ≥ a for some a > 0. Assumption 5.3.6. Let R1Nk(δ) be the regret for those cases when the algorithm’s choices of arms are being played. If the algorithm’s choice is made Nk times then for this regret let us assume R1k(δ) < O(N 1− 1 3+d/κ k ) a.s. (as in Qian and Yang (2016a)) where d is the dimension of the covariates and κ is the Ho¨lder smoothness parameter. Assumption 5.3.7. Assume that | ∑n j=1 f ∗(Xj) n − Ef∗(X1)| < O( lognn ) a.s. Theorem 5.3.8. If Assumptions 5.3.5-5.3.7 are satisfied, then P (A) = 0 which will imply that mk a.s.→ 0 as k →∞. The proof for Theorem 5.3.8 can be found in section 5.4. 127 5.3.4 Scenario 2: doctor performs at par with the algorithm Unlike in the previous case, here we want to show that there is a non-zero probability for the event that the proportion of times (special cases) doctor is allowed to make their decision is decreased only finitely many times. For that we want to show that P (Ai) > 0 for some i, w.l.o.g we show that P (A2) > 0. For this case we make the following assumptions. Assumption 5.3.9. For all batches k ∈ {1, 2, . . . ,Mn},∑dk j=1 fIjv I{Tjv=0} (1−mk)dk − ∑dk j=1 fIjv I{Tjv=1} mkdk ≤ 0. Assumption 5.3.10. Let d1 be the number of special cases in batch 1. The number of special cases per batch dk and the cutoff βk be such that, dk−1β2k log k → ∞ as k →∞. Assumption 5.3.11. The number of times doctor agrees with the algorithm is non-zero for all the batches. Theorem 5.3.12. Given that Assumptions 5.3.9-5.3.11 hold, we have that P (A2) > 0, i.e., with a positive probability, no reduction happens in the number of chances the doctor gets to treat special cases from the second batch onwards. The proof for Theorem 5.3.12 can be found in section 5.4. 5.3.5 Consistency for the scenario 1: doctor performs worse Here, we will use the result of section 5.3.3 to prove consistency of our proposed allo- cation scheme. Along with the conditions specified in section 5.3.3 and Assumptions 5.3.1, 5.3.2 and 5.3.3, we need the following assumptions: Assumption 5.3.13. The regression procedure is strongly consistent in L∞ norm for all individual mean functions fi under the proposed allocation scheme, that is, ||fˆi,n−fi|| → 0 a.s. for each 1 ≤ i ≤ l as n→∞. Assumption 5.3.14. The mean functions satisfy fi(x) ≥ 0, E(f∗(X1)) > 0 and, A = sup 1≤i≤l sup x∈[0,1]d (f∗(x)− fi(x)) <∞. 128 Lemma 5.3.15. Let Uk = ∑dk v=1 I{Ttv = 1}, where k = 1, . . . ,Mn, and denotes the number of special cases where the doctor made a treatment decision in batch k. Then,∑Mn k=1 Uk n → 0 almost surely. Proof. We have already shown in section 5.3.3 that mk a.s.→ 0. In other words, given  > 0, P (∃M∗ > 0 such that mk <  ∀k > M∗) = 1. Assuming that k > M∗ would correspond to n > M∗1 , we consider, 1 n Mn∑ k=1 Uk = 1 n Mn∑ k=1 mkdk = 1 n M∗∑ k=1 mkdk + 1 n Mn∑ k=M∗+1 mkdk. Since the first summand has only finitely many terms, we can say that for given  > 0,∃M∗2 > 0, such that ∑M∗ k=1 mkdk n <  for all n > M ∗ 2 . Then for n ≥ max{M∗1 ,M∗2 }, ≤ +  n Mn∑ k=M∗+1 dk ≤ 2 with probability 1. This is because we know that ∑Mn k=M∗+1 dk ≤ n. Therefore we have shown that 1 n ∑Mn k=1 Uk a.s.→ 0. Theorem 5.3.16. Given assumptions 5.3.5-5.3.7 and Assumptions 5.3.13-5.3.14, we will have that Rn(δ) a.s.→ 0. Proof of Theorem 5.3.16. The regret of our allocation scheme δ is given by, Rn(δ) = ∑n j=1 fIj ,j∑n j=1 fi∗j ,j = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=1}∑n j=1 fi∗j ,j = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=0}∑n j=1 fi∗j ,j + ∑n j=1(fIj ,j − fi∗j ,j)I{γj=1,Tj=1}∑n j=1 fi∗j ,j + ∑n j=1 fi∗j ,jI{γj=1,Tj=1}∑n j=1 fi∗j ,j 129 ≥ ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=0}∑n j=1 fi∗j ,j −A ∑n j=1 I{γj=1,Tj=1}∑n j=1 fi∗j ,j + ∑n j=1 fi∗j ,jI{γj=1,Tj=1}∑n j=1 fi∗j ,j . ≥ ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j −A ∑n j=1 I{γj=1,Tj=1}∑n j=1 fi∗j ,j . We can re-write the RHS of the above inequality as, = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j −A ∑Mn k=1 ∑Ndk t=Ndk−1+1 I{γt=1,Tt=1}∑n j=1 fi∗j ,j . (5.5) Then given that doctor declared dk special cases in the kth batch, we can extract a subsequence {tv : v = 1, . . . , dk} for each batch marking the special cases, making (5.5) equals, = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j −A ∑Mn k=1 ∑dk v=1 I{Ttv=1}∑n j=1 fi∗j ,j . (5.6) Let Uk = ∑dk v=1 I{Ttv=1}. Then (5.6) equals, = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j −A ∑Mn k=1 Uk∑n j=1 fi∗j ,j . (5.7) Note that Uk is a random variable denoting the number of times doctor is allowed to make their decision out of the total special cases (dk) considered in the kth batch. From Lemma 5.3.15, we know have that the second term in (5.7) converges to 0 almost surely. We want to show that the sum of the first term in 5.7 converges to 1 almost surely, that is, ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j a.s.→ 1 as n→∞. Notice that the term above correspond to the times when the algorithm’s decision was made using the - greedy heuristic. Let {jv, v = 1, 2, . . . , rn} be the subsequence where the algorithm’s decision is chosen, i.e., it is those observations when the doctor agreed with the algorithm recommendation. Then we can write the above sum as,∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j = ∑rn v=1 fIjv ,jv∑n j=1 fi∗j ,j . 130 We want to show that ∑rn v=1 fIjv ,jv∑n j=1 fi∗j ,j a.s.→ 1. We can rewrite this as, ∑rn v=1 fIjv ,jv∑n j=1 fi∗j ,j = ∑rn v=1 fiˆjv ,jv∑n j=1 fi∗j ,j + ∑rn v=1(fIjv ,jv − fiˆjv ,jv)∑n j=1 fi∗j ,j ≥ ∑rn v=1 fiˆjv ,jv∑n j=1 fi∗j ,j − rn n 1 rn ∑rn v=1AI{Ijv 6=iˆjv} 1 n ∑n j=1 fi∗j ,j = ∑rn v=1(fiˆjv ,jv − fi∗jv ,jv)∑n j=1 fi∗j ,j + ∑rn v=1 fi∗jv ,jv∑n j=1 fi∗j ,j − rn n 1 rn ∑rn v=1AI{Ijv 6=iˆjv} 1 n ∑n j=1 fi∗j ,j ≥ ∑rn v=1(fiˆjv ,jv − fi∗jv ,jv)∑n j=1 fi∗j ,j − rn n 1 rn ∑rn v=1AI{Ijv 6=iˆjv} 1 n ∑n j=1 fi∗j ,j . (5.8) Let U˜v = I{Ijv 6=iˆjv}. Then U˜v’s are independent random variables after jv = m0 with success probability (l − 1)pijv . Since, ∞∑ v=m0+1 Var ( U˜v v ) = ∑∞ v=m0+1 (l − 1)pijv(1− (l − 1)pijv) v2 <∞, then by Kolmogorov’s two series lemma, we have that ∞∑ v=m0+1 (U˜v − (l − 1)pijv) v converges a.s. Then it follows by Kronecker’s lemma that 1 rn rn∑ v=1 (U˜v − (l − 1)pijv) → 0 a.s. Since pijv → 0 as k → ∞, we will have 1rn ∑rn v=1(l − 1)pijv → 0 and thus we have that 1 rn ∑rn v=1 U˜v → 0 a.s. We also have that rnn ∈ [0, 1] a.s., thus we have shown that the second term vanishes in (5.8). Now, we need to show that,∑rn v=1(fiˆjv ,jv − fi∗jv ,jv)∑n j=1 fi∗j ,j a.s.→ 0. Note that we are only restricting ourselves to the cases where the doctor agreed with the system recommendation. When estimating the mean functions we are not using the information for all the special cases claimed by the doctor. We are only using 131 information from only the cases where the doctor agreed with algorithm and each arm’s reward function is estimated. Let j˜v−1 refer to the previous time step to jv in this subsequence {jv : v = 1, . . . , rn}. Consider, fiˆjv (Xjv)− fi∗jv (Xjv) = fiˆjv (Xjv)− fˆiˆjv ,j˜v−1(Xjv) + fˆiˆjv ,j˜v−1(Xjv)− fˆi∗(Xjv ),j˜v−1(Xjv) +fˆi∗(Xjv ),j˜v−1 (Xjv)− fi∗(Xjv )(Xjv). By definition of iˆjv , for jv > m0 + 1, we have fˆiˆ(Xjv ),j˜v−1 (Xjv) ≥ fˆi∗(Xjv ),j˜v−1(Xjv), ≥ fiˆjv (Xjv)− fˆiˆjv ,j˜v−1(Xjv) + fˆi∗(Xjv ),j˜v−1(Xjv)− fi∗(Xjv )(Xjv) ≥ −2 sup 1≤i≤l ||fˆi,j˜v−1 − fi||∞. For 1 ≤ jv ≤ m0, we have fiˆjv (Xjv) − f ∗(Xjv) ≥ −A. Based on the assumption A, ||fˆi,j−1 − fi||∞ a.s.→ 0 as j → ∞ for each i, and thus sup1≤i≤l ||fˆi,j−1 − fi||∞ a.s.→ 0. Let v = cm0 be the first time when jv > m0, then it follows for n > m0,∑rn v=1(fiˆjv (Xjv)− fi∗jv (Xjv))∑n j=1 f ∗(Xj) ≥ −Am0/n− (2/n) ∑rn v=cm0 sup1≤i≤l ||fˆi,j˜v−1 − fi||∞ (1/n) ∑n j=1 f ∗(Xj) . The right hand side converges to 0 almost surely and hence the conclusion follows. 5.3.6 Consistency for scenario 2: doctor performs better The case when doctor performs better than the algorithm, is an advantage as it beats the algorithm. As consistency has already been established for the decision rule in Yang and Zhu (2002), we know that the choice of treatments made by our algorithm would converge to the optimal in the long run. Hence, the fact that doctor’s choices work even better is an added advantage and convergence to the optimal will probably be faster in this case. Theorem 5.3.17. Given Assumptions 5.3.9-5.3.11 and Assumptions 5.3.13-5.3.14, we will have that Rn(δ) a.s.→ 0. 132 Proof of Theorem 5.3.17. Consider the regret of our allocation scheme, Rn(δ) = ∑n j=1 fIj ,j∑n j=1 fi∗j ,j = ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=0}∑n j=1 fi∗j ,j + ∑n j=1 fIj ,jI{γj=1,Tj=1}∑n j=1 fi∗j ,j . Since we assume in Assumption 5.3.9, ∑n j=1 fIj ,jI{γj=1,Tj=1} n ≥ ∑n j=1 fIj ,jI{γj=1,Tj=0} n and we have already shown in section 5.3.4 that there is a positive probability that the number of times doctor’s given a chance to treat the special patients, is non-decreasing. Also from the Assumption 5.3.11 that the number of cases that doctor agrees with the algorithm have to be non-zero for each batch, we will have that, ≥ ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j + 2 ∑n j=1 fIj ,jI{γj=1,Tj=0}∑n j=1 fi∗j ,j ≥ ∑n j=1 fIj ,jI{γj=0}∑n j=1 fi∗j ,j . Now, we can use the exact same proof as in the previous case to show that Rn(δ) a.s.→ 1 as n→∞. That is because we still will have that rn →∞ as n→∞ as the number of times the algorithm’s choice is made is also not decreasing as n→∞. 5.4 Proofs for Theorems 5.3.8 and 5.3.12 Proof of theorem 5.3.8. Recall, A is a set consisting of all the sample paths for which mk gets reduced only a finite number of times and the set Ai denotes the event that no reduction happened in mi from the ith batch onwards. It can be seen that A = ∪∞i=2Ai. Also, notice that Ai’s are disjoint for all i = 2, . . .; then, P (A) = P (∪∞i=2Ai) = ∑∞ i=2 P (Ai). Let us denote M to be the set of possible values of mi−1 for the (i− 1)th batch (or set of all possible paths until i− 1th batch). P (Ai) = ∑ M P (Ai|the proportion of times doctor treats in batch i− 1 = mi−1) ×P (the proportion of times doctor treats in batch i− 1 = mi−1). (5.9) 133 It is important to note here that for event Ai to happen we want that condition 5.1 holds (Bk occurs) for all k ≥ i. Also note that since mk = mi−1∀k ≥ i will imply that the occurrence of event Bk; k ≥ i will be independent of each other, hence, P (Ai|the proportion of times doctor gets a chance to treat in batch i− 1 = mi−1) = ∞∏ k=i P (Bk|the proportion of times doctor gets a chance to treat in batch k − 1 =mi−1) = ∞∏ k=i P (∑dk−1 v=1 YIjv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 YIjv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk ) = ∞∏ k=i P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk +{∑dk−1 j=1 fIjv I{Tjv=0} (1−mi−1)dk−1 − ∑dk−1 j=1 fIjv I{Tjv=1} mi−1dk−1 }) ≤ ∞∏ k=i P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk +{∑dk−1 v=1 fIjv I{Tjv=0} (1−mi−1)dk−1 − ∑dk−1 j=1 f ∗(Xj)I{Tjv=0} (1−mi−1)dk−1 +∑dk−1 j=1 f ∗(Xj)I{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 fIjv I{Tjv=1} mi−1dk−1 +∑dk−1 j=1 f ∗(Xj)I{Tjv=0} (1−mi−1)dk−1 − ∑dk−1 j=1 f ∗(Xj)I{Tjv=1} mi−1dk−1 }) (5.10) ≤ ∞∏ k=i P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk +{∑dk−1 v=1 fIjv I{Tjv=0} (1−mi−1)dk−1 − ∑dk−1 j=1 f ∗(Xj)I{Tjv=0} (1−mi−1)dk−1 +∑dk−1 j=1 f ∗(Xj)I{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 fIjv I{Tjv=1} mi−1dk−1 +∑dk−1 j=1 f ∗(Xj)I{Tjv=0} (1−mi−1)dk−1 − Ef ∗(X1) + Ef∗(X1)− ∑dk−1 j=1 f ∗(Xj)I{Tjv=1} mi−1dk−1 }) . (5.11) Next, we use Assumption 5.3.6 in section 5.3.3 for the first pair of summands on the right hand side of 5.10. Let Nk = (1−mi−1)dk−1, we will have that R1k(δ) = ∑dk−1 v=1 (f ∗(Xj)− 134 fIjv (Xjv))I{Tjv=0} and by assumption 2 there exists constant C ∗ such that, R1k(δ) (1−mi−1)dk−1 ≤ C ∗ ((1−mi−1)dk−1)1− 1 3+p/κ (1−mi−1)dk−1 = C ∗ 1 1 (dk−1) 1 3+p/κ , almost surely. Then for 0 <  < a8 , there exists a M ∗ 1 > 0 such that for k > M ∗ 1 , such that, ∣∣∣∣∣C∗1 1(dk−1) 13+p/κ ∣∣∣∣∣ < . Assumption 5.3.5 can be used for the second pair of summands and Assumption 5.3.7 for the third pair of summands on the right hand side of (5.10). In (5.11), we have added and subtracted Ef∗(X1) in the last pair of summands in (5.10). Then each of those quantities will be of the order O( log((1−mi−1)dk−1) (1−mi−1)dk−1 ) a.s. and O( log(mi−1dk−1) mi−1dk−1 ) a.s. respectively. We will then have that there exists M ∗ 2 > 0 such that for k ≥ M∗2 , | log((1−mi−1)dk−1)(1−mi−1)dk−1 | <  2 and ∃M∗3 > 0 such that | log(mi−1dk−1)mi−1dk−1 | < 2 for k ≥M∗3 . Let M∗ = max{M∗1 ,M∗2 ,M∗3 } then almost surely for k ≥M∗, ≤ ∞∏ k=M∗ P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk − + a−  2 −  2 ) ≤ ∞∏ k=M∗ P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > βk − a 8 + a− a 8 ) Let β1 > −a4 and βk ↑ 0, then ≤ ∞∏ k=M∗ P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} mi−1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−mi−1)dk−1 > a 2 ) Using the inequality (5.4), ≤ ∞∏ k=M∗ [ exp ( − dk−1m 2 i−1(a/2) 2 2(v2 +mi−1ac/2) ) + exp ( − dk−1(1−mi−1) 2(a/2)2 2(v2 + (1−mi−1)ac/2) )] = ∞∏ k=M∗ [ exp ( − dk−1m 2 i−1a 2 8(v2 +mi−1ac/2) ) + exp ( − dk−1(1−mi−1) 2a2 8(v2 + (1−mi−1)ac/2) )] . 135 Let m˜i−1 = min{mi−1, 1−mi−1}, then ≤ ∞∏ k=M∗ [ exp ( − dk−1m˜ 2 i−1a 2 8(v2 + (1− m˜i−1)ac/2) ) + exp ( − dk−1m˜ 2 i−1a 2 8(v2 + (1− m˜i−1)ac/2) )] ≤ ∞∏ k=M∗ 2 exp ( − dk−1m˜ 2 i−1a 2 8(v2 + (1− m˜i−1)ac/2) ) = lim t→∞ 2 t−M∗ exp ( − m˜ 2 i−1a 2 8(v2 + (1− m˜i−1)ac/2) t∑ k=M∗ dk−1 ) . If the number of cases that doctor feels are special are non-decreasing with time, then the exponential decay happens faster than the growth of 2t, and we have that, P (Ai|proportion of times doctor gets to treat in batch i− 1 = mi−1) a.s.= 0. Also, since M only has finite possibilities (there can only be a finite set of paths till i− 1 depending on how the user decides to decrease mk, k ≤ i− 1), thus we have that the sum in 5.9 is 0 almost surely, that is, P (Ai) a.s. = 0. Therefore, we have shown that P (Ai) = 0 ∀ i = 2, . . .. Hence, P (A) = P (∪∞i=2Ai) =∑∞ i=2 P (Ai) = 0. Thus, we have shown that the probability of doctor’s proportion to make his/her decision gets reduced only finite number of times is zero. This in turn implies that mk a.s.→ 0. Proof of Theorem 5.3.12. Recall, that A2 is the event that the proportion of times doc- tor is allowed to make their decision does not decrease after the second batch. Consider, P (A2|the proportion of times doctor treats in batch 1 = m1) = ∞∏ k=2 P (Bk|the proportion of times doctor treats in batch 1 =m1) = ∞∏ k=2 P (∑dk−1 v=1 YIjv ,jvI{Tjv=1} m1dk−1 − ∑dk−1 v=1 YIjv ,jvI{Tjv=0} (1−m1)dk−1 > βk ) = ∞∏ k=2 P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} m1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−m1)dk−1 > βk + {∑dk−1 j=1 fIjv I{Tjv=0} (1−m1)dk−1 − ∑dk−1 j=1 fIjv I{Tjv=1} m1dk−1 }) . 136 From Assumption 5.3.9, ∑dk−1 j=1 fIjv I{Tjv=0} (1−m1)dk−1 − ∑dk−1 j=1 fIjv I{Tjv=1} m1dk−1 ≤ 0, hence ≥ ∞∏ k=2 P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} m1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−m1)dk−1 > βk ) = ∞∏ k=2 [ 1− P (∑dk−1 v=1 Ijv ,jvI{Tjv=1} m1dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−m1)dk−1 < βk )] = ∞∏ k=2 [ 1− P (∑dk−1 v=1 Ijv ,jvI{Tjv=0} (1−m1)dk−1 − ∑dk−1 v=1 Ijv ,jvI{Tjv=1} m1dk−1 > −βk )] . From the inequality obtained in 5.4, ≥ ∞∏ k=2 [ 1− exp ( − dk−1(1−m1) 2β2k 8(v2 − c(1−m1)βk/2) ) − exp ( − dk−1m 2 1β 2 k 8(v2 −m1βkc/2) )] = exp [ ∞∑ k=2 log { 1− exp ( − dk−1(1−m1) 2β2k 8(v2 − c(1−m1)βk/2) ) − exp ( − dk−1m 2 1β 2 k 8(v2 −m1βkc/2) )}] . (5.12) Let us denote, zk = exp ( − dk−1(1−m1) 2β2k 8(v2 − c(1−m1)βk/2) ) + exp ( − dk−1m 2 1β 2 k 8(v2 −m1βkc/2) ) . Then we have that (5.12) equals, = exp [ ∞∑ k=2 log {1− zk} ] . (5.13) We know that for y > 0, y − 1 y ≤ log y ≤ y − 1. If y = 1− x for 0 < x < 12 , then, −2x ≤ log(1− x) ≤ −x. 137 Using this condition assuming that zk is small enough (smaller than 1/2) we get that (5.13) is, ≥ exp [ − ∞∑ k=2 2zk ] = exp [ −2 ∞∑ k=2 { exp ( − dk−1(1−m1) 2β2k 8(v2 − c(1−m1)βk/2) ) + exp ( − dk−1m 2 1β 2 k 8(v2 −m1βkc/2) )}] ≥ exp [ −2 ∞∑ k=2 { exp ( −dk−1(1−m1) 2β2k 8(v2 − cβ1/2) ) + exp ( − dk−1m 2 1β 2 k 8(v2 − cβ1/2) )}] . From Assumption 5.3.10, we have that dk−1β2k log k →∞. This will imply that the right side of the inequality above is summable. Therefore, P (A2| the proportion of times doctor gets a chance to treat in 1st batch = m1) > 0. Therefore we have that P (A2) 6= 0, hence P (A) = ∑∞i=2 P (Ai) > 0. Hence, there is a positive probability that doctor’s chances to treat the patient in special situations gets reduced only finitely many times. Chapter 6 Conclusion In this dissertation, we consider a contextual bandit problem with delayed feedback and propose randomized allocation strategies for the problem, with sequential treat- ment allocation as the motivation. We take a nonparametric approach in modeling the relationship of the rewards with the covariates. We compare strategies which dif- fer in how the underlying exploration probability sequence is updated in the presence of delayed feedback, to see which ones perform better under different delay scenarios and underlying complexities of the problem. We study these strategies both from an asymptotic and finite-time perspective, and draw comparisons under various simulated and real-data settings. One of our major contributions is to consider random and unbounded delays in a non- parametric modeling framework for contextual bandits, as most other works on delayed contextual bandits are parametric in nature with fixed or bounded delays. Our results are promising as they could address a broader range of problems, especially situations where it is not possible to explicitly lay out a parametric model. Also, it is important to note that the assumptions we make on delays are mild and could potentially hold for a lot of practical settings. Another contribution of the work is that we try to relax the assumption of delays being independent of the choice of covariates. Relaxing this assumption is crucial for applying in the medical domain but has not been well-studied. Our finite-time bounds for this case, although conservative, can form a starting point for further developments in this direction, which could potentially require development of new mathematical tools that deal with the underlying dependence structure. 138 139 In another leg of work, we consider the clause of having doctor/expert advice being incorporated in the automated bandit strategies. Allowing for these expert interventions would make these algorithms implementable in real life scenarios as expert advice is certainly crucial in decision making processes. We propose a randomized allocation strategy which allows for doctor’s interventions and show that it is strongly consistent, that is, in the long run the cumulative reward for the proposed strategy approaches the cumulative reward of the theoretically best scenario. In a nutshell, our research reveals that randomized allocation strategies for contex- tual bandits are useful and promising tools that could be used for sequential decision making in a lot of applications. There is still a lot of statistical work that needs to be done, especially in terms of drawing statistical inference and establishing robust- ness of these methodologies. A rigorous understanding and development of more robust strategies could be of immense help as a tool to help health care providers in using the enormous amount of patient data available for making informed treatment decisions. There is a lot of scope for future developments in this promising field of research, some of the most immediate directions include the following. • Developing minimax optimal finite-time results for the proposed strategies. This will help in a much deeper understanding on how these randomized strategies theoretically compare to other already existing strategies. • Devising methodology for estimating delays when modeling for delayed rewards in a bandit setting. This can be specifically important in scenarios where one might not have any prior understanding on the expected delay in observing the rewards. This could also be useful for studying the more complicated setting when delays depend on covariates and arm choices. • Incorporating delays in the work with doctor’s intervention in the proposed ran- domized allocation strategies, and studying their finite time properties. This would make these strategies more applicable and would require controlling the selection bias in the process. • Developing statistical inference tools for contextual bandit strategies. While there 140 has been some theoretical development in statistical inference on standard multi- armed bandit strategies, to our knowledge, technical tools required to build sta- tistical inference theory in contextual bandits are yet to be developed. The pres- ence of covariates in a sequential setup poses technical challenges that remain unexplored. Statistical inference could be highly beneficial in sequential decision making. For example, confidence intervals for our reward function estimates and robust procedures that remain valid even when some assumptions are violated can make these procedures more reliable. References Abbasi-Yadkori, Y., Pa´l, D., and Szepesva´ri, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320. Agarwal, A., Dud´ık, M., Kale, S., Langford, J., and Schapire, R. (2012). Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19–26. Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory (COLT). Agrawal, S. and Goyal, N. (2013a). Further optimal regret bounds for thompson sam- pling. In Artificial Intelligence and Statistics, pages 99–107. Agrawal, S. and Goyal, N. (2013b). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simul- taneous learning from multiple patients. European Journal of Operational Research, 248(2):619–633. Anderson, T. (1964). Sequential analysis with delayed observations. Journal of the American Statistical Association, 59(308):1006–1015. Anscombe, F. (1963). Sequential medical trials. Journal of the American Statistical Association, 58(302):365–383. 141 142 Armitage, P. et al. (1975). Sequential medical trials. Sequential medical trials. 2nd edition. Audibert, J.-Y. and Bubeck, S. (2010a). Best arm identification in multi-armed bandits. Audibert, J.-Y. and Bubeck, S. (2010b). Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836. Audibert, J.-Y., Munos, R., and Szepesva´ri, C. (2009). Exploration–exploitation trade- off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902. Audibert, J.-Y., Tsybakov, A. B., et al. (2007). Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multi- armed bandit problem. Machine Learning, 47(2-3):235–256. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77. Auer, P., Ortner, R., and Szepesva´ri, C. (2007). Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory, pages 454–468. Springer. Bartroff, J., Lai, T. L., and Shih, M.-C. (2013). Adaptive design of confirmatory trials. In Sequential Experimentation in Clinical Trials, pages 187–223. Springer. Bastani, H. and Bayati, M. (2015). Online decision-making with high-dimensional co- variates. Bastani, H., Bayati, M., and Khosravi, K. (2017). Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011. 143 Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contex- tual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. Birge´, L., Massart, P., et al. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4(3):329–375. Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. (2019). Online exp3 learn- ing in adversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems, pages 11345–11354. Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1– 122. Cella, L. and Cesa-Bianchi, N. (2019). Stochastic bandits with delay-dependent payoffs. arXiv preprint arXiv:1910.02757. Cesa-Bianchi, N. and Fischer, P. (1998). Finite-time regret bounds for the multiarmed bandit problem. In ICML, volume 1998, pages 100–108. Citeseer. Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2018). Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pages 750–773. Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A. (2016). Delay and coopera- tion in nonstochastic bandits. Journal of Machine Learning Research, 49(1):613–650. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press. Chapelle, O. (2014). Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105. ACM. Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257. 144 Chow, S.-C. and Chang, M. (2012). Adaptive design methods in clinical trials. CRC press Boca Raton, FL. Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with lin- ear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. COLT, pages 355–366. Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration- exploitation tradeoffs in gaussian process bandit optimization. The Journal of Ma- chine Learning Research, 15(1):3873–3923. Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. (2011). Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI Press. Eick, S. G. (1988a). Gittins procedures for bandits with delayed responses. Journal of the Royal Statistical Society: Series B (Methodological), 50(1):125–132. Eick, S. G. (1988b). The two-armed bandit with delayed responses. The Annals of Statistics, pages 254–264. Filippi, S., Cappe, O., Garivier, A., and Szepesva´ri, C. (2010). Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594. Fontaine, X., Berthet, Q., and Perchet, V. (2019). Regularized contextual bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2144–2153. Freedman, D. A. (1975). On tail probabilities for martingales. the Annals of Probability, pages 100–118. Goldenshluger, A. and Zeevi, A. (2013). A linear response bandit problem. Stochastic Systems, 3(1):230–261. 145 Goldenshluger, A., Zeevi, A., et al. (2009). Woodroofe’s one-armed bandit problem revisited. The Annals of Applied Probability, 19(4):1603–1633. Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer. Hu, Y., Kallus, N., and Mao, X. (2019). Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. arXiv preprint arXiv:1909.02553. Joulani, P., Gyorgy, A., and Szepesva´ri, C. (2013). Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453–1461. Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymp- totically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213. Springer. Kim, E. S., Herbst, R. S., Wistuba, I. I., Lee, J. J., Blumenschein, G. R., Tsao, A., Stewart, D. J., Hicks, M. E., Erasmus, J., Gupta, S., et al. (2011). The battle trial: personalizing therapy for lung cancer. Cancer discovery, 1(1):44–53. Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM. Lai, T., Levin, B., Robbins, H., and Siegmund, D. (1985). Sequential medical trials. In Herbert Robbins Selected Papers, pages 247–250. Springer. Lai, T. L. and Liao, O. Y.-W. (2012). Efficient adaptive randomization and stopping rules in multi-arm clinical trials for testing a new treatment. Sequential analysis, 31(4):441–457. Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22. Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi- armed bandits. In Proceedings of the 20th International Conference on Neural Infor- mation Processing Systems, pages 817–824. Citeseer. 146 Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pages 817–824. Lattimore, T. and Szepesva´ri, C. (2018). Bandit algorithms. Cambridge University Press. Li, B., Chen, T., and Giannakis, G. B. (2019). Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 993–1002. Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670. ACM. Mandel, T., Liu, Y.-E., Brunskill, E., and Popovic´, Z. (2015). The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740. May, B. C., Korda, N., Lee, A., and Leslie, D. S. (2012). Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun):2069– 2106. McDiarmid, C. (1998). Concentration. In Probabilistic methods for algorithmic discrete mathematics, pages 195–248. Springer. Murphy, S. A. (2005). An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481. Nahum-Shani, I., Smith, S. N., Spring, B. J., Collins, L. M., Witkiewitz, K., Tewari, A., and Murphy, S. A. (2017). Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, 52(6):446–462. 147 Perchet, V. and Rigollet, P. (2013). The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721. Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. (2016). Batched bandit problems. The Annals of Statistics, 44(2):660–681. Pike-Burke, C., Agrawal, S., Szepesvari, C., and Gru¨newa¨lder, S. (2017). Bandits with delayed anonymous feedback. stat, 1050:20. Pike-Burke, C., Agrawal, S., Szepesva´ri, C., and Grunewalder, S. (2018). Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning. Qian, W. and Yang, Y. (2016a). Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, (1):5181–5217. Qian, W. and Yang, Y. (2016b). Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics, 10(1):242–270. Rabbi, M., Pfammatter, A., Zhang, M., Spring, B., and Choudhury, T. (2015). Auto- mated personalized feedback for physical activity and dietary behavior change with mobile phones: a randomized controlled trial on adults. JMIR mHealth and uHealth, 3(2):e42. Rigollet, P. and Zeevi, A. (2010). Nonparametric bandits with covariates. COLT 2010, page 54. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535. Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411. Russo, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Math- ematics of Operations Research, 39(4):1221–1243. Sarkar, J. (1991). One-armed bandit problems with covariates. The Annals of Statistics, 19(4):1978–2002. 148 Slivkins, A. (2014). Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568. Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836. Sommer Thune, T., Cesa-Bianchi, N., and Seldin, Y. (2019). Nonstochastic multiarmed bandits with unrestricted delays. arXiv preprint arXiv:1906.00670. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Suzuki, Y. (1966). On sequential decision problems with delayed observations. Annals of the Institute of Statistical Mathematics, 18(1):229–267. Sverdlov, O. (2015). Modern adaptive randomized clinical trials: statistical and practical aspects, volume 81. CRC Press. Szorenyi, B., Busa-Fekete, R., Weng, P., and Hu¨llermeier, E. (2015). Qualitative multi- armed bandits: A quantile-based approach. In 32nd International Conference on Machine Learning, pages 1660–1668. Tewari, A. and Murphy, S. A. (2017). From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294. Vernade, C., Cappe´, O., and Perchet, V. (2017). Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence. Vernade, C., Carpentier, A., Zappella, G., Ermis, B., and Brueckner, M. (2018). Con- textual bandits under delayed feedback. arXiv preprint arXiv:1807.02089. Villar, S. S., Wason, J., and Bowden, J. (2015). Response-adaptive randomization for multi-arm clinical trials using the forward looking gittins index rule. Biometrics, 71(4):969–978. 149 Wanigasekara, N. and Yu, C. (2019). Nonparametric contextual bandits in metric spaces with unknown metric. In Advances in Neural Information Processing Systems, pages 14657–14667. Wason, J. and Jaki, T. (2012). Optimal design of multi-arm multi-stage trials. Statistics in medicine, 31(30):4269–4279. Wei, L. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association, 73(364):840–843. Woodroofe, M. (1979). A one-armed bandit problem with a concomitant variable. Jour- nal of the American Statistical Association, 74(368):799–806. Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, pages 100–121. Yoshikawa, Y. and Imai, Y. (2018). A nonparametric delayed feedback model for con- version rate prediction. arXiv preprint arXiv:1802.00255. Zhou, Z., Xu, R., and Blanchet, J. (2019). Learning in generalized linear contextual ban- dits with stochastic delays. In Advances in Neural Information Processing Systems, pages 5198–5209. Zimmert, J. and Seldin, Y. (2019). An optimal algorithm for adversarial bandits with arbitrary delays. arXiv preprint arXiv:1910.06054. Appendix A Appendix In this chapter, we will enlist the statistical concepts and well-known technical tools that would regularly be needed in the dissertation. Next, we state the famous Borel-Cantelli Lemma. Lemma A.0.1 (Borel-Cantelli). Let (A1, A2, . . .) be a sequence of events in a common probability space (Ω,F , P ) and set A = lim supn→∞An. If ∑∞ n=1 P (An) < ∞, then P (A) = 0. This result is useful in assessing almost sure convergence and is often used in the analysis presented in the following chapters. Next, we define the modulus of continuity, which quantifies the maximum differences in functional values for a given function on a given domain. Definition A.0.2. Let x1, x2 ∈ [0, 1]d. Then w(h; f) denotes a modulus of continuity defined by, w(h; f) = sup{|f(x1)− f(x2)| : |x1k − x2k| ≤ h for all 1 ≤ k ≤ d}. It can be seen that if f is continuous then w(h; f)→ 0 as h→ 0. Next, we review some concentration inequalities, which are quite standard results and will be used in the following chapters. A.1 Concentration inequalities Lemma A.1.1 (Hoeffding’s Inequality). Let X1, X2, . . . , Xn be independent real-valued random variables such that for each i = 1, . . . , n there exists some ai ≤ bi such that 150 151 P [ai ≤ Xi ≤ bi] = 1. Then for every  > 0, P [ n∑ i=1 Xi − E n∑ i=1 Xi >  ] ≤ exp ( − 2 2∑n i=1(bi − ai)2 ) More such inequalities with their proofs can be found in Hoeffding (1994). The martingale version of Hoeffding inequality has also been derived and is known as the Azuma-Hoeffding inequality. Lemma A.1.2 (Azuma-Hoeffding Inequality). Suppose Fj , j = 1, 2, . . . is an increasing filtration of σ-fields. For each j ≥ 1, let Xj be Fj-measurable such that Xj ≥ 0 almost surely, and aj ≤ Xj ≤ bj, then for all  > 0, we have, P  n∑ j=1 Xj − n∑ j=1 E(Xj | Fj−1) >   ≤ exp(− 22∑n j=1(bj − aj)2 ) One if referred to McDiarmid (1998) for more details and a proof of the inequality. Lemma A.1.3 (Bernstein’s Inequality). Let X1, . . . , Xn be independent real-valued random variables with zero mean, and assume that X1 ≤ 1 with probability 1. Let Vj = Var(Xj) and σ 2 = ∑n j=1 Vj. For any  > 0, P [ 1 n n∑ i=1 Xi >  ] ≤ exp ( − n 2 2σ2 + 2/3 ) (A.1) Proofs of these inequalities can be found in Cesa-Bianchi and Lugosi (2006). Corollary A.1.4. Suppose W˜1, W˜2, . . . , W˜n, are independent Bernoulli random vari- ables with success probability βj. By Bernstein’s inequality in (A.1), P  n∑ j=1 W˜j ≤ ( n∑ j=1 βj)/2  ≤ exp(−3∑nj=1 βj 28 ) . The proof follows by substituting  = ( ∑n j=1 βj)/2 and Xj = βj−W˜j in (A.1). Note that the same inequality holds for any Bernoulli random variable where Wj takes values aj ≤ 1, ∀j ≥ 1 and 0. The Bernstein’s inequality has been extended to the case of martingales. 152 Lemma A.1.5 (Bernstein’s Inequality for Martingales). Let (Ω,F , P ) be a probability space. Let Fj , j = 1, 2, . . . , be an increasing filtration of sub-σ-fields of F . Let X1, X2, . . . be random variables on (Ω,F , P ), such that Xj is Fj-measurable. Assume |Xj | ≤ K with probability 1, for all j ≥ 1. Let Vj = Var(Xj | Fj−1) and denote the sum of conditional variances by, Then for all positive real numbers  and v, P  n∑ j=1 (Xj − E(Xj |Fj−1)) > , n∑ j=1 Vj ≤ v  ≤ exp(− 2 2(v +K/3) ) The proof of this inequality can be found in Freedman (1975). Corollary A.1.6 (Extended Bernstein Inequality). Suppose {Fj , j = 1, 2, . . .} is an increasing filtration of σ-fields. For each j ≥ 1, let Wj be an Fj-measurable Bernoulli random variable whose conditional success probability satisfies P (Wj = 1|Fj−1) ≥ βj for some βj ∈ [0, 1]. Then given n ≥ 1, P  n∑ j=1 Wj ≤ ( n∑ j=1 βj)/2  ≤ exp(−3∑nj=1 βj 28 ) (A.2) The proof for this can be found in Qian and Yang (2016a). Lemma A.1.7. Suppose {Fj , j = 1, 2, . . .} is an increasing filtration of σ-fields. For each j ≥ 1, let j be an Fj+1-measurable random variable that satisfies E(j |Fj) = 0, and let Wj be an Fj-measurable random variable that is upper bounded by a constant C > 0 in absolute value almost surely. If there exists positive constants v and c such that for all k ≥ 2 and j ≥ 1, E(|j |k|Fj) ≤ k!v2ck−2/2, then for every  > 0 and every integer n ≥ 1, P  n∑ j=1 Wjj ≥ n  ≤ exp(− n2 2C2(v2 + c/C) ) . (A.3) Proof of Lemma A.1.7. Lemma A.1.7 is the same as Lemma 1 in Qian and Yang (2016a) and the proof for the same can be found there. A simplified version of Lemma A.1.7 can be stated as follows. 153 Corollary A.1.8. Let 1, 2, . . . be independent random variables satisfying the refined Bernstein condition, that is, if there exists positive constants v and c such that for all k ≥ 2 and j ≥ 1, E|j |k ≤ k!v2ck−2/2. Let I1, I2, . . . be Bernoulli random variables such that Ij is independent of {l : l ≥ j} for all j ≥ 1. For any  > 0, P  n∑ j=1 Ijj ≥ n  ≤ exp(− n2 v2 + c ) . (A.4) The proof for this lemma can be found in Yang and Zhu (2002). A.2 Notations ` number of arms n generic end time point, arms being pulled until time n i∗ arm corresponding to the maximum mean reward iˆj best promising arm so far based on the estimation procedure Ij arm chosen at the j th time point η, η1, η2 allocation strategy θ unknown parameter used for parametric methods µi mean reward for arm i µ∗ optimal mean reward (max1≤i≤l µi) ∆i µ ∗ − µi Fi unknown reward distribution of arm i Yi,j reward for arm i at j th time point Rn(η) cumulative reward for allocation strategy η rn(η) per-round regret for strategy η Tn(i) number of observations from arm i upto time n X, x covariates: random, realized fi(x) mean reward function for arm i at covariate x f∗(x) optimal reward function at covariate x j error in the regression model for the j th case dj delay in observing the reward for j th case tj time of observing j th reward 154 Gj cumulative distribution function for dj q(n) lower bound for expected number of observed rewards AN set of observed indices up to time N Xn set of covariates observed until time n Zn collection of past and present information used for estima- tion τn number of observed rewards by time n hn binwidth for the chosen nonparametric procedure pin exploration probability sequence Ji,n+1 set of observed indices by time n corresponding to arm i Qn+1(x), Qi,n+1(x) indices corresponding to rewards observed in a small cube containing x, pertaining to arm i Mn+1(x),Mi,n+1(x) size of Qn+1(x), Qi,n+1(x) m0 initialization count A, a1, a˜1, c, c¯, L, c5, v constants Fn sigma-field σt time when the t th reward is observed Mδ probabilistic bound on the maximum difference between con- secutive observed rewards en uniform lower bound for cumulative distribution function of delays over the covariate space γj indicator for if doctor disagrees with the algorithm Tj indicator if doctor is allowed to make their decision for j th special case a.s. almost surely Mn number of batches until time n dk number of special cases in the k th batch mk proportion of dk special cases when doctor is allowed to make their decision βk threshold for the k th batch ω sample path for arms