Contextual Bandits with Delayed Feedback Using
Randomized Allocation
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Sakshi Arya
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Doctor of Philosophy
Advisor: Dr. Yuhong Yang
May, 2020
c© Sakshi Arya 2020
ALL RIGHTS RESERVED
Acknowledgements
I am very grateful to my advisor Professor Yuhong Yang for his unwavering support
and invaluable guidance in my research and in various aspects of graduate school. He
has always been very accomodating, encouraging and patient with me, for which I could
not thank him enough. His advice and insights have helped sustain my enthusiasm for
becoming a better researcher and teacher.
I would like to thank Professor Galin Jones for being the chair of my thesis committee
and for his consistent guidance and support. I would also like to thank my committee
members, Professors Xiaoou Li and Bjorn Berg for their consistent support and feedback
on research and its applicability; and thanks to Professor Lan Wang for serving on my
preliminary thesis committee. I am indebted to all the exceptional faculty who taught
me fundamental statistics and prepared me for research. I would like to acknowledge
Wei Qian for his help and support with my research and coding. I am grateful to
the School of Statistics and College of Liberal Arts for being extremely generous in
supporting my research over summers and for academic travel. A big thanks to the
staff, Taryn and Taylor for the administrative help and support.
These six years would not have been as joyful without the continued support and
encouragement of my fellow graduate students and friends. I can not thank Dootika
and Haema enough, as this would not have been possible without their support and
friendship. The list is long but special thanks to Adam, Aaron, Christina, Dan, James,
Kaibo, Karl Oskar, Sanhita, Sarah and Ziyue. Thanks to my amazing mathematician
friends; Anuj, Rohit, Saumya and Arunima for being there with me on this journey. A
special thanks to my friends in India, who were just a phone call away all this time.
Lastly, I am forever grateful to my parents for their unconditional love and support.
Whatever I am today is because of them and their happiness means the world to me.
i
Abstract
Contextual bandit problems are important for sequential learning in various prac-
tical settings that require balancing the exploration-exploitation trade-off to maximize
total rewards. Motivated by applications in health care, we consider a multi-armed
bandit setting with covariates and allow for delay in observing the rewards (treatment
outcomes) as would most likely be the case in a medical setting. We focus on developing
randomized allocation strategies that incorporate delayed rewards using nonparametric
regression methods for estimating the mean reward functions. Although there has been
substantial work on handling delays in standard multi-armed bandit problems, the field
of contextual bandits with delayed feedback, especially with nonparametric estimation
tools, remains largely unexplored. In the first part of the dissertation, we study a sim-
ple randomized allocation strategy incorporating delayed feedback, and establish strong
consistency. Our setup is widely applicable as we allow for delays to be random and
unbounded with mild assumptions, an important setting that is usually not considered
in previous works.
We study how different hyperparameters controlling the amount of exploration and
exploitation in a randomized allocation strategy should be updated based on the extent
of delays and underlying complexities of the problem, in order to enhance the overall
performance of the strategy. We provide theoretical guarantees of the proposed method-
ology by establishing asymptotic strong consistency and finite-time regret bounds. We
also conduct simulations and real data evaluations to illustrate the performance of the
proposed strategies.
In addition, we consider the problem of integrating expert opinion into a randomized
allocation strategy for contextual bandits. This is also motivated by applications in
health care, where a doctor’s opinion is crucial in the treatment decision making process.
Therefore, although contextual bandit algorithms are proven to work both theoretically
and empirically in many practical settings, it is crucial to incorporate doctor’s judgment
to build an adaptive bandit strategy. We propose a randomized allocation strategy
incorporating doctor’s interventions and show that it is strongly consistent.
ii
Contents
Acknowledgements i
Abstract ii
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 The standard multi-armed bandit problem . . . . . . . . . . . . . . . . . 2
1.1.1 Exploration and exploitation dilemma . . . . . . . . . . . . . . . 2
1.2 Types of bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Stochastic bandit problem . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Adversarial bandit problem . . . . . . . . . . . . . . . . . . . . . 4
1.3 Algorithms for standard MAB . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 -greedy policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 UCB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Exponential weighting . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Parametric framework . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Nonparametric framework . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Multi-armed bandits with delayed feedback . . . . . . . . . . . . . . . . 16
1.6 Delayed feedback bandit problem . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Bayesian setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
1.6.2 Stochastic setting . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.3 Nonstochastic setting . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Delayed anonymous feedback bandit problem . . . . . . . . . . . . . . . 23
1.7.1 Stochastic setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.2 Nonstochastic setting . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Contextual bandits and health care . . . . . . . . . . . . . . . . . . . . . 24
1.8.1 Contextual bandits for adaptive clinical trials . . . . . . . . . . . 24
1.8.2 Contextual bandits for mobile health . . . . . . . . . . . . . . . . 26
1.9 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Randomized allocation strategy for delayed nonparametric bandits 30
2.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 The proposed strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Consistency of the proposed strategy . . . . . . . . . . . . . . . . 33
2.3 The Histogram method . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Allocation with histogram estimates . . . . . . . . . . . . . . . . 34
2.3.2 Number of observations in a small cube . . . . . . . . . . . . . . 35
2.3.3 Effects of reward delay distributions . . . . . . . . . . . . . . . . 38
2.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Proof of consistency of the proposed strategy . . . . . . . . . . . 43
2.5.2 A probability bound for the histogram method . . . . . . . . . . 44
2.6 Supplementary simulation results . . . . . . . . . . . . . . . . . . . . . . 46
3 To update or not to update? 48
3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 The proposed strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Consistency of the proposed strategy . . . . . . . . . . . . . . . . . . . . 51
3.3.1 The histogram method . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2 Allocation with histogram estimates . . . . . . . . . . . . . . . . 52
3.3.3 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
3.4 Comparison of strategies, η1 and η2 . . . . . . . . . . . . . . . . . . . . . 58
3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.1 The simulation process and results . . . . . . . . . . . . . . . . . 65
3.6 Other proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7 Supplementary simulation results . . . . . . . . . . . . . . . . . . . . . . 74
4 Finite-time analysis for randomized allocation strategies 78
4.1 Finite-time regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 Nadaraya-Watson regression . . . . . . . . . . . . . . . . . . . . . 81
4.2 Delays dependent on covariates . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.1 Discussion on finite-time results . . . . . . . . . . . . . . . . . . . 91
4.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Proofs of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4.3 Proof of Theorem 4.1.13 . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.4 Proof outline for the case when delays depend on covariates . . . 109
4.5 Supplementary real-data results . . . . . . . . . . . . . . . . . . . . . . . 114
5 Doctor’s intervention in randomized allocation strategy 116
5.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.1 Regret and consistency . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 Proposed allocation strategy . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2.1 Regression procedures . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Consistency of the proposed strategy . . . . . . . . . . . . . . . . . . . . 124
5.3.1 Layout of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 A preliminary result . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.3 Scenario 1: doctor performs worse than the algorithm . . . . . . 126
5.3.4 Scenario 2: doctor performs at par with the algorithm . . . . . . 127
5.3.5 Consistency for the scenario 1: doctor performs worse . . . . . . 127
5.3.6 Consistency for scenario 2: doctor performs better . . . . . . . . 131
5.4 Proofs for Theorems 5.3.8 and 5.3.12 . . . . . . . . . . . . . . . . . . . . 132
v
6 Conclusion 138
References 141
Appendix A. Appendix 150
A.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
vi
List of Tables
1.1 Cumulative regret (Rn(δ)) upper bounds for different multi-armed bandit
policies in different settings. . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Regret bounds for multi-armed bandits with stochastic delayed rewards 20
1.3 Cumulative regret upper bounds for multi-armed bandits with nonstochas-
tic delayed feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii
List of Figures
2.1 Per-round regret for the proposed strategy for different delay scenarios.
The grid of plots represent 4 different combination of choices for {pin}
and {hn}. For a given row, pin remains fixed and hn varies and vice versa
for columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Per-round regret averaged over 60 replications for the proposed strategy
in section 2.2 for different delay situations. pin = log
−2 n and hn decays
faster as we move from left to right. . . . . . . . . . . . . . . . . . . . . 42
2.3 Per-round regret averaged over 60 replications for the proposed strategy
in section 2.2 for different delay situations. The grid of plots represent
four different combinations of {hn} and {pin}. For a given row, pin remains
fixed and hn varies and vice versa for columns. . . . . . . . . . . . . . . 47
3.1 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first
two rows) and strategy η2 has lower cumulative average regret in setup 3
and 4 (rows third and fourth). . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Each row represents a setup, with first column depicting a one-dimensional
function used to generate the mean reward functions. The second and
the third column depict the average regret over time for delay 3 and delay
4 respectively. Here, {hn} = log−1 n, {pin} = log−2 n. . . . . . . . . . . . 75
3.3 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first
two rows) and strategy η2 has lower cumulative average regret in setup 3
and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = n−1/4. . . . . 76
3.4 Strategy η1 has lower cumulative average regret in setup 1 and 2 (first
two rows) and strategy η2 has lower cumulative average regret in setup 3
and 4 (rows third and fourth). Here, {hn} = n−1/4, {pin} = log−1 n. . . . 77
viii
4.1 Boxplots of normalized CTRs for the three methods being compared.
Each column represents a particular delay scenario. . . . . . . . . . . . . 91
4.2 Boxplots with 200 replications show similar patterns as Figure 4.1. . . . 92
4.3 Boxplots of normalized CTRs for the three methods for 1000 rounds of
initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1 Flow chart of the allocation strategy . . . . . . . . . . . . . . . . . . . . 122
ix
Chapter 1
Introduction
Imagine a treatment allocation problem in which patients arrive sequentially at times
t = 1, 2, . . ., to be treated for a particular fatal disease. There are ` competing treat-
ments available for the disease. The objective is to maximize the expected total patient
lifetime. Treatment allocation is sequential in nature, i.e., the previously used treat-
ments and patient lifetimes could be used to guide the treatment decision for the current
patient. Although we might not have complete survival information on patients treated
so far, we can use information such as how long patients survived after treatment and
which have survived to the current time, and information on patient characteristics like
disease history, age, genetic information and gender, to guide treatment decisions at a
given time. This is an example of contextual multi-armed bandit problem with delayed
responses, delayed because we might not have survival information for all the previous
patients when treating the current patient.
Bandit problems have been studied extensively in various fields including statistics,
computer science, mathematics, economics, finance, operations research and medicine.
We will try to give a summary of the developments in the field in this chapter but due
to the enormity of literature, this should by no means be considered exhaustive. To
explain the developments in the field in a chronological fashion, we first introduce the
general concept of multi-armed bandits in section 1.1, then try to give a summary of
the literature in contextual bandits in section 1.4 and finally talk about delayed rewards
in both context-free and contextual settings in section 1.5.
1
21.1 The standard multi-armed bandit problem
Multi-armed bandit (MAB) problems are sequential allocation problems intended to-
wards utilizing “past and present information” to achieve certain goals in a sequential
manner. These problems were first introduced in the landmark paper by Robbins (1952).
Metaphorically, the idea comes from imagining a slot machine with multiple arms where
the goal is to optimize the total reward by strategically deciding the order of arms pulled.
The player at each time step has to decide whether to continue pulling the current arm
or try a different arm. Each arm sequentially generates random rewards based on a
probability distribution specific to that arm with unknown parameters. The objective
of the player is to maximize the sum of rewards generated through a sequence of arm
plays or equivalently, to minimize the cumulative regret (the shortfall of the reward of
the algorithm compared to the optimal). In order to achieve this objective, balancing
the exploration-exploitation trade-off plays an integral role.
1.1.1 Exploration and exploitation dilemma
The problem is prototypical of a general class of adaptive control problems in which
there is a fundamental dilemma between “exploration”/“information”(such as the need
to learn from all populations about their parameter values) and “exploitation”/“control”
(such as the objective of sampling only from the best population). The main challenge
of the bandit problem is that when we pull an arm, rewards of other arms are not
observed. Therefore it is necessary to try all arms (explore) in order to form a better
estimation. In an exploration step, the goal is to form unbiased samples by randomly
pulling all arms to improve the prediction accuracy of arms. Because exploration does
not focus on the best arm, this step may lead to large immediate regret but can poten-
tially reduce regret for the future exploitation steps. During exploitation, the algorithm
suggests the best arm learned from the samples formed during exploration, and the arm
is pulled. The intention behind exploitation is maximizing immediate reward (or min-
imize immediate regret). Therefore, there is an intrinsic trade-off between exploiting
the current knowledge to focus on the arm that seems to yield the highest rewards and
exploring the other arms to identify with better precision which arm is actually the
best. So the decision maker needs to come up with a strategy that does a good job in
3balancing exploration and exploitation.
Definition 1.1.1. A policy or allocation strategy δ is an algorithm that gives the strategy
for choosing the next arm to play based on the sequence of past information on arms
played and the rewards obtained respectively.
Different policies have been formulated based on different statistical approaches and
assumptions on data generating processes. Next, we discuss some of the type of problems
considered in the literature.
1.2 Types of bandit problems
We will discuss two fundamental formalizations of the bandit problem depending on the
assumed nature of the reward process: stochastic and adversarial. Different allocation
strategies have been developed for the two types. We will briefly discuss these strategies,
their regret bounds and other developments in this section.
1.2.1 Stochastic bandit problem
The standard setting of the bandit problem is defined as follows: let Yi,j denote se-
quential rewards for 1 ≤ i ≤ ` and j ≥ 1 where each i is indexing the arms and j is
indexing the time point. Successive plays of the arm i yield rewards Yi,1, Yi,2, . . . which
are independent and identically distributed according to an unknown probability distri-
bution Fi with unknown expectation µi. Independence also holds across machines; i.e.
Yi,t and Yk,s are independent for each 1 ≤ i < k ≤ ` and each s, t ≥ 1. The goal is to
find an efficient allocation strategy and to determine whether a strategy is ‘good’. This
standard bandit problem where we assume that the rewards of a given arm is an i.i.d.
sequence of random variables, fall under the category of stochastic bandit problems. The
standard approach is to compare the algorithm performance to the best performance
one could possibly achieve for a given problem if the mean rewards were known. The
best mean reward is denoted by µ∗ := max1≤i≤` µi. At each time step j = 1, 2, . . . n,
the decision maker selects an arm Ij ∈ {1, 2, . . . , `} and receives reward YIj ,j . Then the
4regret of δ after n arm plays is defined by
Rn(δ) = µ
∗n− E(
n∑
j=1
µIj ),
where µIj is the mean reward generated when the jth arm is pulled by strategy δ.
The analysis of this stochastic bandit problem was pioneered in the seminal paper of
Lai and Robbins (1985), who introduced the technique of upper confidence bounds for
the asymptotic analysis of regret. Another popular heuristic is the -greedy strategy,
which is commonly used and will be discussed in section 1.3. Another potential goal
of solving a multi-armed bandit problem could be to identify the best arm out of the
competing arms. The regret in this case is defined by the gap between the mean reward
of the optimal arm and the mean reward of the ultimately chosen arm. Details can be
found in Audibert and Bubeck (2010a).
1.2.2 Adversarial bandit problem
Another type of bandit problems is called adversarial or non-stochastic MAB problems.
In this problem, rather than a well behaved stochastic process controlling the rewards,
an adversary has full control over the rewards. No statistical assumptions are made on
the nature of the process generating the rewards of the arms. This formulation takes its
roots in game theory. This alternative formulation as an instance of multi-armed bandit
problem was recognized by Auer et al. (1995). They describe the problem as gambling
in a rigged casino in which the owner sets the gains/rewards for the various arms in a
slot machine. The owner may observe the way a gambler plays in order to design even
more evil sequences of rewards. To formalize the bandit problem as a game between
a player choosing actions/arms and an adversary choosing the rewards associated with
each action, we assume that all rewards belong to the unit interval [0, 1]. The game is
played in a sequence of trials t = 1, 2, . . . , n and we assume ` possible actions/arms.
1. The adversary selects a vector Y (t) ∈ [0, 1]` of current rewards where the ith
component is reward associated with arm i at time t.
2. Without knowledge of the rewards chosen by the adversary, the player chooses
an action/arm by picking an arm It ∈ {1, 2, . . . , `} and receives a corresponding
reward YIt .
53. Two possible scenarios:
• Full information game: The player observes the entire vector Y (t) of current
rewards.
• Partial information game: The player observes only the reward YIt for the
chosen action It.
The adversary could be oblivious if the rewards are independent of the actions/arms
chosen by the player. Otherwise, the adversary is called non-oblivious. The authors in
Auer et al. (1995) provide an algorithm/allocation rule for the adversarial bandit prob-
lem with non-oblivious adversary and show that in a sequence of n plays, the expected
reward of the algorithm approaches that of the best arm at a rate of O(n−1/3). The
authors claim that it is impossible to obtain an upper bound of O(log n) which is consid-
ered to be the optimal in the stochastic bandit problems. Some standard algorithms for
adversarial MABs are Hedge for full information game and Exp3 (Exponential-weight
algorithm for Exploration and Exploitation) for partial information games. For a de-
tailed background, a reader is referred to Cesa-Bianchi and Lugosi (2006) and references
therein. Next, we review some of the commonly used algorithms for a standard multi-
armed bandit problem.
1.3 Algorithms for standard MAB
1.3.1 -greedy policy
As we have previously described, in a multi-armed bandit setting, a player is faced with
the challenge of choosing between competing arms/options. An amateur bandit player
could be tempted to choose the arm with the best estimated reward every single time.
This strategy is called the greedy strategy. However, the problem with playing greedily
is that one can fail to find the best action with positive probability, resulting in high
regret. A simple strategy to balance the exploration-exploitation trade-off is to fix  > 0,
go with the greedy choice with probability 1− (`−1) and choose any other action with
probability , thus being termed as -greedy strategy. Even better, one could choose a
non-increasing sequence j , such that, at time j,
6• with probability 1− (`− 1)j , play the arm with highest empirical mean,
• with probability j , play a random arm other than the arm with the highest
empirical mean.
Theoretical guarantees for this heuristic/policy are provided by Auer et al. (2002a)
where the results are as follows: let ∆i = µ
∗ − µi and ∆ = mini:∆i>0 ∆i and consider
j = min(
6l
∆2j
, 1). When j ≥ 6l
∆2
, the probability of choosing a suboptimal arm i is
bounded by C
∆2j
for some constant C > 0. As a consequence one gets the logarithmic
regret, E[Tn(i)] ≤ C∆2 log n, where Tn(i) be the number of times arm i is played in n
plays. This results in, Rn ≤
∑
i:∆i>0
C∆i
∆2
log n. A drawback of this strategy is that it
requires knowledge of ∆ and it does not distinguish between sub-optimal arms. Note
that, this becomes a purely greedy strategy when j = 0.
1.3.2 UCB algorithm
Lai and Robbins (1985) introduced the UCB technique and since then there has been
a myriad of developments and improvisations in the policy, leading to its widespread
use. This policy is explained by the principle of optimism in the face of uncertainty.
This means that despite our lack of knowledge in what actions are best, we construct
an optimistic guess as to how good the expected reward of each arm is, and pick the
arm with the highest guess. Specifically, we define an upper confidence bound on the
difference between the expected reward of each arm i with the prescribed upper bound.
We want to know with high probability that the true expected reward of an arm, µi,
is less than our prescribed upper bound. One general way to do that is using some
concentration inequality, such as, the Chernoff-Hoeffding inequality, which states that
for any a ∈ R,
P (Y¯i − µi < a) ≤ e−2na2 .
Let Yi,j be the reward variables for a single arm i in the rounds for which we have
chosen i. Then Y¯i is just the empirical average reward for action i. Let Tn(i) be the
number of times arm i is played in n plays. Use a = a(i, n) =
√
2 log(n)/Tn−1(i) we get
that P (Y¯i > µi+a) ≤ n−4, which converges to zero very quickly as the number of rounds
n played grows. The UCB policy says that, at time n play the arm i which maximizes
7Y¯i,n−1 +
√
2 logn
Tn−1(i) where Y¯i,n−1 =
1
Tn−1(i)
∑Tn−1(i)
j=1 Yi,j . It can be shown that the regret
bound Rn ≤
∑
i 6=i∗ min(
10
∆i
log n, n∆i) as in Auer et al. (2002a). Several versions of UCB
algorithms can be designed by using different concentration inequalities like Bernstein
inequality and other concepts such as KL divergence. Such developments can be found
in Audibert et al. (2009); Maurer and Pontil (2009); Audibert and Bubeck (2010b);
Lattimore and Szepesva´ri (2018) and references therein.
1.3.3 Exponential weighting
Exponential weighting schemes are broadly used for balancing exploration and exploita-
tion in reinforcement learning. In the stochastic setting, one popular algorithm is the
Boltzmann exploration (Softmax) strategy. This is a simple version of exponential
weighting which picks an action that is proportional to its average reward, i.e., actions
with higher average rewards are picked with higher probability. At time n, let pi(n) be
the probability of pulling arm i, 1 ≤ i ≤ `. Pull arm i at time n with probability,
pi(n) =
exp(µˆi(n)/τ)∑`
k=1 exp (µˆk(n)/τ)
where τ is a tuning parameter. More details on this heuristic and regret bounds can be
found in Cesa-Bianchi and Fischer (1998); Sutton and Barto (2018).
A more commonly used exponential weighting approach, perhaps more so in the
nonstochastic setting, is called “exponential-weight algorithm for exploration and ex-
ploitation” (Exp3). It works by maintaining a list of weights for each of the actions, using
these weights to decide randomly which action to take next, and increasing (decreasing)
the relevant weights when a payoff is good (bad).
1. Given γ ∈ [0, 1], initialize the weights wi(1) = 1 for i = 1, . . . , `.
2. In each round n:
(a) Set pi(n) = (1− γ) wi(n)∑`
k=1 wj(n)
+
γ
`
for each i.
(b) Pull the next arm In according to the distribution of pi(n).
(c) Observe reward yIn(n) and define the estimated reward to be yˆIn(n) =
yIn(n)/pIn(n).
8(d) Set wIn(n+ 1) = wIn(n) exp (γyˆIn(n)/`) and wi(n+ 1) = wi(n) for i 6= In.
Regret guarantees and theoretical details can be found in Auer et al. (2002b).
1.3.4 Thompson sampling
Thompson Sampling is a Bayesian approach to Bandit problem and dates back to
Thompson (1933), who proposed a simple strategy for the case of Bernoulli Bandits.
Let there be ` arms and each arm i when played produces a reward of 1 with probability
µi (mean reward) and a reward of zero with probability 1 − µi. Take the priors over
each µi to be beta-distributed with parameters αi and βi. At any time n, because of
the conjugacy property, each action’s posterior distribution is also a beta distribution
with parameters that can be updated according to this rule,
(αi, βi)←
(αi, βi) if In 6= i(αi, βi) + (Yn,i, 1− Yn,i) if In = i
Let pii,n be the posterior distribution for µi at the nth round, and let µˆi,n ∼ pii,n. The
chosen arm is then given by In ∈ arg maxi=1,...,` µˆi,n. Recently there has been a lot
of interest for this simple policy and we now have a fairly good understanding of its
theoretical properties (Agrawal and Goyal (2012); Kaufmann et al. (2012)) and empirical
performances (Chapelle and Li (2011)).
Next, we discuss a more practical setting which broadens the applicability of bandit
framework to a wide range of problems, namely contextual bandits or multi-armed
bandits with covariates.
1.4 Contextual bandits
In the literature covered in previous sections, no auxiliary information beyond the ob-
served rewards was considered when selecting which arm to play. However, in many
practical situations some additional information in the form of covariates can be uti-
lized for allocation purposes, and in such cases the reward distributions may depend on
this covariate. For example, in sequential treatment allocation, before deciding which
treatment to assign to a patient, we can observe covariates such as age, gender, genetic
9information or severity of disease. Taking these covariates into account can help in
devising a personalized treatment allocation mechanism. Such a framework is called
Contextual Bandits or Multi Armed Bandit problem with Covariates (MABC). The
term contextual bandits was coined by Langford and Zhang (2008) and is the most
commonly used terminology these days. In this dissertation, we use both these names
interchangeably. More formally, in the slot machine analogy, the game player is given
a d-dimensional covariate x ∈ Rd before deciding which arm to pull and the expected
reward of an arm given covariate x, takes a functional form f(x). Then, one way to
model the reward for ith arm at the jth time point is by adopting regression procedures,
so we get a model,
Yi,j = fi(Xj) + i,j
In most of the settings, i,j ’s are considered to be i.i.d. mean zero random variables.
Based on assumptions made on f , one could take either a parametric or a nonpara-
metric approach for the purpose of estimation. There is a vast amount of literature on
contextual bandits (Bubeck and Cesa-Bianchi (2012) for a complete bibliography) and
we try to review some relevant work in the following sections.
1.4.1 Parametric framework
The most extensively studied parametric regression is the linear regression model. In
this, we assume that f(x) = β′ix, such that,
Yi,j = β
′
iXj + i,j
where Xj , βi ∈ Rd and i,j are iid mean-zero random variables. Then the expected
regret of a learning algorithm δ, at time N , would be defined as,
RN (δ) =
N∑
n=1
(E[Yi∗(x),n − YIn,n])
where i∗(x) = arg maxi β′ix and In is the arm chosen by the algorithm at time n.
This problem was first addressed by Woodroofe (1979) who introduced the one-
armed bandit problem with covariates. In this setup, we let (Xj , Y0,j , Y1,j), j ≥ 1
denote a sequence of random vectors, where Xj is a covariate at time step j and Yi,j is a
10
reward from from arm i = 0 or i = 1 that is obtained at time j. Suppose (Xj , Y0,j , Y1,j)
are i.i.d. copies of (X,Y0, Y1). Woodroofe (1979) considered the problem in a Bayesian
setting under the assumption that Y1 = X − θ + , where  is zero mean random
variable with known distribution, independent of X. For a given prior distribution of
θ, he provided a description of the optimal Bayesian policy. These results were later
extended by Sarkar (1991) for the scenarios where the reward distribution belongs to a
one-parameter exponential family.
The problem was revisited by Goldenshluger et al. (2009), where the authors study
minimax complexity of the one-armed bandit problem with covariates and establish
policies that achieve non-asymptotic lower bounds on the minimax regret (goal is to
minimize the maximal regret). Another related work is Goldenshluger and Zeevi (2013),
where a linear response bandit problem with finite number of arms is considered in a
minimax setting and optimal rates for the proposed algorithm are established. They
propose an algorithm under certain assumptions like i,j are normally distributed and
a “margin” condition, and establish a regret bound of order O(d3 logN) for the same.
Recently, Bastani and Bayati (2015) have extended the algorithm of Goldenshluger and
Zeevi (2013) to the high dimensional case where the vectors βi are sparse. Another
notable work by Bastani et al. (2017) is where they show that under mild conditions,
the greedy algorithm is rate optimal in cumulative regret for a two-armed linear bandit
model. For situations where the assumptions are hard to verify, they propose, what the
call ‘Greedy-First’ algorithm which follows a greedy policy in the beginning and only
performs exploration when the data indicates the need for it. Furthermore, Filippi et al.
(2010) developed bandit policies for a generalized linear model framework.
There has been a significant amount of work using heuristics like UCB. First studied
by Auer et al. (2007) under the name “linear reinforcement learning”, and later in the
context of web advertisement by Li et al. (2010), Chu et al. (2011), is a variant when
the set of available arms changes from time step to time step, but has the same finite
cardinality in each step. Li et al. (2010) gave the LinUCB algorithm (Algorithm 1), that
is based on the linearity assumption E(Yi|X = x) = β′ix. It adds a confidence term to
each arm’s current estimate and the arm which maximizes this sum is chosen.
11
Algorithm 1 LinUCB (Li et al., 2010)
Inputs: α > 0
Ai = Id, bi = 0d×1 for all i
for n = 1 to N do
Observe covariate xn.
βˆi = A
−1
i bi for all i.
pn,i = βˆi
′
xn + α
√
x′nA
−1
i xn for all i. . Upper confidence bound
Take action In = arg maxi(pn,i) and observe reward Yn,In
For i = In, update Ai = Ai + xnx
′
n, bi = bi + YIn,nxn
end for
A slightly different setting was studied by Dani et al. (2008), Abbasi-Yadkori et al.
(2011), Rusmevichientong and Tsitsiklis (2010), where they study linear contextual
bandits for the case when the set of available actions does not change between time
steps but the set can be an almost arbitrary, even infinite, bounded subset of a finite-
dimensional vector space.
Agrawal and Goyal (2013b); Russo and Van Roy (2014) studied linear contextual
bandits in the Bayesian setting using Thompson sampling and established regret guar-
antees. Going beyond the world of linearity, Agarwal et al. (2012) consider a setting
where the mean reward functions are assumed to lie in a general class of finitely many
members. Agrawal and Goyal (2013a) analyze Algorithm 2 which chooses a Gaussian
prior with mean zero and covariance matrix σ2Id×d. They assume that Yi|x ∼ N(βix, σ2)
and as a result get a Gaussian posterior using conjugacy property. Then, it draws sam-
ples from the posterior distribution and chooses the action with the highest posterior
mean. For  ∈ (0, 1), they prove a regret bound of O(d
√
N1+
 (logN log (1/δ))) with
probability 1 − δ. A big advantage of using Thompson Sampling is that it does not
require the covariates to be iid, which is usually the case in a lot of practical settings.
Soare et al. (2014) study the problem of best-arm identification in linear bandits. The
corresponding regret rates for much of the work discussed above are tabulated in 1.1.
12
Algorithm 2 Thompson Sampling (Agrawal and Goyal, 2013b)
Input: σ2 (variance parameter used in the prior)
Ai = Id×d, bi = 0d×1 for all i
for n = 1 to N do
Compute βˆi = (Ai)
−1bi for all arms i
Sample β˜i from N(βˆi, σ2(Ai)−1) for all i . Sample from the posterior
Take action In = arg maxi(β˜i)
′xn and observe reward Yn,In
For i = In, update Ai = Ai + xnx
′
n, bi = bi + YIn,nxn
end for
1.4.2 Nonparametric framework
The first work to venture outside the realm of parametric modeling assumptions was by
Yang and Zhu (2002). They show that using nonparametric estimation techniques like
histogram and K-nearest neighbor methods, the function estimation is strongly consis-
tent. As a result, the cumulative reward of their randomized allocation rule (similar
to an -greedy policy) is asymptotically equivalent to the optimal cumulative reward.
However, the property of strong consistency does not address the issue of how quickly
the total reward based on the allocation strategy approached the optimal one. The
allocation rule used in this work is essentially a variant of the −greedy policy and will
be discussed in chapter 2 as it will lay the ground for our proposed work. This notion
of reward strong consistency as in Yang and Zhu (2002) was then established by May
et al. (2012) for their Bayesian sampling method. The question of establishing a more
refined notion than Yang and Zhu (2002) of optimality for nonparametric MABC was
addressed in Rigollet and Zeevi (2010), where they proposed UCB-type policies. They
derive near-optimal bounds on the regret in the case of a two-armed bandit problem
under only two assumptions on the underlying functional form that governs the arm
rewards. The two conditions are,
1. Smoothness condition: We say that an algorithm satisfies the Ho¨lder smoothness
condition with parameters (κ, ρ) if fi satisfies,
|fi(x)− fi(x′)| ≤ ρ||x− x′||κ ∀x, x′ ∈ X , i = 1, . . . , `,
for some κ ∈ (0, 1] and ρ > 0.
13
2. Margin condition: Given x ∈ X , define f ](x) to be
f ](x) =
max1≤i≤`{fi(x) : fi(x) < f∗(x)} if min1≤i≤` fi(x) < f∗(x),f∗(x) otherwise.
f is said to satisfy the margin condition if there exists α ∈ (0, d/κ], t0 ∈ (0, 1) and
c0 > 0 such that
PX(0 < f
∗(X)− f ](X) ≤ t) ≤ c0tα for all t ∈ [0, t0].
The margin condition encodes the “separation” between the functions that describe
the arms’ responses and was originally studied by Goldenshluger et al. (2009) in the
one armed bandit problem. Rigollet and Zeevi (2010) introduced the idea of binning
covariates which is an intuitive concept and could be described using the example of
clinical trials: patients are segmented into groups with similar characteristics and then
the treatment is allocated based on the responses over that group. This setup of two
armed bandit problem was extended by Perchet and Rigollet (2013) to the `- armed
bandit problem with covariates when ` may be large. They use successive arm elimina-
tion algorithms to establish a regret upper bound (O(N
κ+d−ακ
2κ+d )) with the same order as
the minimax lower bound of a two-armed MABC problem in Rigollet and Zeevi (2010).
Subsequently, Qian and Yang (2016a) use “chaining” arguments to show uniform strong
consistency in estimation using kernel methods along with providing a finite-time re-
gret analysis. Given Ho¨lder smoothness parameter κ and total time horizon N , the
expected cumulative regret upper bound for the policy proposed is O(N
2κ+d
3κ+d ) which is
slightly sub-optimal and might reflect theoretical limitation of the proposed algorithm.
Another contribution of this paper is that it introduces a fully data-driven model com-
bining technique to help choosing the best estimation method for each arm integrated
in the randomized allocation strategy for MABC. Motivated by the observation that us-
ing randomized allocation strategy alone may give sub-optimal rate for the cumulative
regret, the authors in Qian and Yang (2016b) propose an algorithm (RAEE) which em-
beds the arm-elimination technique of Perchet and Rigollet (2013) into the randomized
allocation strategy. They show that near minimax regret upper bounds can be achieved
without prior knowledge of the smoothness parameter. In particular, they use Lep-
ski’s method to adaptively estimate the smoothness parameter under a “self-similarity”
condition.
14
Note that, in the smoothness condition κ ≤ 1 corresponds to non-differentiable
functions, weaker form of Lipschitz continuity (κ = 1). Also κ = ∞, corresponds
to infinitely-extrapolable functions such as linear and other parametric functions. Hu
et al. (2019) develop a novel algorithm that bridges the gap between infinitely-smooth
linear response bandit and the non-smooth non-differentiable response bandit. They
characterize the smoothness of the mean reward functions in terms of highest order
of continuous derivatives. They propose an algorithm for every level of smoothness
1 ≤ κ < ∞ and prove that it achieves minimax optimal regret rate up to polylogs
(O˜(N
κ+d−ακ
2κ+d )). Other works on nonparametric bandits include Fontaine et al. (2019)
who incorporate regularization in contextual bandits and Wanigasekara and Yu (2019)
who deal with nonparametric bandits in an unknown metric space.
A slightly different approach is taken by (Langford and Zhang (2008)), in which
they imposes neither linear nor any smoothness assumption on the mean reward func-
tion. Instead, they fix a class of policies, H, and then aim to minimize the expected
regret relative to the class H. They do not require knowledge of the time horizon N ,
as they run exploration-exploitation steps in epochs (batches) with sample dependent
exploitation step such that the resulting regret bound is no more than three times
the regret for known N . The regret bound of Epoch-Greedy, with a finite class H is
O(N2/3(log |H|)1/3). However, the authors note that for an infinite class H with fi-
nite VC dimension, a similar regret bound could be shown. An advantage of using
this methodology is that it does not make assumptions like the margin condition on
the underlying reward generating functions. Additionally, the authors show that upon
imposing some assumptions (such as gap between best and second best bandit) on the
Epoch-Greedy algorithm, one could achieve the regret of the form O(logN).
A different but related setting to MABC problem considers the arm space (with
possibly infinitely many arms) instead of the covariate space. For reference, see Auer
et al. (2007), Kleinberg et al. (2008). Plug-in type policies explained for the full informa-
tion case in Audibert et al. (2007), have gained popularity in the context of continuum
armed bandit (uncountably many). See Slivkins (2014) for reference. These bandit
problems consider joint covariate and arm space. Some contextual bandit policies, such
as the ones proposed in Beygelzimer et al. (2011); Langford and Zhang (2007), allow for
adversarially chosen covariates and establish regret bounds in this more general setting.
15
Setting Regret upper bound
Standard stochastic MAB O(log n)∗
Adversarial MAB O˜(
√
`n)∗
Linear contextual bandits O(d3 log n) (Goldenshluger and Zeevi,
2013)
O(d log n
√
n log n/δ) (Dani et al.,
2008)
O(d log n
√
n+
√
dn log n/δ)
(Abbasi-Yadkori et al., 2011)
O(d
√
n1+
 (log n log (1/δ))) (Agrawal
and Goyal, 2013a)
Contextual bandit O(n2/3(log |H|)1/3) (Langford and
Zhang, 2007)
Nonparametric bandits O(n
κ+d−ακ
2κ+d )∗ (Perchet and Rigollet,
2013)
O(n
2κ+d
3κ+d ) (Qian and Yang, 2016a)
O˜(n
κ+d−ακ
2κ+d )(Qian and Yang, 2016b)
Smooth contextual bandits O˜(n
κ+d−ακ
2κ+d ) Hu et al. (2019)
Table 1.1: Cumulative regret (Rn(δ)) upper bounds for different multi-armed bandit
policies in different settings.
d denotes the covariate dimension, δ ∈ (0, 1) such that the regret bound holds with probability 1− δ, 
is a fixed number between (0, 1), H is the size of the policy set in Epoch-Greedy, κ is the smoothness
parameter and α is the parameter from the margin condition. ∗ denotes rates that have been shown to
be minimax optimal, O˜ signifies some missing log or polylog terms.
Next, we discuss a pertinent issue that arises in most practical situations where
16
multi-armed bandit finds its applications, that is, delay in observing the rewards.
1.5 Multi-armed bandits with delayed feedback
Delays in observing rewards in bandits manifest itself in different ways in various prac-
tical settings. A lot of these have been identified and addressed in the literature. We
try to review them in this section. Online advertising is one of the main application
areas of multi-armed bandit algorithms. Typically, a user is regularly shown advertise-
ments on social media or other websites by a bandit algorithm. When a user clicks
on an advertisement or buys a product, the bandit algorithm updates and takes this
information into account in future recommendations. However, usually the click or the
user feedback is not instantaneous but might be received some time after the algorithm
presented the advertisement. Another common example where delayed rewards are no-
ticed is in applications in health care, like allocating treatments for a particular disease.
In this case, one would most likely not have observed results of previous patients before
making current treatment decisions for a specific disease. Therefore, not all information
is available to make real-time decisions and hence delays are expected to play a crucial
role in determining the performance of such a sequential treatment allocation scheme.
In some cases, the algorithm (player) receives the delayed feedback in the shape
of arm-reward pairs, in which the player knows both the reward observed and which
arm generated it. This is called the delayed feedback bandit problem. However, in some
online situations, it is not possible to distinguish which advertisement corresponds to
the observed delayed reward, that is, one does not have any information on which ad-
vertisement was actually responsible for attracting user’s interest. This, seemingly a
harder problem, is known as delayed anonymous feedback bandit problem. The complex-
ity of these problems could then vary depending on the assumptions made on delays,
ranging from fixed known delays to random unbounded delays. Another dimension of
complexity would be the contextual bandits with delayed feedback case, which is where
lies our research contribution.
17
1.6 Delayed feedback bandit problem
As mentioned in section 1.5, delayed bandit feedback problem arises when one observes
rewards at a delayed time along with the knowledge of the arm it corresponds to. For
example, this is a scenario one would expect to see in an adaptive treatment allocation
setting, where you would keep track of which outcome corresponds to which treatment.
1.6.1 Bayesian setting
One of the most extensively studied ways of tackling a bandit problem is through the
use of Bayesian methods. Anderson (1964) and Suzuki (1966) highlighted the impor-
tance of considering delays for the choice of optimal sequential decision procedures and
characterized Bayes solutions of sequential decision problems under delayed feedback.
Motivated by ethical and practical issues in the designs of sequential clinical trials, a ban-
dit process with delayed responses was studied by Eick (1988b,a), where the existence of
optimal Gittins indices is shown under some conditions. The responses considered were
patients’ survival times after the treatment, which might be censored upon their obser-
vation, with the objective of maximizing the total discounted expected survival times.
More recently, Chapelle and Li (2011) along with providing deeper insights through an
empirical comparison between Thompson Sampling and UCB algorithm, conducted an
empirical study to illustrate robustness of Thompson sampling in the case of constant
delayed feedback.
1.6.2 Stochastic setting
A substantial amount of work that has been done in this direction assumes fixed and
bounded delays. In the last section of their paper, Dudik et al. (2011) considered a
constant known delay which resulted in an additional additive penalty in the regret for
contextual bandits with finite decision sets. A more systemic study of online learning
problems with delayed feedback was conducted by Joulani et al. (2013), who devel-
oped meta-algorithms which in a black-box fashion use algorithms developed for the
non-delayed case into ones that can handle delays in a feedback loop. They propose
Algorithm 3 (QPM-D) where they use any bandit algorithm for the non-delayed case
(which they call as BASE) to tackle delays. They create buffers (queues) Q[i] for each
18
arm i. While rewards for the arms chosen are available in the queues, the BASE algo-
rithm runs and makes further predictions. When no feedback is available for a given
arm, (that is, Q[I] is empty for some arm I), the algorithm keeps choosing the same
arm until a reward is observed for that arm. Their results show that the price of delay
leads to a multiplicative increase in the regret in adversarial problems and an additive
increase in stochastic problems. Their results only hold for finite side information sets
and not for a fully contextual setting. Following this notable work, Mandel et al. (2015)
devised a method that guarantees good black-box algorithms when leveraging a prior
dataset and incorporating a heuristic to help improve empirical performance while re-
taining the strong theoretical guarantees of Joulani et al. (2013). Another approach
to handling delays in bandits was by Desautels et al. (2014). They analyzed the case
of Gaussian Process Bandits, and developed algorithms for parallelizing exploration-
exploitation trade-offs and provide regret bounds for them respectively.
Another similar setting in online learning is motivated by delayed conversions in
advertising and product recommendations on e-commerce websites (Chapelle (2014)).
Conversion is a generic term used to refer to user’s buying decision. It occurs when the
reward that is immediately obtained (for example, click) is a proxy for an actual outcome
(for example, a corresponding sale) which might take hours or days to happen. For this
setting, Vernade et al. (2017) consider potentially infinite stochastic delays, resulting in
some feedback being censored (not observable), which happens in online settings due
to limited memory. The strategy proposed in Joulani et al. (2013) does not work in
this setting because their algorithm acts like a queuing mechanism where the number
of draws of an arm as well as the cumulated sum of the subsequent rewards are only
updated when the observation arrives to the learner. However, in this setting (Joulani
et al. (2013)), the associated reward corresponding to a click is 1 so the cumulated sum
would just be 1 and would not allow to compare arms. Vernade et al. (2017)develop UCB
based strategies and their analysis assumes prior knowledge of the delay distribution
and does not handle the contextual case. Subsequently, Vernade et al. (2018) extend
this work (delayed conversions with censoring) to the contextual case. They assume
a linear assumption on modeling the rewards as a function of the covariates. The
algorithm they develop is a delayed version of LinUCB and name it DeLinUCB, which
can handle ambiguous delayed feedback. Another major improvement over Vernade
19
et al. (2017) is that they do not require any prior knowledge of the delay distribution
or the conversion probability. They make the assumptions of bounded scalar rewards,
bounded coefficients of any action and bounded noise (mostly consider Bernoulli arms).
Although not directly related to bandits, a relevant line of work is to devise methods
for conversion rate prediction, such as Chapelle (2014) and Yoshikawa and Imai (2018),
where both use two models one for time delays between click and conversion, and the
another a classification model for predicting conversion. Both use survival analysis tools
for time delays as there is censoring involved, with the latter extends the generalized
linear model of the former to a nonparametric model.
Recently, Zhou et al. (2019) design delay-adaptive algorithm for generalized linear
contextual bandits using UCB-style exploration. In our knowledge, no previous work
has addressed adapting for delayed feedback in nonparametric contextual bandits and
that is the contribution of this dissertation work. Also, there has been no work on using
randomized policies for dealing with delayed contextual bandits and our work seems to
be the first one addressing that as well. The regret rates for each of the works listed
above are tabulated in Table 1.2.
Algorithm 3 Queued Partial Monitoring with Delays (QPM-D) (Joulani et al., 2013)
Create an empty buffer Q[i] for each arm i ∈ {1, . . . , `}
Let I be the first arm predicted by the BASE algorithm.
for each time instant n = 1, 2, . . . , N do
Prediction Step:
while Q[I] is not empty do
Update BASE with a reward from Q[I].
BASE predicts an arm In.
end while
Q[I] is empty (no available reward for I), predict In = I at time instant t to get
a reward.
Update:
for each (s, Ys) ∈ Zn do . (Zn- set of arms and rewards observed at time n.)
Add the reward Ys to the buffer Q[Is].
end for
end for
20
Stochastic Rewards
No covariates Covariates
Fixed
delays
O(
√
` logN(τconst +√
N))(Dudik et al.,
2011)
Variable
delays
RN ≤
R′N +O(logN + E[τ ] +√
E[τ ] logN)(Joulani
et al., 2013; Mandel
et al., 2015)
Linear
O(
√
fN,δ+m
τmpc
√
8dN log (1 +N/λ))
(Vernade et al., 2018)
RN ≤
C1R
′
N + C2τmax log τmax
Desautels et al. (2014)
Generalized linear
O˜(
√
µDd+
√
σGd+
d)
√
N) (Zhou et al.,
2019)
Nonparametric This
dissertation
Anonymous
delayed
feedback
+O(logN + E[τ ] +√
E[τ ] logN)
(Pike-Burke et al., 2017)
Table 1.2: Regret bounds for multi-armed bandits with stochastic delayed rewards
Here, R′N is the cumulative regret rate for non-delayed BASE policy, τconst is the constant delay and τ
is the random delay.
In Vernade et al. (2018), {m, τm, pc} are conversion parameters,
fN,δ = 2(1 + 1/ log t) log(1/δ) + cd log (d log t), d is the covariate dimension.
In Zhou et al. (2019), d is the covariate dimension, µD is the mean for iid delays, σG is a parameter
characterizing the tail bound of the delays.
21
Nonstochastic Delays
No covariates
Fixed delays
Non-anonymous
feedback
O(√(`+ d)N log `)∗
(Cesa-Bianchi et al., 2016)
Composite
Anonymous
feedback
O(√d`N log `))∗
(Cesa-Bianchi et al., 2018)
Variable delays
Bounded
O(√(`N +D) log `)
(Sommer Thune et al., 2019)
Unbounded,
unknown N, D
O(
√
log `(`2N +D))
(Bistritz et al., 2019)
Unbounded,
observed
at action time
O(minβ |Sβ|+ β log `+ `N+Dββ )
(Sommer Thune et al., 2019)
Unbounded,
unknown N,d
O(√`N+minS(|S|+
√
DS¯ log `))
(Zimmert and Seldin, 2019)
Table 1.3: Cumulative regret upper bounds for multi-armed bandits with nonstochastic
delayed feedback
` denotes the number of arms, d is for fixed delay, N is the time horizon, |Sβ | is the number of
observations with delay exceeding β, and Dβ is the total delay of observations with delay below β. S
is the set of rounds excluded from delay counting, S¯ = [N ]/S and DS¯ are the counted rounds.
∗ refers
to results that have been proved to be minimax.
22
1.6.3 Nonstochastic setting
Delayed rewards also occur in a nonstochastic setting where the rewards are being
obtained in a more deterministic fashion. For example, Cesa-Bianchi et al. (2016) give
an example of multiple ad servers, which form a communication network through which
they can share user information and use real-time bidding to sell their inventory. Each
server, learns how to set the auction parameters (e.g., reserve price) using a bandit
algorithm sequentially, in order to maximize the network’s overall revenue, and shares
feedback information with other advertisers in order to speed up learning. Delay comes
in because the rate at which information is exchanged through the communication
network is slower than the typical rate at which ads are served. This causes each
learner to acquire feedback information from other servers with a delay that depends
on the network’s structure. In this communication network, messages that take more
than d hops to arrive are dropped, and d is called the delay parameter. Now two
scenarios can occur, 1) the learning agents could decide to cooperate to solve the same
nonstochastic bandit problem or, 2) not cooperate and ignore the information received
from other agents. The authors propose a version of Exp3 algorithm and prove that with
` actions and M agents, the average per-agent regret after N rounds is at most of order√
(d+ 1 + KN α≤d)(T logK), where α≤d is the independence number of the d-th power
of the communication network G (i.e., graph G augmented with all edges between any
two pair of nodes at shortest-path distance less that or equal to d). The authors provide
results on regret bounds depending on the nature of the underlying graph structure G.
More recently, Li et al. (2019); Sommer Thune et al. (2019), have developed algorithms
with regret guarantees for the case when the delays are variable (unrestricted). In
Sommer Thune et al. (2019), prior knowledge of delays is no longer required but they
assume that delays are available at action time (when an arm is pulled) for the doubling
technique to work. This assumption of delays being available at action time is justified
in the work as it is satisfied in the above-mentioned setting of interaction between ad
servers. However, Zimmert and Seldin (2019) relax this assumption of delays being
available at action time. Their results require no advance knowledge of delays and the
time horizon N , with no requirement of a doubling technique. The tightness of the
regret bound achieved by Zimmert and Seldin (2019) still remains an open problem.
More recently, Bistritz et al. (2019) considered the delayed Exp3 algorithm and propose
23
a novel doubling trick for online learning with delays to deal with the case where the
total delay and time horizon are unknown. Solving the problem of delay in nonstochastic
bandits with covariates has not been addressed thus far in our knowledge. The regret
bounds for the literature reviewed here are tabulated in Table 1.3.
1.7 Delayed anonymous feedback bandit problem
As mentioned in section 1.5, delayed anonymous feedback problem arises when rewards
are observed at a delayed time but without the knowledge of the corresponding arms.
These situations often arise in online learning settings. We discuss work done in this area
in both stochastic and adversarial realms in section 1.7.1 and section 1.7.2 respectively.
1.7.1 Stochastic setting
This problem was formulated by Pike-Burke et al. (2017, 2018), motivated by application
in online advertising. In this problem, along with assuming delayed feedback, it is also
assumed that the player does not observe the outcome of a specific action. Instead,
at each time step, t, a player selects an action It and then receives reward Yt which
could be a cumulative/aggregated reward from any of the past t plays of the bandit
process. Although this seems to be a harder problem due to this anonymity, the authors
devise a strategy and show that one can achieve regret of similar order to the simpler
delayed feedback problem of Joulani et al. (2013). The key idea used is that playing
an arm consecutively for long period of time helps obtain an accurate estimate of the
mean reward of that arm. In our knowledge, extending this to contextual bandits still
remains an unsolved problem. More recently, Cella and Cesa-Bianchi (2019) proposed
a slightly different problem motivated by recommendation problems in music streaming
platforms. They propose a nonstationary stochastic bandit model in which the expected
reward of an arm depends on the number of rounds that have passed since the arm was
last pulled. They introduce a class of ranking policies, and propose an algorithm that
achieves a regret of O˜(`N) with respect to the best ranking policy.
24
1.7.2 Nonstochastic setting
In Cesa-Bianchi et al. (2018), the authors study the delayed anonymous feedback in
a nonstochastic setting, where rewards (or losses) are generated by some unspecified
deterministic mechanism. In addition, they consider a more general setup by assuming
that the loss for choosing an action at time t is adversarially spread over at most d
(known, fixed delay) consecutive time steps in the future, t, t + 1, . . . , t + d − 1. They
call this setting composite anonymous setting as it can accommodate scenarios where
actions have a lasting effect which combines additively over time. These scenarios
can occur in various online settings, for example, an impression results in immediate
click-through, later followed by a conversion; or a user interacts multiple times with
the recommended item. For this setting, the authors provide an upper bound and
a matching lower bound (up to log factors), showing that in the nonstochastic case,
anonymous feedback is strictly harder than non-anonymous feedback. Extending this
to the contextual case and unknown delays still remains an unsolved problem.
1.8 Contextual bandits and health care
In recent times, majority work in contextual bandits has been driven by personal rec-
ommendation problems arising on the web, such as providing advertisement recommen-
dations based on user and webpage features, or tailoring news article recommendations
based on user interests, or many more of the like. However, multi-armed bandits from
the very beginning have been motivated by their potential applications in health care.
Since the work proposed in this dissertation is motivated by its potential applications
to health care, we discuss some of the work done in two directions in the broad field of
health care in the following sections.
1.8.1 Contextual bandits for adaptive clinical trials
Traditionally, clinical trials have followed a non-adaptive design with known randomiza-
tion of patients to treatments throughout the trial. These designs are very well-studied
from a long time and largely prevalent in the present times, due to good properties
like maintaining low Type I error and controlling bias. However, these trials often are
25
very long, expensive and could lead to poor patient outcomes and inconclusive results.
Therefore, adaptive designs are being studied and encouraged even by regulatory bodies
like FDA in the U.S. There are several types of adaptive-designs, most commonly used
is the response-adaptive design (Chow and Chang (2012); Murphy (2005)). Such de-
signs are usually Bayesian in nature, intending to learn from the accumulating patient
responses by making procedural changes (like changing randomization probabilities)
while the trial is still ongoing.
Contextual bandits are extremely relevant in the context of adaptive clinical tri-
als. The covariate information xi for the ith patient can be used to “personalize” the
treatment selection for the patient as in biomarker-guided therapies in personalized
medicine. For example, the BATTLE trial in Kim et al. (2011) demonstrated that per-
sonalizing the chemotherapy regimen led to increased success rates in cancer patients.
In Lai and Liao (2012), the authors develop an asymptotic theory for efficient outcome-
adaptive randomization schemes and optimal stopping rules. They extend the classic
MAB theory to developing asymptotic lower bounds for the expected sample sizes for
the treatment arms and the control arm, using generalized likelihood ratio procedures
to obtain these bounds. Some of these ideas are used in Bartroff et al. (2013), Wason
and Jaki (2012). A LASSO bandit algorithm for parametric bandits is used in Bastani
and Bayati (2015) on a medical decision making problem of Warfarin dosing and it is
shown that it improves overall treatment benefits. Szorenyi et al. (2015) have developed
an innovative qualitative MAB approach in which the rewards are not assumed to be
numerical. They use a quantile-based online learning approach for regret minimization
and finite time analysis. Another notable work, chiefly motivated by clinical trials is
batched bandit problems (Perchet et al. (2016)), where the allocation policy must split
trials into a small number of batches. They show that optimal regret bounds can be
attained even when the number of batches are much smaller than log T where T is the
total number of time steps. There is a long list of published work discussing sequential
approaches (including multi-armed bandits) and their role in randomized and adaptive
clinical trials, some examples include Anscombe (1963); Armitage et al. (1975); Wei and
Durham (1978); Lai et al. (1985); Wason and Jaki (2012); Sverdlov (2015); Villar et al.
(2015); Ahuja and Birge (2016). Although multi-armed bandits have long been used as
the motivating example for adaptive clinical trials, in our knowledge it still has limited
26
application in real clinical trials.
1.8.2 Contextual bandits for mobile health
Mobile health is a term used for the practice of medicine and public health supported
by mobile devices such as mobile phones, tablet computers, and wearable devices such
as smart watches, for health services, information, and data collection. As contextual
bandits provide a natural framework for personalized decision making over time, they
are expected to be found useful in personalizing mobile health interventions to a spe-
cific person in a particular context. Recently, a concept called Just-in-time adaptive
intervention or JITAI (Nahum-Shani et al. (2017)) was formulated to unify a number
of decision making problems that arise in mobile health. JITAIs are increasingly being
used to support health behavior changes in domains such as physical inactivity, alcohol
use, mental illness, smoking and obesity. Contextual bandits seem to be a promising
tool to help personalize JITAIs, where the arms chosen are the interventions provided
to the users. Widespread use of technologies such as smart phones, tablets etc. and
their portable nature enables individuals to access and receive interventions anytime and
anywhere. Moreover, mobile-phone sensing (e.g., GPS), computing sensors in wearable
devices, and digital footprints make it possible to monitor individuals continuously and
hence know when and why to intervene. For example, MyBehavior (Rabbi et al. (2015))
is a lifestyle mobile intervention (a smart phone application) that uses multi-armed ban-
dits. It uses sensor data to suggest a frequent behavior (e.g., walking) when the person
is in a particular location and life context (e.g., on the way home after work) or run-
ning (which might happen less frequently). As the person repeats these behaviors, the
online algorithm updates and the recommendations are prompted more frequently in
that setting. Although, JITAIs are now being used extensively, they are still in early
stage of development and face some unique challenges like attrition (people abandoning
their mobile health resources) and the rapid, unexpected nature of the problem. Tewari
and Murphy (2017) review the existing contextual bandits literature in the light of mo-
bile health applications and discuss specific technical and statistical challenges in this
direction. Some of the statistical challenges mentioned are: good initialization of the
learning algorithm, assessing usefulness of covariates, robustness to failure of assump-
tions, dealing with variables that are expensive to acquire or are missing, and finding
27
interpretable policies.
1.9 Our contribution
In section 1.4.2, we discussed contextual bandits from a nonparametric estimation point
of view and in section 1.6.2, we discussed the crucial component of incorporating delays
in the stochastic bandit framework. However to our knowledge, the combination of
these two concepts has not been considered in the literature thus far, that is, contextual
bandits with delayed rewards have not been considered using nonparametric methods
of estimation. Also, not much work has been done to handle unrestricted delays in
a fully contextual bandit problem. For example, most of the work is either restricted
to constant delays (Dudik et al. (2011)) or finite decision sets for covariates (Joulani
et al. (2013)). In the work presented in this dissertation, we try to develop contextual
bandit strategies with unrestricted delays, using a nonparametric estimation approach.
We present the cumulative regret analysis for these proposed strategies, and illustrate
their performance both theoretically and empirically. The motivation for considering
this framework comes from applications in sequential treatment allocation for health
care as discussed in section 1.8.
In our setting, we consider the delay in observing rewards to be a random variable
and allow for them to be unbounded with some mild assumptions. In the first part
of our work, we extend the proposed randomized allocation strategy of Yang and Zhu
(2002) by incorporating delayed feedback in the framework. The strategy proposed is
an annealed -greedy type of strategy for contextual bandits with delays. We prove that
the proposed strategy is strongly consistent, which shows that the cumulative reward
of the proposed strategy is asymptotically equivalent to the optimal cumulative reward.
For this result, the only assumption we make on the nature of the underlying mean
reward functions is that they are continuous. Therefore, this general setup along with
the mild assumptions on delays fits in a sequential treatment allocation framework for
a variety of settings as discussed in chapter 2.
Applying nonparametric methods of estimation usually requires making good choices
for hyperparameters, such as the binwidth in the histogram method, bandwidth in
kernel regression (like the Nadaraya-Watson estimator), and k in the k-nearest neighbors
28
algorithm. Since sequential procedures like contextual bandits require time dependent
updates of these hyperparameters for estimation at each time point, the user needs to
determine an appropriate choice of the binwidth sequence. Another hyperparameter
sequence that needs to be chosen is the exploration probability sequence, used in the
annealed -greedy strategy to balance the exploration-exploitation trade-off. These have
been well-studied in the no-delay setting and choices that would guarantee optimal rates
have been suggested in Qian and Yang (2016b). Once proper choices of these user
determined sequences have been made, one faces the question of how to update both
these sequences in the presence of delayed feedback. There are two possible choices:
1) update the sequences only after observing a new reward, or, 2) keep updating at
all time points irrespective of having observed a reward. Based on these two choices,
we consider two strategies that differ in how the exploration probability sequence is
updated. However, for both strategies, the binwidth sequence is only updated when a
new reward is observed for reasons described in chapter 3. Then, through simulations
we show that both the proposed strategies are advantageous in different scenarios. This
helps us understand that black-box procedures for incorporating delays might not always
be advisable and it is important to opt for strategies that take into account several
factors like the complexity of the problem and expected magnitude of delays.
In the last part of this work on delayed feedback in contextual bandits, we provide
finite-time regret bounds for our proposed strategies. We provide upper bounds for
the cumulative regret for both the strategies as discussed in the previous paragraph,
which helps further our understanding of how these strategies compare in finite time. In
addition, we try to relax the assumption of independence of delays with covariates, and
provide finite-time analysis for this setting under some additional assumptions. Finally,
we illustrate these results by applying them to the Yahoo! Front Page Module User
Click dataset, a benchmark dataset extensively used for contextual bandit problems.
We compare our results with the DeLinUCB algorithm of Vernade et al. (2018) and
discuss future directions.
Finally, in the last chapter, we switch focus on a different and pressing problem of
improving applicability of contextual bandits in health care. We consider a setup where
a doctor can intervene in the automated decision making process of multi-armed bandit
algorithms. With contextual bandits, the decision maker uses a computer algorithm
29
that can balance the tendency to apply treatments that have done well in the past
with the option to try other treatments that might be more beneficial in the future.
However, based on their experience, doctors may consider certain patient cases to be
special and would want to allot a different treatment than the one proposed by the algo-
rithm. Therefore, we develop a consistent treatment allocation strategy that holistically
integrates the adaptive learning by the bandits algorithm and expert interventions.
The dissertation is organized as follows: In chapter 2 we propose a randomized
algorithm which incorporates delayed feedback and show that it is strongly consistent.
In chapter 3, we modify the strategy proposed in chapter 2 by understanding how the
hyper-parameter sequences should be updated in the presence of delays, in order to
balance the exploration-exploitation dilemma in a better way. Then in chapter 4, we
provide finite-time regret upper bounds for the two strategies proposed in chapter 3 and
illustrate their applicability on a real dataset. Finally in chapter 5, we change the focus
from delayed feedback and propose a randomized allocation strategy that incorporates
expert (doctor) intervention in the automated bandit strategy. Appendix A contains
useful inequalities and probabilistic tools that are often used in the proofs. A table for
common notations used in this dissertation is given in section A.2.
Chapter 2
Randomized allocation strategy
for delayed nonparametric
bandits
In this chapter, we propose a contextual bandit algorithm accounting for delayed rewards
with sequential treatment decision making as the motivation. We use nonparametric
estimation to estimate the functional relationship between the rewards and the covari-
ates. We show that the proposed algorithm is strongly consistent in that the cumulative
rewards almost surely converge to the optimal cumulative rewards.
2.1 Problem setup
The general contextual bandit setup of the problem is as follows. Assume that there
are ` ≥ 2 arms available for allocation. Each arm allocation results in a reward which is
obtained at some random time after the arm allocation. For each time j ≥ 1, a treatment
Ij is alloted based on the data observed previously and the covariate Xj . We assume
that the covariates are d-dimensional continuous random variables and take values in
the hypercube [0, 1]d. Since the rewards can be obtained at some delayed time, we
denote {tj ∈ R+, j ≥ 1} to be the observation time for the rewards for arms {Ij , j ≥ 1}
respectively. Let Yi,j be the reward obtained at time tj ≥ j for arm i = Ij . The mean
30
31
reward with covariate Xj for the i
th arm is denoted as fi(Xj), 1 ≤ i ≤ `. The observed
reward with covariate Xj by pulling the ith arm is modeled as, Yi,j = fi(Xj) + i,j ,
where i,j denotes independent random error with E(i,j) = 0 and Var(i,j) <∞ for all
1 ≤ i ≤ ` and j ∈ N. The functions fi are assumed to be unknown and not of any given
parametric form.
The rewards are observed at delayed times tj ; the delay in the reward for arm Ij
pulled at the jth time is given by a random variable dj := tj − j. Assume that these
delays are mutually independent, independent of the covariates, and could be drawn
from different distributions. That is, let {dj , j ≥ 1} be a sequence of independent
random variables with probability density functions {gj , j ≥ 1} and the cumulative
distribution functions {Gj , j ≥ 1}, respectively.
Let {Xj , j ≥ 1} be a sequence of covariates independently generated according to
an unknown underlying probability distribution PX , from a population supported in
[0, 1]d. Let δ be a sequential allocation rule, which for each time j chooses an arm
Ij based on the previous observations and Xj . The total mean reward up to time n
is
∑n
j=1 fIj (Xj). To evaluate the performance of the allocation strategy, let i
∗(x) =
arg max1≤i≤` fi(x) and f∗(x) = fi∗(x)(x). Without the knowledge of the random errors,
the ideal performance occurs when the choices of arms selected I1, . . . , In match the
optimal arms i∗(X1), . . . , i∗(Xn), yielding the optimal total reward
∑n
j=1 f
∗(Xj). The
ratio of these two quantities is the quantity of interest,
Rn(δ) =
∑n
j=1 fIj (Xj)∑n
j=1 f
∗(Xj)
. (2.1)
It can be seen that Rn is a random variable no bigger than 1.
Definition 2.1.1. An allocation rule δ is said to be strongly consistent if Rn(δ) → 1
with probability 1, as n→∞.
In section 2.2, we propose an allocation rule which takes into account reward delays.
Then in sections 2.2.1 and 2.3.1, we discuss the consistency of the proposed allocation
rule under some assumptions and then validate those assumptions when the histogram
method is used to estimate the regression functions respectively.
32
2.2 The proposed strategy
Let Zn,i denote the set of observations for arm i whose rewards have been obtained
up to time n, that is, Zn,i := {(Xj , Yi,j) : 1 ≤ tj ≤ n and Ij = i}. Let fˆi,n denote
the regression estimator of fi based on the data Z
n,i. Let {pij , j ≥ 1} be a sequence of
positive numbers in [0, 1] decreasing to zero.
Step 1. Initialize. Allocate each arm once, w.l.o.g., we can have I1 = 1, I2 = 2, . . . , I` = `.
Since the rewards are not immediately obtained for each of these ` arms, we
continue these forced allocations until we have at least one reward observed for
each arm. Suppose, that happens at time m0.
Step 2. Estimate the individual functions fi. For n = m0, based on Z
n,i, estimate fi
by fˆi,n for 1 ≤ i ≤ ` using the chosen regression procedure.
Step 3. Estimate the best arm. For Xn+1, let iˆn+1(Xn+1) = arg max1≤i≤` fˆi,n(Xn+1).
Step 4. Select and pull. Randomly select an arm with probability 1 − (` − 1)pin+1 for
i = iˆn+1 and with probability pin+1, for all other arms, i 6= iˆn+1. Let In+1 denote
this selected arm.
Step 5. Update the estimates.
Step 5a. If a reward is obtained at the (n+ 1)th time (could be one or more rewards
corresponding to one or more arms Ij , 1 ≤ j ≤ (n+ 1)), update the function
estimates of fi for the respective arm (or arms) for which the reward (or
rewards) are obtained at (n+ 1)th time.
Step 5b. If no reward is obtained at the (n + 1)th time, use the previous function
estimators, i.e. fˆi,n+1 = fˆi,n ∀ i ∈ {1, . . . , `}.
Step 6. Repeat. Repeat steps 3-5 when the next covariate Xn+2 surfaces and so on.
The choice of pin in the randomization step 4 is crucial in determining how much ex-
ploration and exploitation is done at any phase of the trial. To emphasize the role of
pin, we may use δpi to denote the allocation rule. In order to select the best arm as time
progresses, pin needs to decrease to zero but the rate of decrease will play a key role in
33
determining how well the allocations work. For example, if in our set-up we have large
delays for some arms then it might be beneficial to decrease pin at a slower rate so that
there is enough exploration and the accuracy of our estimates is not affected in the long
run. We use a user-determined choice of pin in this work, that is, the sequence pin does
not adapt to the data.
2.2.1 Consistency of the proposed strategy
Let An := {j : tj ≤ n}, denote the time points for which rewards were obtained by time
n. If An is known, then the total number of observed rewards until time n, denoted by
τn, is also known. Recall that it is possible to observe multiple rewards at the same time
point. Given An, let {sk, k = 1, . . . , τn} be the reordered sequence of these observed
reward timings, {tk, k ∈ An}, arranged in a non-decreasing order.
Assumption 2.2.1. The regression procedure is strongly consistent in L∞ norm for
all individual mean functions fi under the proposed allocation scheme. That is, ||fˆi,n −
fi||∞ a.s.→ 0 as n→∞ for each 1 ≤ i ≤ `.
As described in the allocation strategy in section 2.2, fˆi,n is the estimator based on
all previously observed rewards. That is, after initialization, the mean reward function
estimators are only updated at the time points {sk, k = 1, . . . τn} where τn is the number
of rewards observed by time n. Therefore, this condition is equivalent to saying ||fˆi,sn−
fi||∞ a.s.→ 0 as n→∞.
Assumption 2.2.2. Mean functions satisfy fi(x) ≥ 0, A = sup
1≤i≤`
sup
x∈[0,1]d
(f∗(x) −
fi(x)) <∞ and E(f∗(X1)) > 0.
Theorem 2.2.3. Under Assumptions 2.2.1 and 2.2.2, the allocation rule δpi is strongly
consistent as n→∞.
Proof. Note that consistency holds only when the sequence {pin, n ≥ 1} is chosen such
that pin → 0 as n→∞. The proof is very similar to the proof in Yang and Zhu (2002).
The details can be found in section 2.5.1).
Note that Assumption 2.2.1, seemingly natural, is a strong assumption and it re-
quires additional work to verify this assumption for a particular regression setting. We
34
verify this assumption for the histogram method in section 2.3.1. On the other hand,
Assumption 2.2.2 does not involve the estimation procedure and does not require any
verification.
2.3 The Histogram method
Partition [0, 1]d into M = (1/h)d hyper-cubes with side width h, assuming h is chosen
such that 1/h is an integer. For some x ∈ [0, 1]d, let J(x) denote the set of time
points, for which the corresponding design points observed until time n fall in the same
cube as x, say B(x), and for which the corresponding rewards are observed by time
n. Let N(x) denote the size of J(x). That is, let J(x) = {j : Xj ∈ B(x), tj ≤ n}
and N(x) =
∑n
j=1 I{Xj ∈ B(x), tj ≤ n}. Furthermore, let J¯i(x) be the subset of
J(x) corresponding to arm i and N¯i(x) is the number of such time points, that is,
J¯i(x) = {j ∈ J(x) : Ij = i} and N¯i(x) =
∑n
j=1 I{Ij = i,Xj ∈ B(x), tj ≤ n}. Then the
histogram estimate for fi(x) is defined as,
fˆi,n(x) =
1
N¯i(x)
∑
j∈J¯i(x)
Yj .
For the estimator to behave well, a proper choice of the bandwidth, h = hn is necessary.
Although one could choose different widths hi,n for estimating different fi’s, for simplic-
ity, the same bandwidth hn is used in the following sections. For notational convenience,
when the analysis is focused on a single arm, i is dropped from the subscript of fˆ , N¯
and J¯ .
Other nonparametric methods like nearest-neighbors, kernel method, spline fitting
and wavelets can also be considered for estimation. Assumption 2.2.1 could be verified
for these methods using the same broad approach as illustrated in the following sections
for the histogram method, along with some method specific mathematical tools and
assumptions.
2.3.1 Allocation with histogram estimates
Here, we show that the histogram estimation method along with the allocation scheme
described in section 2.2, leads to strong consistency under some reasonable conditions on
35
random errors, design distribution, mean reward functions and delays. As already dis-
cussed in section 2.2.1, we only need to verify that Assumption 2.2.1 holds for histogram
method estimators. Along with Assumption 2.2.2, we make the following assumptions.
Assumption 2.3.1. The design distribution PX is dominated by the Lebesgue measure
with a density p(x) uniformly bounded above and away from 0 on [0, 1]d; that is, p(x)
satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯.
Assumption 2.3.2. The errors satisfy a moment condition that there exists positive
constants v and c such that, for all m ≥ 2, the Bernstein condition is satisfied, that is,
E|ij |m ≤ m!2 v2cm−2.
Assumption 2.3.3. The delays, {dj , j ≥ 1}, are independent of each other, the choice
of arms and also of the covariates.
Assumption 2.3.4. Let the partial sums of delay distributions satisfy,
∑n
j=1Gj(n −
j) = Ω(nα logβ n) 1 for some α > 0, β ∈ R or for α = 0 and β > 1.
Note that, the choice nα logβ n could be generalized to a sub-linear function q(n)
with a growth rate faster than log n.
2.3.2 Number of observations in a small cube
From Assumption 2.3.1 and Assumption 2.3.3, we have that for a fixed cube B with
side width hn at time n, P (Xj ∈ B, tj ≤ n) = P (Xj ∈ B)P (tj ≤ n) ≥ chdnGj(n − j).
Let N be the number of observations that fall in B and are observed by time n, that is
N =
∑n
j=1 I{Xj∈B,tj≤n}. It is easily seen that N is a random variable with expectation
β ≥∑nj=1 chdnGj(n− j). From the extended Bernstein inequality (A.2), we have
P
(
N ≤ ch
d
n
∑n
j=1Gj(n− j)
2
)
≤ exp
(
−3ch
d
n
∑n
j=1Gj(n− j)
28
)
. (2.2)
Lemma 2.3.5. Let  > 0 be given. Suppose that h is small enough such that w(h; f) < .
Then the histogram estimator fˆn satisfies,
PAn,Xn(||fˆn − f ||∞ ≥ ) ≤M exp
(
−3pin min1≤b≤M Nb
28
)
+ 2M exp
(
−min1≤b≤M Nbpi
2
n(− w(h; f))2
8(v2 + c(pin/2)(− w(h; f)))
)
,
1 f(n) = Ω(g(n)) if for some positive constant c,f(n) ≥ cg(n) when n is large enough
36
where the probability PAn,Xn denotes conditional probability given design points Xn =
(X1, X2, . . . , Xn) and An = {j : tj ≤ n}. Here, Nb is the number of design points for
which the rewards have been observed by time n such that they fall in the bth small cube
of the partition of the unit cube at time n.
Proof. The proof of Lemma 2.3.5 is included in section 2.5.
Theorem 2.3.6. Suppose Assumptions 2.2.2, 2.3.1-2.3.4 are satisfied. If for some
α > 0 and β ∈ R or α = 0 and β > 1, hn and pin are chosen to satisfy,
nα(log n)β−1hdnpi
2
n →∞, (2.3)
then the allocation rule δpi is strongly consistent.
Proof of Theorem 2. The histogram technique partitions the unit cube into M =
(1/h)d small cubes. For each small cube Bb, 1 ≤ b ≤ M , in the partition of the unit
cube, let Nb denote the number of time points, for which the corresponding design
points fall in the cube Bb and corresponding arm rewards are observed by time n. In
other words, Nb =
∑n
j=1 I{Xj∈Bb,tj≤n}. Using inequality (2.2) we have,
P
(
Nb ≤
chdn
∑n
j=1Gj(n− j)
2
)
≤ exp
(
−3ch
d
n
∑n
j=1Gj(n− j)
28
)
⇒ P
(
min
1≤b≤M
Nb ≤
chdn
∑n
j=1Gj(n− j)
2
)
≤M exp
(
−3ch
d
n
∑n
j=1Gj(n− j)
28
)
. (2.4)
Let W1, . . . ,Wn be Bernoulli random variables indicating whether the ith arm is selected
(Wj = 1) for time point j, or not (Wj = 0). Note that, conditional on the previous
observations and Xj , the probability of Wj = 1 is almost surely bounded below by
pij ≥ pin for 1 ≤ j ≤ n. Let w(hn; fi) be the modulus of continuity as in Definition
A.0.2. Note that, under the continuity assumption of fi, we have w(hn; fi) → 0 as
37
hn → 0. Thus, for any  > 0, when hn is small enough, − w(hn; fi) ≥ /2. Consider,
P (||fˆi,n − fi||∞ > ) = P
(
||fˆi,n − fi||∞ > , min
1≤b≤M
Nb ≥
chdn
∑n
j=1Gj(n− j)
2
)
+ P
(
||fˆi,n − fi||∞ > , min
1≤b≤M
Nb <
chdn
∑n
j=1Gj(n− j)
2
)
≤ EPAn,Xn
(
||fˆi,n − fi||∞ > , min
1≤b≤M
Nb ≥
chdn
∑n
j=1Gj(n− j)
2
)
+ P
(
min
1≤b≤M
Nb <
chdn
∑n
j=1Gj(n− j)
2
)
,
where PAn,Xn denotes conditional probability given the design points until time n,
Xn = {X1, X2, . . . , Xn} and the event, An := {j : tj ≤ n}.
From Lemma 2.3.5, we have that given the design points and the time points for
which rewards were observed, for any  > 0, when h is small enough,
PAn,Xn(||fˆn − f ||∞ ≥ ) ≤M exp
(
−3pin min1≤b≤M Nb
28
)
+ 2M exp
(
−min1≤b≤M Nbpi
2
n(− w(hn; f))2
8(v2 + c(pin/2)(− w(hn; f)))
)
.
Using the above inequality and (2.4), we have,
P (||fˆi,n − fi||∞ > ) ≤ 2M exp
(
−ch
d
n(
∑n
j=1Gj(n− j))pi2n(− w(hn; fi))2
16(v2 + cpin/2(− w(hn; fi)))
)
+M exp
(
−3ch
d
npin
∑n
j=1Gj(n− j)
56
)
+ exp
(
−3ch
d
n
∑n
j=1Gj(n− j)
28
)
.
It can be shown that the above upper bound is summable in n under the condition,
hdnpi
2
n
∑n
j=1Gj(n− j)
log n
→∞. (2.5)
It is easy to see that this follows from Assumption 2.3.4 and (2.3).
Since  is arbitrary, by the Borel-Cantelli lemma, we have that ||fˆi,n − fi||∞ → 0.
This is true for all arms 1 ≤ i ≤ `. Hence, this completes the proof of Theorem 2.3.6.
38
2.3.3 Effects of reward delay distributions
As one would expect, the amount of delay in observing the rewards will have a consid-
erable effect on the speed of sequential learning. In terms of treatment allocation, if
there are substantial delays in observing patient responses for a particular treatment,
the learning for that treatment will slow down and as a result the efficiency of the al-
location strategy will decrease. Therefore, Assumption 2.3.4 imposes some restrictions
on the delay distributions to ensure that at least a small proportion of rewards will be
obtained in finite time. It is of interest to see how the delay distribution affects the rate
at which pin and hn are allowed to decrease. This relationship can be understood by
examining condition (2.3) for Theorem 2.3.6.
Note that Assumption 2.3.4 and (2.3) in Theorem 2.3.6 can be generalized to include
any function q(x) with at least a growth rate faster than logarithmic growth rate. We
assume
∑n
j=1Gj(n− j) = Ω (q(n)) where q(n) satisfies, q(n)/ log(n)→∞ as n→∞.
Then it is easy to see that hn and pin can be chosen such that,
hdnpi
2
nq(n)
log(n)
→∞ as n→∞. (2.6)
which implies condition (2.5) holds. A possible advantage of this is that we allow a
wide range of possible delay distributions with mild restrictions on the delays. Below,
we consider some cases of the delay distributions and see how they effect exploration
(pin) and bandwidth (hn) of the histogram estimator as time progresses.
1. In condition (2.3), q(n) = nα logβ n for α > 0 and β ∈ R or α = 0 and β > 1. Let
us first consider the case when α = 0 and β > 1, we have q(n) = logβ n for β > 1
and we want
∑n
j=1Gj(n− j) = Ω(logβ n). Consider, pin = (log n)−(β−1)/(2+d) for
n > m0 and β > 1, then for (2.5) to hold we need the bandwidth hn also to
be of order Ω((log n)−(β−1)/(2+d)). For example, hn = (log n)−(β−1)/β(2+d) would
guarantee consistency. Notice that with these pin and hn, one would spend a lot of
time in exploration and the bandwidth would also decay very slowly which would
effect the accuracy of the reward function estimates until n is sufficiently large.
Notice that the restriction of partial sum of probability distributions for the delays,
being at least of the order logβ n gives the possibility of modeling cases with
extremely large delays. For example, in clinical studies when the outcome of
39
interest is survival time and we want to administer treatments for a disease such
that the survival time is maximized. With the unprecedented advances in drug
development, the life expectancy of patients is more likely to increase, hence the
survival time for a patient given any treatment would be large. Therefore, the
assumption that partial sums of probability distributions for the delays until time
n need only be at least logβ n seems to be quite reasonable when the expected
waiting times (in this case survival times) are long. For example, diseases like
diabetes and hypertension which have a long survival time, since they cannot be
cured, but can be controlled with medications. These diseases also have fairly high
prevalence, so a large sample size to be able to get close to optimality would not be
a problem. For such diseases, assuming that one would only observe the responses
(survival times) of a small fraction of patients in finite time seems reasonable.
2. For the case when α > 0 and β ∈ R, we have that ∑nj=1Gj(n− j) = Ω(nα logβ n).
Consider, pin = n
−α/(2+d) for n > m0, then for the condition (2.5) to hold we
need hn to also be of order Ω(n
−α/(2+d)). For example, hn = n−α/2(2+d) results
in hdnpi
2
nn
α logβ−1 (n) = nαd/2(2+d) logβ−1 (n) → ∞ as n →∞, irrespective of the
value of β. Here the lower bound on the partial sums of probability distributions
for the delays can grow faster than the previous case, depending on the values of
α and β.
This restriction of order nα(logβ n) can model cases with moderately large delays.
From a clinical point of view, one could model diseases in which treatments show
their effect in a short to moderate duration of time, for examples diseases like
diarrhea, common cold, headache, and nutritional deficiencies. Here the response
of interest would be improvement in the condition of a patient as a result of
a treatment. For such diseases, one can expect to see the treatment effects on
patients in a short period of time. Hence, the delay in observing treatment results
will not be too long. If the response considered was survival (survived or not),
then stroke could also fall in this category because of high mortality.
Note that, Assumption 2.3.4 only restricts on the proportion of rewards expected
to be observed in the long run. Therefore, it is possible for strong consistency
to be achieved even when there is infinite delay in observing the rewards of some
40
arms (non-observance of some rewards).
2.4 Simulation study
We conduct a simulation study to compare the effect of different delay scenarios on the
per-round average regret of our proposed strategy. The per-round regret is given by,
rn(δ) =
1
n
∑n
j=1(f
∗(Xj)− fIj (Xj)).
Note that if 1n
∑n
j=1 f
∗(Xj) is eventually bounded above and away from 0 with
probability 1, then Rn(δ)→ 1 a.s. is equivalent to rn(δ)→ 0 a.s.
2.4.1 Simulation setup
Consider number of arms, ` = 3, and the covariate space to be two-dimensional, d = 2.
Let Xn = (Xn1, Xn2) where Xni
i.i.d.∼ Unif(0, 1). We assume that the errors n ind∼
0.5N(0,1). The first 30 rounds were used for initialization. The following true mean
reward functions are used,
f1(x) = 0.7(x1 + x2), f2(x) = 0.5x
0.75
1 + sin(x2), f3(x) =
2x1
0.5 + (1.5 + x2)1.5
.
We consider the following delay scenarios and run simulations until N = 10000.
1) No delay ; 2) Delay 1: Geometric delay with probability of success (observing the
reward) p = 0.3; 3) Delay 2: Every 5th reward is not observed by time N and other
rewards are obtained with a geometric (p = 0.3) delay; 4) Delay 3: Each case has
probability 0.7 to delay and the delay is half-normal with scale parameter, σ = 1500; 5)
Delay 4: In this case we increase the number of non-observed rewards. Divide the data
into four equal consecutive parts (quarters), such that, in part 1, we only observe every
10th (with Geom(0.3) delay) observation by time N and not observe the remaining;
in part 2, we only observe every 15th observation; in part 3, only observe every 20th
observation; in part 4, only observe every 25th observation.
In Figure 2.1, we plot the per-round regret vs time by delay type for four combina-
tions of pin and hn. As one would expect (see Figure 2.1), the severity of delay has a
clear effect on the regret, and for delay scenarios where a large number of rewards are
not observed in finite time, the regret is comparatively higher. Note that most delay
41
scenarios for which a substantial number of rewards can be obtained in finite time, tend
to converge in quite similar patterns.
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
pin = n
−1 4
, hn = n−1 6
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
pin = n
−1 4
, hn = log(n)−1
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
pin = n
−1 6
, hn = n−1 6
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
pin = n
−1 6
, hn = log(n)−1
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
Figure 2.1: Per-round regret for the proposed strategy for different delay scenarios. The
grid of plots represent 4 different combination of choices for {pin} and {hn}. For a given
row, pin remains fixed and hn varies and vice versa for columns.
Choice of {pin} and {hn}: According to Theorem 2, if pin and hn are chosen
such that condition (2.3) is met, consistency of the allocation rule follows. Therefore,
for the case with d = 2, which is the case of the simulation setting, we have to choose
sequences slower than (pin = n
−1/2, hn = n−1/2), even in the case of no delays. Keeping
this in mind, we chose two different choices of sequences for pin (n
−1/4, n−1/6) and two
choices of hn(log
−1 n, n−1/6). Note that, in Figure 2.1, for a given row, pin remains fixed
while hn varies and vice versa for columns. It can be seen that the regret gets worse
42
when hn decays too fast (in our range of n as N = 10000), specially for the scenario
(Delay 4) with increasing number of non-observed rewards, possibly because of violation
of condition (2.3). Also notice that, slow decaying pin has higher regret (last row). This
could be because of large randomization error that leads to high exploration price. In
general, there are a large pool of choices for hn and pin that satisfy equation (2.3) as
can be seen from the Figure 2.1. The simulation study is replicated 60 times and the
averaged per-round regret is plotted in section 2.6, revealing very similar trends and
results. The best choice of the hyperparameter sequences amongst those considered
seems to be hn = n
−1/6 and pin = log−2 n, as seen in Figure 2.2. It is also observed in
Figure 2.2 that using a fast decaying {pin} results is better performance (low cumulative
regret) for relatively slow to moderate delay scenarios, that is, for delay scenarios 0-3.
However, a thorough understanding of the finite-time regret rates and further research
would be needed to evaluate optimal choices of {pin} and {hn} for a given scenario.
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = log(n)−2 , hn = n−1 6)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = log(n)−2 , hn = n−1 4)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = log(n)−2 , hn = log(n)−1)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
Figure 2.2: Per-round regret averaged over 60 replications for the proposed strategy in
section 2.2 for different delay situations. pin = log
−2 n and hn decays faster as we move
from left to right.
43
2.5 Proofs
2.5.1 Proof of consistency of the proposed strategy
Proof of Theorem 1. Since the ratio Rn(δpi) is always upper bounded by 1, we only need
to work on the lower bound direction. Note that,
Rn(δpi) =
∑n
j=1 fiˆj (Xj)∑n
j=1 f
∗(Xj)
+
∑n
j=1(fIj (Xj)− fiˆj (Xj))∑n
j=1 f
∗(Xj)
≥
∑n
j=1 fiˆj (Xj)∑n
j=1 f
∗(Xj)
−
1
n
∑n
j=1AI{Ij 6=iˆj}
1
n
∑n
j=1 f
∗(Xj)
,
where the inequality follows from Assumption 2.2.2.
Let Uj = I{Ij 6=iˆj}. Since (1/n)
∑n
j=1 f
∗(Xj) converges a.s. to Ef∗(X) > 0, the
second term on the right hand side in the above inequality converges to zero almost
surely if (1/n)
∑n
j=1 Uj
a.s.→ 0. Note that for j ≥ m0 + 1, Uj ’s are independent Bernoulli
random variables with success probability (`− 1)pij . Since,
∞∑
j=m0+1
Var
(
Uj
j
)
=
∞∑
j=m0+1
(`− 1)pij(1− (`− 1)pij)
j2
<∞.
we have that
∑∞
m0+1
((Uj − (` − 1)pij)/j) converges almost surely. It then follows by
Kronecker’s lemma that,
1
n
n∑
j=1
(Uj − (`− 1)pij) a.s.→ 0.
We know that pij → 0 as j → ∞ (the speed depending on the delay times). Thus, we
will have 1/n
∑n
j=1(`− 1)pij → 0 since pij → 0 as j →∞. Hence, 1/n
∑n
j=1 Uj → 0 a.s.
To show that Rn(δpi)
a.s.→ 1, it remains to show that∑n
j=1 fiˆj (Xj)∑n
j=1 f
∗(Xj)
a.s.→ 1 or equivalently,
∑n
j=1(fiˆj (Xj)− f∗(Xj))∑n
j=1 f
∗(Xj)
a.s.→ 0.
Recall from section 2.2.1, given the observed reward timings {tj : tj ≤ n, 1 ≤ j ≤ n},
let {sk : k = 1, . . . , τn} be the reordered sequence of the observed reward timings,
arranged in an increasing order. Then for any j,m0 + 1 ≤ j ≤ n, there exists an
skj , kj ∈ {1, . . . , τn} such that skj ≤ j < skj+1. Also, note that as j →∞, we also have
44
that kj → ∞. By the definition of iˆj , for j ≥ m0 + 1, fˆiˆj ,skj (Xj) ≥ fˆi∗(Xj),skj (Xj) and
thus,
fiˆj (Xj)− f∗(Xj) = fiˆj (Xj)− fˆiˆj ,skj (Xj) + fˆiˆj ,skj (Xj)− fˆi∗(Xj),skj (Xj)
+ fˆi∗(Xj),skj
(Xj)− f∗(Xj)
≥ fiˆj (Xj)− fˆiˆj ,skj (Xj) + fˆi∗(Xj),skj (Xj)− fi∗(Xj)(Xj)
≥ −2 sup
1≤i≤`
||fˆi,skj − fi||∞.
For 1 ≤ j ≤ m0, we have fiˆj (Xj)− f∗(Xj) ≥ −A. Based on Assumption 2.2.1, ||fˆi,skj −
fi||∞ a.s.→ 0 as j → ∞ for each i, and thus sup1≤i≤` ||fˆi,skj − fi||∞
a.s.→ 0. Then it follows
that, for n > m0,∑n
j=1(fiˆj (Xj)− f∗(Xj))∑n
j=1 f
∗(Xj)
≥
−Am0/n− (2/n)
∑n
j=m0+1
sup1≤i≤` ||fˆi,skj − fi||∞
(1/n)
∑n
j=1 f
∗(Xj)
.
The right hand side converges to 0 almost surely and hence the conclusion follows.
2.5.2 A probability bound for the histogram method
Consider the regression model with i dropped for notational convenience.
Yj = f(xj) + j ,
where j ’s are independent errors satisfying the moment condition in Assumption 2.3.2
of section 2.3.1. Let W1, . . . ,Wn are Bernoulli random variables that decide if arm i is
observed or not, that is Wj = I{Ij=i}. Assume, for each 1 ≤ j ≤ n, Wj is independent
of {k : k ≥ j}. Let fˆn be the histogram estimator of f . Let An denote the event
consisting of the indices (time points) for which the rewards were observed by time n,
that is An := {j : tj ≤ n} and Xn = {X1, X2, . . . , Xn}, the design points until time n.
Proof of lemma 2.3.5. Note that the inequality of lemma 2.3.5 trivially holds if min1≤b≤M Nb =
0. Therefore, let’s assume that min1≤b≤M Nb > 0. Let N(x) denote the number of time
points, for which the corresponding design points xj ’s fall in the same cube as x and
45
for which the corresponding rewards are observed by time n. Let J(x) denote the set
of indices 1 ≤ j ≤ n of such design points. Let J¯(x) be the subset of J(x) where arm i
is chosen (i.e. where Wj = 1) and let N¯(x) be the number of such design points (note
that i is dropped for notational convenience).
For arm i, we consider the histogram estimator
fˆn(x) =
1
N¯(x)
∑
j∈J¯(x)
Yj
= f(x) +
1
N¯(x)
∑
j∈J¯(x)
(f(xj)− f(x)) + 1
N¯(x)
∑
j∈J¯(x)
j
⇒ |fˆn(x)− f(x)| ≤ w(h; f) +
∣∣∣∣∣∣ 1N¯(x)
∑
j∈J¯(x)
j
∣∣∣∣∣∣ ,
where w(h; f) is the modulus of continuity. For any  > w(h; f), with the given design
points and the time points for which rewards have been observed by time n,
PAn,Xn(||fˆn − f ||∞ ≥ ) ≤ PAn,Xn
sup
x
∣∣∣∣∣∣ 1N¯(x)
∑
j∈J¯(x)
j
∣∣∣∣∣∣ ≥ − w(h; f)
 .
Note that, in the same small cube B, N(x) and N¯(x), J(x) and J¯(x) are the same for
any x, respectively. Let x0 be a fixed point in B. Then consider,
PAn,Xn
sup
x∈B
∣∣∣∣∣∣ 1N¯(x)
∑
j∈J¯(x)
j
∣∣∣∣∣∣ ≥ − w(h; f)

= PAn,Xn
∣∣∣∣∣∣
∑
j∈J¯(x0)
j
∣∣∣∣∣∣ ≥ N¯(x0)(− w(h; f))

= PAn,Xn
∣∣∣∣∣∣
∑
j∈J(x0)
Wjj
∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f))

= PAn,Xn
∣∣∣∣∣∣
∑
j∈J(x0)
Wjj
∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f)), N¯(x0)N(x0) > pin2

+ PAn,Xn
∣∣∣∣∣∣
∑
j∈J(x0)
Wjj
∣∣∣∣∣∣ ≥ N(x0)N¯(x0)N(x0)(− w(h; f)), N¯(x0)N(x0) ≤ pin2

46
≤ PAn,Xn
∣∣∣∣∣∣
∑
j∈J(x0)
Wjj
∣∣∣∣∣∣ ≥ N(x0)pin2 (− w(h; f))
+ PAn,Xn (N¯(x0)N(x0) ≤ pin2
)
≤ 2 exp
(
− N(x0)pi
2
n(− w(h; f))2
8(v2 + cpin/2(− w(h; f)))
)
+ exp
(
−3N(x0)pin
28
)
,
where the last inequality follows from inequality (A.4) and (A.2) in Chapter A respec-
tively. For applying (A.4), we used the fact that Wj is independent of the ik’s for all
k ≥ j since Wj depends only on the previous observations and Xj .
Given that Nb be the number of design points in the bth small cube whose rewards
are observed by time n, we have
PAn,Xn(||fˆn − f ||∞) ≤M exp
(
−3(min1≤b≤M Nb)pin
28
)
+ 2M exp
(
−(min1≤b≤M Nb)pi
2
n(− w(h; f))2
8(v2 + c(pin/2)(− w(h; f)))
)
.
2.6 Supplementary simulation results
We conducted a simulated study in section 2.4 to illustrate the choices of pin and hn
under different delay scenarios. The results were presented in Figure 2.1, where the
y-axis of the graphs represented the per-round regret and the x-axis was for time. The
results represented one run of the proposed algorithm in section 2.2. In this section,
we run the proposed algorithm for 60 replications to get a more precise understanding
of the performance. We note that the trends observed are very similar to what was
noted in section 2.4, except for the fact the curves look much smoother. It can be seen
that the average regret gets worse when hn decays too fast, especially for the scenario
(Delay 4) with increasing number of non-observed rewards, possibly because of violation
of condition (2.3). Also notice that, slow decaying pin has higher regret (last row). This
could be because of large randomization error that leads to high exploration price.
47
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = n−1 4 , hn = n−1 6)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = n−1 4 , hn = log(n)−1)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = n−1 6 , hn = n−1 6)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
(pin = n−1 6 , hn = log(n)−1)
Time index
Av
e
ra
ge
 re
gr
et
No delay
Delay 1
Delay 2
Delay 3
Delay 4
Figure 2.3: Per-round regret averaged over 60 replications for the proposed strategy
in section 2.2 for different delay situations. The grid of plots represent four different
combinations of {hn} and {pin}. For a given row, pin remains fixed and hn varies and
vice versa for columns.
Chapter 3
To update or not to update?
In this chapter, through slight modifications in the randomized allocation strategy
that was proposed in chapter 3 for contextual bandits with delays, we illustrate that
one could achieve lower cumulative regret by modifying the exploration-exploitation
dilemma when faced with large delays. We employ nonparametric regression framework
to model the mean reward functions and allow for unbounded delays. In the presence
of delayed feedback, one may choose between using the original exploration sequence
which updates at every time point or update the sequence only when a new reward is
observed, leading to two competing strategies. In this chapter, we prove that while both
strategies lead to strongly consistent allocation, the property holds for a wider scope
of situations for the strategy which updates the exploration probability only when a
new reward is observed. However, we also illustrate that both strategies have their own
advantages and disadvantages depending on the severity of the delay and underlying
reward generating mechanisms.
3.1 Problem setup
The setup of the problem is the same as the one in section 2.1 in chapter 2. Recall, there
are ` > 1 competing arms. The covariates are assumed to be d- dimensional random
variables generated according to an unknown underlying probability distribution PX ,
from a population supported in [0, 1]d. We adopt a regression perspective to model the
48
49
relationship between covariates and rewards,
Yi,j = fi(Xj) + j ,
where j ’s are independent errors with E(j) = 0 and Var(j) <∞ for all 1 ≤ i ≤ ` and
j ≥ 1.
Now, the problem can be viewed as one of estimating the mean reward functions
fi(x) for i ∈ {1, . . . , `} and allocating arms based on the estimators fˆi. We follow a
nonparametric approach to estimating these functions.
In our setup, the rewards can be obtained at some delayed time, which we denote
by {tj ∈ R+, j ≥ 1}. The delay in the reward for pulling arm Ij is given by the
random variable, dj := tj − j. We assume that {dj : dj ≥ 0, j ≥ 1} is a sequence
of independent random variables. Let the number of rewards obtained at time n be
denoted by τn =
∑n
j=1 I(tj ≤ n), also a random variable.
We devise a sequential allocation strategy δ, incorporating delayed rewards, such
that it chooses arms sequentially based on previous observations and present covariates.
As a measure of performance of the strategy, we consider the following ratio,
Rn(δ) =
∑n
j=1 fIj (Xj)∑n
j=1 f
∗(Xj)
, (3.1)
where f∗(x) = max1≤i≤` fi(x), the theoretical best mean reward functional value at
x, and i∗(x) is the corresponding arm. Then, we establish strong consistency of δ in
section 3.3, that is, we show that Rn(δ) → 1 with probability 1, as n → ∞. Our goal
is to compare the two allocation strategies proposed in section 3.2 and use simulation
results to illustrate how both can be advantageous in different situations.
3.2 The proposed strategies
Define Zn,i to be the set of observations for arm i whose rewards have been obtained
up to time n−1, that is, Zn,i := {(Xj , Yi,j) : 1 ≤ tj ≤ n−1 and Ij = i}. Let fˆi,n denote
the regression estimator of fi based on the data Z
n,i. Let {pij , j ≥ 1} be a sequence of
positive numbers in [0, 1] decreasing to zero, such that (` − 1)pij < 1 for all j ≥ 1. We
propose two strategies η1 and η2 with a subtle difference in the arm selection step but
same structure of the algorithm.
50
3.2.1 Algorithm
Step 1. Initialize. Allocate each arm once, I1 = 1, I2 = 2, . . . , I` = `. Since the rewards
are not immediately obtained for each of these ` arms, we continue these forced
allocations until we have at least one reward observed for each arm. Suppose,
that happens at time m0.
Step 2. Estimate the individual functions fi. For n = m0 + 1, based on Z
n,i,
estimate fi by fˆi,n for 1 ≤ i ≤ ` using the chosen regression procedure.
Step 3. Estimate the best arm. For Xn, let iˆn(Xn) = arg max1≤i≤` fˆi,n(Xn).
Step 4. Select and pull. Recall, τn =
∑n
j=1 I(tj ≤ n) is the number of rewards
observed by time n.
(a) Strategy η1: In =
iˆn, with probability 1− (`− 1)pini, with probability pin, i 6= iˆn, 1 ≤ i ≤ `.
(b) Strategy η2: In =
iˆn, with probability1− (`− 1)piτni, with probability piτn , i 6= iˆn, 1 ≤ i ≤ `.
Step 5. Update the estimates.
Step 5a. If a reward is obtained at the nth time (could be one or more rewards
corresponding to one or more arms Ij , 1 ≤ j ≤ n), update the function
estimates of fi for the respective arm (or arms) for which the reward
(or rewards) are obtained at nth time.
Step 5b. If no reward is obtained at the nth time, use the previous function
estimators, i.e. fˆi,n+1 = fˆi,n ∀ i ∈ {1, . . . , `}.
Step 6. Repeat. Repeat steps 3-5 when the next covariate Xn+1 surfaces and so on.
In the algorithm above, Step 1 initializes the allocations by pulling each arm alterna-
tively until we observe at least one reward for each arm. Step 2 estimates the mean
reward function for each arm. This could be done using several regression methods,
and we use kernel regression and histogram method in this work. Steps 3 and 4 enforce
an -greedy type of randomization scheme which prefers the best performing arm so far
51
with some probability and explores with the remaining. The preference is determined
by user determined sequence of exploration probability {pin, n ≥ 1}, which for strategy
η2 only gets updated when a new reward is observed, that is, piτn . While for strategy η1,
it is updated at every time point irrespective of a reward being observed or not, that is,
pin. Hence, the two strategies differ in the extent of exploration and exploitation that is
allowed over time. Finally in Step 5, the mean reward function estimators are updated
if new rewards are observed or they remain the same if no new rewards are observed.
Note that, the user determined sequence for both pin in η1 and piτn in η2 are the same.
Therefore, for notational convenience, we use {·} to denote a user-determined sequence,
such as {pin}, when we only want to refer to the original sequence selected by the user,
without distinguishing between when it gets updated.
3.3 Consistency of the proposed strategy
Let An := {j : tj ≤ n}, denote the set consisting of the time points for which rewards
were observed by time n. We make the following assumptions.
Assumption 3.3.1. The regression procedure is strongly consistent in L∞ norm for
all individual mean functions fi under the proposed allocation scheme. That is, ||fˆi,n −
fi||∞ a.s.→ 0 as n → ∞ for each 1 ≤ i ≤ `, where fˆi,n is the estimator based on all
previously observed rewards.
Due to the presence of delays, the mean reward function estimators are only updated
at the time points where a new reward is observed. Therefore, we need this strong con-
sistency in estimation to hold when the mean reward functions are only being updated
at the observed data points, which seems to be a somewhat stronger condition. Next,
we make a mild assumption on the mean reward functions.
Assumption 3.3.2. The mean reward functions are continuous and fi(x) ≥ 0 such
that,
A = sup
1≤i≤`
sup
x∈[0,1]d
(f∗(x)− fi(x)) <∞ and E(f∗(X1)) > 0.
Theorem 3.3.3. Under Assumptions 3.3.1 and 3.3.2, the allocation rule η1 and η2 are
strongly consistent as n→∞.
52
Proof. Note that consistency holds only when the sequence {pin, n ≥ 1} is chosen such
that {pin} → 0 as n → ∞. The proof is very similar to the proof in section 2.5.1 in
chapter 2 so we skip it here.
Note that Assumption 3.3.1, seemingly natural, is a strong assumption and it re-
quires additional work to verify it for a particular regression setting. We verify this
assumption for the histogram method in section 3.3.2 and for the kernel method in
3.3.3. On the other hand, Assumption 3.3.2 is a mild assumption on the mean reward
functions and does not require any verification.
3.3.1 The histogram method
In this section, we consider the histogram method for the setting with delayed rewards.
We assume that the binwidth h is chosen such that 1/h is an integer. At time n,
partition [0, 1]d into M = (1/hτn)
d hyper-cubes with side width hτn , where τn is the
number of observed rewards by time n. For some x ∈ [0, 1]d such that it falls in a
hypercube B(x), let J¯i(x) = {j : Xj ∈ B(x), tj ≤ n, Ij = i} and N¯i(x) be the size of
J¯i(x). Then the histogram estimate for fi(x) is defined as,
fˆi,n(x) =
1
N¯i(x)
∑
j∈J¯i(x)
Yj . (3.2)
For the estimator to behave well, a proper choice of the bandwidth, {hn} is necessary.
Note that, we only update hn to hn+1 when a new reward is observed, hence we denote
it as hτn . For notational convenience, when the analysis is focused on a single arm, i is
dropped from the subscript of fˆ , N¯ and J¯ . Next, we prove consistency for the proposed
strategies in section 3.2.1 using the histogram method.
3.3.2 Allocation with histogram estimates
As already discussed in section 3.3, we only need to verify that Assumption 3.3.1 holds
for histogram method. Along with Assumption 3.3.1 and 3.3.2, we make the following
assumptions.
Assumption 3.3.4. The design distribution PX is dominated by the Lebesgue measure
with a density p(x) uniformly bounded above and away from 0 on [0, 1]d; that is, p(x)
satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯.
53
This assumption is needed to make sure that all regions in the covariate space are
observed with positive probability, in order to ensure good estimation in all regions.
Assumption 3.3.5. The errors satisfy a moment condition that there exists positive
constants v and c such that, for all integers m ≥ 2, the extended Bernstein condition
(Birge´ et al. (1998); Qian and Yang (2016a)) is satisfied, that is,
E|ij |m ≤ m!
2
v2cm−2.
This condition on the errors holds in a lot of settings, for example, normal distribu-
tion and bounded errors meet this requirement, thus making it useful in a wide range
of applications.
The next two assumptions are made on the nature of the delays in observing rewards,
so that we could ensure that delays are not being confounded by other factors and we
observe a minimum number of rewards with time, so as to ensure proper and effective
learning.
Assumption 3.3.6. The delays, {dj , j ≥ 1}, are independent of each other, the choice
of arms and also of the covariates.
Assumption 3.3.7. Let the partial sums of delay distributions satisfy, E(τn) = Ω(q(n))
1 , where q(n) is a sequence that acts as a lower bound to the expected number of observed
rewards by time n, and q(n)→∞ as n→∞.
Lemma 3.3.8. Let  > 0 be given. Suppose that h is small enough such that w(h; f) < .
Then the histogram estimator fˆn satisfies,
Pη1An,Xn(||fˆn − f ||∞ ≥ ) ≤M exp
(
−3pin min1≤b≤M Nb
28
)
+ 2M exp
(
−min1≤b≤M Nbpi
2
n(− w(hτn ; f))2
8(v2 + c(pin/2)(− w(hτn ; f)))
)
, (3.3)
Pη2An,Xn(||fˆn − f ||∞ ≥ ) ≤M exp
(
−3piτn min1≤b≤M Nb
28
)
+ 2M exp
(
−min1≤b≤M Nbpi
2
τn(− w(hτn ; f))2
8(v2 + c(piτn/2)(− w(hτn ; f)))
)
, (3.4)
1 f(n) = Ω(g(n)) if for some positive constant c,f(n) ≥ cg(n) when n is large enough
54
where PAn,Xn denotes conditional probability given design points Xn = (X1, . . . , Xn)
and An = {j : tj ≤ n}. Here, Nb is the number of design points for which the rewards
have been observed by time n such that they fall in the bth small cube of the partition of
the unit cube at time n.
Proof. The proof of Lemma 3.3.8 is similar to the proof of Lemma 2.3.5 in section 2.5.2
in chapter 2. For strategy η1, it is easy to see that an exactly similar lemma with hn
replaced by hτn could be derived. For strategy η2, pin is replaced by piτn and hn replaced
by hτn . This is because the result is a conditional probability result, and given An and
Xn, τn is a known quantity.
Theorem 3.3.9. Suppose Assumptions 3.3.2-3.3.7 are satisfied.
a) If {hn} and {pin} are chosen to satisfy,
h2q(n)pi
2
nq(n)
log n
→∞ as n→∞, (3.5)
then the histogram estimator in (3.2) is strongly consistent in the L∞ norm for
strategy η1, hence η1 is strongly consistent.
b) If {hn} and {pin} are chosen to satisfy,
h2q(n)pi
2
q(n)q(n)
log n
→∞ as n→∞, (3.6)
then the histogram estimator in (3.2) is strongly consistent in the L∞ norm for
strategy η2, hence η2 is strongly consistent.
Proof. The proofs for a) and b) are quite similar, so we prove b) here and consequently
discuss a). GivenAn, the indices corresponding to when rewards were obtained, we know
that at time n, the Histogram method partitions the unit cube into M = (1/hτn)
d small
cubes. For each small cube Bb, 1 ≤ b ≤ M , in the partition, let Nb =
∑n
j=1 I(Xj ∈
Bb, tj ≤ n). Note that given AN , PAN (Xj ∈ Bb, tj ≤ n) = P (Xj ∈ Bb) ≥ chdτn by
Assumption 3.3.6, thus using inequality (A.2) we have,
PAn
(
Nb ≤
chdτnτn
2
)
≤ exp
(
−3ch
d
τnτn
28
)
⇒ PAn
(
min
1≤b≤M
Nb ≤
chdτnτn
2
)
≤ exp
(
−3ch
d
τnτn
28
)
. (3.7)
55
Recall, τn =
∑n
j=1 I{tj ≤ n}. First, we show that τn a.s.→ ∞ as n → ∞ for both
strategies, η1 and η2. By Assumption 3.3.7 we have that for a large enough n, there
exists a positive constant a1 > 0 such that, E(τn) ≥ a1q(n). Then using the inequality
(A.2) in Lemma A.1.6 we get,
P
(
τn ≤ a1q(n)
2
)
≤ P
(
τn ≤ E(τn)
2
)
≤ exp
(
−3E(τn)
28
)
≤ exp
(−3a1q(n)
28
)
.
Now, it is easy to see that the upper bound is summable in n under the conditions (3.5)
and (3.6). By Borel-Cantelli lemma, this implies that event {τn > a1q(n)/2} happens
infinitely often, therefore τn
a.s.→ ∞. Note that, by construction this implies that hτn a.s.→ 0,
and piτn
a.s.→ 0 as n→∞. Let w(hτn ; fi) be the modulus of continuity as in the definition
A.0.2. Then, continuity of fi leads to the conclusion that w(hτn ; fi)
a.s.→ 0, as n → ∞.
Thus, for any  > 0, for large enough n, when hτn is small enough, −w(hτn ; fi) ≥ /2,
almost surely. Consider,
PAn
(
||fˆi,n − fi||∞ ≥ 
)
= PAn
(
||fˆi,n − fi||∞ ≥ , min
1≤b≤M
Nb >
chdτnτn
2
)
+ PAn
(
||fˆi,n − fi||∞ ≥ , min
1≤b≤M
Nb ≤
chdτnτn
2
)
≤ EXnPAn,Xn
(
||fˆi,n − fi||∞ ≥ , min
1≤b≤M
Nb >
chdτnτn
2
)
+ PAn
(
min
1≤b≤M
Nb ≤
chdτnτn
2
)
,
where EX
n
denotes expectation with respect to Xn, which appears by applying the law
of iterated expectations. From (3.4) in Lemma 3.3.8 and (3.7), we get that,
PAn
(
||fˆi,n − fi||∞ ≥ 
)
≤M exp
(
−3cpiτnh
d
τnτn
56
)
+ 2M exp
(
−ch
d
τnpi
2
τnτn(− w(Lhτn ; fi))2
8(v2 + c(piτn/2)
)
+M exp
(
−3ch
d
τnτn
28
)
. (3.8)
56
Now consider,
P (||fˆi,n − fi||∞ > ) ≤ P
(
||fˆi,n − fi||∞ ≥ , τn > E(τn)
2
)
+ P
(
τn ≤ E(τn)
2
)
≤ EAnPAn
(
||fˆi,n − fi||∞ ≥ , τn > E(τn)
2
)
+ P
(
τn ≤ E(τn)
2
)
.
Let ne = bE(τn)/2c. Then, by using condition (3.6) and (3.8), we have that, for large
enough n,
P (||fˆi,n − fi||∞ > )
≤M exp
(
−3cpineh
d
nene
56
)
+ 2M exp
(
−ch
d
nepi
2
nene(− w(Lhne ; fi))2
8(v2 + c(pine/2)()
)
+M exp
(
−3ch
d
nene
28
)
+ exp
(
−3ne
14
)
≤M exp
(
−
3c˜piq(n)h
d
q(n)q(n)
112
)
+ 2M exp
(
−
c˜hdq(n)pi
2
q(n)q(n)(− w(Lhq(n); fi))2
16(v2 + c(piq(n)/2)()
)
+M exp
(
−
3c˜hdq(n)q(n)
56
)
+ exp
(
−3a1q(n)
28
)
, (3.9)
where, c˜ is a new constant that incorporates functions of a1 and c. It can be seen that
the above upper bound is summable in n under the condition
hdq(n)pi
2
q(n)q(n)
log n
→∞. (3.10)
Since  is arbitrary, by the Borel-Cantelli Lemma, we have that ||fˆi,n−fi||∞ → 0, almost
surely. This is true for all arms 1 ≤ i ≤ `. Note that the result a) could be similarly
obtained by using (3.3) from Lemma 3.3.8 to obtain a result similar to (3.8) but with
pin instead of piτn .
3.3.3 Kernel regression
We can obtain analogous results for strong consistency of strategy η1 and η2 using
Nadaraya-Watson estimator. Consider a nonnegative kernel function K(u) : Rd → R
that satisfies the following Lipschitz and boundedness conditions.
57
Assumption 3.3.10. For some constants 0 < λ < ∞, |K(u) − K(u′)| ≤ λ||u −
u′||∞, for all u, u′ ∈ Rd.
Assumption 3.3.11. ∃ constants L1 ≤ L, c3 > 0 and c4 ≥ 1 such that K(u) = 0 for
||u||∞ > L,K(u) ≥ c3 for ||u||∞ ≤ L1, and K(u) ≤ c4 for all u ∈ Rd.
Recall, τn =
∑n
j=1 I(tj ≤ n), the number of observed rewards by time n. Define,
Ji,n+1 = {j : Ij = i, tj ≤ n, 1 ≤ j ≤ n}, that is, the set of time points corresponding to
pulling of arm i whose rewards have been observed by time n. Let Mi,n+1 denote the
size of Ji,n+1.
Let hτn denote the bandwidth, where hτn → 0 almost surely as n → ∞. For each
arm i, the Nadaraya-Watson estimator of fi(x) is defined as,
fˆi,n+1(x) =
∑
j∈Ji,n+1 Yi,jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) . (3.11)
Theorem 3.3.12. Suppose Assumptions 3.3.2-3.3.11 are satisfied, and,
1. If {hn} and {pin} are chosen to satisfy,
q(n)h2dq(n)pi
4
n
log n
→∞,
then the Nadaraya-Watson estimator defined in (3.11) is strongly consistent in
L∞ norm for strategy η1.
2. If {hn} and {pin} are chosen to satisfy,
q(n)h2dq(n)pi
4
q(n)
log n
→∞, (3.12)
then the Nadaraya-Watson estimator defined in (3.11) is strongly consistent in
L∞ norm for strategy η2.
Proof. The proof for this theorem can be found in section 3.6.
58
3.4 Comparison of strategies, η1 and η2
Note that we got an analogous consistency result as Theorem 3.3.9 for histogram method
in chapter 2, which was that, for q(n) as in Assumption 3.3.7, if, hn, pin are chosen to
satisfy (2.6), which is,
hdnpi
2
nq(n)
log n
→∞ as n→∞, (3.13)
then the proposed allocation rule is strongly consistent.
Now if we compare (3.5), (3.6) and (3.13), we see that (3.13) ⇒ (3.6) ⇒ (3.5),
but not vice versa, therefore (3.5) seems to give more options for the choice of the
user-determined sequences, {hn} and {pin}, to achieve consistency, while there may be
a trade-off in the regret rate as we will see in the simulations. The rate of decrease
of the average cumulative regret can be slow for some choices of {pin} and {hn} that
satisfy both (3.13) and (3.5). Note that, a similar relationship is noticed in Theorem
3.3.12 when using kernel regression. To understand which choices of hyper-parameter
sequences help minimize the cumulative regret, let us consider the regret for strategy η,
RN (η) =
N∑
j=m0+1
(f∗(Xj)− fIj (Xj)) =
N∑
j=m0+1
(fi∗j (Xj)− fˆi∗j (Xj) + fˆi∗j (Xj)− fˆIj (Xj)
+ fˆIj (Xj)− fIj (Xj))
≤
N∑
j=m0+1
(fi∗j (Xj)− fˆi∗j (Xj) + fˆiˆj (Xj)− fˆIj (Xj)
+ fˆIj (Xj)− fIj (Xj))
≤
N∑
j=m0+1
2 sup
1≤i≤`
|fi(Xj)− fˆi(Xj)|+AI{Ij 6= iˆj}.
Thus the cumulative regret can be rougly decomposed into estimation error and ran-
domization error. For the no delay setting, both these error components are studied in
a finite-time setting in Qian and Yang (2016b), and it is shown that, {hn} and {pin} can
be chosen so as to achieve an optimal (minimax) rate of convergence for the regret. In
their work, the choices of {hn} and {pin} also depend on the smoothness parameter of
the underlying mean reward functions. Thus, in situations where the underlying mean
reward functions are simple and smoother, {hn} and {pin} are chosen to be fast decaying
59
to achieve optimal rates of convergence in no-delay situations. Similarly, for scenarios
where the underlying mean reward functions are more complex, {hn} and {pin}, are
chosen to be relatively slow decaying in order to guarantee optimal rates under no delay
scenarios. Now the question that arises in the presence of delayed rewards is that, how
should sequences {hn} and {pin} be updated so as to minimize the resulting cumulative
regret? That is, should one update pin to pin+1 (and hn to hn+1) at every time point
irrespective of observing a reward or only update upon observing a new reward. Let us
try to understand the impact of delay and the reward generating mechanisms on the
two components of cumulative regret to answer this question.
Different nonparametric methods could be used for estimation purposes, and es-
timation accuracy largely depends on the complexity of the underlying mean reward
functions and the amount of data available for estimation. The binwidth of methods
like histogram and kernel regression, usually is a function of the number of data points
available for estimation at a given time. Therefore in the presence of delayed rewards,
hτn (τn being the number of observed rewards until n) seems to be the sensible choice
for the binwidth. Choosing hn may lead to inefficient estimation due to unavailability
of data points in some small neighborhood of [0, 1]d. Therefore, employing a binwidth
sequence that guarantees optimal rates of convergence in the no-delay setting, which
updates only when a new reward is obtained, seems to be the right choice from an
estimation point of view. Hence, we only consider the policies (η1 and η2) that employ
hτn as the chosen binwidth sequence. It is important to note that from an asymptotic
point of view, based on our theoretical results (Theorem 3.3.9), estimation will improve
with time, but this discussion is from a finite time perspective.
In terms of randomization error, delayed rewards affect this directly through the ran-
domization scheme. This is tied to the exploration-exploitation dilemma which is in
turn controlled by the exploration probability {pin}. In the following illustrations, we
try to convey the message of why carefully balancing exploration-exploitation is tied
to updating the sequence {pin} carefully in the presence of delayed rewards, and the
decision to do that can vary in different situations.
Illustration 1. Suppose that the underlying mean reward functions are not too
complex and are well-separated. In this setting, it will be easy to get good functional
estimates over time, even with small sample of information available due to presence of
60
large delays. Since the no delay case is well-studied, for such a setting we could choose
an exploration probability sequence {pin} that gives the optimal rate of convergence
according to Qian and Yang (2016b). Now, with the delays, we need to decide whether
we want to use pin or piτn as the exploration probability sequence. In this setting,
it would perhaps be advantageous to opt for pin, which updates at every time step
irrespective of whether a reward is obtained or not. This is because in settings where
the underlying functions are somewhat easier to learn, major contribution to the regret
would come from the randomization error. In order to illustrate that, let Randj(η1) and
Randj(η2) denote the indicator I(Ij 6= iˆj) for η1 and η2, respectively. Let σt = min{n¯ :∑n¯
j=m0+1
I(tj ≤ N) ≥ t}, that is, σt is the time index where the tth reward is observed.
Then we have that,
EAN (
N∑
j=m0+1
Randj(η2)) =
N∑
j=m0+1
Pη2,AN (Ij 6= iˆj) =
τN∑
t=1
(σt+1 − σt)(`− 1)pit, (3.14)
where EAN denotes conditional expectation given AN , the set of indices when the re-
wards were observed by time N . Here, τN =
∑N
j=m0+1
I(tj ≤ N), number of rewards
observed between time m0 and N . However, for strategy η1, since the exploration
probability {pij} does not depend on delays, we have that,
E(
N∑
j=m0+1
Randj(η1)) =
N∑
j=m0+1
Pη2(Ij 6= iˆj) =
N−m0−1∑
j=1
(`− 1)pij . (3.15)
For brevity sake, let us denote N¯ = N −m0 − 1 and we start the counting process at
m0 + 1. Now, given τN , the minimum value that we can get for the R.H.S. in (3.14) is
when all the rewards from m0 + 1 until τN are observed instantaneously and after that
no reward is observed until we hit the horizon N¯ . Likewise, an approximate maximum
value of R.H.S. in (3.14) is achieved when the rewards for (m0 +1)
th through (N¯−τN )th
arms are not observed until time (N¯ − τN ), and all the τN many rewards are observed
from time N¯ − τN + 1 to N¯ respectively. Therefore,
min
AN
EAN (
N∑
j=m0+1
Randj(η2)) = (`− 1)[
τN−1∑
t=1
pit + (N¯ − τN )piτN ],
max
AN
EAN (
N∑
j=m0+1
Randj(η2)) = (`− 1)[(N¯ − τN )pi1 +
τN∑
t=2
pit].
61
For the sake of illustration, assume that we observe a fraction of N¯ by time N , that
is, τN = αN¯ , for some α ∈ (0, 1). Then we have that,
min E(
N∑
j=m0+1
Randj(η2)) = (`− 1)[
τN−1∑
t=1
pit + (1− α)N¯piτN ], (3.16)
max E(
N∑
j=m0+1
Randj(η2)) = (`− 1)[(1− α)N¯pi1 +
τN∑
t=2
pit]. (3.17)
Notice that the terms (1−α)N¯pi1 and (1−α)N¯piτN in the RHS in (3.16) and (3.17) can
be fairly large and grow as N increases for all reasonably fast choices of {pin} such as,
n−1/4, log−1 n. From (3.15), (3.16) and (3.17), we also get that,
N¯∑
t=τN+1
(`− 1)(pi1 − pit) ≥ E(
N∑
j=m0+1
Randj(η2)− Randj(η1)) ≥
N¯∑
t=τN+1
(`− 1)(piτN − pit),
where it can be seen that
∑N¯
t=τN+1
(` − 1)(piτN − pit) > 0 for any N and
∑N¯
t=τN+1
(` −
1)(pi1 − pit) → ∞ as N → ∞. Therefore, we see that using strategy η1, which updates
pin at every time step irrespective of having observed a reward or not, gives a lower
randomization error on average as compared to strategy η2. For example, if we choose
{pin} = n−1/4, α = 0.25 (one-fourth of rewards observed) and say m0 = 30 (initialization
phase), time horizon N = 10000, then we get that the average randomization error
difference approximately satisfies,
0.23(`− 1) ≥ E(
∑N
j=m0+1
Randj(η2)− Randj(η1))
N − (m0 + 1) ≥ 0.02(`− 1),
for N = 10000,m0 = 30. Therefore, in situations where underlying mean reward
functions are not too complex, the randomization error can be quite large and potentially
dominate over the estimation error. Thus, using strategy η1 could reduce the cumulative
regret substantially as compared to strategy η2 in such situations.
Illustration 2. On the other hand, there are situations in which it may be better
to use strategy η2 with piτn (updating only when a new reward is observed) as the
exploration probability sequence. For example, scenarios where the best arms frequently
alternate over regions of covariate space in terms of maximizing reward and it is hard
to tell a clear winner with less information available due to presence of large delays.
62
Another such situation is when an arm which is inferior in majority of the covariate
space, but is superior with a substantial reward gain in a very small area of the domain
and it might be the case that under large delays these under-represented regions remain
unexplored. As described, let us assume that the underlying mean reward functions are
somewhat complex. In such settings, we would need substantial exploration even in later
stages of the trial, specially in the presence of large delays. Here, in the hope of reducing
the randomization error, we could employ strategy η1 and use an exploration probability
sequence pin, which meets the conditions in Qian and Yang (2016b) that ensure optimal
convergence rates in no-delay situations. However, this could be disadvantageous in
such complex settings. This is because using η1 may lead to insufficient exploration
for the inferior arms. We consider the event that a seemingly inferior arm is chosen
at time t, that is, I(It 6= iˆt). Then to ensure enough exploration, we need that this
event occurs with a positive probability that is not too small, specially in such complex
settings as discussed above. From Yang and Zhu (2002) and Qian and Yang (2016a) for
no delay settings, we know that it is necessary to have
∑∞
t=1 pit =∞ for the algorithm to
perform optimally both asymptotically and in finite time. We also know that τN →∞
as N → ∞. Therefore, using both these facts, the sum of probability of the event
{I(It 6= iˆt), t ≥ 1}, over the time points where rewards were observed for strategy η2
goes to ∞,
τN∑
t=1
Pη2(It 6= iˆt) =
τN∑
t=1
(`− 1)pit a.s.→ ∞, as N →∞,
whereas, for η1, this sum could actually be summable for large delay situations. Let
σt = min{n¯ :
∑n¯
j=m0+1
I(tj ≤ N) ≥ t}. Let us assume that the observed rewards are
equally spaced, that is, σt = tN/τN , assuming w.l.o.g. that N/τN is an integer. Then,
we have,
τN∑
t=1
Pη1(It 6= iˆt) =
τN∑
t=1
(`− 1)piσt =
τN∑
t=1
(`− 1)pitN/τN .
Now, it can be shown that this series is summable for a variety of choices of {pin}. For
63
example, let us assume that {pin} = n−1/2, then for strategy η1,
τN∑
t=1
Pη1(It 6= iˆt) =
τN∑
t=1
(`− 1)pitN/τN =
τN∑
t=1
(
tN
τN
)−1/2
=
(
N
τN
)−1/2 τN∑
t=1
t−1/2 = O
(
τN√
N
)
. (3.18)
If the number of observed rewards are small, say τN = O(
√
N), then the series in (3.18)
is summable. Therefore by Borel-Cantelli Lemma, we know that the events {It 6= iˆt}
can only occurs only finitely many times. This will lead to insufficient exploration
and could lead to large regret gains in areas that remain unexplored, specially in the
more complex settings. Therefore, if we employ strategy η1 in such settings with large
delays, we may end up over-exploiting certain arms and as a result obtain insufficient
number of rewards pertaining to a seemingly inferior arm, which may possibly yield
higher rewards in some unexplored regions in future. This would adversely affect the
performance and lead to high cumulative regret. Therefore, in scenarios like this, it
would be advantageous to use strategy η2. In the next section, we demonstrate these
ideas using four different simulation setups and illustrate the performance of strategies
η1 and η2 in the four setups respectively. These insights also suggest that studying
adaptive strategies for updating these parameters more locally could be promising area
to explore.
3.5 Simulations
We conduct a simulation study to compare the per-round average regret for strategies
η and η2 under different delayed rewards scenarios. The per-round regret for strategy δ
is given by,
rn(η) =
1
n
n∑
j=1
(f∗(Xj)− fIj (Xj)).
Note that, if 1n
∑n
j=1 f
∗(Xj) is eventually bounded above and away from 0 with prob-
ability 1, then Rn(η) → 1 a.s. is equivalent to rn(η) → 0 a.s. The data has been
generated from the following mean reward functions. We assume d = 2, ` = 2 (or 3)
and x ∈ [0, 1]2 and the simulations run until time N = 10000 with first 20-30 rounds of
64
initialization. For each of the setups, we define one-dimensional functions g1 and g2, and
then for x1, x2 ∈ [0, 1], we define, f1(x1, x2) = g1(x1) ∗ x2 and f2(x1, x2) = g2(x1) ∗ x2.
Setup 1: In this setup, we consider two well-separated sinusoidal functions, where
one is a shifted above version of the other.
g1(x) = (−2 sin(20pix) + 3), g2(x) = (−2 sin(20pix) + 2); x ∈ [0, 1].
Setup 2: Consider three piecewise-linear functions that are still well-separated but
over different regions in the covariate space. Then, f1(x1, x2) = x2g1(x1), f2(x1, x2) =
x2g2(x1), f3(x1, x2) = x2g3(x1).
g1(x) =

1 0 ≤ x < 0.5
−10x+ 6 0.5 ≤ x < 0.6
0 x ≥ 0.6
, g2(x) =

0 0 ≤ x < 0.5
10x− 5 0.5 ≤ x < 0.6
1 x ≥ 0.6
,
g3(x) =

0 0 ≤ x < 0.3
20x− 6 0.3 ≤ x < 0.4
2 0.4 ≤ x < 0.6
−20x+ 14 0.6 ≤ x < 0.7
0 x ≥ 0.7.
Setup 3: Consider two sinusoidal functions such that the best arm alternates rapidly
as the functions oscillate.
g1(x) = 2 cos(5pix) + 2, g2(x) = −2 sin(5pix) + 2, for x ∈ [0, 1].
Setup 4: Consider a setup where one arm dominates over majority of the covariate
space, except for a small area where it incurs a considerably high regret.
g1(x) = 1, for all x ∈ [0, 1]; g2(x) =

0 0 ≤ x < 0.5
100000x− 50000 0.5 ≤ x < 0.502
200 0.502 ≤ x < 0.503
−100000 ∗ x+ 50500 0.503 ≤ x < 0.505
0 0.505 ≤ x ≤ 1.
65
Note that, in our setup d = 2 and f1(x1, x2) = g1(x1) ∗ x2 and f2(x1, x2) = g2(x1) ∗ x2.
Notice that the functions above are constructed such that different data-generating
scenarios could be considered when comparing η1 and η2 for delayed reward settings,
keeping the discussion in section 3.4 in mind. These one dimensional functions gi, for
arm i, have been plotted in Figure 3.1.
3.5.1 The simulation process and results
We simulate the data from the above mentioned true mean reward functions as follows:
Yi,j = fi(Xj) + 0.5j , i ∈ {1, 2, 3}, j ∈ N
where j
i.i.d.∼ N(0, 1). We use Nadaraya-Watson estimator with Gaussian kernel to esti-
mate the mean reward functions. The algorithm in section 3.2.1 is run, with strategies
η1 and η2. We consider the following choices of hyper-parameter sequences but in our
discussion, we only illustrate a few combinations to make a comparison for the sake of
brevity.
pin = {n−1/4, log−1 n, log−2 n;n ≥ 1} and hn = {n−1/4, n−1/6, log−1 n;n ≥ 1}
The algorithm is run for 60 replications (time horizonN = 10000), both for strategies
η1 and η2. Then the regret is averaged for each round (time point) over the replications,
to give a more accurate estimate of the total regret accumulated up to a given time point.
Since, we incorporate delays in this work, we artificially create scenarios governing when
a reward will be observed. We consider the following delay scenarios in the increased
order of severity of delays,
No delay; Every reward is observed instantaneously.
Delay 1: Geometric delay with probability of success (observing the reward) p = 0.3.
Delay 2: Every 5th reward is not observed by time N and other rewards are obtained
with a geometric (p = 0.3) delay.
Delay 3: Each case has probability 0.7 to delay and the delay is half-normal with scale
parameter, σ = 1500.
Delay 4: In this case we increase the number of non-observed rewards. Divide the data
into four equal consecutive parts (quarters), such that, in part 1, we only observe every
10th (with Geom(0.3) delay) observation by time N and not observe the remaining;
66
in part 2, we only observe every 15th observation; in part 3, only observe every 20th
observation; in part 4, only observe every 25th observation.
In our simulations, we noted that the difference in the cumulative regret was most
discernible in the more extreme delay situations, that is, delay 3 and delay 4. Therefore,
we only illustrate the results on those two delay scenarios. The plots in Figure 3.1 can be
used to compare performance of strategy η1 and η2, where recall that η1 is when (pin, hτn)
are used as hyper-parameters, and η2 is when (piτn , hτn) are used as hyper-parameters.
On the y-axis is the average regret plotted against time on the x-axis. The rows in the
figure correspond to the simulation setups and columns 2 and 3 correspond to delay
3 and delay 4, respectively. For illustration, we only show the plots corresponding to
one choice of hyper-parameter sequences, {hn} = log−1 n and {pin} = log−1 n, however
results from other combinations show similar trends and are included in section 3.7.
Note that in setup 1 and 2, η1 performs better than η2 in terms of reducing the
overall average regret. Both these setups consist of underlying mean reward functions
that are well-separated and clear winners in terms of reward gain in substantial portions
of the covariate space. Therefore, achieving good function estimation should not be a
problem. Thus in these examples, controlling for the randomization error is crucial,
which is better achieved by using pin instead of piτn , as illustrated in section 3.4. On
the contrary, in setup 3 and 4, we notice that strategy η2 performs better than η1 in
terms of lower average regret. This can be attributed to the fact that the underlying
data generating functions in these setups are more complicated in a more localized way,
thus requiring more exploration for better estimation. Therefore, using piτn instead of
pin helps reduce the risk of over-exploitation, especially in the more localized high regret
incurring zones. Another interesting observation is that for setup 1 and 2, the average
regret curves for strategies η1 and η2 are closer with delay 3 and much separated with
delay 4. Whereas, in setup 3 and 4, an opposite trend is seen, where the difference in
the average regret curves for η1 and η2 is more pronounced with delay 3 as compared to
delay 4. A possible reason for this could be that the mean reward functions for setup 1
and 2 are easy to estimate, even with as few observations as with delay 4, thus fast and
continuous exploitation helps reduce the regret. However, the underlying mean reward
functions in setup 3 and 4 are harder to learn and perhaps with so few observations as
in delay 4, it is hard to do get good estimates even with more exploration using piτn .
67
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
Setup 1: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
5
1.
0
1.
5
2.
0
Setup 2: Mean reward generating function (1D)
x
f(x
)
g1
g2
g3
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
Setup 3: Mean Reward Generating Functions (1D)
x
f(x
)
g2
g1
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
50
10
0
15
0
20
0
Setup 4: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−1 , hn = (log(n))−1 , Delay 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−1 , hn = (log(n))−1 , Delay 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
Figure 3.1: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two
rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third
and fourth).
68
3.6 Other proofs
Recall, Ji,n+1 = {j : Ij = i, tj ≤ n, 1 ≤ j ≤ n} and Mi,n+1 is the size of Ji,n+1,
A = {j : tj ≤ n} and τn =
∑n
j=1 I(tj ≤ n).
Lemma 3.6.1. Under the setting of the kernel estimation in section 3.3.3, let A ⊂ [0, 1]d
be a hypercube with side-width h. For a given arm i, if Assumptions 3.3.5, 3.3.6, 3.3.10
and 3.3.11 are satisfied, then for any  > 0,
PAn,Xn
sup
A
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)
>
τn
1− 1/√2

≤ exp
(
− τn
2
4c24v
2
)
+ exp
(
− τn
4c4c
)
+
∞∑
k=1
2kd exp
(
−2
kτn
2
λ2v2
)
+
∞∑
k=1
2kd exp
(
−2
k/2τn
2λc
)
,
where PAn,Xn denotes conditional probability given An = {j : tj ≤ n} and Xn =
{X1, . . . , Xn}.
Proof. The proof of this lemma follows exactly from the analogous lemma but without
delays in Qian and Yang (2016a). The results follow because we condition on An, and
given An, τn is a known quantity which plays the role of n in the non-delayed situation
as in Qian and Yang (2016a).
Next, we provide a proof for Theorem 3.3.12.
Proof of Theorem 3.3.12. Here, we prove the result for strategy η2 and discuss how the
proof for strategy η1 follows similarly. For each x ∈ [0, 1]d,
|fˆi,n+1 − fi(x)| =
∣∣∣∣∣∣∣∣
∑
j∈Ji,n+1 Yi,jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) − fi(x)
∣∣∣∣∣∣∣∣
=
∣∣∣∣∣∣∣∣
∑
j∈Ji,n+1(fi(Xj) + j)K
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) − fi(x)
∣∣∣∣∣∣∣∣
69
=
∣∣∣∣∣∣∣∣
∑
j∈Ji,n+1(fi(Xj)− fi(x))K
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) +
∑
j∈Ji,n+1 jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣∣∣
≤ sup
x,y:||x−y||∞≤Lhτn
|fi(x)− fi(y)|+
∣∣∣∣∣∣∣∣
1
Mi,n+1hdτn
∑
j∈Ji,n+1 jK
(
x−Xj
hτn
)
1
Mi,n+1hdτn
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣∣∣ ,
where the last inequality follows from the bounded support assumption of kernel function
K(·). We know from the proof in Theorem 3.3.9 that τn →∞ almost surely as n→∞.
Thus, by uniform continuity of the function fi,
lim
n→∞ supx,y:||x−y||∞≤Lhτn
|fi(x)− fi(y)| = 0, almost surely.
Therefore, we only need,
sup
x∈[0,1]d
∣∣∣∣∣∣∣∣
1
Mi,n+1hdτn
∑
j∈Ji,n+1 jK
(
x−Xj
hτn
)
1
Mi,n+1hdτn
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣∣∣
a.s.→ 0 as n→∞. (3.19)
We first show that,
inf
x∈[0,1]d
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
>
c3cL
d
1piτn
2
, (3.20)
almost surely for large enough n. Indeed, for each n ≥ m0 +1, given τn, we can partition
the unit cube [0, 1]d into B˜ bins with bin width L1hτn such that B˜ ≤ 1/(L1hτn)d. We
denote these bins by A˜1, A˜2, . . . , A˜B˜. Let σt = inf{n˜ :
∑n˜
j=1 I(tj ≤ n) ≥ t}. Given an
arm i and 1 ≤ k ≤ B˜, for every x ∈ A˜k, given τn we have that,∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
=
τn∑
t=1
I(Iσt = i)K
(
x−Xσt
hτn
)
≥
τn∑
t=1
I(Iσt = i,Xσt ∈ A˜k)K
(
x−Xσt
hτn
)
≥ c3
τn∑
t=1
I(Iσt = i,Xσt ∈ A˜k),
70
where the last inequality follows from Assumption 3.3.11 (boundedness of kernels).
Therefore,
PAn,Xn
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

≤ PAn,Xn
 inf
x∈A˜k
1
τnhdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

≤ PAn,Xn
(
c3
τnhdτn
τn∑
t=1
I(Iσt = i,Xσt ∈ A˜k) ≤
c3cL
d
1piτn
2
)
≤ PAn,Xn
(
τn∑
t=1
I(Iσt = i,Xσt ∈ A˜k) ≤
cτn(L1hτn)
dpiτn
2
)
.
Note that, PAn,Xn(Iσt = i,Xσt ∈ A˜k) ≥ c(L1hτn)dpiτn , for all 1 ≤ t ≤ n. Then,
PAn,Xn
(
τn∑
t=1
I(Iσt = i,Xσt ∈ A˜k) ≤
cτn(L1hτn)
dpiτn
2
)
≤ exp
(
−3cτn(L1hτn)
dpiτn
28
)
.
Therefore, we get that,
PAn,Xn
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

≤ exp
(
−3cτn(L1hτn)
dpiτn
28
)
. (3.21)
Now consider,
P
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

= P
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2
, τn >
E(τn)
2

+ P
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2
, τn ≤ E(τn)
2

71
≤ EPAn,Xn
 inf
x∈A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2
, τn >
E(τn)
2

+ P
(
τn ≤ E(τn)
2
)
≤ exp
(
−3c(L1hτn)
dpiτn(E(τn))
56
)
+ exp
(
−3E(τn)
28
)
,
where the last inequality followed from (3.21) and the Bernstein’s inequality (A.2).
Hence,
P
 inf
x∈[0,1]d
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

≤
B˜∑
k=1
P
inf
A˜k
1
Mi,n+1hdτn
∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
≤ c3cL
d
1piτn
2

≤ B˜
(
exp
(
−3c(L1hτn)
dpiτn(E(τn))
56
)
+ exp
(
−3E(τn)
28
))
≤ B˜
(
exp
(
−3c˜(L1hq(n))
dpiq(n)(q(n))
56
)
+ exp
(
−3a1q(n)
28
))
,
where the last inequality follows from Assumption 3.3.7 and the condition (3.12). Here, c˜
and a1 are constants due to the use of Assumption 3.3.7, which says that E(τn) ≥ a1q(n)
for some constant a1 > 0. Also, the same condition
q(n)h2d
q(n)
pi4
q(n)
logn →∞ ensures that the
right hand side above is summable, and by Borel-Cantelli lemma (Lemma A.0.1), we
have (3.20).
In order to prove (3.19), we now need to show that,
sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ = o(piτn), almost surely. (3.22)
For each n > m0 + 1, we can partition the unit cube [0, 1]
d into B bins with bin length
hτn such that B ≤ 1/hdτn . We denote these bins by A1, A2, . . . , AB. Then given  > 0,
72
consider,
PAn,Xn
 sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn

≤ B max
1≤k≤B
PAn,Xn
 sup
x∈Ak
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn

≤ B max
1≤k≤B
PAn,Xn
 sup
x∈Ak
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn,
Mi,n+1
τn
>
piτn
2
)
+BPAn,Xn
(
Mi,n+1
τn
≤ piτn
2
)
≤ B max
1≤k≤B
PAn,Xn
 sup
x∈Ak
∣∣∣∣∣∣
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > τnpi
2
τnh
d
τn
2

+BPAn,Xn
(
Mi,n+1
τn
≤ piτn
2
)
≤ B max
1≤k≤B
PAn,Xn
 sup
x∈Ak
∣∣∣∣∣∣
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > τnpi
2
τnh
d
τn
2

+B exp
(
−3τnpiτn
28
)
, (3.23)
where the last inequality follows from (A.2). Note that using Lemma 3.6.1,
PAn,Xn
 sup
x∈Ak
∣∣∣∣∣∣
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > τnpi
2
τnh
d
τn
2

≤ 2 exp
(
−(
√
2− 1)2τnpi4τnh2dτn2
32c24v
2
)
+ 2 exp
(
−(
√
2− 1)τnpi2τnhdτn
8
√
2c4c
)
+ 2
∞∑
k=1
2kd exp
(
−2
k(
√
2− 1)2τnpi4τnh2dτn2
8λ2v2
)
+ 2
∞∑
k=1
2kd exp
(
−2
k/2(
√
2− 1)τnpi2τnhdτn
4
√
2λc
)
. (3.24)
73
Using (3.23) and (3.24), we get that,
PAn,Xn
 sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn

≤ 2B exp
(
−(
√
2− 1)2τnpi4τnh2dτn2
32c24v
2
)
+ 2B exp
(
−(
√
2− 1)τnpi2τnhdτn
8
√
2c4c
)
+ 2B
∞∑
k=1
2kd exp
(
−2
k(
√
2− 1)2τnpi4τnh2dτn2
8λ2v2
)
+ 2B
∞∑
k=1
2kd exp
(
−2
k/2(
√
2− 1)τnpi2τnhdτn
4
√
2λc
)
+B exp
(
−3τnpiτn
28
)
.
Now consider,
P
 sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn

≤ EPAn,Xn
 sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn, τn > E(τn)2

+ P
(
τn ≤ E(τn)
2
)
.
Let ne = bE(τn)/2c, then using condition (3.12),
P
 sup
x∈[0,1]d
∣∣∣∣∣∣ 1Mi,n+1hdτn
∑
j∈Ji,n+1
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ > piτn

≤ 2B exp
(
−(
√
2− 1)2nepi4neh2dne2
32c24v
2
)
+ 2B exp
(
−(
√
2− 1)nepi2nehdne
8
√
2c4c
)
+ 2B
∞∑
k=1
2kd exp
(
−2
k(
√
2− 1)2nepi4neh2dne2
8λ2v2
)
+ 2B
∞∑
k=1
2kd exp
(
−2
k/2(
√
2− 1)nepi2nehdne
4
√
2λc
)
+B exp
(
−3nepine
28
)
+ exp
(
−3E(τn)
28
)
74
≤ 2B exp
(
−
(
√
2− 1)2a˜1q(n)pi4q(n)h2dq(n)2
64c24v
2
)
+ 2B exp
(
−
(
√
2− 1)a˜2q(n)pi2q(n)hdq(n)
16
√
2c4c
)
+ 2B
∞∑
k=1
2kd exp
(
−
2k(
√
2− 1)2a˜1q(n)pi4q(n)h2dq(n)2
16λ2v2
)
+ 2B
∞∑
k=1
2kd exp
(
−
2k/2(
√
2− 1)a˜2q(n)pi2q(n)hdq(n)
8
√
2λc
)
+B exp
(
−3a˜3q(n)piq(n)
56
)
+ exp
(
−3a1q(n)
28
)
.
where a˜1 is a constant that occurs due to Assumption 6 and the choice of hyperparameter
sequence when applied to the constant a1, where a1 is a positive constant such that
E(τn) ≥ a1q(n), for large enough n. Using condition (3.12),
q(n)pi4
q(n)
h2d
q(n)
logn → ∞, it
is easy to see that RHS above is summable. Then, by Borel-Cantelli Lemma we can
conclude (3.22), thus proving the theorem. Note, following the same lines of proof, we
could prove the strong consistency for η1 by just replacing piτn with pin.
3.7 Supplementary simulation results
In this section, we plot the average regret curves for both strategies η1 and η2 for different
hyper-parameter choices. In Figure 3.2, we choose {hn} = log−1 n and {pin} = log−2 n.
We still notice the same trend, where η1 performs better than strategy η2 in setup 1 and
setup 2, while η2 performs better in setup 3 and setup 4. Notice that, for setup 1 and 2,
in the case of delay scenario 3, the difference in the average regret is not as noticeable
as it is in delay 4. This could be attributed to the fast decaying {pin} = log−2 n, where
whether you update at every time point or only at observed reward time points, there
is sharp increase in the amount of exploitation with the amount of data available in
delay 3 scenario unlike the delay 4 scenario. We also notice that, in setup 3, with delay
4, the average regret does not seem to decay by our time horizon and might need a
larger horizon to show some decay, which could be because the exploration probability
is decaying too fast for both the algorithms to learn efficiently. Figure 3.3 and Figure 3.4
correspond to the choices {hn, pin} = (n−1/4, n−1/4), (n−1/4, log−1 n) respectively. We
see very similar trends as discussed in section 3.5 and for Figure 3.2.
75
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
Setup 1: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
5
1.
0
1.
5
2.
0
Setup 2: Mean reward generating function (1D)
x
f(x
)
g1
g2
g3
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
Setup 3: Mean Reward Generating Functions (1D)
x
f(x
)
g2
g1
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
50
10
0
15
0
20
0
Setup 4: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−2 , hn = (log(n))−1 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−2 , hn = (log(n))−1 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
Figure 3.2: Each row represents a setup, with first column depicting a one-dimensional
function used to generate the mean reward functions. The second and the third column
depict the average regret over time for delay 3 and delay 4 respectively. Here, {hn} =
log−1 n, {pin} = log−2 n.
76
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
Setup 1: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = n−1 4 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = n−1 4 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
5
1.
0
1.
5
2.
0
Setup 2: Mean reward generating function (1D)
x
f(x
)
g1
g2
g3
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = n−1 4 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = n−1 4 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
Setup 3: Mean Reward Generating Functions (1D)
x
f(x
)
g2
g1
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = n−1 4 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = n−1 4 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
50
10
0
15
0
20
0
Setup 4: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = n−1 4 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = n−1 4 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
Figure 3.3: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two
rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third
and fourth). Here, {hn} = n−1/4, {pin} = n−1/4.
77
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
Setup 1: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−1 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Setup 1: pin = (log(n))−1 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
5
1.
0
1.
5
2.
0
Setup 2: Mean reward generating function (1D)
x
f(x
)
g1
g2
g3
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−1 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 2: pin = (log(n))−1 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
Setup 3: Mean Reward Generating Functions (1D)
x
f(x
)
g2
g1
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−1 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000
0.
0
0.
1
0.
2
0.
3
0.
4
0.
5
Setup 3: pin = (log(n))−1 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0.0 0.2 0.4 0.6 0.8 1.0
0
50
10
0
15
0
20
0
Setup 4: Mean Reward Generating Functions (1D)
x
f(x
)
g1
g2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−1 , hn = n−1 4 , delay = 3
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
0 2000 4000 6000 8000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
Setup 4: pin = (log(n))−1 , hn = n−1 4 , delay = 4
Time index
Av
e
ra
ge
 re
gr
et
η1
η2
Figure 3.4: Strategy η1 has lower cumulative average regret in setup 1 and 2 (first two
rows) and strategy η2 has lower cumulative average regret in setup 3 and 4 (rows third
and fourth). Here, {hn} = n−1/4, {pin} = log−1 n.
Chapter 4
Finite-time analysis for
randomized allocation strategies
In this chapter we conduct a finite-time regret analysis for the strategies η1 and η2 as
proposed in algorithm 3.2 in chapter 3. Note that, in terms of notation and to enhance
readability, we will use i for arms and j, n for time indices and N for total time horizon.
To recall the problem setup, we assume that there are ` ≥ 2 arms available for allocation.
Each arm allocation results in a reward which is obtained at some random time after the
arm allocation. For each patient j ≥ 1, visiting at known times sj ∈ R+, a treatment
Ij is alloted based on the data observed previously and the covariate Xj . We assume
that the covariates are d-dimensional continuous random variables and take values in
the hypercube [0, 1]d. Since the rewards may be obtained at some delayed time, we
denote {tj ∈ R+, j ≥ 1} to be the observation time for the rewards for arms {Ij , j ≥ 1}
respectively. Let Yi,j be the reward obtained at time tj ≥ sj for arm i = Ij . The mean
reward with covariate Xj for the ith arm is denoted as fi(Xj), 1 ≤ i ≤ l. The observed
reward with covariate Xj by pulling the Ijth arm is modeled as,
YIj ,j = fIj (Xj) + j ,
where j denotes random independent error with E(j) = 0 and Var(j) < ∞ for all
j ∈ N. The functions fi, 1 ≤ i ≤ `, are assumed to be unknown and not of any given
parametric form.
Let {Xj , j ≥ 1} be a sequence of covariates independently generated according to
78
79
an unknown underlying probability distribution PX , from a population supported in
[0, 1]d. Let {pin, n ≥ 1} be a sequence of probabilities decreasing to 0 as n→∞. Let η
be a sequential allocation rule, which for each time j chooses an arm Ij based on the
previous observations and Xj . The total mean reward up to time n is
∑n
j=1 fIj (Xj).
The rewards are observed at delayed times tj ; the delay in the reward for arm Ij
pulled at the jth time is given by a random variable dj := tj − sj . Assume that these
delays are independent of both the covariates and of the arms. That is, let {dj , j ≥ 1}
be a sequence of random variables with a probability distributions,
dj ∼ Gj ∀j ∈ N.
To evaluate the performance of the allocation strategy, let i∗(x) = arg max1≤i≤` fi(x)
and f∗(x) = fi∗(x)(x). Without the knowledge of the random errors, the ideal perfor-
mance occurs when the choices of arms selected I1, . . . , In match the optimal arms
i∗(X1), . . . , i∗(Xn), yielding the optimal total reward
∑n
j=1 f
∗(Xj). Thus we measure
the performance of the allocation rule, δ, by the cumulative regret,
Rn(η) =
n∑
j=1
f∗(Xj)− fIj (Xj).
This is the quantity we use for finite-time analysis of our proposed strategies. We also
define the per-round or average regret rn(η) by
rn(η) =
Rn(η)
n
=
1
n
n∑
j=1
(f∗(Xj)− fIj (Xj)).
A strategy η is strongly consistent if rn(η) = op(1) and the finite time analysis provides
an upper bound on the rate of this decay.
4.1 Finite-time regret analysis
We start by making some assumptions on the errors, the underlying functions, the kernel
function used in the definition of Nadaraya-Watson estimator (4.1) and the delays, that
will be used in the consequent results.
80
Assumption 4.1.1. The errors satisfy a (conditional) moment condition that there
exists positive constants v and c such that for all integers k ≥ 2 and n ≥ 1,
E(|n|k|Xn) ≤ k!
2
v2ck−2,
almost surely.
This assumption imposes some moment conditions on the error distributions known
as the refined Bernstein condition (as in Birge´ et al. (1998); Qian and Yang (2016a).
Assumption 4.1.1 is met for a wide range of distributions, for example, normal distri-
bution and bounded errors satisfy this assumption, making it viable in a wide range
of applications. Next, we consider two natural assumptions on the mean reward func-
tions and the covariate density, respectively. Although we restrict the covariate space
to [0, 1]d, any bounded and compact subset of Rd would suffice.
Assumption 4.1.2. The functions fi are continuous on [0, 1]
d with,
A := sup
1≤i≤`
sup
x∈[0,1]d
(f∗(x)− fi(x)) <∞.
Assumption 4.1.3. The design distribution PX is dominated by the Lebesgue measure
with a continuous density p(x) uniformly bounded above and away from 0 on [0, 1]d; that
is, p(x) satisfies c ≤ p(x) ≤ c¯ for 0 < c ≤ c¯.
For Kernel regression, we consider a multivariate nonnegative kernel function K(u) :
Rd → R that satisfies Lipschitz, boundedness and bounded support conditions.
Assumption 4.1.4. For some constants 0 < λ <∞,
|K(u)−K(u′)| ≤ L||u− u′||∞,
for all u, u′ ∈ Rd.
Assumption 4.1.5. There exists constants L1 ≤ L, c3 > 0 and c4 ≥ 1 such that
K(u) = 0 for ||u||∞ > L,K(u) ≥ c3 for ||u||∞ ≤ L1 and K(u) ≤ c4 for all u ∈ Rd.
Next assumption is an independence assumption on the delays. We try to relax this
assumption in Section 4.2.
81
Assumption 4.1.6. Let the delays, {dj , j ≥ 1}, are independent of each other, the
choice of arms and also of the covariates.
Next assumption is to mildly restrict the expected number of delayed rewards, such
that we expect to observe an increasing number of rewards as time progresses. The
assumption is not restrictive as it allows for rewards to be unbounded as long as a
minimum number of rewards are being observed in finite time. This assumption would
naturally hold for a lot of scenarios with delayed rewards, where some informed learning
is plausible.
Assumption 4.1.7. Let the partial sums of delay distributions satisfy,
∑n
j=1Gj(n −
sj) = Ω(q(n)), where q(n) could be a function as in Assumption 2.3.4 in Chapter 2,
such that q(n)→∞ as n→∞.
Next, we also provide a mild assumption on an upper bound for the expected number
of observed rewards by the time horizon N .
Assumption 4.1.8. We assume that, for a given δ > 0 and time horizon N ,
E(τN ) < N −
√
N
2
log
(
1
δ
)
.
4.1.1 Nadaraya-Watson regression
We focus on Nadaraya-Watson regression and study its finite time performance under
the proposed allocation strategies η1 and η2. Let τn =
∑n
j=1 I{tj ≤ n}, be the random
running index of the number of rewards observed by time n, and we choose hτn for
the bandwidth sequence. Recall, we use {pin} and {hn} to denote the user defined
sequences for the exploration probability sequence and the bandwidth. We remove
the {·} to distinguish between strategies η1 and η2, and use piτn and pin to denote
the sequence updating only at time points corresponding to observed rewards and the
sequence updating at every time point, respectively.
For arm 1 ≤ i ≤ `, at each time point n, define Ji,n = {j : Ij = i, 1 ≤ tj ≤ n− 1, 1 ≤
j ≤ n − 1}, be the indices corresponding to the rewards which were observed by time
n − 1. Let AN = {(sj , tj) : tj ≤ N, j ≥ 1}, denote the time points for which rewards
82
were obtained by time N and Xn = {X1, X2, . . . , Xn} denote the set of design points
until time n.
Recall that, the Nadaraya-Watson estimator of fi(x) is,
fˆi,n+1(x) =
∑
j∈Ji,n+1 Yi,jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) . (4.1)
Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 + 1, define Qn+1(x) = {j : 1 ≤ tj ≤
n, ||x−Xj || ≤ Lhτn} and Qi,n+1(x) = {j : 1 ≤ tj ≤ n, Ij = i, ||x−Xj ||∞ ≤ Lhτn}. Let
Mn+1(x) and Mi,n+1(x) be the size of Qn+1(x) and Qi,n+1(x), respectively.
In the next section, we provide finite time analysis for both strategies η1 and η2, and
compare the two to see if the results support our findings in chapter 3.
To avoid the case of the denominator of the Nadaraya-Watson estimator in (4.1)
being extremely small, we will replace the kernel K(·) in (4.1) with a uniform kernel
I(||u||∞ ≤ L). In particular for the case when the complement of the event Bi,n defined
as,
Bci,n :=
 1Mi,n+1(x) ∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
< c5
 (4.2)
occurs almost surely for some small positive constant 0 < c5 < 1, we will use the uniform
kernel. This usage is seen in Lemma 4.1.10 and Lemma 4.1.11.
Given 0 < δ < 1 and the total time horizon N , for strategy η2 and a positive constant
a˜1, we define a special time point n
′
δ by,
n′δ = min
{
n > m0 : exp
(
−3ca˜1(2Lhq(n))
dpiq(n)q(n)
112
)
≤ δ
4`N
}
. (4.3)
Under the condition (3.6), limn→∞ hdq(n)piq(n)q(n)/ log(n) → ∞, we have, n′δ/N → 0 as
N →∞.
Similarly, given 0 < δ < 1 and time horizon N , for strategy η1 and some positive
constant ˜˜a1, we define a special time point n
′′
δ by,
n′′δ = min
{
n > m0 : exp
(
−3c
˜˜a1(2Lhq(n))
dpinq(n)
112
)
≤ δ
4`N
}
. (4.4)
Under the condition (3.5), limn→∞ hdq(n)pinq(n)/ log(n) → ∞, we will have, n′′δ/N → 0
as N →∞. Therefore, for large enough time horizon N , we will have N > n′′δ .
83
Lemma 4.1.9. Under Assumption 4.1.6 and Assumption 4.1.7, τn
a.s.→ ∞ as n→∞.
Proof. Recall, τn =
∑n
j=1 I{tj ≤ n}. Then, E(τn) = E(
∑n
j=1 I{tj ≤ n}) =
∑n
j=1 P (tj ≤
n) =
∑n
j=1Gj(n − sj). Now, by Assumption 4.1.7 we have, for large enough n, there
exists a positive integer a1, such that,
∑n
j=1Gj(n − sj) ≥ a1q(n). Then using the
inequality (A.2), we get,
P
(
τn ≤ a1q(n)
2
)
≤ P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
≤ exp
(
−3
∑n
j=1Gj(n− sj)
28
)
≤ exp
(−3a1q(n)
28
)
.
It is easy to see that the upper bound is summable in n under the condition (3.5)
and (3.6) from chapter 3. By the Borel-Cantelli lemma, the event {τn > a1q(n)/2}
happens infinitely often, therefore τn
a.s.→ ∞. Note that, by construction this implies
that hτn
a.s.→ 0, and piτn a.s.→ 0 as n→∞. As an immediate consequence of this along with
continuity of f , we get that w(hτn ; f)
a.s.→ 0, as n→∞.
Lemma 4.1.10 (For strategy η2). Suppose Assumptions 4.1.1,4.1.2, 4.1.5 and 4.1.6 are
satisfied and {pin} is a decreasing sequence. Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 +1,
for every  > w(Lhτn ; fi) a.s., we have for strategy η2,
P η2Xn,AN (|fˆi,n+1(x)− fi(x)| ≥ ) ≤ exp
(
−3Mn+1(x)piτn
28
)
+ 4N exp
(
−c
2
5Mn+1(x)piτn(− w(Lhτn ; fi))2
4c24v
2 + 4c4c(− w(Lhτn ; fi))
)
(4.5)
where P η2Xn,AN (·) denotes the conditional probability for strategy η2 given the design
points Xn = {X1, X2, . . . , Xn}, AN = {(sj , tj); tj ≤ N, j ≥ 1} and τn =
∑n
j=1 I{tj ≤
n}, which is a known quantity given AN .
Similarly, one could derive the analogous result for strategy η1. The proofs for the
two results are very similar and only one of them is presented in Section 4.4.1.
84
Lemma 4.1.11 (For strategy η1). Suppose Assumptions 4.1.1,4.1.2, 4.1.5 and 4.1.6 are
satisfied and {pin} is a decreasing sequence. Given x ∈ [0, 1]d, 1 ≤ i ≤ ` and n ≥ m0 +1,
for every  > w(Lhτn ; fi) a.s., we have for strategy η1,
P η1Xn,AN (|fˆi,n+1(x)− fi(x)| ≥ ) ≤ exp
(
−3Mn+1(x)pin
28
)
+ 4N exp
(
−c
2
5Mn+1(x)pin(− w(Lhτn ; fi))2
4c24v
2 + 4c4c(− w(Lhτn ; fi))
)
,
(4.6)
where P η1Xn,AN (·) denotes the conditional probability for strategy η1 given the design
points Xn = {X1, X2, . . . , Xn} and AN = {(sj , tj); tj ≤ N, j ≥ 1} and τn =
∑n
j=1 I{tj ≤
n}, which is a known quantity given AN .
It can be seen that Lemma 4.1.10 and Lemma 4.1.11 only differ in the hyperparame-
ter choice of piτn and pin, other things remain the same. The reason for this is that both
are conditional probability results, and given AN , τn is an observed quantity. Next, we
provide the theorems for finite-time regret bounds on the cumulative regret for strategy
η2 and η1 respectively.
Theorem 4.1.12. Suppose Assumptions 4.1.1-4.1.8 are satisfied and {pin} is a decreas-
ing sequence. Assume N > n′δ and the kernel estimator as defined in (4.1) and kernel
chosen as described in (4.2). Then for 0 < δ ≤ 1/4, we have that, with probability at
least 1− 32δ9 , the cumulative regret for η2 satisfies,
RN (η2) < An
′
δ +
N∑
n=n′δ+1
2
max
1≤i≤`
w(Lhq(n); fi) +
CN,δ√
(2L)dhdq(n)piq(n)q(n)

+A
N∗(δ)∑
t=1
Mδ(`− 1)pit + max
{
A
√
Mδ
E(τN )
2
log
(
2
δ
)
, A
√(
N
2
)
log
(
2
δ
)}
,
where N∗(δ) = E(τN )+
√
N
2 log
(
1
δ
)
, CN,δ =
√
64c24v
2 log(12`N2/δ)/c25c(2L)
d and Mδ is
a number chosen such that
(
1− a1q(Mδ/2)Mδ/2
)Mδ/2
= δ, where q(.) comes from Assumption
4.1.7.
Proof. The proof is in Section 4.4.2.
85
The right hand side of the inequality in Theorem 4.1.12 above consists of several
terms that are also intuitively meaningful. The first term An′δ comes from the initial
rough exploration. The second term has two components: max1≤i≤`w(Lhq(n); fi) which
is associated with the estimation bias, CN,δ/
√
hdq(n)piq(n)q(n) can be associated with the
estimation standard error, which depends on delay. That is, if the delays are expected
to be large, then q(n) will be small as a result of which the estimation standard error
will be large. The next term
∑N∗(δ)
t=1 Mδ(`−1)pit is the randomization error, where Mδ is
a probabilistic upper bound on the difference between consecutive reward observations.
This may potentially be quite large for large delay situations leading to large random-
ization error. Finally, the last term is reflective of the fluctuation of the randomization
scheme, and this also depends on the extent of delays in observing the rewards.
Theorem 4.1.13. Suppose assumptions 4.1.1-4.1.7 are satisfied and {pin} is a decreas-
ing sequence. Assume N > n′′δ and the kernel estimator as defined in (4.1) and kernel
chosen as described in (4.2). Then with probability larger than 1 − 2δ, the cumulative
regret for strategy η1 satisfies,
RN (η1) < An
′′
δ +
N∑
n=n′′δ+1
2
max
1≤i≤`
w(Lhq(n); fi) +
CN,δ√
hdq(n)pinq(n)
+A(`− 1)pin

+A
√(
N
2
log
(
1
δ
))
,
where CN,δ =
√
64c24v
2 log(12`N2/δ)/c25c(2L)
d.
The proof for Theorem 4.1.13 can be found in Section 4.4.3. The right hand side of
the inequality in Theorem 4.1.13 also consists of several terms that are intuitively mean-
ingful. The first term An′′δ comes from the initial rough exploration. The second term
has three components: max1≤i≤`w(Lhq(n); fi) which is associated with the estimation
bias, CN,δ/
√
hdq(n)pinq(n) can be associated with the estimation standard error, which
depends on delay. That is, if the delays are expected to be large then q(n) is going to
be small as a result of which the estimation standard error will be large. Then the next
term (`− 1)pin is the randomization error. This is not affected by the delay because as
per the proposed allocation strategy, allocations are made at each time point. The third
term A
√
N/2 log(1/δ) is reflective of the fluctuation of the randomization scheme.
86
As both the upper bounds in Theorem 4.1.12 and Theorem 4.1.13 consist of compo-
nents that reflect the bias-variance trade-off and the exploration-exploitation trade-off,
we can compare the bounds to get some idea of the underlying nature of the two strate-
gies, η2 and η1 respectively. We notice that there is a trade-off in the bounds of the two
strategies. While the upper bound for the estimation bias in the two strategies remains
the same, the bound on the estimation standard error component for the former (η2)
is smaller than the latter (η1) because piq(n) ≥ pin in the presence of delays. However,
randomization error bound for strategy η2 (Theorem 4.1.12) could be large as compared
to the randomization error bound for strategy η1 (Theorem 4.1.13), depending upon the
extent of delay and corresponding value of Mδ. If Mδ is not too large, we see that the
last term corresponding to the fluctuation of the randomization scheme in both the
bounds could actually be about the same (≈ A√(N/2) log (1/δ). The extent to which
one component (estimation error or randomization error) overpowers the other is also
determined by the underlying nature of reward generating functions and severity of de-
lays, as discussed in chapter 3. It is important to note that the bounds presented are
not tight, so it is hard to precisely quantify the difference in the cumulative regret for
both the strategies.
4.2 Delays dependent on covariates
It can often be the case that the extent of delay in observing rewards depends on the co-
variates. For example, patient characteristics could play an integral role in determining
when the treatment outcome for that patient would be observed. For instance, it could
be the case that treatments take longer time to show their outcomes in older patients
as compared to younger patients. In this section, we consider the scenarios where the
delays depend on covariates but are independent of the arms (treatments). In other
words, we assume that,
dj |(Xj = x) ∼ Gx; Gx ∈ G.
In other words, P (dj ≤ n | Xj = x) = Gx(n), where, we assume that,
G := {Gx : en ≤ Gx(n) ∀ x ∈ [0, 1]d}, (4.7)
87
where en ∈ (0, 1), is a non-decreasing sequence such that en → 1 as n→∞. Intuitively,
en is a sequence that gives a uniform lower bound on the cdf’s of delays for all patients
with covariates in the given domain [0, 1]d. In other words, if dj = tj − sj ,
P (tj ≤ n|Xj = x) = Gx(n− sj) ≥ en−sj ; Gx ∈ G, j ∈ N,
Next, we make assumptions on the on the growth of the sequence en to ensure that we
expect to observe rewards at a minimum rate that would guarantee a finite-time bound
on the cumulative regret.
Assumption 4.2.1. The delays {dj , j ≥ 1} depend on the covariates but are indepen-
dent of the choice of arms.
Assumption 4.2.2. Let,
∑n
j=1 e
2
n−sj = Ω(q(n)), where q(n) → ∞ as n → ∞, where
q(n) gives a lower bound on the uniform growth rate of number of observed rewards by
time n.
Under the conditions for consistency (3.5) and (3.6), one can do a similar analysis
as the finite-time results of Theorem 4.1.12 and 4.1.13. The proof techniques in certain
parts differ due to delays depending on the covariates and those details are discussed
in Section 4.4.4. However, a similar structure of the proof can be maintained and
consequently the following results can be established but with q(n) as in Assumption
4.2.2.
Theorem 4.2.3. Suppose Assumptions 4.1.1-4.1.5,4.2.1 and 4.2.2 are satisfied and
{pin} is a decreasing sequence of probabilities. Assume N > n′δ (with q(n) as in 4.2.2)
and the kernel estimator as defined in (4.1) and kernel chosen as described in (4.2).
Then with probability larger than 1− 2δ, the cumulative regret for η2 satisfies,
RN (η2) < An
′
δ +
N∑
n=n′δ+1
2
max
1≤i≤`
w(Lhq(n); fi) +
CN,δ√
hdq(n)piq(n)q(n)

+A
τN∑
t=1
(σt+1 − σt)(`− 1)pit
+ max
{
A
√
E(τN | FN )Mδ
2
log
(
2
δ
)
, A
√
N
2
log
(
2
δ
)}
,
88
where CN,δ =
√
64c24v
2 log(12`N2/δ)/c25c(2L)
d and Mδ is a number chosen such that(
1−
√
q(Mδ/2)
Mδ/2
)Mδ/2
= δ, where q(.) is as in Assumption 4.2.2. Also note that, FN is
the σ-field generated by (ZN , XN , IN ).
Theorem 4.2.4. Suppose assumptions 4.1.1-4.1.5, 4.2.1 and 4.2.2 are satisfied and
{pin} is a decreasing sequence of probabilities. Assume N > n′′δ (with q(n) as in Assump-
tion 4.2.2) and the kernel estimator as defined in (4.1) and kernel chosen as described
in (4.2). Then with probability larger than 1− 2δ, the cumulative regret for strategy η1
satisfies,
RN (η1) < An
′′
δ +
N∑
n=n′′δ+1
2
max
1≤i≤`
w(Lhq(n); fi) +
CN,δ√
hdq(n)pinq(n)
+A(`− 1)pin

+A
√(
N
2
log
(
1
δ
))
,
where CN,δ =
√
64c24v
2 log(12`N2/δ)/c25c
˜˜a1(2L)d.
Note that similar to Theorems 4.1.12 and 4.1.13, we gain intuitive insights by com-
paring the bounds of Theorems 4.2.3 and 4.2.4. However, it is hard to precisely quantify
the difference in the growth rate of the cumulative regret for the two strategies, as the
bounds presented are not tight. It may actually be the case that with the dependence
assumption of delays on covariates, the bounds on the cumulative regret are more con-
servative than the ones in Theorems 4.1.12 and 4.1.13, because we essentially use a
uniform lower bound over the whole covariate space, instead of taking into account
region-wise contribution to the regret. There is a scope to tighten the bounds by re-
laxing Assumption 4.2.2 which would require developing new technical tools that would
help in conducting a more involved analysis.
4.3 Real data analysis
In this section, we use a benchmark dataset for multi-armed bandit problems, the Yahoo!
Front Page Today Module User Click Log dataset to evaluate the proposed allocation
strategy by artificially introducing delays in observing the rewards in the data. The
89
complete data is about 46 million Yahoo front page interactions collected during first
ten days in May 2009. Each event (page visit interaction) has the following information:
1) five variables on visitor’s information; 2) a pool of about 10-14 editor-picked news
articles; 3) one article displayed to the visitor; 4) the visitor’s response (1=click or 0=no
click) to the selected article. The five variables on different visitors would reflect their
article preferences and hence they act as covariates in our setup.
While in the original dataset, the pool of articles is dynamic, we only consider fixed
number of arms in our setup. Therefore, we consider only one day’s data (May 1, 2009).
Also, we choose four articles (article id 109511-109514), and keep only the events where
the four articles are included in the article pool and one of the four articles is displayed
to the visitor. As a result, we obtain a reduced dataset consisting of 403,456 interaction
events. In order to speed up the computation, we only work with a randomly chosen
subset of this data which consists of 30,000 interaction events. We also use the first three
principal components for the covariates in order to tackle the curse of dimensionality.
We apply the unbiased offline evaluation method proposed by Li et al. (2010). This
helps in evaluating a contextual bandit algorithm on real datasets. If the arm proposed
by the bandit algorithm matches the actually displayed arm in the data, this event is
kept as a “valid” event; and if the proposed arm does not match the actually displayed
arm, this event is ignored. This process is sequentially run over all the events to generate
the final “valid” dataset, which is then used to evaluate the contextual bandit algorithm
using the click-through rate (CTR, the proportion of times a click is made). This works
because the displayed arm is selected uniformly at random from the pool, therefore the
final ‘valid’ dataset is like a random sample of the underlying population. A random
subsample (as in our case) of this ‘valid’ dataset also works using the same logic. In
addition to this we induce some delay mechanism in observing the rewards by forcing
the time for observing the response of a visitor to be delayed. We consider the following
delay scenarios in the increased order of severity of delays,
No delay; Every reward is observed instantaneously.
Delay 1: Geometric delay with probability of success (observing the reward) p = 0.3.
Delay 2: Every 5th reward is not observed by time N = 30000 and other rewards are
obtained with a geometric (p = 0.3) delay.
Delay 3: Each case has probability 0.7 to delay and the delay is half-normal with scale
90
parameter, σ = 1500.
Delay 4: In this case we increase the number of non-observed rewards. Divide the data
into four equal consecutive parts (quarters), such that, in part 1, we only observe every
10th (with Geom(0.3) delay) observation by time N and not observe the remaining;
in part 2, we only observe every 15th observation; in part 3, only observe every 20th
observation; in part 4, only observe every 25th observation.
With the subsample of the dataset with N = 30, 000 events with induced delays and
the unbiased offline evaluation method, we evaluate the performance of the following
algorithms.
• Random: An arm is selected uniformly at random.
• η1: The randomized allocation strategy (pin, hτn) as proposed in Section 3.2 of
chapter 3.
• η2: The randomized allocation strategy (piτn , hτn) as proposed in Section 3.2 of
chapter 3.
• DeLinUCB: The algorithm proposed by Vernade et al. (2018) is a delayed version
of LinUCB, which can handle random delayed feedback. They assume a linear
assumption on modeling the rewards as a function of the covariates.
The choices of bandwidth considered: {hn} = n−1/4, n−1/6, log−1 n, and the choices
of exploration probability considered: {pin} = n−1/4, log−1 n, log−2 n. We only illustrate
the results for {pin} = log−2 n with {hn} = n−1/6, log−1 n in Figure 4.1. Each of
the algorithms listed above is run 100 times over the reduced dataset of size 30,000
with the offline evaluation method. The first 150 events are used for initialization.
The resulting CTRs are divided by the mean CTR of the random algorithm which
results in the normalized CTRs. The boxplots of these normalized CTRs for the delay
scenarios 3-5 are shown in Figure 4.1. Note that our proposed strategies, η1 and η2,
work considerably better than DeLinUCB for delay scenario 2, which is moderate delay
situation where every 5th reward is not observed by time N = 30000 and other rewards
are obtained with a geometric (p = 0.3) delay. We also a somewhat better performance
of our strategies, η1 and η2 for scenario 3, which is a more severe delay setting. The
performance in delay 4, the most severe delay scenario also shows similar trends with
91
slightly lower normalized CTR values for all three methods, however these results seem
to be highly variable (Figure 4.3), largely depending on the sample size and the number
of initialization rounds. Perhaps, running these algorithms for an even longer period of
time (subsample of the data) for delay 4, would give more insights as more data becomes
available after going through the offline evaluation method and might help reduce the
variability of the results.
Other results (see section 4.5) show similar trends and it is seen that fast decaying
{pin} leads to better results for strategies η1 and η2.
l
l
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 2 , pin = log(n)−2 , hn = n−1 4
l
ll
l
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 3 , pin = log(n)−2 , hn = n−1 4
l
l
ll
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 4 , pin = log(n)−2 , hn = n−1 4
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 2 , pin = log(n)−2 , hn = log(n)−1
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 3 , pin = log(n)−2 , hn = log(n)−1
ll
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 4 , pin = log(n)−2 , hn = log(n)−1
Figure 4.1: Boxplots of normalized CTRs for the three methods being compared. Each
column represents a particular delay scenario.
4.3.1 Discussion on finite-time results
In this section, we reflect upon the real data analysis conducted in section 4.3 and
discuss some caveats and areas of improvement.
Due to computational feasability, we considered a random subsample of size 30000
and ran the algorithms for 100 independent replications. It can be argued that 100
92
independent replications might be insufficient for reaching to conclusions as there might
exist significant simulation error. In order to study that, we increased the number of
replications to 200 for each algorithm on the same data subsample and made boxplots
as in Figure 4.1. We saw very similar patterns (two of them displayed in Figure 4.2)
suggesting that it is perhaps reasonable to conclude that strategies η1 and η2 perform
better than DeLinUCB for the chosen data subsample.
l
l
l
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 3 , pin = log(n)−2 , hn = log(n)−1
l
l
l
η1 η2 DeLinUCB
0.
7
0.
8
0.
9
1.
0
1.
1
1.
2
1.
3
1.
4
delay 4 , pin = log(n)−2 , hn = log(n)−1
Figure 4.2: Boxplots with 200 replications show similar patterns as Figure 4.1.
Since we only considered a random subsample of the full data, another question
that arises is how do the algorithms compare on the entire dataset. In order to explore
that along with keeping in mind the computational challenge of dealing with a large
dataset in R, we partitioned our dataset into 30 disjoint subsets of size ≈ 13000 each.
Then we conducted a paired t-test to compare the mean normalized CTR for DeLinUCB
and both η1 and η2 respectively. The results were not statistically significant for most
combinations of hyperparameter sequence choices for strategies η1 and η2. We think
that this could be because of the small sample size of the valid data using the offline
evaluation method of Li et al. (2010) for around 13000 events. This led us to consider
larger subsample of the entire dataset and we considered a random subsample of 50000
events. Interestingly, we did not find a significant difference in the performance of
the three algorithms for this subsample, perhaps reflecting a need for more adaptive
93
algorithms. Another possibility of improvement is using a more efficient dimension
reduction technique for the covariates in order to tackle the curse of dimensionality
for nonparametric regression methods. Since DeLinUCB assumes a linear model, it is
possible that a different parametric model would fit the data better.
4.4 Proofs
4.4.1 Proofs of Lemmas
Proof of Lemma 4.1.10. Recall that Qn+1(x) = {j : 1 ≤ tj ≤ n, ||x − Xj || ≤ Lhτn}
and Qi,n+1(x) = {j : j ∈ Qn+1(x), Ij = i}. Let Mn+1(x) and Mi,n+1(x) be the size of
Qn+1(x) and Qi,n+1(x), respectively. It can be seen that if Mn+1(x) = 0, (4.5) trivially
holds. Therefore, without loss of generality we can assume Mn+1(x) > 0.
For the event Bi,n =
{
1
Mi,n+1(x)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
≥ c5
}
. Note that,
PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ 
)
= PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x)
Mn+1(x)
≤ piτn
2
)
+ PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x)
Mn+1(x)
>
piτn
2
)
≤ PXn,AN
(
Mi,n+1(x)
Mn+1(x)
≤ piτn
2
)
+ PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ , (4.8)
Mi,n+1(x)
Mn+1(x)
>
piτn
2
)
a≤ exp
(
−3Mn+1(x)piτn
28
)
+ PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x)
Mn+1(x)
>
piτn
2
, Bi,n
)
+ PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ , Mi,n+1(x)
Mn+1(x)
>
piτn
2
, Bci,n
)
=: exp
(
−3Mn+1(x)piτn
28
)
+A1 +A2. (4.9)
where the first term in the inequality in step a comes from the extended Bernstein
inequality (A.2). By Assumption 4.1.5 and the definition A.0.2 of the modulus of
94
continuity, we have,
|fˆi,n+1(x)− fi(x)| =
∣∣∣∣∣∣
∑
j∈Ji,n+1 Yi,jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) − fi(x)
∣∣∣∣∣∣
=
∣∣∣∣∣∣
∑
j∈Ji,n+1(fi(Xj) + j)K
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) − fi(x)
∣∣∣∣∣∣
=
∣∣∣∣∣∣
∑
j∈Ji,n+1(fi(Xj)− fi(x))K
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
) + ∑j∈Ji,n+1 jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣
≤ sup
x,y:||x−y||∞≤Lhτn
|fi(x)− fi(y)|+
∣∣∣∣∣∣
∑
j∈Ji,n+1 jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣
= w(Lhτn ; fi) +
∣∣∣∣∣∣
∑
j∈Ji,n+1 jK
(
x−Xj
hτn
)
∑
j∈Ji,n+1 K
(
x−Xj
hτn
)
∣∣∣∣∣∣ . (4.10)
Under Bi,n,
|fˆi,n+1(x)− fi(x)| ≤ w(Lhτn ; fi) +
1
c5Mi,n+1(x)
∣∣∣∣∣∣
∑
j∈Qi,n+1(x)
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ .
Using this, we will construct an upper bound for A1. Define σt = inf{n˜ :
∑n˜
j=1 I{Ij =
i, tj ≤ n and ||x−Xj || ≤ Lhτn} ≥ t}, t ≥ 1. Then, for large enough n, by Lemma 4.1.9,
 > w(Lhτn , fi) a.s., and we have,
A1 ≤ PXn,AN
∣∣∣∣∣∣
∑
j∈Qi,n+1(x)
jK
(
x−Xj
hτn
)∣∣∣∣∣∣ ≥ c5Mi,n+1(x)(− w(Lhτn ; fi)), (4.11)
Mi,n+1(x)
Mn+1(x)
>
piτn
2
)
≤
n∑
n¯=0
PXn,AN
(∣∣∣∣∣
n¯∑
t=1
σtK
(
x−Xσt
hτn
)∣∣∣∣∣ ≥ c5n¯(− w(Lhτn , fi)), Mi,n+1(x)Mn+1(x) > piτn2 ,
Mi,n+1(x) = n¯
)
95
≤
n∑
dMn+1(x)piτn/2e
PXn,AN
(∣∣∣∣∣
n¯∑
t=1
σtK
(
x−Xσt
hτn
)∣∣∣∣∣ ≥ c5n¯(− w(Lhτn ; fi))
)
≤
n∑
dMn+1(x)piτn/2e
2 exp
(
− n¯c
2
5(− w(Lhτn ; fi))2
2c24v
2 + 2c4c(− w(Lhτn ; fi))
)
≤ 2N exp
(
−c
2
5Mn+1(x)piτn(− w(Lhτn ; fi))2
4c24v
2 + 4c4c(− w(Lhτn ; fi))
)
, (4.12)
where the last inequality follows from Lemma A.1.7 and the upper boundedness of the
kernel function.
Now, to find the bound for A2, under B
c
i,n we run into technical problems since the
denominator of the Nadaraya-Watson estimator can be extremely small, hence we will
replace the kernel K(·) in (4.1) with a uniform kernel I(||u||∞ ≤ L). That is for the
case when,
Bci,n :=
 ∑
j∈Ji,n+1
K
(
x−Xj
hτn
)
< c5
∑
j∈Ji,n+1
I(||x−Xj ||∞ ≤ Lhτn)
 , (4.13)
for some small positive constant 0 < c5 < 1, we will use the uniform kernel. Therefore,
using (4.10), (4.13) and A.1.7, we get that,
A2 ≤ PXn,AN
∣∣∣∣∣∣
∑
j∈Ji,n+1
jI(||x−Xj || ≤ Lhτn)
∣∣∣∣∣∣ ≥Mi,n+1(x)(− w(Lhτn ; fi)), (4.14)
Mi,n+1(x)
Mn+1(x)
>
piτn
2
)
≤
n∑
n¯=0
PXn,AN
(∣∣∣∣∣
n¯∑
t=1
σtI(||x−Xσt || ≤ Lhτn)
∣∣∣∣∣ ≥ n¯(− w(Lhτn ; fi)),
Mi,n+1(x)
Mn+1(x)
>
piτn
2
,Mi,n+1(x) = n¯
)
≤
n∑
dMn+1(x)piτn/2e
PXn,AN
(∣∣∣∣∣
n¯∑
t=1
σtI(||x−Xσt || ≤ Lhτn)
∣∣∣∣∣ ≥ n¯(− w(Lhτn ; fi))
)
≤
n∑
dMn+1(x)piτn/2e
2 exp
(
− n¯(− w(Lhτn ; fi))
2
2v2 + 2c(− w(Lhτn ; fi))
)
≤ 2N exp
(
−Mn+1(x)piτn(− w(Lhτn ; fi))
2
4v2 + 4c(− w(Lhτn ; fi))
)
. (4.15)
96
Therefore, using the fact that 0 < c5 ≤ 1 ≤ c4, (4.12) and (4.15) in (4.9), we get,
PXn,AN
(
|fˆi,n+1(x)− fi(x)| ≥ 
)
≤ exp
(
−3Mn+1(x)piτn
28
)
+ 4N exp
(
−c
2
5Mn+1(x)piτn(− w(Lhτn ; fi))2
4c24v
2 + 4c4c(− w(Lhτn ; fi))
)
.
(4.16)
The proof for Lemma 4.1.11 will follow the same steps with piτn replaced by pin.
Next, we prove a lemma that would be used to prove Theorem 4.1.12.
Lemma 4.4.1. An  that satisfies,
4N exp
(
−c
2
5ca˜1(2Lhq(n))
dpiq(n)q(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
≤ δ
4`N
, (4.17)
is given by,
˜i,n = w(Lhq(n); fi) +
√
64c24v
2 log(16`N2/δ)
c25ca˜1(2L)
dhdq(n)piq(n)q(n)
.
Proof for Lemma 4.4.1. Let Z := − w(Lhq(n); fi), then (4.17) becomes,
c25ca˜1(2Lhq(n))
dpiq(n)q(n)Z
2
16c24v
2 + 16c4cZ
≥ log
(
16`N2
δ
)
.
Let A1 = c
2
5ca˜1(2L)
d, A2 = 16c
2
4v
2, A3 = 16c4c.
A1q(n)h
d
q(n)piq(n)Z
2 −A3 log
(
16`N2
δ
)
Z −A2 log
(
16`N2
δ
)
≥ 0. (4.18)
Left hand side is a quadratic polynomial in Z. Solving for Z,
A1q(n)h
d
q(n)piq(n)Z
2 −A3 log
(
16`N2
δ
)
Z −A2 log
(
16`N2
δ
)
= 0
⇒ Z = 1
2
A3 log(16`N2/δ)
A1q(n)hdq(n)piq(n)
±
√√√√ A23 log2(16`N2/δ)
(A1q(n)hdq(n)piq(n))
2
+
4A2 log(16`N
2/δ)
A1q(n)hdq(n)piq(n)
 .
This will give two real roots for the quadratic equation. Therefore if we want some
value of Z such that (4.18) holds, we can use a point that is larger than the roots
97
−b±√b2 + d2 and we know that d ≥ −b±√b2 + d2. Therefore, a potential candidate
could be,
Z =
√
4A2 log(16`N
2/δ)
A1q(n)hdq(n)piq(n)
=
√
64c24v
2 log(16`N2/δ)
c25ca˜1(2L)
dhdq(n)piq(n)q(n)
,
which means that we want
˜i,n = w(Lhq(n); fi) +
√
64c24v
2 log(16`N2/δ)
c25ca˜1(2L)
dhdq(n)piq(n)q(n)
.
A similar lemma with piq(n) replaced by pin could be derived that will be used in the
proof of Theorem 4.1.13.
Lemma 4.4.2. An  that satisfies,
4N exp
(
−c
2
5c
˜˜a1(2Lhq(n))
dpinq(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
≤ δ
4`N
, (4.19)
is given by,
˜′i,n = w(Lhq(n); fi) +
√√√√ 64c24v2 log(16`N2/δ)
c25c
˜˜a1(2L)dhdq(n)pinq(n)
.
98
4.4.2 Proofs of Theorems
Proof of Theorem 4.1.12. By definition of iˆj , fˆi∗(Xj),j ≤ fˆiˆj ,j(Xj), then the regret accu-
mulated after the initial forced sampling period is,
N∑
j=m0+1
(f∗(Xj)− fIj (Xj))
=
N∑
j=m0+1
(fi∗(Xj)(Xj)− fˆi∗(Xj),j(Xj) + fˆi∗(Xj),j(Xj)− fiˆj (Xj) + fiˆj (Xj)− fIj (Xj))
≤
N∑
j=m0+1
(fi∗(Xj)(Xj)− fˆi∗(Xj),j(Xj) + fˆiˆj ,j(Xj)− fiˆj (Xj) + fiˆj (Xj)− fIj (Xj))
≤
N∑
j=m0+1
(2 sup
1≤i≤l
|fˆi,j(Xj)− fi(Xj)|+AI{Ij 6= iˆj}) (4.20)
Here the first term corresponds to the regret incurred due to estimation error and the
second term corresponds to the randomization error.
We will first find a lower bound for the estimation error. Note that Lemma 4.1.10
gives a probability inequality for the estimation error conditional on AN and Xn. There-
fore, in order to get a probability (not conditional) bound on the estimation error, we
first remove this condition on Xn and then remove the condition on AN in (4.5). Given
arm i, for a large enough n satisfying n ≥ m0 + 1 and  > w(Lhτn ; fi) a.s., consider,
PAN (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ )
= PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤ c(2Lhτn)
dτn
2
)
+ PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn)
dτn
2
)
(4.21)
≤ PAN
(
Mn+1(Xn+1) ≤ c(2Lhτn)
dτn
2
)
+ PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn)
dτn
2
)
99
≤ exp
(
−3c(2Lhτn)
dτn
28
)
+ exp
(
−3c(2Lhτn)
dτnpiτn
56
)
+ 4N exp
(
−c
2
5c(2Lhτn)
dτnpiτn(− w(Lhτn ; fi))2
8c24v
2 + 8c4c(− w(Lhτn ; fi))
)
(4.22)
where, the above inequality follows from Lemma 4.1.10 and (A.2), and the fact that
E(Mn+1(Xn+1) | AN ) ≥ c(2Lhτn)dτn.
Now, we want to remove the conditioning on AN . Recall that dj ind∼ Gj , for j ≥ 1.
Therefore, for the known visiting times {sj , j ≥ 1}, P (tj ≤ n) = P (dj + sj ≤ n) =
P (dj ≤ n− sj) = Gj(n− sj), hence,
P (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ )
= P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn ≤
∑n
j=1Gj(n− sj)
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn >
∑n
j=1Gj(n− sj)
2
)
≤ P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn >
∑n
j=1Gj(n− sj)
2
)
≤ P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n)
2
)
= P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
+ E
[
PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n)
2
)]
, (4.23)
for large enough n, where a1 is a positive constant arises from Assumption 4.1.7. Also,
note that the second term in the last equality (4.23) is due to the law of iterated expec-
tation. Let q1(n) = q(n)/2, we get, for τn > a1q1(n), since we have the condition that
hdq(n)piq(n)q(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ a˜1hdq1(n)q1(n)
and hdτnpiτnτn ≥ a˜1hdq1(n)piq1(n)q1(n), where a˜1 is a constant that is function of constant
a1, which depends on the user determined choice of sequences {pin} and {hn}. For large
100
enough n, − w(Lhq(n); fi) > 0, and we have using (4.45) and (A.2) in (4.23),
≤ exp
(
−3a1q1(n)
14
)
+ exp
(
−3ca˜1(2Lhq1(n))
dq1(n)
28
)
+ exp
(
−3ca˜1(2Lhq1(n))
dq1(n)piq1(n)
56
)
+ 4N exp
(
−c
2
5ca˜1(2Lhq1(n))
dq1(n)piq1(n)(− w(Lhq1(n); fi))2
8c24v
2 + 8c4c(− w(Lhq1(n); fi))
)
≤ exp
(
−3a1q(n)
28
)
+ exp
(
−3ca˜1(2Lhq(n))
dq(n)
56
)
+ exp
(
−3ca˜1(2Lhq(n))
dq(n)piq(n)
112
)
+ 4N exp
(
−c
2
5ca˜1(2Lhq(n))
dq(n)piq(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
. (4.24)
Given 0 < δ < 1, we want to bound the right hand side above by δ. To do that for the
first three terms, given total time horizon N , we define a special time point n′δ by
n′δ = min
{
n > m0 : exp
(
−3ca˜1(2Lhq(n))
dpiq(n)q(n)
112
)
≤ δ
4`N
}
. (4.25)
Under the condition that limn→∞ hdq(n)piq(n)q(n)/ log(n) → ∞, we have n′δ/N → 0 as
N →∞. Therefore, if the total time horizon is long enough, we have N > n′δ.
For the fourth term in the right hand side of (4.24), we want to choose an  such
that,
4N exp
(
−c
2
5ca˜1(2Lhq(n))
dpiq(n)q(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
≤ δ
4`N
,
One such value for  as shown in Lemma 4.4.1 is given by,
˜i,n = w(Lhq(n); fi) +
√
64c24v
2 log(16`N2/δ)
c25ca˜1(2L)
dhdq(n)piq(n)q(n)
. (4.26)
By (4.24), (4.25) and (4.26), for n ≥ n′δ, we have that,
P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ˜i,n
)
≤ δ
4`N
+
δ
4`N
+
δ
4`N
+
δ
4`N
=
δ
`N
,
101
which implies that,
P
 N∑
n′δ+1
2 sup
1≤i≤`
|fˆi,n(Xn)− fi(Xn)| ≥
N∑
n′δ+1
2 max
1≤i≤`
˜i,n−1
 ≤ δ. (4.27)
Now we want to get a bound for the randomization error.
Let σt = min{n¯ :
∑n¯
j=n′δ+1
I(tj ≤ N) ≥ t}, for t ∈ Z. For the cases when the rewards
are observed by time N , i.e. for all instances σt, t ∈ Z we update only when a new
reward is observed that is at every σt, t ≥ 1. In between the time points corresponding
to two consecutive reward observations, {pit} takes the same as the value for the previous
observed case. In other words, we have σt+1−σt same values (`−1)pit for the exploration
probability for each t, hence
∑N
n=n′δ+1
P (In 6= iˆn) =
∑N
n=n′δ+1
(`− 1)piτn =
∑τN
t=1(σt+1−
σt)(`− 1)pit, and w.l.o.g., assume that στN+1 = N .
Given  > 0 and the set of observed indices by time N , AN , we have by the Bern-
stein’s inequality that,
PAN ,XN
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 

≤ exp
(
− 
2
2A2(
∑τN
t=1(σt+1 − σt)(`− 1)pit[1− (`− 1)pit] + /3)
)
. (4.28)
Next, for some positive constant M > 0, we study the event Bt := {σt+1 − σt > M}
for t ≥ 1. Note that, the event Bt is contained in the event that the first M/2 cases in
[σt, σt+1] are delayed by more than M/2, that is,
{σt+1 − σt > M} ⊂
{
dσt+1 >
M
2
, . . . , dσt+M/2 >
M
2
}
.
Therefore, using this fact and by independence of delays, we have that,
P (σt+1 − σt > M) ≤ P
(
dσt+1 >
M
2
, . . . , dσt+M/2 >
M
2
)
≤
M/2∏
s=1
P
(
dσt+s >
M
2
)
=
M/2∏
s=1
(
1−Gdσt+s
(
M
2
))
(4.29)
102
≤
(
M/2−∑M/2s=1 Gdσt+s(M/2)
M/2
)M/2
≤
(
1− a1q(M/2)
M/2
)M/2
, for all t = 1, . . . , τN , (4.30)
where the second to last inequality comes from AM-GM inequality and the last inequal-
ity follows from Assumption 4.1.7 and q(M/2) ≤ M/2 for all M , by construction. We
see that the above upper bound decays at an exponential rate as M grows. As the
above right hand side is free of t (by independence of delays), we have that,
P
(
max
t
(σt+1 − σt) ≥M
)
≤
(
1− a1q(M/2)
M/2
)M/2
.
We can choose M such that, for a given δ,(
1− a1q(M/2)
M/2
)M/2
= δ. (4.31)
Since, q is known a priori and a1 is a positive constant, we can solve for M in the above
equation. Consequently, since M will depend on δ, we denote it as Mδ. Depending on
what q is for a given problem, we will always be able to find a corresponding Mδ.
Also, note that using Hoeffding’s inequality (A.1.1), we have that,
P
(
τN ≥ E(τN ) + 
A
)
≤ exp
(
− 2
2
A2N
)
. (4.32)
We can choose 1(N, δ) =
√
(N/2) log(1/δ) such that this probability is less that δ, that
is,
P (τN ≥ E(τN ) + 1(N, δ)) ≤ δ. (4.33)
103
Now consider,
P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 

= P
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,max
t
(σt+1 − σt) ≥Mδ

+ P
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,max
t
(σt+1 − σt) < Mδ

≤ P
(
max
t
(σt+1 − σt) ≥Mδ
)
+ P
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,
max
t
(σt+1 − σt) < Mδ, τN ≥ E(τN ) + 
A
)
+ P
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,
max
t
(σt+1 − σt) < Mδ, τN < E(τN ) + 
A
)
≤ P
(
max
t
(σt+1 − σt) ≥Mδ
)
+ P
(
τN ≥ E(τN ) + 
A
)
+ P
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,max
t
(σt+1 − σt) < Mδ,
τN < E(τN ) +

A
)
≤ δ + exp
(
− 2
2
A2N
)
+ E
[
PAN ,XN
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 
A
,
max
t
(σt+1 − σt) < Mδ, τN < E(τN ) + 
A
)]
, (4.34)
where the first term follows from the (4.30) and the definition of Mδ, the second term
from (4.32) and last inequality follows from law of iterated expectation.
104
Then using (4.28) we have that,
PAN ,XN
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ ,
max
t
(σt+1 − σt) < Mδ, τN < E(τN ) + 
A
)
≤

exp
(
− 
2
2A2Mδ(E(τN ) + )/4 + /3
)
, if maxt(σt+1 − σt) < Mδ,
τN < E(τN ) + /A;
0, otherwise.
Using this in (4.34), we get,
EPAN ,XN
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ ,
max
t
(σt+1 − σt) ≤Mδ, τN < E(τN ) + /A
)
(4.35)
≤ exp
(
− 
2
2A2Mδ(E(τN ) + )/4 + /3
)
. (4.36)
Therefore, combining (4.34) and (4.36), we get that with probability at least 1-δ,
P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ 

≤ δ + exp
(
− 2
2
A2N
)
+ exp
(
− 
2
2A2Mδ(E(τN ) + )/4 + /3
)
.
In order to bound the right hand side by 2δ, let,
N,δ = max
{
A
√
Mδ
E(τN )
2
log(
2
δ
), A
√
(N/2) log (2/δ)
}
.
For this chosen , we have that,
P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ N,δ
 ≤ 2δ
⇒ P
A N∑
n=n′δ+1
I(In 6= iˆn) ≥ A
τN∑
t=1
(σt+1 − σt)(`− 1)pit + N,δ
 ≤ 2δ. (4.37)
105
Note that,
P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ N,δ

≥ P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ N,δ,
τN ≤ E(τN ) + 1(N, δ),max
t
(σt+1 − σt) ≤Mδ
)
= P
(
A
( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ N,δ
∣∣∣τN ≤ E(τN ) + 1(N, δ),
max
t
(σt+1 − σt) ≤Mδ
)
P
(
τN ≤ E(τN ) + 1(N, δ)
)
P
(
max
t
(σt+1 − σt) ≤Mδ
)
≥ P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
E(τN )+1(N,δ)∑
t=1
Mδ(`− 1)pit
 ≥ N,δ
 (1− δ)2, (4.38)
where the last inequality follows from (4.33) and (4.31). From Assumption 4.1.8, we
also have that E(τN ) + 1(N, δ) < N , so the above statement is meaningful. Now, from
(4.37) and (4.38), we get,
P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
E(τN )+1(N,δ)∑
t=1
Mδ(`− 1)pit
 ≥ N,δ
 (1− δ)2
≤ P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ N,δ
 ≤ 2δ
⇒ P
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
E(τN )+1(N,δ)∑
t=1
Mδ(`− 1)pit
 ≥ N,δ
 ≤ 2δ
(1− δ)2
(4.39)
From (4.27) and (4.39), we get that with probability at least 1 − 2δ/(1 − δ)2, the
cumulative regret for strategy η2 satisfies,
RN (η2) < An
′
δ +
N∑
n=n′δ+1
2
(
max
1≤i≤`
w(Lhq(n); fi) +
√
64c24v
2 log(12`N2/δ)
c25c(2L)
dhdq(n)piq(n)q(n)
)
+A
N∗(δ)∑
t=1
Mδ(`− 1)pit + max
{
A
√
Mδ
E(τN )
2
log
(2
δ
)
, A
√(N
2
)
log
(2
δ
)}
,
106
for N∗(δ) = E(τN ) + 1(N, δ). Let δ < 1/4 and we get the desired result.
4.4.3 Proof of Theorem 4.1.13
Proof of Theorem 4.1.13. Similar to Theorem 4.1.12, we will first find a lower bound for
the estimation error. In order to do so, in (4.1.11), we first remove conditioning on Xn
and then remove the conditioning on AN . Given arm i, n ≥ m0 + 1 and  > w(Lhτn ; fi)
a.s., consider,
PAN
(
|fˆi,n+1 − fi(Xn+1)| ≥ 
)
≤ PAN
(
|fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤ c(2Lhτn)
dτn
2
)
+ PAN
(
|fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn)
dτn
2
)
≤ PAN
(
Mn+1(Xn+1) ≤ c(2Lhτn)
dτn
2
)
+ PAN
(
|fˆi,n+1 − fi(Xn+1)| ≥ ,Mn+1(Xn+1) > c(2Lhτn)
dτn
2
)
≤ exp
(
−3c(2Lhτn)
dτn
28
)
+ exp
(
−3c(2Lhτn)
dτnpin
56
)
+ 4N exp
(
−c
2
5c(2Lhτn)
dτnpin(− w(Lhτn ; fi))2
8c24v
2 + 8c4c(− w(Lhτn ; fi))
)
,
where, the above inequality follows from Lemma 4.1.11 and (A.2).
Now, we want to remove the conditioning on AN . Recall that dj ind∼ Gj , for j ≥ 1.
Therefore, for the known visiting times {sj , j ≥ 1}, P (tj ≤ n) = P (dj + sj ≤ n) =
P (dj ≤ n− sj) = Gj(n− sj), and hence,
P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ 
)
= P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn ≤
∑n
j=1Gj(n− sj)
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn >
∑n
j=1Gj(n− sj)
2
)
107
≤ P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn >
∑n
j=1Gj(n− sj)
2
)
≤ P
(
τn ≤
∑n
j=1Gj(n− sj)
2
)
+ EPAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ , τn > a1q(n)
2
)
,
where
∑n
j=1Gj(n − sj) = Ω(q(n)) from Assumption 4.1.7, that is, for large enough
n, we would have that
∑n
j=1Gj(n − sj) ≥ a1q(n) for some positive constant a1.
Let q1(n) = a1q(n)/2, we get, for τn > q1(n), since we have the condition that
hdq(n)pinq(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ ˜˜a1hdq1(n)q1(n)
and hdτnpinτn ≥ ˜˜a1hdq1(n)pinq1(n), where ˜˜a1 is a positive constant depending on a1 and
the choice of hyperparameter sequences {hn} and {pin}. For large enough n, we have
that − w(Lhq(n); fi) > 0,
≤ exp
(
−3q1(n)
14
)
+ exp
(
−3c(2Lhq1(n))
dq1(n)
28
)
+ exp
(
−3c(2Lhq1(n))
dq1(n)pin
56
)
+ 4N exp
(
−c
2
5c(2Lhq1(n))
dq1(n)pin(− w(Lhq1(n); fi))2
8c24v
2 + 8c4c(− w(Lhq1(n); fi))
)
≤ exp
(
−3a1q(n)
28
)
+ exp
(
−3c
˜˜a1(2Lhq(n))
dq(n)
56
)
+ exp
(
−3c
˜˜a1(2Lhq(n))
dq(n)pin
112
)
+ 4N exp
(
−c
2
5c
˜˜a1(2Lhq(n))
dq(n)pin(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
. (4.40)
Given 0 < δ < 1, we want to bound the R.H.S. above by δ. To do that for the first
three terms, given total time horizon N , we define a special time point n′δ by
n′′δ = min
{
n > m0 : exp
(
−3c
˜˜a1(2Lhq(n))
dpinq(n)
112
)
≤ δ
4`N
}
. (4.41)
Under the condition that limn→∞ hdq(n)pinq(n)/ log(n)→∞, then we will have n′′δ/N → 0
as N →∞. Therefore, if the total time horizon is long enough, we have N > n′′δ .
108
For the fourth term in the R.H.S. of (4.40), we want to choose an  such that,
4N exp
(
−c
2
5c
˜˜a1(2Lhq(n))
dpinq(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
≤ δ
4`N
.
One such value for  is given by,
˜′i,n = w(Lhq(n); fi) +
√√√√ 64c24v2 log(16`N2/δ)
c25c
˜˜a1(2L)dhdq(n)pinq(n)
. (4.42)
By (4.40), (4.41) and (4.42), for n ≥ n′′δ , we have that,
P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ˜′i,n
)
≤ δ
4`N
+
δ
4`N
+
δ
4`N
+
δ
4`N
=
δ
`N
,
which implies that,
P
 N∑
n′′δ+1
2 sup
1≤i≤`
|fˆi,n(Xn)− fi(Xn)| ≥
N∑
n′′δ+1
2 max
1≤i≤`
˜′i,n−1
 ≤ δ. (4.43)
Now we want to get a bound for the randomization error regret. Given  > 0, since
P (In 6= iˆn) = (`− 1)pin, we have by the Hoeffding’s inequality that,
P
A
 N∑
n=n′′δ+1
I(In 6= iˆn)−
N∑
n=n′′δ+1
(`− 1)pin
 ≥ 
 ≤ exp(− 22
NA2
)
.
Take  = A
√
N/2 log(1/δ), we get,
P
A N∑
n=n′′δ+1
I(In 6= iˆn) ≥ A
N∑
n=n′′δ+1
(`− 1)pin +A
√
N
2
log
(
1
δ
) ≤ δ. (4.44)
Therefore, from (4.43) and (4.44), we get that with probability at least 1 − 2δ, the
cumulative regret satisfies,
RN (η1) < An
′′
δ +
N∑
n=n′′δ+1
2
max
1≤i≤`
w(Lhq(n); fi) +
CN,δ√
hdq(n)pinq(n)
+A(`− 1)pin

+A
√(
N
2
log
(
1
δ
))
,
where CN,δ =
√
64c24v
2 log(12`N2/δ)/c25c
˜˜a1(2L)d.
109
4.4.4 Proof outline for the case when delays depend on covariates
Notice that the Lemma 4.5 and 4.6 will remain exactly the same, as those results
are conditional probability results, given the time points when rewards were observed
and the covariate positions for which arms were assigned. In this section, we will
discuss the steps in the proof for Theorem 4.1.12 where the dependence of delays on
covariates plays a role. Recall, Mn+1(x) =
∑n
j=1 I{||Xj − x||∞ ≤ Lhτn , tj ≤ n} and
σt = min{n¯ :
∑n¯
j=n′δ+1
I(tj ≤ N) ≥ t}. Then we have,
EAN (Mn+1(x)) = E(
n∑
j=1
I{||Xj − x||∞ ≤ Lhτn , tj ≤ n} | An)
=
n∑
j=1
I{tj ≤ n}P (||Xj − x||∞ ≤ Lhτn | tj ≤ n)
≥
n∑
j=1
I{tj ≤ n}P (||Xj − x|| ≤ Lhτn , tj ≤ n)
=
n∑
j=1
I{tj ≤ n}P (tj ≤ n | ||Xj − x||∞ ≤ Lhτn)P (||Xj − x||∞ ≤ Lhτn)
≥
τn∑
t=1
en−sσt c(2Lhτn)
d,
where the last inequality follows by (4.7), which assumes a uniform bound on the cdf’s
of delay distributions across the covariate space. Then, in the proof of theorem 4.1.12,
the step (4.21) is where dependence of delays on covariates plays a role. Consider,
PAN (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ )
≤ PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) ≤
c(2Lhτn)
d
∑τn
t=1 en−sσt
2
)
+ PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) >
c(2Lhτn)
d
∑τn
t=1 en−sσt
2
)
≤ PAN
(
Mn+1(Xn+1) ≤
c(2Lhτn)
d
∑τn
t=1 en−sσt
2
)
+ PAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,Mn+1(Xn+1) >
c(2Lhτn)
d
∑τn
t=1 en−sσt
2
)
110
≤ exp
(
−
3c(2Lhτn)
d
∑τn
t=1 en−sσt
28
)
+ exp
(
−
3c(2Lhτn)
dpiτn
∑τn
t=1 en−sσt
56
)
+ 4N exp
(
−
c25c(2Lhτn)
dpiτn
∑τn
t=1 en−sσt (− w(Lhτn ; fi))2
8c24v
2 + 8c4c(− w(Lhτn ; fi))
)
. (4.45)
Now, in order to provide an upper bound for the estimation error, we want to remove the
condition on AN from the above probability bound (4.45). Recall that τn =
∑n
j=1 I(tj ≤
n). Consider,
P (tj ≤ n) =
∫
[0,1]d
P (tj ≤ n|Xj = x)p(x)dx
=
∫
[0,1]d
Gx(tj ≤ n)p(x)dx
≥
∫
[0,1]d
en−sjp(x)dx
= en−sj ,
where, the second to last inequality follows from (4.7). Therefore, we get that,
E(
τn∑
t=1
en−sσt ) = E
 n∑
j=1
I{tj ≤ n}en−sj

=
n∑
j=1
en−sjP (tj ≤ n) ≥
n∑
j=1
e2n−sj .
Therefore, by the extended Bernstein’s inequality (A.2), we have that,
P
(
τn∑
t=1
en−sσt ≤
∑n
j=1 e
2
n−sj
2
)
= P
 n∑
j=1
I(tj ≤ n)en−sj ≤
∑n
j=1 e
2
n−sj
2

≤ exp
(
−3
∑n
j=1 e
2
n−sj
28
)
.
111
Now consider,
P (|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ )
= P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt ≤
∑n
j=1 e
2
n−sj
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt >
∑n
j=1 e
2
n−sj
2
)
≤ P
(
τn∑
t=1
en−sσt ≤
∑n
j=1 e
2
n−sj
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt >
∑n
j=1 e
2
n−sj
2
)
≤ P
(
τn∑
t=1
en−sσt ≤
∑n
j=1 e
2
n−sj
2
)
+ P
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt >
a1q(n)
2
)
≤ P
(
τn∑
t=1
en−sσt ≤
∑n
j=1 e
2
n−sj
2
)
+ EPAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt >
a1q(n)
2
)
≤ exp
(
−3
∑n
j=1 e
2
n−sj
28
)
+ EPAN
(
|fˆi,n+1(Xn+1)− fi(Xn+1)| ≥ ,
τn∑
t=1
en−sσt >
a1q(n)
2
)
, (4.46)
for large enough n, where a1 is a positive constant arises from Assumption 4.2.2. Also,
notice that
∑τn
t=1 en−sσt ≤ τn, hence,
τn∑
t=1
en−sσt >
a1q(n)
2
⇒ τn > a1q(n)
2
.
Let q1(n) = q(n)/2, we get, for τn > a1q1(n), since we have the condition that
hdq(n)piq(n)q(n)/ log n→∞, for large enough n, we can assume that hdτnτn ≥ a˜1hdq1(n)q1(n)
and hdτnpiτnτn ≥ a˜1hdq1(n)piq1(n)q1(n), where a˜1 is a constant that is function of constant
112
a1, which depends on the user determined choice of sequences {pin} and {hn}. Then for
large enough n, − w(Lhq(n); fi) > 0, and we have using (4.45) and (A.2) in (4.46),
≤ exp
(
−3a1q1(n)
14
)
+ exp
(
−3ca˜1(2Lhq1(n))
dq1(n)
28
)
+ exp
(
−3ca˜1(2Lhq1(n))
dq1(n)piq1(n)
56
)
+ 4N exp
(
−c
2
5ca˜1(2Lhq1(n))
dq1(n)piq1(n)(− w(Lhq1(n); fi))2
8c24v
2 + 8c4c(− w(Lhq1(n); fi))
)
≤ exp
(
−3a1q(n)
28
)
+ exp
(
−3ca˜1(2Lhq(n))
dq(n)
56
)
+ exp
(
−3ca˜1(2Lhq(n))
dq(n)piq(n)
112
)
+ 4N exp
(
−c
2
5ca˜1(2Lhq(n))
dq(n)piq(n)(− w(Lhq(n); fi))2
16c24v
2 + 16c4c(− w(Lhq(n); fi))
)
. (4.47)
Note, that the above inequality we get is the same as (4.24). A similar analysis can be
done for strategy η2 by replacing piτn with pin. Next, the steps of the proof that follow
remain the same and will lead to exactly same results as in the proofs for both strategies
η1 and η2. However, it is important to recognize that although the inequalities (4.24)
and (4.47) look alike (as will the final regret upper bounds), the underlying meaning of
q(n) are different in the two setups and the bounds could certainly look very different
in the two settings. The bound for randomization error will also look similar to (4.37)
for strategy η1, and the essential difference in the proof lies in the following steps. First
step that is different is (4.30), consider,
P (σt+1 − σt > M) ≤ P
(
dσt+1 >
M
2
, . . . , dσt+M/2 >
M
2
)
≤
M/2∏
s=1
P
(
dσt+s >
M
2
)
≤
M/2∏
s=1
∫
[0,1]d
P
(
dσt+s >
M
2
| Xσt+s = x
)
p(x)dx
113
=
M/2∏
s=1
(
1−
∫
[0,1]d
Gx
(
M
2
)
p(x)dx
)
≤
M/2∏
s=1
(
1− e(M/2)−s
)
≤
(
M/2−∑M/2s=1 e(M/2)−s
M/2
)M/2
≤
(
1−
√
a1q(M/2)
M/2
)M/2
. (4.48)
Let Fn be the σ-field generated by (Zn, Xn, In). Since the total number of observed
rewards τN will depend on the covariates, we will replace E(τN ) in the proof for ran-
domization error bound for the independent case with E(τN | FN ) and then we can
apply Azuma-Hoeffding’s inequality as stated in Lemma A.1.2. Consider,
P
(
A
( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
)
= P
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,max
t
(σt+1 − σt) ≥Mδ
)
+ P
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,max
t
(σt+1 − σt) < Mδ
)
≤ P
(
max
t
(σt+1 − σt) ≥Mδ
)
+ P
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,max
t
(σt+1 − σt) < Mδ,
τN ≥ E(τN | FN ) + 
A
)
+ P
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,max
t
(σt+1 − σt) < Mδ,
τN < E(τN | FN ) + 
A
)
114
≤ P
(
max
t
(σt+1 − σt) ≥Mδ
)
+ P
(
τN ≥ E(τN | FN ) + 
A
)
+ P
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,max
t
(σt+1 − σt) < Mδ,
τN < E(τN | FN ) + 
A
)
≤ δ + exp
(
− 2
2
A2N
)
+ E
[
PAN
(( N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
)
≥ 
A
,
max
t
(σt+1 − σt) < Mδ, τN < E(τN | FN ) + 
A
)]
, (4.49)
where the second inequality follows from Azuma-Hoeffding’s inequality. Then using
Lemma (A.1.5) (Bernstein’s inequality for martingales) we have that,
PAN
A
 N∑
n=n′δ+1
I(In 6= iˆn)−
τN∑
t=1
(σt+1 − σt)(`− 1)pit
 ≥ ,max
t
(σt+1 − σt) < Mδ,
τN < E(τN | FN ) + 
A
)
≤

exp
(
− 
2
2A2Mδ(E(τN | FN ) + )/4 + /3
)
, if maxt(σt+1 − σt) < Mδ,
τN < E(τN | FN ) + ;
0, otherwise.
Then, the remaining proof follows as the proof of Theorem 4.1.12. We get the same
results as (4.37) and (4.44) for both strategies η1 and η2, except for the fact that
q(n) has a different meaning and interpretation, based on our definition in (4.7) and
Assumption 4.1.7. Combining the results for estimation error bounds and randomization
error bounds, we get the bounds for the cumulative regret for strategies η1 and η2
respectively as in Theorems 4.2.3 and 4.2.4.
4.5 Supplementary real-data results
In section 4.3, we conducted a real data analysis with 150 steps of initialization. In
Figure 4.3, we change number of initialization steps to 1000. Our proposed strategies
η1 and η2 still perform better than DeLinUCB for delay settings 2 and 3, but for delay
115
setting 4, the results are now comparable. This could be due to enough learning during
initialization and less observed data due to severe delays, reflecting a need to run on a
bigger data sample.
l
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 2 , pin = log(n)−2 , hn = n−1 6
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 3 , pin = log(n)−2 , hn = n−1 6
l
l
l
l
ll
l
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 4 , pin = log(n)−2 , hn = n−1 6
l
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 2 , pin = log(n)−2 , hn = log(n)−1
l
l
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 3 , pin = log(n)−2 , hn = log(n)−1
l
l
l
ll
l
η1 η2 DeLinUCB
0.
90
0.
95
1.
00
1.
05
1.
10
1.
15
1.
20
1.
25
delay 4 , pin = log(n)−2 , hn = log(n)−1
Figure 4.3: Boxplots of normalized CTRs for the three methods for 1000 rounds of
initialization.
Chapter 5
Doctor’s intervention in
randomized allocation strategy
This chapter, somewhat distinct from the previous chapters aims to extend the work of
Yang and Zhu (2002) with the intent of improving medical practice. Suppose there are
two competing treatments A and B for a disease which have been considered comparable
in their effect but the doctor wants to assess the performance of these on the patients. In
doing so, they want to take patient characteristics (covariates) into account. We could
use an MABC algorithm to help the doctor. Every time a patient visits, the doctor
assigns a treatment and consequently the effect/reward for that treatment is measured.
After a couple of visits when one has some data to assess the performance of both
treatments, the MABC algorithm recommends the next treatment that should be given
to the forthcoming patient based on his/her covariates. Since the optimality of these
MABC strategies has already been established, we know that the treatment decisions
made over time are eventually going to be for the betterment of the patients. However,
real life is more complicated and there could be some other factors that were ignored
while these treatment decisions were made. In such situations, the doctor might want
to give a different treatment to a patient than the one recommended by the algorithm.
This disagreement could be a result of some hard to quantify information or doctor’s
judgment based on their experience. Therefore, in this work we propose to integrate
the cases where such a disagreement arises into an adaptive MABC algorithm. The
116
117
goal is to show that the proposed integrated allocation strategy is consistent, that is, in
the long run the cumulative reward of the algorithm is equivalent to the best possible
cumulative reward. In section 5.1 we layout the problem setup, in section 5.2 we describe
the allocation strategy and in section 5.3 we outline the proof for the proposed allocation
strategy.
5.1 Problem setup
Assume that there are `, ` ≥ 2 arms available for playing. After pulling an arm,
a random reward is generated. Each time before deciding which arm to pull, a d-
dimensional covariate x ∈ Rd is observed. This contains information about the patient
like their age, gender, genetic factors, etc. Let us assume that we only have one doctor
in this study for simplification. We assume that the characteristics or covariates are
continuous variables and take values in a hypercube [0, 1]d without loss of generality.
The mean reward with the given covariate x for the ith arm is denoted by fi(x), 1 ≤ i ≤ l.
Ideally, if the f ′is were known with the observed covariate x, then one would pull the
arm with the largest mean reward at the given x; that is, one would choose the arm
i∗(x) which results in f∗(x) = max1≤i≤l fi(x). The actual reward with covariate x of
pulling the ith arm is modeled as
Yi,j = fi(x) + j ,
where j denotes random independent errors with mean 0 and finite variance. Let,
• X1, . . . , Xn, . . . be a sequence of covariates independently generated from a popu-
lation supported in [0, 1]d.
• PX denote the underlying probability distribution, which is also assumed to be
unknown.
• Yi,j denote the reward of pulling the ith arm when the covariate Xj is presented.
• Ij , j ≥ 1 be the chosen arm at time j ∈ N.
• γj is indicator of whether or not the doctor thinks that the jth patient is a special
case.
118
• Tj be the indicator of whether or not the system allowed the doctor to make their
decision when they declare that the jth case as special.
At the time n ≥ 1, let Zn,i denote the set of observations {(Xj , γj , Tj , YIj ,j), 1 ≤ j ≤ n}
to which the ith arm is pulled (i.e. Ij = i). The total mean reward up to this time n is∑n
j=1 fIj (Xj). The goal is to maximize the total reward after a number of plays.
The performance of the doctor is evaluated in batches. The size of each batch is
determined by the number of cases considered special by the doctor (dk for the k
th
batch which are pre-determined). Suppose, we divide the trials until time step n into
batches of size Ndk ; k = 1, . . . ,Mn such that Ndk is the time step at which the dkth case
was considered special in the kth batch. Therefore, as a result we will assume to have∑Mn
k=1Ndk ≤ n (as we are interested in asymptotic results (n→∞), w.l.o.g. we assume
that
∑Mn
k=1Ndk = n for simplification). It is important to note that {dk; k = 1, . . . ,Mn}
are known quantities specified by the decision maker.
In order to evaluate the performance in a batch, the proportion of times the doctor
is allowed to go with their instinct, for the cases they think are special, is determined
adaptively. Since for making a comparison, we are only interested in the cases when
doctor felt the patient’s case was special and wanted to use a different treatment than
the one proposed by the algorithm, we only look at the cases where this disagreement
arises. We will call this as the subsequence indexed by {jv; v = 1, . . . , dk} for each kth
batch. Let mk be the proportion of times the doctor is allowed to follow their instinct
in the kth batch. Then in the next batch (k + 1th batch), the proportion mk+1 is
determined based on the following criterion,
Bk+1 :=
∑dk
v=1 YIjv ,jvI{Tjv=1}
mkdk
−
∑dk
v=1 YIjv ,jvI{Tjv=0}
(1−mk)dk > βk+1, (5.1)
with βk+1 ≤ 0, a threshold which is determined adaptively by the decision maker.
Based on if the above condition is met, we update the value of mk for the next batch.
In particular, define
mk+1 =
remains the same as mk with probability pk+1gets reduced to mk+1 with probability 1− pk+1,
119
with mk+1 ≤ mk, where,
pk+1 = Pr
(∑dk
v=1 YIjv ,jvI{Tjv=1}
mkdk
−
∑dk
v=1 YIjv ,jvI{Tjv=0}
(1−mk)dk > βk+1
∣∣∣∣∣mk
)
. (5.2)
5.1.1 Regret and consistency
Let δ be a sequential allocation rule proposed and I1, I2, . . ., be the chosen arms at time
j = 1, 2, . . .. With the allocation rule, given the previous observations and Xj , the mean
reward at the given Xj is fIj (Xj) for j ≥ 1. The total of this mean reward up to time
n is
∑n
j=1 fIj (Xj). Without knowing the random errors, the ideal performance occurs
when the choices I1, . . . In match i
∗(X1), . . . , i∗(Xn), yielding the optimal total reward∑n
j=1 f
∗(Xj). The quantity of interest here is the regret of our allocation scheme δ
which is given by,
Rn(δ) =
∑n
j=1 fIj (Xj)∑n
j=1 f
∗(Xj)
.
Clearly, Rn is a random variable no bigger than 1. It measures the performance of the
allocation rule relative to the ideal one with the optimal arm known for each x.
Definition 5.1.1. An allocation rule δ is said to be strongly consistent if Rn(δ) → 1
with probability 1.
Remark If 1n
∑n
j=1 f
∗(Xj) is eventually bounded above and away from 0 with prob-
ability 1, then Rn(δ)
a.s.→ 1 is equivalent to 1n
∑n
j=1(fIj (Xj)− f∗(Xj)) a.s.→ 0.
We use the strategy developed in Yang and Zhu (2002) and extend it to incorporate
the doctor’s interventions and present an adaptive allocation strategy in section 5.2.
Then we will consider the two scenarios 1) when the doctor performs poorly as compared
to the algorithm, 2) when the doctor performs better or at par with the algorithm and
show strong consistency for each of these scenarios.
120
5.2 Proposed allocation strategy
There are three main ingredients in our approach on selecting an arm (1) nonparametric
estimation of the functions fi, (2) a proper allocation rule to control the exploitation-
exploration trade-off and (3) incorporating doctor’s interventions as part of the alloca-
tion rule. Let {pij , j ≥ 1} be a sequence of positive numbers decreasing to 0 and let m1
be the proportion of special cases where doctor is allowed to make their decision in the
first batch.
Step 1 Initialize. Each patient is allotted to a treatment based on what the doctor chooses
to give. Let’s say I1 = i1, I2 = i2, . . . , It0 = it0 where ik ∈ {1, . . . , `} for each k in
1, . . . , t0. This allocation is done until the doctor has made his decision t0 times
and then from the t0 + 1th time onwards, allot the arms that have never been
allotted so far (if any). Suppose that all arms have been allotted at least once in
m0 steps.
Step 2 Estimate the individual functions fi. For n = m0, based on current data Z
n,i,
estimate fˆi,n for 1 ≤ i ≤ l using the chosen regression procedure.
Step 3 Estimate the best arm. For the next covariate Xn+1, let iˆn+1 be the maximizer
of fˆi,n(Xn+1) over 1 ≤ i ≤ l. Now, iˆn+1 is the algorithm’s recommendation at the
n+ 1th time step.
Step 4 At time step n+ 1, let
γn+1(Xn+1) =
1 if doctor disagrees with the recommendation0 if doctor agrees with the recommendation.
Let i′n+1(Xn+1) (i′n+1 6= iˆn+1) be the arm chosen by the doctor when γn+1(Xn+1) =
1.
Step 4a- If doctor agrees with the recommendation (i.e. if γn+1(Xn+1) = 0), then
allocation happens based on -greedy heuristic. That is, randomly select
an arm, with probability 1 − (l − 1)pin+1 for i = iˆn+1 and with probability
(l − 1)pin+1 for each of the remaining arms.
121
Step 4b- If the doctor disagrees with the recommendation (i.e. if γn+1(Xn+1) = 1),
he/she is allowed to make their decision m1 proportion of times there is a
disagreement and for the rest 1 − m1 proportion of times their decision is
overruled by the system recommendation. Let,
Tn+1 =
1 if the system lets doctor decide, i.e. i′n+1 is chosen0 system overrides doctor’s decision.
If Tn+1 = 0, randomly select an arm based on the -greedy heuristic. If
Tn+1 = 1, then we select i = i
′
n+1 with probability 1.
Step 4c- Let In+1 denote the selected arm. Pull the arm In+1 to receive the reward.
Step 5 Update the estimates based on the available information only for the cases when
the doctor agreed with the system’s decision. After the new observationXn+1, γn+1, Tn+1,
In+1, YIn+1,n+1 update the function estimate fi for i = In+1.
Step 6 Repeat steps 2-5 when the next covariate Xn+2 surfaces and so on until the time
we’ve had total d1 disagreements out of which the doctor has made d1m1 decisions
of his own. This will be the first batch with size Nd1 .
Step 7 In order to draw a comparison in between a doctor decisions and the recom-
mendation’s performance, we denote the sequence where the doctor disagrees as
{tv : v = 1, . . . , d1}. Then take d1m1 cases for when the doctor is allowed to make
his/her decision and d1(1 −m1) cases when their disagreement is overruled. We
compare the cumulative reward for these two subsequences as follows,
1. Doctor performs worse than the algorithm: For the first batch, we
will have that for β1 ≤ 0,∑d1
v=1 YItv ,tvI{Ttv=1}
m1d1
−
∑d1
v=1 YItv ,tvI{Ttv=0}
(1−m1)d1 < β1. (3)
If this is the case, we would want to decrease m1 for the next time points
(m2 < m1) and force the doctor to go with the system’s recommendation
more often. β1 is replaced by β2 from a sequence of non-positive numbers βk
converging to zero as k →∞.
122
2. Doctor performs better/ at par with the algorithm: For the sequence
βk, the criterion for this first batch in this case is,∑d1
v=1 YItv ,tvI{Ttv=1}
m1d1
−
∑d1
v=1 YItv ,tvI{Ttv=0}
(1−m1)d1 > β1. (2)
In this case we would want to give the doctor more chance to make his/her
decisions so we can let m1 be the same for the next batch. β1 is replaced by
β2 in the next batch analysis and the rate of increase of βk will be discussed
later.
Step 8 Repeat all the steps for the second batch of size Nd2 which has total of d2 dis-
agreements in which the doctor makes d2m2 decisions. When we are at the nth
time step we have repeated the same analysis for k = 2, 3, . . . ,Mn groups with
sizes Ndk , k = 1, . . . ,Mn each, such that
∑Mn
k=1Ndk = n.
This allocation strategy is outlined in the flow chart in Figure 5.1.
Patient arrives
Estimate the best arm
Doctor agrees Doctor disagrees
System overrides Doctor’s choice
Evaluate doctor’s
performance
Update the
parameters
Figure 5.1: Flow chart of the allocation strategy
123
5.2.1 Regression procedures
Various regression procedures can be used to estimate the individual mean functions
fi’s. In Yang and Zhu (2002), two procedures are discussed; histogram method and
nearest neighbor method. Strong consistency for their proposed allocation rule was
proved for both the regression procedures, however finite time regret bounds were not
established. In another work by Qian and Yang (2016a), strong consistency and finite
time regret analysis was performed for their proposed allocation strategy using kernel
estimation methods. We will assume the finite regret results from Qian and Yang
(2016a) in proving the consistency of our proposed allocation strategy in section 5.2.
These regression methods are briefly discussed below.
1. Kernel method: Let Ji,n = {j : Ij = i, 1 ≤ j ≤ n}, the set of past time points at
which arm i is pulled. Consider a multivariate nonnegative kernel function K(u) :
Rd → R that satisfies Lipschitz, boundedness and bounded support conditions.
Let hn denote the bandwidth, where hn → 0 as n → ∞. The Nadaraya-Watson
estimator fi(x) is
fˆi,n+1(x) =
∑
j∈Ji,n+1 Yi,jK
(
x−Xj
hn
)
∑
j∈Ji,n+1 K
(
x−Xj
hn
) .
2. Histogram method: Partition [0, 1]d into M = (1/h)d (hyper-)cubes with side
width h. For each x, let J(x) = {j : 1 ≤ j ≤ n, xj and x belong to the same cube}.
Let N(x) denote the size of J(x). Then
fˆ(x) =
1
N(x)
∑
j∈J(x)
Yj
3. Nearest neighbor method: Let d be the Euclidean distance on [0, 1]d. For a chosen
integer Nn and x ∈ [0, 1]d, let J(x;N) = {j : 1 ≤ j ≤ n and xj is among
the N closest points to x in distance d}. Then let,
fˆ(x) =
1
N
∑
j∈J(x;N)
Yj .
124
We assume that there exists constants ρ and κ ≤ 1 such that for each reward function
fi, the modulus of continuity as defined in section A.0.2 satisfies,
w(h; fi) ≤ ρhκ.
This will be used in the results in section 5.3.3 where we assume the rate of convergence
as in Qian and Yang (2016a).
5.3 Consistency of the proposed strategy
We will show consistency of the proposed strategy in section 5.2 for two scenarios: (1)
when doctor performs worse than the algorithm, (2) when doctor performs better/at
par with the algorithm.
5.3.1 Layout of the proof
Recall that mk is the proportion of special cases the doctor is allowed to make their
decision. Making use of the inequality in section 5.3.2, we show that for the case when,
• Doctor performs worse than the algorithm: mk a.s.→ 0 as k →∞ in section 5.3.3.
• Doctor performs better than the algorithm: mk is only reduced for a finite number
of batches with probability 1.
Then these results are used to prove consistency for both the cases in sections 5.3.5
and 5.3.6, respectively. We start with proving a useful inequality which is used in the
following sections.
5.3.2 A preliminary result
Since the values of the proportion mk+1 for the k + 1th batch depends on (5.1) being
satisfied or not, it is important to have a sense of how (5.2) grows and so having an
upper bound for that will guide us in further analysis of the problem. First we make
some assumptions.
Assumption 5.3.1. The function fi are nonnegative and continuous on [0, 1]
d and
E[f∗(X1)] > 0.
125
Assumption 5.3.2. The design distribution PX is dominated by the Lebesgue measure
with density p(x) uniformly bounded above and below from 0 on [0, 1]d; that is p(x)
satisfies c ≤ p(x) ≤ c¯ for some positive constants c < c¯.
Assumption 5.3.3. The errors satisfy a moment condition that there exist positive
constants v and c such that, for all m ≥ 2,
E|ij |m ≤ m!
2
v2cm−2. (5.3)
Lemma 5.3.4. Suppose the Assumptions 5.3.1-5.3.3 are met, then for the kth batch,
the cumulative error terms for the cases when the doctor makes decisions versus when
the algorithm overrules doctor’s decisions, the following inequality holds,
P
(∑dk
v=1 Itv ,tvI{Ttv=1}
mkdk
−
∑dk
v=1 Itv ,tvI{Ttv=0}
(1−mk)dk > β
′
k
)
≤ exp
− dk
(
mkβ
′
k
2
)2
2(v2 + c
mkβ
′
k
2 )
+ exp
− dk
(
(1−mk)β′k
2
)2
2(v2 + c
(1−mk)β′k
2 )
 . (5.4)
Proof. Consider,
P
(∑dk
v=1 Itv ,tvI{Ttv=1}
mkdk
−
∑dk
v=1 Itv ,tvI{Ttv=0}
(1−mk)dk > β
′
k
)
= P
(∑dk
v=1 Itv ,tvI{Ttv=1}
mkdk
+
∑dk
v=1(−Itv ,tv)I{Ttv=0}
(1−mk)dk > β
′
k
)
≤ P
(∑dk
v=1 Itv ,tvI{Ttv=1}
mkdk
≥ β
′
k
2
)
+ P
(∑dk
v=1(−Itv ,tv)I{Ttv=0}
(1−mk)dk ≥
β′k
2
)
= P
(
dk∑
v=1
Itv ,tvI{Ttv=1} ≥ dk
mkβ
′
k
2
)
+ P
(
dk∑
v=1
(−Itv ,tv)I{Ttv=0} ≥ dk
(1−mk)β′k
2
)
.
We have derived the upper bound for each of the summands using Assumption 5.3.3
and (A.3), we get,
≤ exp
− dk
(
mkβ
′
k
2
)2
2(v2 + c
mkβ
′
k
2 )
+ exp
− dk
(
(1−mk)β′k
2
)2
2(v2 + c
(1−mk)β′k
2 )
 .
126
5.3.3 Scenario 1: doctor performs worse than the algorithm
Let us consider the case where the doctor’s instincts are not showing comparable re-
sults to the algorithm’s choices of treatments. In this case, we want to show that the
proportion of times the doctor gets a chance to make their decision is decreasing with
time. That is, we need to show that mk
a.s.→ 0 as k → ∞. This means that there are
infinite number of batches in which the proportion of times a doctor is allowed to make
their special decisions are reduced. Let A denote the set,
A := {sample paths ω such that mk gets reduced only for a finite number of batches},
that is, A represents all the sample paths for which the proportion of times a doctor
makes their own treatment decisions gets reduced only a finite number of times, assum-
ing that the trial is run over an indefinite time period.
Also, define the set Ai denoting the event that no reduction happened in mi from the
ith batch onwards. In other words the doctor did not perform poorly after ith batch.
Ai := {ω : mi−2(ω) > mi−1(ω) and mk(ω) = mi−1(ω) ∀k ≥ i}, given that mi−1(ω) is
the proportion of special cases that the doctor is allowed to treat at his/her discretion
in the (i− 1)th batch for sample path ω.
We make the following assumptions:
Assumption 5.3.5. We assume that the doctor performs poorly, (f∗(Xj)−fIj (Xj))I{Tj=1} ≥
a for some a > 0.
Assumption 5.3.6. Let R1Nk(δ) be the regret for those cases when the algorithm’s
choices of arms are being played. If the algorithm’s choice is made Nk times then for
this regret let us assume R1k(δ) < O(N
1− 1
3+d/κ
k ) a.s. (as in Qian and Yang (2016a))
where d is the dimension of the covariates and κ is the Ho¨lder smoothness parameter.
Assumption 5.3.7. Assume that |
∑n
j=1 f
∗(Xj)
n − Ef∗(X1)| < O( lognn ) a.s.
Theorem 5.3.8. If Assumptions 5.3.5-5.3.7 are satisfied, then P (A) = 0 which will
imply that mk
a.s.→ 0 as k →∞.
The proof for Theorem 5.3.8 can be found in section 5.4.
127
5.3.4 Scenario 2: doctor performs at par with the algorithm
Unlike in the previous case, here we want to show that there is a non-zero probability
for the event that the proportion of times (special cases) doctor is allowed to make their
decision is decreased only finitely many times. For that we want to show that P (Ai) > 0
for some i, w.l.o.g we show that P (A2) > 0.
For this case we make the following assumptions.
Assumption 5.3.9. For all batches k ∈ {1, 2, . . . ,Mn},∑dk
j=1 fIjv I{Tjv=0}
(1−mk)dk −
∑dk
j=1 fIjv I{Tjv=1}
mkdk
≤ 0.
Assumption 5.3.10. Let d1 be the number of special cases in batch 1. The number of
special cases per batch dk and the cutoff βk be such that,
dk−1β2k
log k
→ ∞ as k →∞.
Assumption 5.3.11. The number of times doctor agrees with the algorithm is non-zero
for all the batches.
Theorem 5.3.12. Given that Assumptions 5.3.9-5.3.11 hold, we have that P (A2) > 0,
i.e., with a positive probability, no reduction happens in the number of chances the doctor
gets to treat special cases from the second batch onwards.
The proof for Theorem 5.3.12 can be found in section 5.4.
5.3.5 Consistency for the scenario 1: doctor performs worse
Here, we will use the result of section 5.3.3 to prove consistency of our proposed allo-
cation scheme. Along with the conditions specified in section 5.3.3 and Assumptions
5.3.1, 5.3.2 and 5.3.3, we need the following assumptions:
Assumption 5.3.13. The regression procedure is strongly consistent in L∞ norm for all
individual mean functions fi under the proposed allocation scheme, that is, ||fˆi,n−fi|| →
0 a.s. for each 1 ≤ i ≤ l as n→∞.
Assumption 5.3.14. The mean functions satisfy fi(x) ≥ 0, E(f∗(X1)) > 0 and,
A = sup
1≤i≤l
sup
x∈[0,1]d
(f∗(x)− fi(x)) <∞.
128
Lemma 5.3.15. Let Uk =
∑dk
v=1 I{Ttv = 1}, where k = 1, . . . ,Mn, and denotes the
number of special cases where the doctor made a treatment decision in batch k. Then,∑Mn
k=1 Uk
n
→ 0 almost surely.
Proof. We have already shown in section 5.3.3 that mk
a.s.→ 0. In other words, given
 > 0, P (∃M∗ > 0 such that mk <  ∀k > M∗) = 1. Assuming that k > M∗ would
correspond to n > M∗1 , we consider,
1
n
Mn∑
k=1
Uk =
1
n
Mn∑
k=1
mkdk
=
1
n
M∗∑
k=1
mkdk +
1
n
Mn∑
k=M∗+1
mkdk.
Since the first summand has only finitely many terms, we can say that for given  >
0,∃M∗2 > 0, such that
∑M∗
k=1 mkdk
n <  for all n > M
∗
2 . Then for n ≥ max{M∗1 ,M∗2 },
≤ + 
n
Mn∑
k=M∗+1
dk ≤ 2 with probability 1.
This is because we know that
∑Mn
k=M∗+1 dk ≤ n. Therefore we have shown that
1
n
∑Mn
k=1 Uk
a.s.→ 0.
Theorem 5.3.16. Given assumptions 5.3.5-5.3.7 and Assumptions 5.3.13-5.3.14, we
will have that Rn(δ)
a.s.→ 0.
Proof of Theorem 5.3.16. The regret of our allocation scheme δ is given by,
Rn(δ) =
∑n
j=1 fIj ,j∑n
j=1 fi∗j ,j
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=1}∑n
j=1 fi∗j ,j
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1(fIj ,j − fi∗j ,j)I{γj=1,Tj=1}∑n
j=1 fi∗j ,j
+
∑n
j=1 fi∗j ,jI{γj=1,Tj=1}∑n
j=1 fi∗j ,j
129
≥
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=0}∑n
j=1 fi∗j ,j
−A
∑n
j=1 I{γj=1,Tj=1}∑n
j=1 fi∗j ,j
+
∑n
j=1 fi∗j ,jI{γj=1,Tj=1}∑n
j=1 fi∗j ,j
.
≥
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
−A
∑n
j=1 I{γj=1,Tj=1}∑n
j=1 fi∗j ,j
.
We can re-write the RHS of the above inequality as,
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
−A
∑Mn
k=1
∑Ndk
t=Ndk−1+1
I{γt=1,Tt=1}∑n
j=1 fi∗j ,j
. (5.5)
Then given that doctor declared dk special cases in the kth batch, we can extract a
subsequence {tv : v = 1, . . . , dk} for each batch marking the special cases, making (5.5)
equals,
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
−A
∑Mn
k=1
∑dk
v=1 I{Ttv=1}∑n
j=1 fi∗j ,j
. (5.6)
Let Uk =
∑dk
v=1 I{Ttv=1}. Then (5.6) equals,
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
−A
∑Mn
k=1 Uk∑n
j=1 fi∗j ,j
. (5.7)
Note that Uk is a random variable denoting the number of times doctor is allowed to
make their decision out of the total special cases (dk) considered in the kth batch. From
Lemma 5.3.15, we know have that the second term in (5.7) converges to 0 almost surely.
We want to show that the sum of the first term in 5.7 converges to 1 almost surely, that
is, ∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
a.s.→ 1 as n→∞.
Notice that the term above correspond to the times when the algorithm’s decision was
made using the - greedy heuristic. Let {jv, v = 1, 2, . . . , rn} be the subsequence where
the algorithm’s decision is chosen, i.e., it is those observations when the doctor agreed
with the algorithm recommendation. Then we can write the above sum as,∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
=
∑rn
v=1 fIjv ,jv∑n
j=1 fi∗j ,j
.
130
We want to show that
∑rn
v=1 fIjv ,jv∑n
j=1 fi∗j ,j
a.s.→ 1. We can rewrite this as,
∑rn
v=1 fIjv ,jv∑n
j=1 fi∗j ,j
=
∑rn
v=1 fiˆjv ,jv∑n
j=1 fi∗j ,j
+
∑rn
v=1(fIjv ,jv − fiˆjv ,jv)∑n
j=1 fi∗j ,j
≥
∑rn
v=1 fiˆjv ,jv∑n
j=1 fi∗j ,j
−
rn
n
1
rn
∑rn
v=1AI{Ijv 6=iˆjv}
1
n
∑n
j=1 fi∗j ,j
=
∑rn
v=1(fiˆjv ,jv
− fi∗jv ,jv)∑n
j=1 fi∗j ,j
+
∑rn
v=1 fi∗jv ,jv∑n
j=1 fi∗j ,j
−
rn
n
1
rn
∑rn
v=1AI{Ijv 6=iˆjv}
1
n
∑n
j=1 fi∗j ,j
≥
∑rn
v=1(fiˆjv ,jv
− fi∗jv ,jv)∑n
j=1 fi∗j ,j
−
rn
n
1
rn
∑rn
v=1AI{Ijv 6=iˆjv}
1
n
∑n
j=1 fi∗j ,j
. (5.8)
Let U˜v = I{Ijv 6=iˆjv}. Then U˜v’s are independent random variables after jv = m0 with
success probability (l − 1)pijv . Since,
∞∑
v=m0+1
Var
(
U˜v
v
)
=
∑∞
v=m0+1
(l − 1)pijv(1− (l − 1)pijv)
v2
<∞,
then by Kolmogorov’s two series lemma, we have that
∞∑
v=m0+1
(U˜v − (l − 1)pijv)
v
converges a.s.
Then it follows by Kronecker’s lemma that
1
rn
rn∑
v=1
(U˜v − (l − 1)pijv) → 0 a.s.
Since pijv → 0 as k → ∞, we will have 1rn
∑rn
v=1(l − 1)pijv → 0 and thus we have that
1
rn
∑rn
v=1 U˜v → 0 a.s. We also have that rnn ∈ [0, 1] a.s., thus we have shown that the
second term vanishes in (5.8).
Now, we need to show that,∑rn
v=1(fiˆjv ,jv
− fi∗jv ,jv)∑n
j=1 fi∗j ,j
a.s.→ 0.
Note that we are only restricting ourselves to the cases where the doctor agreed with
the system recommendation. When estimating the mean functions we are not using
the information for all the special cases claimed by the doctor. We are only using
131
information from only the cases where the doctor agreed with algorithm and each arm’s
reward function is estimated. Let j˜v−1 refer to the previous time step to jv in this
subsequence {jv : v = 1, . . . , rn}. Consider,
fiˆjv
(Xjv)− fi∗jv (Xjv) = fiˆjv (Xjv)− fˆiˆjv ,j˜v−1(Xjv) + fˆiˆjv ,j˜v−1(Xjv)− fˆi∗(Xjv ),j˜v−1(Xjv)
+fˆi∗(Xjv ),j˜v−1
(Xjv)− fi∗(Xjv )(Xjv).
By definition of iˆjv , for jv > m0 + 1, we have fˆiˆ(Xjv ),j˜v−1
(Xjv) ≥ fˆi∗(Xjv ),j˜v−1(Xjv),
≥ fiˆjv (Xjv)− fˆiˆjv ,j˜v−1(Xjv) + fˆi∗(Xjv ),j˜v−1(Xjv)− fi∗(Xjv )(Xjv)
≥ −2 sup
1≤i≤l
||fˆi,j˜v−1 − fi||∞.
For 1 ≤ jv ≤ m0, we have fiˆjv (Xjv) − f
∗(Xjv) ≥ −A. Based on the assumption A,
||fˆi,j−1 − fi||∞ a.s.→ 0 as j → ∞ for each i, and thus sup1≤i≤l ||fˆi,j−1 − fi||∞ a.s.→ 0. Let
v = cm0 be the first time when jv > m0, then it follows for n > m0,∑rn
v=1(fiˆjv
(Xjv)− fi∗jv (Xjv))∑n
j=1 f
∗(Xj)
≥
−Am0/n− (2/n)
∑rn
v=cm0
sup1≤i≤l ||fˆi,j˜v−1 − fi||∞
(1/n)
∑n
j=1 f
∗(Xj)
.
The right hand side converges to 0 almost surely and hence the conclusion follows.
5.3.6 Consistency for scenario 2: doctor performs better
The case when doctor performs better than the algorithm, is an advantage as it beats
the algorithm. As consistency has already been established for the decision rule in Yang
and Zhu (2002), we know that the choice of treatments made by our algorithm would
converge to the optimal in the long run. Hence, the fact that doctor’s choices work even
better is an added advantage and convergence to the optimal will probably be faster in
this case.
Theorem 5.3.17. Given Assumptions 5.3.9-5.3.11 and Assumptions 5.3.13-5.3.14, we
will have that Rn(δ)
a.s.→ 0.
132
Proof of Theorem 5.3.17. Consider the regret of our allocation scheme,
Rn(δ) =
∑n
j=1 fIj ,j∑n
j=1 fi∗j ,j
=
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=0}∑n
j=1 fi∗j ,j
+
∑n
j=1 fIj ,jI{γj=1,Tj=1}∑n
j=1 fi∗j ,j
.
Since we assume in Assumption 5.3.9,
∑n
j=1 fIj ,jI{γj=1,Tj=1}
n ≥
∑n
j=1 fIj ,jI{γj=1,Tj=0}
n and
we have already shown in section 5.3.4 that there is a positive probability that the
number of times doctor’s given a chance to treat the special patients, is non-decreasing.
Also from the Assumption 5.3.11 that the number of cases that doctor agrees with the
algorithm have to be non-zero for each batch, we will have that,
≥
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
+ 2
∑n
j=1 fIj ,jI{γj=1,Tj=0}∑n
j=1 fi∗j ,j
≥
∑n
j=1 fIj ,jI{γj=0}∑n
j=1 fi∗j ,j
.
Now, we can use the exact same proof as in the previous case to show that Rn(δ)
a.s.→ 1
as n→∞. That is because we still will have that rn →∞ as n→∞ as the number of
times the algorithm’s choice is made is also not decreasing as n→∞.
5.4 Proofs for Theorems 5.3.8 and 5.3.12
Proof of theorem 5.3.8. Recall, A is a set consisting of all the sample paths for which
mk gets reduced only a finite number of times and the set Ai denotes the event that no
reduction happened in mi from the ith batch onwards. It can be seen that A = ∪∞i=2Ai.
Also, notice that Ai’s are disjoint for all i = 2, . . .; then,
P (A) = P (∪∞i=2Ai) =
∑∞
i=2 P (Ai).
Let us denote M to be the set of possible values of mi−1 for the (i− 1)th batch (or set
of all possible paths until i− 1th batch).
P (Ai) =
∑
M
P (Ai|the proportion of times doctor treats in batch i− 1 = mi−1)
×P (the proportion of times doctor treats in batch i− 1 = mi−1). (5.9)
133
It is important to note here that for event Ai to happen we want that condition 5.1
holds (Bk occurs) for all k ≥ i. Also note that since mk = mi−1∀k ≥ i will imply that
the occurrence of event Bk; k ≥ i will be independent of each other, hence,
P (Ai|the proportion of times doctor gets a chance to treat in batch i− 1 = mi−1)
=
∞∏
k=i
P (Bk|the proportion of times doctor gets a chance to treat in batch k − 1 =mi−1)
=
∞∏
k=i
P
(∑dk−1
v=1 YIjv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 YIjv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk
)
=
∞∏
k=i
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk +{∑dk−1
j=1 fIjv I{Tjv=0}
(1−mi−1)dk−1 −
∑dk−1
j=1 fIjv I{Tjv=1}
mi−1dk−1
})
≤
∞∏
k=i
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk +{∑dk−1
v=1 fIjv I{Tjv=0}
(1−mi−1)dk−1 −
∑dk−1
j=1 f
∗(Xj)I{Tjv=0}
(1−mi−1)dk−1 +∑dk−1
j=1 f
∗(Xj)I{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 fIjv I{Tjv=1}
mi−1dk−1
+∑dk−1
j=1 f
∗(Xj)I{Tjv=0}
(1−mi−1)dk−1 −
∑dk−1
j=1 f
∗(Xj)I{Tjv=1}
mi−1dk−1
})
(5.10)
≤
∞∏
k=i
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk +{∑dk−1
v=1 fIjv I{Tjv=0}
(1−mi−1)dk−1 −
∑dk−1
j=1 f
∗(Xj)I{Tjv=0}
(1−mi−1)dk−1 +∑dk−1
j=1 f
∗(Xj)I{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 fIjv I{Tjv=1}
mi−1dk−1
+∑dk−1
j=1 f
∗(Xj)I{Tjv=0}
(1−mi−1)dk−1 − Ef
∗(X1) + Ef∗(X1)−
∑dk−1
j=1 f
∗(Xj)I{Tjv=1}
mi−1dk−1
})
.
(5.11)
Next, we use Assumption 5.3.6 in section 5.3.3 for the first pair of summands on the right
hand side of 5.10. Let Nk = (1−mi−1)dk−1, we will have that R1k(δ) =
∑dk−1
v=1 (f
∗(Xj)−
134
fIjv (Xjv))I{Tjv=0} and by assumption 2 there exists constant C
∗ such that,
R1k(δ)
(1−mi−1)dk−1 ≤ C
∗ ((1−mi−1)dk−1)1−
1
3+p/κ
(1−mi−1)dk−1 = C
∗
1
1
(dk−1)
1
3+p/κ
,
almost surely. Then for 0 <  < a8 , there exists a M
∗
1 > 0 such that for k > M
∗
1 , such
that, ∣∣∣∣∣C∗1 1(dk−1) 13+p/κ
∣∣∣∣∣ < .
Assumption 5.3.5 can be used for the second pair of summands and Assumption
5.3.7 for the third pair of summands on the right hand side of (5.10).
In (5.11), we have added and subtracted Ef∗(X1) in the last pair of summands in
(5.10). Then each of those quantities will be of the order O(
log((1−mi−1)dk−1)
(1−mi−1)dk−1 ) a.s. and
O(
log(mi−1dk−1)
mi−1dk−1 ) a.s. respectively. We will then have that there exists M
∗
2 > 0 such
that for k ≥ M∗2 , | log((1−mi−1)dk−1)(1−mi−1)dk−1 | <

2 and ∃M∗3 > 0 such that | log(mi−1dk−1)mi−1dk−1 | < 2 for
k ≥M∗3 . Let M∗ = max{M∗1 ,M∗2 ,M∗3 } then almost surely for k ≥M∗,
≤
∞∏
k=M∗
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk − + a−

2
− 
2
)
≤
∞∏
k=M∗
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 > βk −
a
8
+ a− a
8
)
Let β1 > −a4 and βk ↑ 0, then
≤
∞∏
k=M∗
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
mi−1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−mi−1)dk−1 >
a
2
)
Using the inequality (5.4),
≤
∞∏
k=M∗
[
exp
(
− dk−1m
2
i−1(a/2)
2
2(v2 +mi−1ac/2)
)
+ exp
(
− dk−1(1−mi−1)
2(a/2)2
2(v2 + (1−mi−1)ac/2)
)]
=
∞∏
k=M∗
[
exp
(
− dk−1m
2
i−1a
2
8(v2 +mi−1ac/2)
)
+ exp
(
− dk−1(1−mi−1)
2a2
8(v2 + (1−mi−1)ac/2)
)]
.
135
Let m˜i−1 = min{mi−1, 1−mi−1}, then
≤
∞∏
k=M∗
[
exp
(
− dk−1m˜
2
i−1a
2
8(v2 + (1− m˜i−1)ac/2)
)
+ exp
(
− dk−1m˜
2
i−1a
2
8(v2 + (1− m˜i−1)ac/2)
)]
≤
∞∏
k=M∗
2 exp
(
− dk−1m˜
2
i−1a
2
8(v2 + (1− m˜i−1)ac/2)
)
= lim
t→∞ 2
t−M∗ exp
(
− m˜
2
i−1a
2
8(v2 + (1− m˜i−1)ac/2)
t∑
k=M∗
dk−1
)
.
If the number of cases that doctor feels are special are non-decreasing with time, then
the exponential decay happens faster than the growth of 2t, and we have that,
P (Ai|proportion of times doctor gets to treat in batch i− 1 = mi−1) a.s.= 0.
Also, since M only has finite possibilities (there can only be a finite set of paths till
i− 1 depending on how the user decides to decrease mk, k ≤ i− 1), thus we have that
the sum in 5.9 is 0 almost surely, that is,
P (Ai)
a.s.
= 0.
Therefore, we have shown that P (Ai) = 0 ∀ i = 2, . . .. Hence, P (A) = P (∪∞i=2Ai) =∑∞
i=2 P (Ai) = 0. Thus, we have shown that the probability of doctor’s proportion to
make his/her decision gets reduced only finite number of times is zero. This in turn
implies that mk
a.s.→ 0.
Proof of Theorem 5.3.12. Recall, that A2 is the event that the proportion of times doc-
tor is allowed to make their decision does not decrease after the second batch. Consider,
P (A2|the proportion of times doctor treats in batch 1 = m1)
=
∞∏
k=2
P (Bk|the proportion of times doctor treats in batch 1 =m1)
=
∞∏
k=2
P
(∑dk−1
v=1 YIjv ,jvI{Tjv=1}
m1dk−1
−
∑dk−1
v=1 YIjv ,jvI{Tjv=0}
(1−m1)dk−1 > βk
)
=
∞∏
k=2
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
m1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−m1)dk−1 > βk
+
{∑dk−1
j=1 fIjv I{Tjv=0}
(1−m1)dk−1 −
∑dk−1
j=1 fIjv I{Tjv=1}
m1dk−1
})
.
136
From Assumption 5.3.9,
∑dk−1
j=1 fIjv
I{Tjv=0}
(1−m1)dk−1 −
∑dk−1
j=1 fIjv
I{Tjv=1}
m1dk−1 ≤ 0, hence
≥
∞∏
k=2
P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
m1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−m1)dk−1 > βk
)
=
∞∏
k=2
[
1− P
(∑dk−1
v=1 Ijv ,jvI{Tjv=1}
m1dk−1
−
∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−m1)dk−1 < βk
)]
=
∞∏
k=2
[
1− P
(∑dk−1
v=1 Ijv ,jvI{Tjv=0}
(1−m1)dk−1 −
∑dk−1
v=1 Ijv ,jvI{Tjv=1}
m1dk−1
> −βk
)]
.
From the inequality obtained in 5.4,
≥
∞∏
k=2
[
1− exp
(
− dk−1(1−m1)
2β2k
8(v2 − c(1−m1)βk/2)
)
− exp
(
− dk−1m
2
1β
2
k
8(v2 −m1βkc/2)
)]
= exp
[ ∞∑
k=2
log
{
1− exp
(
− dk−1(1−m1)
2β2k
8(v2 − c(1−m1)βk/2)
)
− exp
(
− dk−1m
2
1β
2
k
8(v2 −m1βkc/2)
)}]
.
(5.12)
Let us denote,
zk = exp
(
− dk−1(1−m1)
2β2k
8(v2 − c(1−m1)βk/2)
)
+ exp
(
− dk−1m
2
1β
2
k
8(v2 −m1βkc/2)
)
.
Then we have that (5.12) equals,
= exp
[ ∞∑
k=2
log {1− zk}
]
. (5.13)
We know that for y > 0,
y − 1
y
≤ log y ≤ y − 1.
If y = 1− x for 0 < x < 12 , then,
−2x ≤ log(1− x) ≤ −x.
137
Using this condition assuming that zk is small enough (smaller than 1/2) we get that
(5.13) is,
≥ exp
[
−
∞∑
k=2
2zk
]
= exp
[
−2
∞∑
k=2
{
exp
(
− dk−1(1−m1)
2β2k
8(v2 − c(1−m1)βk/2)
)
+ exp
(
− dk−1m
2
1β
2
k
8(v2 −m1βkc/2)
)}]
≥ exp
[
−2
∞∑
k=2
{
exp
(
−dk−1(1−m1)
2β2k
8(v2 − cβ1/2)
)
+ exp
(
− dk−1m
2
1β
2
k
8(v2 − cβ1/2)
)}]
.
From Assumption 5.3.10, we have that
dk−1β2k
log k →∞. This will imply that the right side
of the inequality above is summable. Therefore, P (A2| the proportion of times doctor
gets a chance to treat in 1st batch = m1) > 0. Therefore we have that P (A2) 6= 0, hence
P (A) = ∑∞i=2 P (Ai) > 0. Hence, there is a positive probability that doctor’s chances
to treat the patient in special situations gets reduced only finitely many times.
Chapter 6
Conclusion
In this dissertation, we consider a contextual bandit problem with delayed feedback
and propose randomized allocation strategies for the problem, with sequential treat-
ment allocation as the motivation. We take a nonparametric approach in modeling
the relationship of the rewards with the covariates. We compare strategies which dif-
fer in how the underlying exploration probability sequence is updated in the presence
of delayed feedback, to see which ones perform better under different delay scenarios
and underlying complexities of the problem. We study these strategies both from an
asymptotic and finite-time perspective, and draw comparisons under various simulated
and real-data settings.
One of our major contributions is to consider random and unbounded delays in a non-
parametric modeling framework for contextual bandits, as most other works on delayed
contextual bandits are parametric in nature with fixed or bounded delays. Our results
are promising as they could address a broader range of problems, especially situations
where it is not possible to explicitly lay out a parametric model. Also, it is important
to note that the assumptions we make on delays are mild and could potentially hold
for a lot of practical settings. Another contribution of the work is that we try to relax
the assumption of delays being independent of the choice of covariates. Relaxing this
assumption is crucial for applying in the medical domain but has not been well-studied.
Our finite-time bounds for this case, although conservative, can form a starting point
for further developments in this direction, which could potentially require development
of new mathematical tools that deal with the underlying dependence structure.
138
139
In another leg of work, we consider the clause of having doctor/expert advice being
incorporated in the automated bandit strategies. Allowing for these expert interventions
would make these algorithms implementable in real life scenarios as expert advice is
certainly crucial in decision making processes. We propose a randomized allocation
strategy which allows for doctor’s interventions and show that it is strongly consistent,
that is, in the long run the cumulative reward for the proposed strategy approaches the
cumulative reward of the theoretically best scenario.
In a nutshell, our research reveals that randomized allocation strategies for contex-
tual bandits are useful and promising tools that could be used for sequential decision
making in a lot of applications. There is still a lot of statistical work that needs to
be done, especially in terms of drawing statistical inference and establishing robust-
ness of these methodologies. A rigorous understanding and development of more robust
strategies could be of immense help as a tool to help health care providers in using the
enormous amount of patient data available for making informed treatment decisions.
There is a lot of scope for future developments in this promising field of research, some
of the most immediate directions include the following.
• Developing minimax optimal finite-time results for the proposed strategies. This
will help in a much deeper understanding on how these randomized strategies
theoretically compare to other already existing strategies.
• Devising methodology for estimating delays when modeling for delayed rewards in
a bandit setting. This can be specifically important in scenarios where one might
not have any prior understanding on the expected delay in observing the rewards.
This could also be useful for studying the more complicated setting when delays
depend on covariates and arm choices.
• Incorporating delays in the work with doctor’s intervention in the proposed ran-
domized allocation strategies, and studying their finite time properties. This
would make these strategies more applicable and would require controlling the
selection bias in the process.
• Developing statistical inference tools for contextual bandit strategies. While there
140
has been some theoretical development in statistical inference on standard multi-
armed bandit strategies, to our knowledge, technical tools required to build sta-
tistical inference theory in contextual bandits are yet to be developed. The pres-
ence of covariates in a sequential setup poses technical challenges that remain
unexplored. Statistical inference could be highly beneficial in sequential decision
making. For example, confidence intervals for our reward function estimates and
robust procedures that remain valid even when some assumptions are violated can
make these procedures more reliable.
References
Abbasi-Yadkori, Y., Pa´l, D., and Szepesva´ri, C. (2011). Improved algorithms for linear
stochastic bandits. In Advances in Neural Information Processing Systems, pages
2312–2320.
Agarwal, A., Dud´ık, M., Kale, S., Langford, J., and Schapire, R. (2012). Contextual
bandit learning with predictable rewards. In Artificial Intelligence and Statistics,
pages 19–26.
Agrawal, S. and Goyal, N. (2012). Analysis of thompson sampling for the multi-armed
bandit problem. In Conference on Learning Theory (COLT).
Agrawal, S. and Goyal, N. (2013a). Further optimal regret bounds for thompson sam-
pling. In Artificial Intelligence and Statistics, pages 99–107.
Agrawal, S. and Goyal, N. (2013b). Thompson sampling for contextual bandits with
linear payoffs. In International Conference on Machine Learning, pages 127–135.
Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simul-
taneous learning from multiple patients. European Journal of Operational Research,
248(2):619–633.
Anderson, T. (1964). Sequential analysis with delayed observations. Journal of the
American Statistical Association, 59(308):1006–1015.
Anscombe, F. (1963). Sequential medical trials. Journal of the American Statistical
Association, 58(302):365–383.
141
142
Armitage, P. et al. (1975). Sequential medical trials. Sequential medical trials. 2nd
edition.
Audibert, J.-Y. and Bubeck, S. (2010a). Best arm identification in multi-armed bandits.
Audibert, J.-Y. and Bubeck, S. (2010b). Regret bounds and minimax policies under
partial monitoring. Journal of Machine Learning Research, 11(Oct):2785–2836.
Audibert, J.-Y., Munos, R., and Szepesva´ri, C. (2009). Exploration–exploitation trade-
off using variance estimates in multi-armed bandits. Theoretical Computer Science,
410(19):1876–1902.
Audibert, J.-Y., Tsybakov, A. B., et al. (2007). Fast learning rates for plug-in classifiers.
The Annals of statistics, 35(2):608–633.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multi-
armed bandit problem. Machine Learning, 47(2-3):235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged
casino: The adversarial multi-armed bandit problem. In Foundations of Computer
Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic
multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
Auer, P., Ortner, R., and Szepesva´ri, C. (2007). Improved rates for the stochastic
continuum-armed bandit problem. In International Conference on Computational
Learning Theory, pages 454–468. Springer.
Bartroff, J., Lai, T. L., and Shih, M.-C. (2013). Adaptive design of confirmatory trials.
In Sequential Experimentation in Clinical Trials, pages 187–223. Springer.
Bastani, H. and Bayati, M. (2015). Online decision-making with high-dimensional co-
variates.
Bastani, H., Bayati, M., and Khosravi, K. (2017). Mostly exploration-free algorithms
for contextual bandits. arXiv preprint arXiv:1704.09011.
143
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contex-
tual bandit algorithms with supervised learning guarantees. In Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, pages
19–26.
Birge´, L., Massart, P., et al. (1998). Minimum contrast estimators on sieves: exponential
bounds and rates of convergence. Bernoulli, 4(3):329–375.
Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. (2019). Online exp3 learn-
ing in adversarial bandits with delayed feedback. In Advances in Neural Information
Processing Systems, pages 11345–11354.
Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–
122.
Cella, L. and Cesa-Bianchi, N. (2019). Stochastic bandits with delay-dependent payoffs.
arXiv preprint arXiv:1910.02757.
Cesa-Bianchi, N. and Fischer, P. (1998). Finite-time regret bounds for the multiarmed
bandit problem. In ICML, volume 1998, pages 100–108. Citeseer.
Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2018). Nonstochastic bandits with
composite anonymous feedback. In Conference On Learning Theory, pages 750–773.
Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A. (2016). Delay and coopera-
tion in nonstochastic bandits. Journal of Machine Learning Research, 49(1):613–650.
Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge
University Press.
Chapelle, O. (2014). Modeling delayed feedback in display advertising. In Proceedings
of the 20th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 1097–1105. ACM.
Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. In
Advances in neural information processing systems, pages 2249–2257.
144
Chow, S.-C. and Chang, M. (2012). Adaptive design methods in clinical trials. CRC
press Boca Raton, FL.
Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with lin-
ear payoff functions. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, pages 208–214.
Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under
bandit feedback. COLT, pages 355–366.
Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-
exploitation tradeoffs in gaussian process bandit optimization. The Journal of Ma-
chine Learning Research, 15(1):3873–3923.
Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang,
T. (2011). Efficient optimal learning for contextual bandits. In Proceedings of the
Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI Press.
Eick, S. G. (1988a). Gittins procedures for bandits with delayed responses. Journal of
the Royal Statistical Society: Series B (Methodological), 50(1):125–132.
Eick, S. G. (1988b). The two-armed bandit with delayed responses. The Annals of
Statistics, pages 254–264.
Filippi, S., Cappe, O., Garivier, A., and Szepesva´ri, C. (2010). Parametric bandits:
The generalized linear case. In Advances in Neural Information Processing Systems,
pages 586–594.
Fontaine, X., Berthet, Q., and Perchet, V. (2019). Regularized contextual bandits.
In The 22nd International Conference on Artificial Intelligence and Statistics, pages
2144–2153.
Freedman, D. A. (1975). On tail probabilities for martingales. the Annals of Probability,
pages 100–118.
Goldenshluger, A. and Zeevi, A. (2013). A linear response bandit problem. Stochastic
Systems, 3(1):230–261.
145
Goldenshluger, A., Zeevi, A., et al. (2009). Woodroofe’s one-armed bandit problem
revisited. The Annals of Applied Probability, 19(4):1603–1633.
Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables.
In The Collected Works of Wassily Hoeffding, pages 409–426. Springer.
Hu, Y., Kallus, N., and Mao, X. (2019). Smooth contextual bandits: Bridging the
parametric and non-differentiable regret regimes. arXiv preprint arXiv:1909.02553.
Joulani, P., Gyorgy, A., and Szepesva´ri, C. (2013). Online learning under delayed
feedback. In International Conference on Machine Learning, pages 1453–1461.
Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymp-
totically optimal finite-time analysis. In International Conference on Algorithmic
Learning Theory, pages 199–213. Springer.
Kim, E. S., Herbst, R. S., Wistuba, I. I., Lee, J. J., Blumenschein, G. R., Tsao, A.,
Stewart, D. J., Hicks, M. E., Erasmus, J., Gupta, S., et al. (2011). The battle trial:
personalizing therapy for lung cancer. Cancer discovery, 1(1):44–53.
Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-armed bandits in metric spaces.
In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages
681–690. ACM.
Lai, T., Levin, B., Robbins, H., and Siegmund, D. (1985). Sequential medical trials. In
Herbert Robbins Selected Papers, pages 247–250. Springer.
Lai, T. L. and Liao, O. Y.-W. (2012). Efficient adaptive randomization and stopping
rules in multi-arm clinical trials for testing a new treatment. Sequential analysis,
31(4):441–457.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.
Advances in Applied Mathematics, 6(1):4–22.
Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-
armed bandits. In Proceedings of the 20th International Conference on Neural Infor-
mation Processing Systems, pages 817–824. Citeseer.
146
Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits
with side information. In Advances in Neural Information Processing Systems, pages
817–824.
Lattimore, T. and Szepesva´ri, C. (2018). Bandit algorithms. Cambridge University
Press.
Li, B., Chen, T., and Giannakis, G. B. (2019). Bandit online learning with unknown
delays. In The 22nd International Conference on Artificial Intelligence and Statistics,
pages 993–1002.
Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach
to personalized news article recommendation. In Proceedings of the 19th International
Conference on World Wide Web, pages 661–670. ACM.
Mandel, T., Liu, Y.-E., Brunskill, E., and Popovic´, Z. (2015). The queue method:
Handling delay, heuristics, prior data, and evaluation in bandits. In Twenty-Ninth
AAAI Conference on Artificial Intelligence.
Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance
penalization. arXiv preprint arXiv:0907.3740.
May, B. C., Korda, N., Lee, A., and Leslie, D. S. (2012). Optimistic bayesian sampling
in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun):2069–
2106.
McDiarmid, C. (1998). Concentration. In Probabilistic methods for algorithmic discrete
mathematics, pages 195–248. Springer.
Murphy, S. A. (2005). An experimental design for the development of adaptive treatment
strategies. Statistics in medicine, 24(10):1455–1481.
Nahum-Shani, I., Smith, S. N., Spring, B. J., Collins, L. M., Witkiewitz, K., Tewari,
A., and Murphy, S. A. (2017). Just-in-time adaptive interventions (jitais) in mobile
health: key components and design principles for ongoing health behavior support.
Annals of Behavioral Medicine, 52(6):446–462.
147
Perchet, V. and Rigollet, P. (2013). The multi-armed bandit problem with covariates.
The Annals of Statistics, 41(2):693–721.
Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. (2016). Batched bandit
problems. The Annals of Statistics, 44(2):660–681.
Pike-Burke, C., Agrawal, S., Szepesvari, C., and Gru¨newa¨lder, S. (2017). Bandits with
delayed anonymous feedback. stat, 1050:20.
Pike-Burke, C., Agrawal, S., Szepesva´ri, C., and Grunewalder, S. (2018). Bandits with
delayed, aggregated anonymous feedback. In International Conference on Machine
Learning.
Qian, W. and Yang, Y. (2016a). Kernel estimation and model combination in a bandit
problem with covariates. Journal of Machine Learning Research, (1):5181–5217.
Qian, W. and Yang, Y. (2016b). Randomized allocation with arm elimination in a
bandit problem with covariates. Electronic Journal of Statistics, 10(1):242–270.
Rabbi, M., Pfammatter, A., Zhang, M., Spring, B., and Choudhury, T. (2015). Auto-
mated personalized feedback for physical activity and dietary behavior change with
mobile phones: a randomized controlled trial on adults. JMIR mHealth and uHealth,
3(2):e42.
Rigollet, P. and Zeevi, A. (2010). Nonparametric bandits with covariates. COLT 2010,
page 54.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of
the American Mathematical Society, 58(5):527–535.
Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits.
Mathematics of Operations Research, 35(2):395–411.
Russo, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Math-
ematics of Operations Research, 39(4):1221–1243.
Sarkar, J. (1991). One-armed bandit problems with covariates. The Annals of Statistics,
19(4):1978–2002.
148
Slivkins, A. (2014). Contextual bandits with similarity information. Journal of Machine
Learning Research, 15(1):2533–2568.
Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits.
In Advances in Neural Information Processing Systems, pages 828–836.
Sommer Thune, T., Cesa-Bianchi, N., and Seldin, Y. (2019). Nonstochastic multiarmed
bandits with unrestricted delays. arXiv preprint arXiv:1906.00670.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT
press.
Suzuki, Y. (1966). On sequential decision problems with delayed observations. Annals
of the Institute of Statistical Mathematics, 18(1):229–267.
Sverdlov, O. (2015). Modern adaptive randomized clinical trials: statistical and practical
aspects, volume 81. CRC Press.
Szorenyi, B., Busa-Fekete, R., Weng, P., and Hu¨llermeier, E. (2015). Qualitative multi-
armed bandits: A quantile-based approach. In 32nd International Conference on
Machine Learning, pages 1660–1668.
Tewari, A. and Murphy, S. A. (2017). From ads to interventions: Contextual bandits
in mobile health. In Mobile Health, pages 495–517. Springer.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
Vernade, C., Cappe´, O., and Perchet, V. (2017). Stochastic bandit models for delayed
conversions. In Conference on Uncertainty in Artificial Intelligence.
Vernade, C., Carpentier, A., Zappella, G., Ermis, B., and Brueckner, M. (2018). Con-
textual bandits under delayed feedback. arXiv preprint arXiv:1807.02089.
Villar, S. S., Wason, J., and Bowden, J. (2015). Response-adaptive randomization for
multi-arm clinical trials using the forward looking gittins index rule. Biometrics,
71(4):969–978.
149
Wanigasekara, N. and Yu, C. (2019). Nonparametric contextual bandits in metric spaces
with unknown metric. In Advances in Neural Information Processing Systems, pages
14657–14667.
Wason, J. and Jaki, T. (2012). Optimal design of multi-arm multi-stage trials. Statistics
in medicine, 31(30):4269–4279.
Wei, L. and Durham, S. (1978). The randomized play-the-winner rule in medical trials.
Journal of the American Statistical Association, 73(364):840–843.
Woodroofe, M. (1979). A one-armed bandit problem with a concomitant variable. Jour-
nal of the American Statistical Association, 74(368):799–806.
Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for
a multi-armed bandit problem with covariates. Annals of Statistics, pages 100–121.
Yoshikawa, Y. and Imai, Y. (2018). A nonparametric delayed feedback model for con-
version rate prediction. arXiv preprint arXiv:1802.00255.
Zhou, Z., Xu, R., and Blanchet, J. (2019). Learning in generalized linear contextual ban-
dits with stochastic delays. In Advances in Neural Information Processing Systems,
pages 5198–5209.
Zimmert, J. and Seldin, Y. (2019). An optimal algorithm for adversarial bandits with
arbitrary delays. arXiv preprint arXiv:1910.06054.
Appendix A
Appendix
In this chapter, we will enlist the statistical concepts and well-known technical tools
that would regularly be needed in the dissertation.
Next, we state the famous Borel-Cantelli Lemma.
Lemma A.0.1 (Borel-Cantelli). Let (A1, A2, . . .) be a sequence of events in a common
probability space (Ω,F , P ) and set A = lim supn→∞An. If
∑∞
n=1 P (An) < ∞, then
P (A) = 0.
This result is useful in assessing almost sure convergence and is often used in the
analysis presented in the following chapters. Next, we define the modulus of continuity,
which quantifies the maximum differences in functional values for a given function on a
given domain.
Definition A.0.2. Let x1, x2 ∈ [0, 1]d. Then w(h; f) denotes a modulus of continuity
defined by, w(h; f) = sup{|f(x1)− f(x2)| : |x1k − x2k| ≤ h for all 1 ≤ k ≤ d}.
It can be seen that if f is continuous then w(h; f)→ 0 as h→ 0.
Next, we review some concentration inequalities, which are quite standard results and
will be used in the following chapters.
A.1 Concentration inequalities
Lemma A.1.1 (Hoeffding’s Inequality). Let X1, X2, . . . , Xn be independent real-valued
random variables such that for each i = 1, . . . , n there exists some ai ≤ bi such that
150
151
P [ai ≤ Xi ≤ bi] = 1. Then for every  > 0,
P
[
n∑
i=1
Xi − E
n∑
i=1
Xi > 
]
≤ exp
(
− 2
2∑n
i=1(bi − ai)2
)
More such inequalities with their proofs can be found in Hoeffding (1994).
The martingale version of Hoeffding inequality has also been derived and is known
as the Azuma-Hoeffding inequality.
Lemma A.1.2 (Azuma-Hoeffding Inequality). Suppose Fj , j = 1, 2, . . . is an increasing
filtration of σ-fields. For each j ≥ 1, let Xj be Fj-measurable such that Xj ≥ 0 almost
surely, and aj ≤ Xj ≤ bj, then for all  > 0, we have,
P
 n∑
j=1
Xj −
n∑
j=1
E(Xj | Fj−1) > 
 ≤ exp(− 22∑n
j=1(bj − aj)2
)
One if referred to McDiarmid (1998) for more details and a proof of the inequality.
Lemma A.1.3 (Bernstein’s Inequality). Let X1, . . . , Xn be independent real-valued
random variables with zero mean, and assume that X1 ≤ 1 with probability 1. Let
Vj = Var(Xj) and σ
2 =
∑n
j=1 Vj. For any  > 0,
P
[
1
n
n∑
i=1
Xi > 
]
≤ exp
(
− n
2
2σ2 + 2/3
)
(A.1)
Proofs of these inequalities can be found in Cesa-Bianchi and Lugosi (2006).
Corollary A.1.4. Suppose W˜1, W˜2, . . . , W˜n, are independent Bernoulli random vari-
ables with success probability βj. By Bernstein’s inequality in (A.1),
P
 n∑
j=1
W˜j ≤ (
n∑
j=1
βj)/2
 ≤ exp(−3∑nj=1 βj
28
)
.
The proof follows by substituting  = (
∑n
j=1 βj)/2 and Xj = βj−W˜j in (A.1). Note
that the same inequality holds for any Bernoulli random variable where Wj takes values
aj ≤ 1, ∀j ≥ 1 and 0.
The Bernstein’s inequality has been extended to the case of martingales.
152
Lemma A.1.5 (Bernstein’s Inequality for Martingales). Let (Ω,F , P ) be a probability
space. Let Fj , j = 1, 2, . . . , be an increasing filtration of sub-σ-fields of F . Let X1, X2, . . .
be random variables on (Ω,F , P ), such that Xj is Fj-measurable. Assume |Xj | ≤ K
with probability 1, for all j ≥ 1. Let Vj = Var(Xj | Fj−1) and denote the sum of
conditional variances by, Then for all positive real numbers  and v,
P
 n∑
j=1
(Xj − E(Xj |Fj−1)) > ,
n∑
j=1
Vj ≤ v
 ≤ exp(− 2
2(v +K/3)
)
The proof of this inequality can be found in Freedman (1975).
Corollary A.1.6 (Extended Bernstein Inequality). Suppose {Fj , j = 1, 2, . . .} is an
increasing filtration of σ-fields. For each j ≥ 1, let Wj be an Fj-measurable Bernoulli
random variable whose conditional success probability satisfies
P (Wj = 1|Fj−1) ≥ βj
for some βj ∈ [0, 1]. Then given n ≥ 1,
P
 n∑
j=1
Wj ≤ (
n∑
j=1
βj)/2
 ≤ exp(−3∑nj=1 βj
28
)
(A.2)
The proof for this can be found in Qian and Yang (2016a).
Lemma A.1.7. Suppose {Fj , j = 1, 2, . . .} is an increasing filtration of σ-fields. For
each j ≥ 1, let j be an Fj+1-measurable random variable that satisfies E(j |Fj) = 0,
and let Wj be an Fj-measurable random variable that is upper bounded by a constant
C > 0 in absolute value almost surely. If there exists positive constants v and c such
that for all k ≥ 2 and j ≥ 1, E(|j |k|Fj) ≤ k!v2ck−2/2, then for every  > 0 and every
integer n ≥ 1,
P
 n∑
j=1
Wjj ≥ n
 ≤ exp(− n2
2C2(v2 + c/C)
)
. (A.3)
Proof of Lemma A.1.7. Lemma A.1.7 is the same as Lemma 1 in Qian and Yang (2016a)
and the proof for the same can be found there.
A simplified version of Lemma A.1.7 can be stated as follows.
153
Corollary A.1.8. Let 1, 2, . . . be independent random variables satisfying the refined
Bernstein condition, that is, if there exists positive constants v and c such that for all
k ≥ 2 and j ≥ 1, E|j |k ≤ k!v2ck−2/2. Let I1, I2, . . . be Bernoulli random variables such
that Ij is independent of {l : l ≥ j} for all j ≥ 1. For any  > 0,
P
 n∑
j=1
Ijj ≥ n
 ≤ exp(− n2
v2 + c
)
. (A.4)
The proof for this lemma can be found in Yang and Zhu (2002).
A.2 Notations
` number of arms
n generic end time point, arms being pulled until time n
i∗ arm corresponding to the maximum mean reward
iˆj best promising arm so far based on the estimation procedure
Ij arm chosen at the j
th time point
η, η1, η2 allocation strategy
θ unknown parameter used for parametric methods
µi mean reward for arm i
µ∗ optimal mean reward (max1≤i≤l µi)
∆i µ
∗ − µi
Fi unknown reward distribution of arm i
Yi,j reward for arm i at j
th time point
Rn(η) cumulative reward for allocation strategy η
rn(η) per-round regret for strategy η
Tn(i) number of observations from arm i upto time n
X, x covariates: random, realized
fi(x) mean reward function for arm i at covariate x
f∗(x) optimal reward function at covariate x
j error in the regression model for the j
th case
dj delay in observing the reward for j
th case
tj time of observing j
th reward
154
Gj cumulative distribution function for dj
q(n) lower bound for expected number of observed rewards
AN set of observed indices up to time N
Xn set of covariates observed until time n
Zn collection of past and present information used for estima-
tion
τn number of observed rewards by time n
hn binwidth for the chosen nonparametric procedure
pin exploration probability sequence
Ji,n+1 set of observed indices by time n corresponding to arm i
Qn+1(x), Qi,n+1(x) indices corresponding to rewards observed in a small cube
containing x, pertaining to arm i
Mn+1(x),Mi,n+1(x) size of Qn+1(x), Qi,n+1(x)
m0 initialization count
A, a1, a˜1, c, c¯, L, c5, v constants
Fn sigma-field
σt time when the t
th reward is observed
Mδ probabilistic bound on the maximum difference between con-
secutive observed rewards
en uniform lower bound for cumulative distribution function of
delays over the covariate space
γj indicator for if doctor disagrees with the algorithm
Tj indicator if doctor is allowed to make their decision for j
th
special case
a.s. almost surely
Mn number of batches until time n
dk number of special cases in the k
th batch
mk proportion of dk special cases when doctor is allowed to make
their decision
βk threshold for the k
th batch
ω sample path for arms