
Rating: 8.5/10.
The standard theoretical textbook about bandits, written with a heavy focus on theory over practice, and it’s quite comprehensive, with many proofs as well as notes and literature references. It first covers the vanilla multi-armed bandit case and then covers adversarial bandits, lower bounds, contextual bandits, etc.
Some themes that are common throughout multiple types of bandits.
- Regret bounds have two types: instance-dependent and worst-case (minimax). The instance-dependent one contains parameters related to the instance, like the gaps between each arm and the optimal, and this is often much better than the worst-case bound over all possible instances.
- The worst-case bound is often achieved when there are two items that are close enough to be hard to distinguish but far enough to generate regret. When you work out the math, this often leads to a sqrt(n) factor. But in most real-world cases, we can do better than that, and if an algorithm (eg: MOSS) is only able to achieve the worst-case bound, it’s not useful.
- For different types of bandits, there is often a stochastic variant and an adversarial variant. The stochastic case always requires subgaussian assumptions (basically that the result cannot be dominated by unlikely but influential events), and independence of arms and over time.
- The adversarial case is useful for modeling cases where we don’t want strong assumptions about problem instance, like independence or uniformity over time. The setup is the adversary gets to choose the input to trigger your algorithm’s worst case and this always requires the algorithm to return a distribution to randomly sample from to defeat the adversary.
Chapter 1: The multi-armed bandit problem is to try to maximize the cumulative reward, balancing exploration versus exploitation. The environment is unknown, but assumed to be in some class. The regret is the difference between a chosen policy and the optimal policy in that class. We often assume stationary stochastic bandits, where the rewards don’t depend on previous actions or rewards, but many formulations have been studied. Many problems can be formulated as bandits, like medical trials, A/B testing, pricing, monte carlo tree search, etc.
Chapter 2: Review of probability, from random variables, measurable sets like Borel, and sigma algebras, and a filtration, which is a sequence of sigma algebras representing gaining information that’s revealed over time. Definitions of conditional probability, independence, Riemann and Lebesgue integration, expectation, and CDF based on integrals.
Chapter 3: A stochastic process, which is a sequence of random variables, and a Markov chain is where the behavior depends only on the last state and not the history, and the probability kernel gives the transition probabilities. A martingale is a property of a sequence of random variables where the expected value of a future value is always the same as the present value, and Doob’s optimal stopping theorem says that you cannot increase the expected value of a martingale by a clever stopping method.
Chapter 4: Definition and notation of stochastic bandits. The horizon can be unknown or known. In unstructured bandits, the case where each item is independent, and in structured bandits, sometimes you can deduce properties of an arm without actually pulling it. The regret can be decomposed into loss due to each of the arms, and one of the objectives is to find a policy with sublinear regret, so in other words, choose the optimal action almost all of the time as the horizon goes to infinity.
Chapter 5. Concentration inequalities, beginning with Markov and Chebyshev, which are relatively weak bounds that hold for any random variables. For stronger bounds, a subgaussian random variable is one whose tail behavior is not worse than a Gaussian, and using a Cramér-Chernoff method, we can derive much stronger bounds that are exponential instead of polynomial. Some examples of subgaussian random variables are Gaussian and anything that’s bounded, but not the exponential distribution.
Chapter 6: The Explore-then-Commit (EtC) algorithm explores uniformly for m steps and then commits to the best-looking arm, and you can derive worst-case bounds based on the number of arms, depending on whether n (horizon) and delta (suboptimal gap) are known. In practice, this tends to be difficult because delta is not known, and in cases where the horizon is also not known, there is a doubling trick where you restart and double the length of the epoch and reset the statistics, which has similar asymptotic properties. Another variant is epsilon-greedy, where you choose randomly with probability epsilon, and typically this decays according to a schedule, then the regret can be made to be logarithmic.
Chapter 7: The UCB method is to pull the arm with the largest upper confidence bound (mean + confidence width that decreases with the number of pulls) and is parameterized by a delta parameter that is recommended to be set to 1/n^2. Regret analysis of this shows it tends to be logarithmic regret over time with the gap parameter in the denominator, meaning the regret will tend to be higher when there are arms close to the optimal and it takes more pulls to distinguish them. In the minimax worst case, the regret is O(sqrt n). Phased UCB works with similar bounds with delayed feedback and batched arm switching (common in real situations).
Chapter 8. Version of UCB when the horizon is not known, and a slight modification of the algorithm replaces a constant with a function that varies over time. The regret in this case is similar to the fixed horizon case and increases asymptotically with log n.
Chapter 9: The MOSS algorithm is a modification to UCB to explore more. This gets a slightly better regret bound in the worst case by a log and improves regret to sqrt(kn), ie, it has the optimal minimax regret. However, it’s usually not preferred because of failure modes, eg if two arms are similar and many items are clearly worse, the MOSS regret can be much worse than the instance-dependent UCB regret because it explores the clearly bad arms a lot more, and also the regret distribution is unstable.
Chapter 10: Bernoulli bandits is if the rewards are binary instead of continuous, and then the Chernoff bounds can be used to derive a variant of KL-UCB that has better regret bounds than the regular sub-Gaussian UCB case. The difference is a constant factor, but the asymptotic behavior is the same.
Chapter 11: Adversarial bandits. Model the adversary as being able to choose a sequence of rewards for each arm and step in advance (basically a spreadsheet), while knowing the algorithm. Therefore, any deterministic algorithm has linear regret, so it is essential to randomize the policy. The importance-weighted estimator is weighing the estimate of reward by the probability of sampling it; this gives us the Exp3 algorithm, which uses importance weighting to estimate the reward of each arm and then uses an exponential function to convert these to a distribution and then randomly sample from it, there is a tunable learning rate parameter as well. The regret bound analysis is similar to the worst case for stochastic bandits but very high variance when there are arms with very small sampling probabilities.
Chapter 12: The problem of high variance in the previous algorithm can be fixed by adding a constant in the denominator of the reward estimator function. This is called Exp3-IX (for implicit exploration) and encourages the exploration of worse items a bit more. The analysis of this increases the regret by a constant factor over the base algorithm, but it has much lower variance (ie, regret can be bounded above with high probability).
Chapter 13: Lower bounds is to try to improve that no possible bandit algorithm can do better than some bound in the worst case defined by minimax regret, which is the regret of the best possible policy in the worst case sense (where it is measured by the worst possible environment). Constructing a hard instance for a given algorithm is done by fixing the algorithm and then by pigeonhole you can find two arms that it doesn’t explore much and then construct a bandit where one arm is better than the other, but it is hard to tell apart using the available samples.
Chapter 14: Review of information theory. Entropy is the number of bits required to encode a variable using optimal code; Huffman code is an example construction of an optimal code. Relative entropy (or KL divergence) is the expected difference in length to encode using the other’s optimal code, and a kind of pseudo distance.
Chapter 15: Construction of a worst-case bandit from an algorithm by seeing which arm it pulls the least in an environment and then changing the reward for that arm so that in at least one of the two environments the algorithm will have large regret. This leads to a minimax lower bound of sqrt(kn).
Chapter 16: Instance-dependent lower bounds gives a bound on the regret based on the hardness of a specific instance using relative entropy between arms, and the algorithm must be consistent (ie, sub-polynomial on a whole class of bandits) to rule out algorithms that cheat by hard-coding only doing well on that specific arm, for lower bound to be meaningful. Then under some reasonable assumptions, UCB is optimal.
Chapter 17: Bounds that any minimax-safe algorithm, the distribution of regret must be heavy-tailed, and also a lower bound for adversarial bandits which shows that Exp3 is essentially optimal.
Chapter 18: Contextual Bandits. In this chapter, we assume an adversarial bandit setup. In a naive version, there is a fixed set of contexts and we learn a separate bandit for each context but this is a poor choice when the context set is large. Instead, expert advice is represented as a function mapping the context to a bandit arm (or distribution of arms) to randomize over, and the goal is to optimize the regret compared to choosing the best expert (predictive model) in hindsight. The Exp4 algorithm is given multiple experts that are frozen, learns the distribution of which expert to trust, and samples from it using an update algorithm similar to EXP3. The regret bounds are quite similar to the EXP3 case and only increase logarithmically with the number of experts to choose from. It is also possible to quantify the degree of expert agreement, and the regret bound is lower if the expert agreement is higher.
Chapter 19: Stochastic Linear Bandits. We assume the reward is a dot product of action vector and parameter vector theta in low-dimensional space, plus subgaussian noise. In each iteration, we’re given a feasible set of action vectors and choose one, and try to maximize reward while estimating theta. This gives the LinUCB algorithm, which does an online estimate of theta using regularized least squares from past rewards and then selects the most optimistic arm defined by the max possible value of the reward in a confidence set of possible theta centered around the current estimate (optimism in the face of uncertainty).
The confidence set is usually an ellipsoid centered around the past theta and decreasing over time. Practically, this means a bonus term is added to the mean for each arm and then pick the arm with the highest predicted mean + bonus term, which can be controlled with an exploration factor. The regret scales with the dimension of the feature space, not the number of feasible actions.
Chapter 20: Some extensions to the validity of the ellipsoid method in LinUCB and the regret bound, when the available actions at the current step depend on the past actions instead of being fixed.
Chapter 21: Choosing an optimal design (sequence of actions to explore). The Kiefer-Wolfowitz theorem says that the design that minimizes the worst-case variance is maximizing the log determinant of the information matrix. Thus convex optimization algorithms like Frank-Wolfe to be used to maximize this quantity.
Chapter 22: A special case of linear bandits where the actions are finite and fixed rather than changing. Then you can get a better regret bound using a phase-based algorithm.
Chapter 23: Sparsity. We would like to be able to add features without increasing the regret too much if the features are not actually useful, ie, obtain a regret bound that is dependent on the number of non-zero features rather than just the feature dimension. It starts with a toy example where the action space is a binary hypercube, and you can prove that the algorithm can have a sparsity-dependent regret bound by choosing randomly for dimensions that are important and only committing when they are confident, so this will not pay for dimensions that don’t matter. When expanding this to a more general case, we can use online linear regression to obtain regret guarantees, and this leads to the Online Linear Regression UCB (OLR-UCB) algorithm that wraps UCB with online linear regression. However, sparse online regression tends to be computationally hard.
Chapter 30: Combinatorial Bandits is when the action set is a subset of a binary hypercube, eg selecting a route in a graph or a subset of items to show to the user. Exp3 is possible, but it’s computationally intractable because there are exponentially many actions. The semi-bandit feedback case is when we observe the reward for each selected coordinate and not just the total reward for the instance; here we can use a mirror descent algorithm where you can keep the mean action vector and sample a real action that matches the distribution, but this is computationally difficult due to the combinatorial nature of sampling the real action.
In the semi-bandit case, the follow the perturbed leader algorithm is more tractable: maintain the cumulative past loss estimate vector and then perturb it by adding noise to be adversarially robust, and then use an optimization oracle to choose the best action according to the perturbed leader loss vector (this is useful when the optimization is easy, eg, shortest path). Need to apply importance weighting to get an unbiased reward after perturbing.
Chapter 31: Non-stationary bandits is when the environment changes over time; assume changes at discrete change points and define the regret relative to the optimal policy with the same number of change points. Exp4 over all possible policies is possible and achieves the matching lower bound, but is computationally intractable. Instead, we can modify Exp3 to never sample a probability below a clipped value; when the clip value is chosen appropriately this achieves a similar bound for stochastic bandits. In the non-stationary case, one might wonder whether UCB can be adapted, but it turns out unlike Exp3, we cannot adapt UCB to work when the change points are not known in advance, as we can prove a lower bound of sqrt(n) regret in the worst case.
Chapter 32: The ranking problem is when we rank the top m items of a large pool of items and observe which ones were clicked, and we want to maximize clicks. There are several models, such as the document-based model (only the attractiveness of the document matters); the position-based model (click-through rate is a function of the document attractiveness and its position); and the cascade model (assumes users click on the first item that passes some attractiveness threshold after rejecting all the previous ones). A general model combines all of these into a general attractiveness function. The TopRank algorithm learns partitions of the items where the first partition is known with high confidence to be better than the second, etc. Randomize within each partition but have strict order across partitions to balance between exploration and exploitation (although no contextual information is used here).
Chapter 33. In the best arm identification problem, define simple regret by the difference between the selected and best arm after the trial period, ie, experimentation cost doesn’t matter and we just want to pick the best arm. The simple method is to explore uniformly – regret decreases exponentially fast and much faster than the cumulative regret algorithms like UCB, which will play the optimal arm most of the time.
The fixed confidence case has the clearest analysis. In this problem, we fix some confidence level and minimize the time needed so that we pick the best arm with probability higher than this confidence level. There is a lower bound of logarithmic time, but it is dependent on the desired confidence level and the difficulty of the instance (ie, between the best arm and the second best). This lower bound gives an asymptotic proportion of how many times each arm should be pulled.
The lower bound gives a construction of the Track and Stop algorithm: we first pull all of the arms with a minimal exploration factor, otherwise, track statistics according to how many times each item should have been pulled in the lower bound formula and pull whichever arm has the highest deficit relative to the theoretical proportion. It can be proven that this construction matches the lower bound of stopping time to reach fixed confidence.
Another setup is the fixed budget setting: when the learner is given a number of trials and want to maximize the probability of finding the best arm. An easy algorithm is Sequential Halving: split into equally sized groups and pull uniformly many times in each group, then discard the bottom half of the arms so that the next group has half as many candidates. The probability of picking the wrong arm decreases exponentially, which is roughly optimal.
Chapter 34. Bayesian learning means updating the prior distribution with observations to obtain the posterior using Bayes’ rule, but it requires some measure theory to make it rigorous for continuous distributions. A conjugate prior is when the posterior has the same parametric form as the prior, like Gaussian or Bernoulli distributions. In the Gaussian case, the prior variance being zero means that we always trust the prior, and variance being infinity means always trust the data. Several conjugate prior cases can be unified under the exponential family. The Bayesian bandit setup assumes the environment is sampled from the prior but unknown, and the posterior is determined after a policy acts on the arms. Bayesian regret is the expected value over the prior, whereas frequentist analysis considers the worst case.
Chapter 35: Bayesian regret in the worst case is sqrt(kn), which is the same as the minimax regret, but the Bayesian regret is always smaller. The optimal stopping problem is choosing when to take the last observed value of a sequence vs continuing, and this can be solved using backward induction in the finite case or Markov chains in the infinite case. The one-armed bandit problem with finite horizon can be reduced to optimal stopping, where stopping equals pulling the known reward for the remainder of the game. For Bernoulli rewards, this can be computed exactly using backward induction (DP), and this has lower average regret than the frequentist stopping rule. For Gaussian rewards, a similar technique is possible but requires approximation a continuous functions with quadratic segments and is rather messy.
In the infinite horizon case, this can be reduced to a choice per arm of either paying a fixed cost to receive a variable reward or ending the game, use a discount value for future rewards. We can define the Gittins index as the smallest price such that it is equally good to play another round versus end the game. We can prove that in the multi-armed setup, pulling the largest Gittins index arm is Bayesian optimal, making this cleaner and giving a greedy algorithm that is cleaner than in the finite case. Computing the Gittins index can be done by approximating a finite set of states and solving a matrix inversion problem.
Chapter 36: Thompson sampling is when we pick randomly a plausible world from the current posterior and act optimally in this world. In the k-armed stochastic bandit setting, there is a clean proof that this achieves optimal Bayesian regret. Thompson sampling can be analyzed in frequentist settings as well similar to follow-the-perturbed-leader, but the regret analysis is more complex than in the Bayesian framework. For linear bandits, Thompson sampling is computationally easier: sample the posterior parameter and just choose the optimal action for the sampled parameter. This is much easier than optimizing over confidence ellipsoids as in UCB, and it achieves the same regret bounds. For the Bayesian version of adversarial bandits, the reward matrix is assumed to be drawn from a prior, and the algorithm samples according to the conditional probability that the arm is optimal; the regret is similar to Exp3.



