In multi-armed bandit problems, KL-UCB (for Kullback–Leibler Upper Confidence Bound) is a UCB-type algorithm that is asymptotically optimal, in the sense that its regret matches the problem-dependent Lai-Robbins lower bound. == Multi-armed bandit problem == The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between K {\displaystyle K} actions (arms). Behind every arm a {\displaystyle a} there is an unknown distribution ν a {\displaystyle \nu _{a}} that lies in a set D {\displaystyle {\mathcal {D}}} known by the player (for example, D {\displaystyle {\mathcal {D}}} can be the set of Gaussian distributions or Bernoulli distributions). At each turn t {\displaystyle t} the player chooses (pulls) an arm a t {\displaystyle a_{t}} , he then gets an observation X t {\displaystyle X_{t}} of the distribution ν a t {\displaystyle \nu _{a_{t}}} . === Regret minimization === The goal is to minimize the regret at time T {\displaystyle T} that is defined as R T := ∑ a = 1 K Δ a E [ N a ( T ) ] {\displaystyle R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]} where μ a := E [ ν a ] {\displaystyle \mu _{a}:=\mathbb {E} [\nu _{a}]} is the mean of arm a {\displaystyle a} μ ∗ := max a μ a {\displaystyle \mu ^{}:=\max _{a}\mu _{a}} is the highest mean Δ a := μ ∗ − μ a {\displaystyle \Delta _{a}:=\mu ^{}-\mu _{a}} N a ( t ) {\displaystyle N_{a}(t)} is the number of pulls of arm a {\displaystyle a} up to turn t {\displaystyle t} The player has to find an algorithm that chooses at each turn t {\displaystyle t} which arm to pull based on the previous actions and observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s μ } {\displaystyle {\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {D}},\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}} K L {\displaystyle \mathrm {KL} } is the Kullback–Leibler divergence ν ^ a ( t ) {\displaystyle {\hat {\nu }}_{a}(t)} is the empirical distribution of arm a {\displaystyle a} at turn t {\displaystyle t} δ t {\displaystyle \delta _{t}} is a well-chosen sequence of positive numbers, often equal to ln t + c ln ln t {\displaystyle \ln t+c\ln \ln t} with c > 0 {\displaystyle c>0} . Then we choose the arm a t {\displaystyle a_{t}} with the highest index: a t := arg max a U a ( t ) {\displaystyle a_{t}:=\arg \max _{a}U_{a}(t)} We note that the algorithm does not require knowledge of T {\displaystyle T} . === Example === In the special case of Gaussian distribution with fixed variance σ 2 {\displaystyle \sigma ^{2}} , we have: U a ( t ) = μ ^ a ( t ) + 2 σ 2 δ t N a ( t ) {\displaystyle U_{a}(t)={\hat {\mu }}_{a}(t)+{\sqrt {\frac {2\sigma ^{2}\delta _{t}}{N_{a}(t)}}}} with μ ^ a ( t ) {\displaystyle {\hat {\mu }}_{a}(t)} being the empirical mean of arm a {\displaystyle a} at turn t {\displaystyle t} . === Pseudocode === The player gets the set D for each arm i do: n[i] ← 1; nu[i] ← None; d ← ln(K) for t from 1 to K do: select arm t observe reward r n[t] ← n[t] + 1 nu[t] ← update empirical distribution for t from K+1 to T do: for each arm i do: index[i] ← compute_index(n[i], nu[i], D, d) select arm a with highest index[a] observe reward r n[a] ← n[a] + 1 nu[a] ← update empirical distribution d ← ln(t+1) == Theoretical results == In the multi-armed bandit problem we have the Lai–Robbins asymptotic lower bound on regret. The algorithm KL-UCB matches this lower bound for one-dimensional exponential families with δ t := ln t + 3 ln ln t {\displaystyle \delta _{t}:=\ln t+3\ln \ln t} and for distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} with δ t := ln t + ln ln t {\displaystyle \delta _{t}:=\ln t+\ln \ln t} . === Lai–Robbins lower bound === In 1985 Lai and Robbins proved an asymptotic, problem-dependent lower bound on regret. It states that for every consistent algorithm on the set D {\displaystyle {\mathcal {D}}} — that is, an algorithm for which, for every ( ν 1 , … , ν K ) ∈ D K {\displaystyle (\nu _{1},\dots ,\nu _{K})\in {\mathcal {D}}^{K}} , the regret R T {\displaystyle R_{T}} is subpolynomial (i.e. R T = o T → + ∞ ( T α ) {\displaystyle R_{T}=o_{T\to +\infty }(T^{\alpha })} for all α > 0 {\displaystyle \alpha >0} ) — we have: R T ≥ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln T + o T → + ∞ ( ln T ) . {\displaystyle R_{T}\geq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+o_{T\to +\infty }(\ln T).} This bound is asymptotic (as T → + ∞ {\displaystyle T\to +\infty } ) and gives a first-order lower bound of order ln T {\displaystyle \ln T} with the optimal constant in front of it. === Regret bound for KL-UCB === The algorithm matches the Lai–Robbins lower bound for one-dimensional exponential-family distributions and for distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} . ==== One-dimensional exponential family ==== For D {\displaystyle {\mathcal {D}}} being the set of one-dimensional exponential families, with δ t := ln t + 3 ln ln t {\displaystyle \delta _{t}:=\ln t+3\ln \ln t} we have the following upper bound on the regret of KL-UCB: R T ≤ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln T + O T ( ln T ) . {\displaystyle R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+O_{T}({\sqrt {\ln T}}).} ==== Bounded distributions in [0,1] ==== For D = P ( [ 0 , 1 ] ) {\displaystyle {\mathcal {D}}={\mathcal {P}}([0,1])} (the set of distributions supported on [ 0 , 1 ] {\displaystyle [0,1]} ), and for δ t := ln t + ln ln t {\displaystyle \delta _{t}:=\ln t+\ln \ln t} , we have the following upper bound on the regret of KL-UCB: R T ≤ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln T + O T ( ( ln T ) 4 / 5 ln ln T ) . {\displaystyle R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+O_{T}{\big (}(\ln T)^{4/5}\ln \ln T{\big )}.} === Runtime === For D = P ( [ 0 , 1 ] ) {\displaystyle {\mathcal {D}}={\mathcal {P}}([0,1])} , the runtime needed per step and for an arm k {\displaystyle k} with n {\displaystyle n} observations is O ( n ( ln n ) 2 ) {\displaystyle {\mathcal {O}}{\big (}n(\ln n)^{2}{\big )}} . This is higher than that of other optimal algorithms, such as NPTS with O ( n ) {\displaystyle {\mathcal {O}}(n)} . MED with O ( n ln n ) {\displaystyle {\mathcal {O}}(n\ln n)} . and IMED with O ( n ln n ) {\displaystyle {\mathcal {O}}(n\ln n)} . The high runtime of KL-UCB is due to a two-level optimisation: for each arm and candidate mean μ {\displaystyle \mu } , the algorithm evaluates K inf ( ν ^ a ( t ) , μ , D ) {\displaystyle {\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})} and then maximises μ {\displaystyle \mu } subject to N a ( t ) K inf ( ν ^ a ( t ) , μ , D ) ≤ δ t {\displaystyle N_{a}(t)\,{\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})\leq \delta _{t}} . For distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} the inner problem has no closed form and must be solved numerically, which increases the per-step cost.
Read more →