Reinforcement Learning & Bandits 2

The session Reinforcement Learning & Bandits 2 will be held on thursday, 2019-09-19, from 14:00 to 16:00, at room 0.004 (AOK-HS). The session chair is Michèle Sébag.


15:00 - 15:20
Practical Open-Loop Optimistic Planning (407)
Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)

We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies - i.e. sequences of actions - and under budget constraint. In this setting, the Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KL-OLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.

Reproducible Research
15:20 - 15:40
An Engineered Empirical Bernstein Bound (435)
Mark A. Burgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)

We derive a tightened empirical Bernstein bound (EBB) on the variation of the sample mean from the population mean, and show that it improves the performance of upper confidence bound (UCB) methods in multi-armed bandit problems.Like other EBBs, our EBB is a concentration inequality for the variation of the sample mean in terms of the sample variance.Its derivation uses a combination of probability unions and Chernoff bounds for the mean of samples and mean of sample squares.Analysis reveals that our approach can tighten the best existing EBBs by about a third, and thereby halves the distance to a bound constructed with perfect variance information.We illustrate the practical usefulness of our novel EBB by applying it to a multi-armed bandit problem as a component of a UCB method. Our method outperforms existing approaches by producing lower expected regret than variants of UCB employing several other bounds, including state-of-the-art EBBs.

Reproducible Research
14:00 - 14:20
Stochastic Activation Actor Critic Methods (483)
Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)

Stochastic elements in reinforcement learning (RL) have shown promise to improve exploration and handling of uncertainty, such as the utilization of stochastic weights in NoisyNets and stochastic policies in the maximum entropy RL frameworks. Yet effective and general approaches to include such elements in actor-critic models are still lacking. Inspired by the aforementioned techniques, we propose an effective way to inject randomness into actor-critic models to improve general exploratory behavior and reflect environment uncertainty. Specifically, randomness is added at the level of intermediate activations that feed into both policy and value functions to achieve better correlated and more complex perturbations. The proposed framework also features flexibility and simplicity, which allows straightforward adaptation to a variety of tasks. We test several actor-critic models enhanced with stochastic activations and demonstrate their effectiveness in a wide range of Atari 2600 games, a continuous control problem and a car racing task. Lastly, in a qualitative analysis, we present evidence of the proposed model adapting the noise in the policy and value functions to reflect uncertainty and ambiguity in the environment.

Reproducible Research
14:40 - 15:00
Stochastic One-Sided Full-Information Bandit (646)
Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)

In this paper, we study the stochastic version of the one-sided full information bandit problem, where we have K arms [K] = {1, 2, ..., K}, and playing arm i would gain reward from an unknown distribution for arm iwhile obtaining reward feedback for all arms j > i.One-sided full information bandit can model the online repeated second-price auctions, where the auctioneer could select the reserved price in each round and the bidders only reveal their bids when their bids are higher than the reserved price. In this paper, we present an elimination-based algorithm to solve the problem. Our elimination based algorithm achieves distribution independent regret upper bound O(√(T· (TK))), and distribution dependent bound O(( T + K)f(Δ)), where T is the time horizon, Δ is a vector of gaps between the mean reward of arms and the mean reward of the best arm, and f(Δ) is a formula depending on the gap vector that we will specify in detail. Our algorithm has the best theoretical regret upper bound so far.We also validate our algorithm empirically against other possible alternatives.

15:40 - 16:00
BelMan: An Information-Geometric Approach to Stochastic Bandits (783)
Debabrota Basu (Chalmers University of Technology), Pierre Senellart (DI ENS, ENS, CNRS, PSL University; INRIA), Stéphane Bressan (National University of Singapore)

We propose a Bayesian information-geometric approach to the exploration-exploitation trade-off in stochastic multi-armed bandits. The uncertainty on reward generation and belief is represented using the manifold of joint distributions of rewards and beliefs. Accumulated information is summarised by the barycentre of joint distributions, the pseudobelief-reward. While the pseudobelief-reward facilitates information accumulation through exploration, another mechanism is needed to increase exploitation by gradually focusing on higher rewards, the pseudobelief-focal-reward. Our resulting algorithm, BelMan, alternates between projection of the pseudobelief-focal-reward onto belief-reward distributions to choose the arm to play, and projection of the updated belief-reward distributions onto the pseudobelief-focal-reward. We theoretically prove BelMan to be asymptotically optimal and to incur a sublinear regret growth. We instantiate BelMan to stochastic bandits with Bernoulli and exponential rewards, and to a real-life application of scheduling queueing bandits. Comparative evaluation with the state of the art shows that BelMan is not only competitive for Bernoulli bandits but in many cases also outperforms other approaches for exponential and queueing bandits.

Reproducible Research
14:20 - 14:40
Compatible Natural Gradient Policy Search (J24)
Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann

Parallel Sessions