Reinforcement Learning & Bandits 1

The session Reinforcement Learning & Bandits 1 will be held on wednesday, 2019-09-18, from 14:00 to 16:00, at room 0.004 (AOK-HS). The session chair is Kurt Driessens.


14:00 - 14:20
Deep Ordinal Reinforcement Learning (17)
Alexander Zap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)

Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties.Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years.In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated.We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards.Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable tothe numerical variations for a number of problems.We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.

Reproducible Research
14:40 - 15:00
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics (48)
Denis Steckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.

Reproducible Research
15:20 - 15:40
Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning (213)
Quentin Debard (Itekube, LIRIS), Jilles Steeve Dibangoye (Inria, CITI-Lab, INSA-Lyon), Stéphane Canu (LITIS, INSA-Rouen), Christian Wolf (LIRIS, Inria, CITI-Lab, INSA-Lyon)

Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).

15:40 - 16:00
Safe Policy Improvement with Soft Baseline Bootstrapping (339)
Kimia Nadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.

Reproducible Research
15:00 - 15:20
Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space (543)
Zac Wellmer (Hong Kong University of Science), James T. Kwok (Technology)

This paper proposes a novel deep reinforcement learning architecture that was inspired by previous tree structured architectures which were only useable in discrete action spaces. Policy Prediction Network offers a way to improve sample complexity and performance on continuous control problems in exchange for extra computation at training time but at no cost in computation at rollout time. Our approach integrates a mix between model-free and model-based reinforcement learning. Policy Prediction Network is the first to introduce implicit model-based learning to Policy Gradient algorithms for continuous action space and is made possible via the empirically justified clipping scheme. Our experiments are focused on the MuJoCo environments so that they can be compared with similar work done in this area.

14:20 - 14:40
Attentive Multi-Task Deep Reinforcement Learning (582)
Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)

Sharing knowledge between tasks is vital for efficient learning in a multi-task setting.However, most research so far has focused on the easier case where knowledge transfer is not harmful, i.e., where knowledge from one task cannot negatively impact the performance on another task.In contrast, we present an approach to multi-task deep reinforcement learning based on attention that does not require any a-priori assumptions about the relationships between tasks. Our attention network automatically groups task knowledge into sub-networks on a state level granularity. It thereby achieves positive knowledge transfer if possible, and avoids negative transfer in cases where tasks interfere. We test our algorithm against two state-of-the-art multi-task/transfer learning approaches and show comparable or superior performance while requiring fewer network parameters.

Reproducible Research

Parallel Sessions