ENSAE Paris - École d'ingénieurs pour l'économie, la data science, la finance et l'actuariat

Reinforcement learning - 3A/MS


Reinforcement learning: Taming the Bandit

Moez DRAIEF (former associate professor of statistical learning at Imperial College 2007- 2016 and assistant professor, Statistical Laboratory Cambridge University 2004-2007)

Supported by data scientists from his team at Capgemini as teaching assistants (graduates from top French engineering schools X, ENSAR, TelecomParis, Centrale, etc with Master in Machine Learning MVA, Polytechnique, Dauphine, etc).


These series of lectures will address the problem of prediction, learning, and decision-making in sequential settings. 

We will investigate classical algorithms such as exponential weight algorithms, upper confidence bound, Thompson sampling, and more broadly the problem of “online regret minimization” for settings that get more complex as we advance in the course: starting with prediction with expert advice and going through multi-armed bandits and Markov decision process.

To get better acquainted with these algorithms, we will explore the use of such techniques in practical settings such as experiment design relevant to diverse areas such as clinical trials and user experience, sponsored search and energy storage. We will also attempt to provide students with a glimpse of some recent developments in each of the areas that we will cover.

Format of lectures:  Six 3-hour lectures. The first half of each lecture will introduce the mathematical problem, classical algorithmic approaches and some performance guarantees. In the second half we will delve into a real-world problem starting from actual data and a business scenario. We will then state the problem in the sequential decision-making setting of the first half of the lecture and explore variants thereof.

Assessment: The course will be assessed through projects either involving reading and implementing a research paper or addressing a business problem. The projects will revolve around topics not presented in detail in the course such as model-free reinforcement leaning, deep reinforcement learning, combinatorial bandits, beyond markovian assumptions, Bayesian approaches….

Prerequisites: Familiarity with the analysis of algorithms, probabilistic analysis, linear algebra and optimization. Some familiarity with machine learning foundations and development environments will be quite helpful but not strictly necessary. 



Course Content:

These are the main topics that will be covered.

1.     Introduction to sequential online learning (1 lecture)

a.     binary sequence prediction, the Halving algorithm, mistake bounds

b.     The Weighted Majority algorithm, prediction with expert advice model

c.     Applications: Sequential investment and the universal portfolio algorithm

d.     Glimpse into the multiplicative weights algorithm for online learning with convex losses, Follow The Leader (FTL) algorithm, Follow the Regularized Leader (FTRL) algorithm

2.      Multi-armed bandits (2 lectures)

a.     epsilon greedy and the EXP3 algorithm

b.     Contextual bandits, bandits with expert advice and the EXP4 algorithm

c.     Stochastic bandits, Upper Confidence Bound (UCB) algorithm, Thompson sampling

d.      Applications: Experiment design (MAB vs A/B testing) and contextual bandits for recommender systems and sponsored search

e.     Glimpse into hyperparameter optimisation, combinatorial bandits, Pure exploration in Gaussian processes and multi-armed bandits

3.     Reinforcement learning (3 lectures)

a.     Markov Decision Processes (MDP), dynamic programming, optimal planning for MDPs, value iteration, policy iteration

b.     Reinforcement learning (RL), value estimation methods, Monte Carlo, temporal difference (TD)

c.     Model-free control - Q-learning, SARSA-based control

d.     Model-based RL, upper confidence RL (UCRL)

e.     Applications:  using reinforcement learning frameworks openaigym, pybraiand other hands on RL for energy storage and sponsored search

f.      Glimpse into more advanced settings policy gradient, REINFORCE, actor-critic algorithms


Text/References: There is no single reference text for the course. However, a rough basis will be "Prediction, Learning and Games" (PLG). Cesa-Bianchi and Lugosi, for the first parts of the course, and “Reinforcement Learning: an Introduction” Sutton and Barto for the reinforcement learning part.