Stochastic Optimization and Automatic Differentiation for Machine Learning


How can we tune the parameters of a logistic regression model using terabytes of data, when most machines nowadays only have a few dozens of gigabytes of RAM ? What are the software innovations brought forward in recent years that have made the dissemination of deep neural networks (and their billion parameters) possible? We will provide answers to these questions in this course. The course will blend rigorous ideas from convex optimization with some practical insights through coding sessions in python (we expect participants to bring their laptop).


To answer the first question, we will see that, although several practical hacks to deal with very large datasets exist (e.g. split data and optimize separately parameters on several machines, average these parameters next), the optimization and machine learning communities have recently proposed several dozens of approaches that beat these hacks not only in theory—one can provably converge to the best overall parameter—but also in practice. One of the goals of this course will be to familiarize students with algorithmic tools that are now crucial to run machine learning algorithms at scale, such as stochastic, incremental and distributed optimization. We will cover these tools both in theory, to study their convergence properties, and in practice, through code exercises.


To answer the second question, we will present automatic differentiation, a tool which was neglected by the statistics and machine learning community for many years, but should get most of the credit for enabling the widespread use of deep neural networks. We will demonstrate the usefulness of automatic differentiation for general tasks, including those that are not related to deep learning, and explain why it is efficient when run on GPGPUs.


  • Cours 1. Introduction: the importance of optimization in data science, examples from the ML literature (logistic regression, regularized empirical risk minimization (R-ERM), matrix completion).
  • Cours 2. Reminders on convexity, convex optimization, Fenchel duality. 
  • Cours 3: Review on first order methods to solve R-ERM for small datasets : gradient descent, accelerated gradient. Non-smooth optimization: proximal methods. 
  • Cours 4. Stochastic Gradient (SG) method, presentation, analysis, relations to online-learning, mini-batches, dynamic sampling, iterate averaging.
  • Cours 5. Incremental gradient methods for R-ERM : gradient aggregation, SVRG, SAG. Analysis and numerical examples.
  • Cours 6. Distributed optimization, ADMM, Cocoa. Asynchronous optimization: Hogwild!

  • Cours 7. Automatic Differentiation, a.k.a back propagation.

  • Cours 8. Autodiff frameworks, coding on GPUs

  • TD 1. python reminders. gradient / stochastic gradient for logistic regression.

  • TD 2. incremental methods: SAGA, SVRG.

  • TD 3. distributed (ADMM, Cocoa)

  • TD 4. automatic differentiation


Leon Bottou, Frank E. Curtis, Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning