# Statistical learning theory

#### Enseignant

Département : Statistics

### Objectif

Supervised statistical learning refers to the following prediction problem: find a good rule to predict some output/label of interest (for instance, the object represented by an image), based on some associated input/features (for instance, the image itself), using a dataset of feature-label pairs. This is a core machine learning problem, with applications to computer vision, speech recognition and natural language processing, among many domains.

The aim of this course is to provide an introduction to the theory and principles underlying this general problem. The emphasis will be placed on mathematical and conceptual aspects, rather than on actual implementation: the key concepts and ideas will be illustrated through theoretical (often elementary) results, which will be proven in class.

Along the way, we will introduce some technical tools relevant in statistics and machine learning theory, including basic concentration inequalities and control of empirical processes, stochastic gradient methods, and approximation properties of some classes of functions.

Note: It is not possible to validate both this course and Apprentissage Statistique because there are some overlaps between them.

Prerequisites: Knowledge of probability theory at an undergraduate level. Some notions in statistics would be helpful, though this is not a strict requirement. Some general background in machine learning would also be a plus, though none is required to follow this course.

### Plan

• Probabilistic setting: prediction, loss and risk;
• Universal consistency and No Free Lunch theorem;
• Empirical risk minimization and its analysis via uniform convergence. Approximation and estimation errors, overfitting and underfitting;
• Finite classes and Hoeffding's inequality; Rademacher complexity and Vapnik-Chervonenkis classes (upper and lower bounds);
• Model selection: complexity regularization and cross-validation;
• Nonparametric regression by histograms, bias-variance decomposition and convergence rates over Hölder classes; curse of dimension;
• Convex surrogate losses and their properties; Linear prediction, stochastic gradient methods; Links between regularization and optimization.

### Références

• S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
• L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics. Springer-Verlag, 1996.
• A. B. Tsybakov. Introduction to nonparametric estimation. Springer, 2009.
• L. Györfi, M. Kohler, A. Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2002.
• S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.