ENSAE Paris - École d'ingénieurs pour l'économie, la data science, la finance et l'actuariat

Search by profile

Machine Learning for Natural Language Processing

Teacher

Christopher KERMORVANT

KERMORVANT Christopher

ECTS:
3

Course Hours:
9

Tutorials Hours:
9

Language:
English

Examination Modality:
mém.

Objective

Course Objectives

Understand the basic principles of natural language processing (NLP).
Acquire knowledge of classical (statistics) and modern (deep learning, transformers) models.
Master the techniques of representing text and training language models.
Discover recent applications: RAG, multimodality, agents.
Be familiar with issues related to AI project management and current trends (ethics, sustainability, specialized models).

Planning

Program

1. Morphology and Tokenization

Basics: morphemes, inflection, derivation, composition.
Standardization techniques: stemming, lemmatization.
Tokenization: words, characters, subwords (WordPiece, BPE, Unigram).
Impact of tokenization choices on LLMs.

2. Vector representation of words and documents

Bag of words, TF-IDF.
Limits of classical representations.
Distributional hypothesis, co-occurrences, LSA/LSI.
Word2Vec, GloVe, neural embeddings.

3. Language Templates

Definitions: probability of a sequence of words.
Statistical models n-gram, Markov, backoff, Kneser-Ney.
Evaluation: cross-entropy, perplexity.
Transition to neural models: Bengio networks, Word2Vec.

4. Neural Networks and Transformers

The problem of the long context.
Attention mechanism and self-attention.
Multi-head attention and positional embeddings.
Transformer architecture (encoders, decoders, blocks).
Text generation (sampling, top-k, top-p, temperature).

5. Encoders and contextual embeddings

Limitations of static embeddings.
BERT and its variants (RoBERTa, XLM-R, CamemBERT, FlauBERT).
Pre-training (masked language modeling, next sentence prediction).
Applications: classification, NER, translation.

6. Prompt Engineering and LLMs

Prompt design principles.
Contextualization, roles and personas.
Iteration, adjustments, error analysis.
Professional applications of LLMs.
Limitations and biases.

7. Retrieval-Augmented Generation (RAG)

Classic information search: BM25, bag of words.
Integration of embeddings and neural motors.
RAG architecture: research + generation.
Applications (chatbots, augmented search engines).

8. AI Project Management

Specificities of classic AI vs. IT projects.
Methodologies (waterfall, agile, MLOps).
Roles in an AI team.
Data: collection, annotation, quality, learning curves.
Metrics, baseline, improvement iterations.

9. Current Trends in NLP

Multimodality (text + image, audio, video).
Multilingual and cross-lingual templates.
Domain-specific LLMs.
Small language models.
Sustainability and energy impact.
Agentic AI, ethics and responsibility.