ENSAE Paris - École d'ingénieurs pour l'économie, la data science, la finance et l'actuariat

Machine Learning for Natural Language Processing

Objective

Course Objectives

  • Understand the basic principles of natural language processing (NLP).
  • Acquire knowledge of classical (statistics) and modern (deep learning, transformers) models.
  • Master the techniques of representing text and training language models.
  • Discover recent applications: RAG, multimodality, agents.
  • Be familiar with issues related to AI project management and current trends (ethics, sustainability, specialized models).

Planning

Program

1. Morphology and Tokenization

  • Basics: morphemes, inflection, derivation, composition.
  • Standardization techniques: stemming, lemmatization.
  • Tokenization: words, characters, subwords (WordPiece, BPE, Unigram).
  • Impact of tokenization choices on LLMs.

2. Vector representation of words and documents

  • Bag of words, TF-IDF.
  • Limits of classical representations.
  • Distributional hypothesis, co-occurrences, LSA/LSI.
  • Word2Vec, GloVe, neural embeddings.

3. Language Templates

  • Definitions: probability of a sequence of words.
  • Statistical models n-gram, Markov, backoff, Kneser-Ney.
  • Evaluation: cross-entropy, perplexity.
  • Transition to neural models: Bengio networks, Word2Vec.

4. Neural Networks and Transformers

  • The problem of the long context.
  • Attention mechanism and self-attention.
  • Multi-head attention and positional embeddings.
  • Transformer architecture (encoders, decoders, blocks).
  • Text generation (sampling, top-k, top-p, temperature).

5. Encoders and contextual embeddings

  • Limitations of static embeddings.
  • BERT and its variants (RoBERTa, XLM-R, CamemBERT, FlauBERT).
  • Pre-training (masked language modeling, next sentence prediction).
  • Applications: classification, NER, translation.

6. Prompt Engineering and LLMs

  • Prompt design principles.
  • Contextualization, roles and personas.
  • Iteration, adjustments, error analysis.
  • Professional applications of LLMs.
  • Limitations and biases.

7. Retrieval-Augmented Generation (RAG)

  • Classic information search: BM25, bag of words.
  • Integration of embeddings and neural motors.
  • RAG architecture: research + generation.
  • Applications (chatbots, augmented search engines).

8. AI Project Management

  • Specificities of classic AI vs. IT projects.
  • Methodologies (waterfall, agile, MLOps).
  • Roles in an AI team.
  • Data: collection, annotation, quality, learning curves.
  • Metrics, baseline, improvement iterations.

9. Current Trends in NLP

  • Multimodality (text + image, audio, video).
  • Multilingual and cross-lingual templates.
  • Domain-specific LLMs.
  • Small language models.
  • Sustainability and energy impact.
  • Agentic AI, ethics and responsibility.