Machine Learning for Natural Language Processing
Teacher
ECTS:
3
Course Hours:
9
Tutorials Hours:
9
Language:
Remedial English
Examination Modality:
mém.
Objective
Course Objectives
- Understand the basic principles of natural language processing (NLP).
- Acquire knowledge of classical (statistics) and modern (deep learning, transformers) models.
- Master the techniques of representing text and training language models.
- Discover recent applications: RAG, multimodality, agents.
- Be familiar with issues related to AI project management and current trends (ethics, sustainability, specialized models).
Planning
Program
1. Morphology and Tokenization
- Basics: morphemes, inflection, derivation, composition.
- Standardization techniques: stemming, lemmatization.
- Tokenization: words, characters, subwords (WordPiece, BPE, Unigram).
- Impact of tokenization choices on LLMs.
2. Vector representation of words and documents
- Bag of words, TF-IDF.
- Limits of classical representations.
- Distributional hypothesis, co-occurrences, LSA/LSI.
- Word2Vec, GloVe, neural embeddings.
3. Language Templates
- Definitions: probability of a sequence of words.
- Statistical models n-gram, Markov, backoff, Kneser-Ney.
- Evaluation: cross-entropy, perplexity.
- Transition to neural models: Bengio networks, Word2Vec.
4. Neural Networks and Transformers
- The problem of the long context.
- Attention mechanism and self-attention.
- Multi-head attention and positional embeddings.
- Transformer architecture (encoders, decoders, blocks).
- Text generation (sampling, top-k, top-p, temperature).
5. Encoders and contextual embeddings
- Limitations of static embeddings.
- BERT and its variants (RoBERTa, XLM-R, CamemBERT, FlauBERT).
- Pre-training (masked language modeling, next sentence prediction).
- Applications: classification, NER, translation.
6. Prompt Engineering and LLMs
- Prompt design principles.
- Contextualization, roles and personas.
- Iteration, adjustments, error analysis.
- Professional applications of LLMs.
- Limitations and biases.
7. Retrieval-Augmented Generation (RAG)
- Classic information search: BM25, bag of words.
- Integration of embeddings and neural motors.
- RAG architecture: research + generation.
- Applications (chatbots, augmented search engines).
8. AI Project Management
- Specificities of classic AI vs. IT projects.
- Methodologies (waterfall, agile, MLOps).
- Roles in an AI team.
- Data: collection, annotation, quality, learning curves.
- Metrics, baseline, improvement iterations.
9. Current Trends in NLP
- Multimodality (text + image, audio, video).
- Multilingual and cross-lingual templates.
- Domain-specific LLMs.
- Small language models.
- Sustainability and energy impact.
- Agentic AI, ethics and responsibility.