Deconstructing the text embedding models — Kacper Łukawski

Conference: EuroPython 2024

Year: 2024

[EuroPython 2024 — North Hall on 2024-07-10] Deconstructing the text embedding models by Kacper Łukawski https://ep2024.europython.eu/session/deconstructing-the-text-embedding-models Selecting the optimal text embedding model is often guided by benchmarks such as the Massive Text Embedding Benchmark (MTEB). While choosing the best model from the leaderboard is a common practice, it may not always align perfectly with the unique characteristics of your specific dataset. This approach overlooks a crucial yet frequently underestimated element - the tokenizer. We will delve deep into the tokenizer's fundamental role, shedding light on its operations and introducing straightforward techniques to assess whether a particular model is suited to your data based solely on its tokenizer. We will explore the significance of the tokenizer in the fine-tuning process of embedding models and discuss strategic approaches to optimize its effectiveness. --- This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: https://creativecommons.org/licenses/by-nc-sa/4.0/