How to choose embeddings model for your LLM app?

May 26, 2024

If you build an RAG system, you will eventually need to decide what embedding model to use. In this context, the embedding model retrieves documents from storage at the user's request. The better the model, the better your grounding for a LLM.

I think the first parameter is a language. Some models are focused on one language, others support multiple languages, but there are models which support cross-lingual requests. The next parameter - functionality. In this context, we can talk about - dense retrieval, multi-vector retrieval, and sparse retrieval. The maximum length of text the embedding model can handle helps to design a chunking strategy for RAG. The other parameters to consider are model size - do you want to run it on the client or the server, an open or private model, and domains used during training?

For instance, the Jina model was trained only on English texts but has the smallest model size of just 35M parameters. Google's Gecko model was trained on syntactical data. During the first stage, LLM generates training data. In the second stage, the same LLM generates positive and negative passages. As a result, Gecko has only 256 embedding dimensions and outperforms all models up to 768 embedding dimensions. In comparison, BGE M3-Embedding was trained on 100 languages with cross-language support and multi-functionality (dense retrieval, multi-vector retrieval, and sparse retrieval).

Another parameter to consider is the size of embedding dimensions. Popular models produce embeddings with 1024 dimensions. This means that if you have 10K users and each user uploads 100 documents, each split into 2 chunks, you need 8GB of storage. Fortunately for us, there are several ways how we can reduce storage and improve performance at the same time.

The first option is to use Matryoshka embeddings. The idea behind this technique is to produce embeddings of different sizes. Use one model but produce 5 embedding vectors with different dimensions. From 1024 produce 768, 512, 256, 128, or 64 if you wish. In theory, every model can produce Matryoshka embeddings. The goal is to store the most important information in earlier dimensions and, as a result, to perform well on downstream tasks even with smaller dimensions.

The second option is Binary and Scalar embedding. Binary embeddings store 0s and 1s instead of float32, so we sacrifice quality for performance. Scalar embeddings, on the other side, use buckets of numbers instead of 0s and 1s. So, Scalar embeddings can save 4x, whereas Binary can save 32x of storage. The funny thing is that we can only lose about 10% of quality.

As you might guess, Binary, Scalar, and Matryoshka can be used together to even more improve the performance even more.

References:

https://arxiv.org/abs/2402.03216 - BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
https://huggingface.co/blog/embedding-quantization - Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

https://arxiv.org/abs/2403.20327 - Gecko: Versatile Text Embeddings Distilled from Large Language Models
https://arxiv.org/abs/2307.11224 - Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/ - Jina AI Launches World's First Open-Source 8K Text Embedding, Rivaling OpenAI - embedding, open-source, ok
https://qdrant.tech/articles/fastembed/# - FastEmbed: Fast and Lightweight Embedding Generation for Text
https://huggingface.co/blog/matryoshka - Introduction to Matryoshka Embedding Models
https://qdrant.tech/articles/scalar-quantization/ - Qdrant under the hood: Scalar Quantization

Shchegrikovich LLM

Discussion about this post