Where is RAG architecture going?
A year ago, RAG implementations were super simple—vector database searches for enriching prompts. These days, RAG has become more advanced and complex. One reason for this is growing user needs.
One such need is asking questions on the dataset level. For instance, 'Give me an overview of the whole dataset' compared to 'Give me the summary of this year's P&L report'. The paper 'From Local to Global' highlights that these are two different tasks. The global aspect requires 'a query-focused summarization', whereas a local one - is 'an explicit retrieval task'.
The paper suggests building a graph from the dataset and pre-generating summaries for closely related entities. Entities are extracted in multiple rounds; after each, there is a check to measure the quality of extraction. Similar entities are grouped into communities using the Leiden algorithm. During Query time, we query community summaries and use them to give a global answer.
Another need is to use not only private dataset, but also the power of the Internet. The 'WeKnow-RAG' shows how to integrate web search and knowledge graphs. The paper shows how to integrate results from the search engine into RAG pipeline. Also, WeKnow-RAG uses automatic evaluation to determine whether the response is correct or not.
Other user needs that need to be addressed are improving quality and reducing hallucination. In the previous example, an LLM evaluator was used, but 'Multi-Head RAG' proposes quite a radical approach. To retrieve a document from a dataset, an embedding model is used. An embedding is an array of numbers that represents a document. We can understand how similar the source documents are if we compare two embedding vectors. As Multi-Head RAG shows for such a simple question as 'What car did Alexander the Great drive?', the normal embedding model will retrieve two documents, one for Cars and another for Alexander the Great.
We need to go one level below to understand the Multi-Head RAG approach. In the standard RAG approach, embeddings from Transformer architecture are used. Transformers consist of multiple blocks, all of which are pretty much the same. Embeddings are created at the last block after the Feedforward layer. But before the Feedforward layer, we've got attention heads. MRAG suggests using the results of the attention heads instead of the feedforward layer.
So, instead of one embedding vector, we will use multiple embedding vectors. This approach uses different aspects of the query. Recent papers show that attention heads receive specialization during LLM training, which explains why this approach works. The results are also impressive - 10% increase in the retrieval success ration and from 10% to 30% improvements in accuracy.
Resources:
https://arxiv.org/abs/2408.05141 - A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
https://arxiv.org/abs/2406.05085v1 - Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
https://arxiv.org/abs/2404.16130 - From Local to Global: A Graph RAG Approach to Query-Focused Summarization
https://arxiv.org/abs/2408.07611 - WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs