Measuring RAG systems in LLM applications

May 05, 2024

RAG is an important part of any LLM application if you want to use private or up-to-date data in your app. Consider an email assistant. Here, RAG will be responsible for accessing your Inbox. First of all, RAG retrieves some context for a task. Then, use both task and context to generate an answer. The retrieval part usually works with some kind of database and returns related documents. Whereas the generation part directly calls LLM to prepare the answer. Measuring or evaluating RAG will help us understand how much value it adds to the end user.

Specific metrics exist for each component of the RAG system: retrieval and generation. For retrieval, we have context precision, recall, relevancy, and entity recall. Faithfulness and Answer relevancy are used to measure generation. It's worth noticing that the retrieval component needs to rank the database results. We can measure ranking capabilities using metrics such as MSE, MAE, MAP@k, RBR, and many others. On the other hand, RAG systems can be evaluated end-to-end. Answer semantic similarity and correctness are two metrics which measure the whole RAG pipeline.

Luckily for us, we don't need to implement all of these metrics ourselves. Last year, the RAGAS framework was developed and open-sourced. RAGAS proposed an implementation based on faithfulness, context relevance, and answer relevance metrics. In a nutshell, RAGAS uses LLM to evaluate the response of the RAG system. For each metric, a specific prompt is sent to LLM. The result of this call is evaluated, and a single metric is calculated. In other words, RAGAS automatically evaluates RAG without a need for human innervation, so we don’t need to provide ground truth results. By measuring quality, we also identify hallucinations. So, with the help of RAGAS, we can filter out bad responses and ask the model to try again.

Improvements to RAGAS were proposed in the ARES paper. ARES uses LLM judges. LLM judge is an LLM train specifically for measuring one of the three metrics - faithfulness("Is the answer generated answer faithful to the retrieved passage, or does it contain hallucinated or extrapolated statements beyond the passage?"), context relevance("Is the passage returned relevant for answering the given query?"), and answer relevance("Is the answer generated relevant given the query and retrieved passage?") metrics (the same as in RAGAS). LLM judge is trained using synthetic data, which are generated with the help of 150 annotated data points. ARES returns results with a confidence interval, which can be used to adjust the system's behaviour dynamically.

Before going to production we can test our RAG system using benchmarks and datasets specifically created for RAG evaluation. If we take a task of multi-document question-answering, there are MasQA and MultiHop-RAG benchmarks.

These benchmarks consist of various documents and question pairs for testing RAG systems, which greatly accomplish RAGAS and ARES. Because LLMs have their knowledge, we can use it as a baseline in some cases; the only caveat here is to pay attention to the date of knowledge cut-off for specific LLMs.

References:

https://arxiv.org/abs/2402.01767 - HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA
https://arxiv.org/abs/2401.15391 - MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
https://arxiv.org/abs/2311.09476 - ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
Video - Webinar "Evaluating LLM Models for Production Systems: Methods and Practices"
https://towardsdatascience.com/comprehensive-guide-to-ranking-evaluation-metrics-7d10382c1025 - Comprehensive Guide to Ranking Evaluation Metrics - RAG, metrics, search, ranking
https://arxiv.org/abs/2309.15217 - RAGAS: Automated Evaluation of Retrieval Augmented Generation

Shchegrikovich LLM

Discussion about this post