From zero to hero in LLM apps
GenAI and LLMs are huge new technology waves that are changing the way we develop software. These waves will increase the complexity of software products. The easiest way to start building with LLMs is to build an educational project.
RAG covers all aspects of LLM app development—prompting, searching, reasoning, and content generation—which makes it the best candidate for an educational project. Tens of different approaches to building RAG exist, which makes it quite hard to choose one architecture for implementation. For the first education project, it would be better to try several architectures. This is why I suggest starting from RAGLAB.
RAGLAB is a Python framework which provides modules to build RAGs. The idea of the RAGLAB paper was to give a unified framework to compare different RAG architectures to have a common ground for evaluation. The framework consists of Trainer, Instruction Lab, Metric, Retriever, Corpus and Generator modules. The RAGLAB provides baseline implementation Naïve RAG, which can be compared to RRR, Iter-RETGEN, Self Ask, Active RAG and Self-RAG algorithms.
The RAGLAB provides an interactive mode, support of configs to test LORA and different models. All of this will be enough to develop a new RAG algorithm or reproduce the results of other papers. The reproduction of results is basically taking a popular benchmark, such as Multi-Hop QA, and running your RAG against it. This is good for the initial steps, but the production solution requires human evaluation, which is not cheap and time-consuming.
An alternative to human evaluation is LLM-as-a-judge approach. The idea is to use LLMs to replace humans for evaluation - if we train LLMs with human preference, then we can use these later to evaluate the results. Hugging Face has a really good Open-Source AI Cookbook, which describes this approach with code examples. In the simplest case, we can ask LLM to provide scoring for results. Scores will be a measurement of success. The LLM has biases, which they will show in scoring, but there are some ways how to minimize them. To automate the process of testing LLMs apps I highly recommend promptfoo. In addition AI Gateway from Cloudflare will increase observability and reduce costs of testing.
Until this point, I've covered RAG apps (development), LLM-as-a-judge and promptfoo (evaluation) and AI Gateway (LLMOps), which is enough to build production-ready apps. The next step will be to dive deeper into the world of LLMs. I recommend two papers, 'An Empirical Study on Challenges for LLM Developers' and 'The Llama 3 Herd of Models'. The first paper shows a systematic overview of different areas of LLM app development. From the paper, it will be easier to understand what is missing in your toolset. The Llama 3 paper is a brilliant overview of the process of development LLMs. It will provide an understanding of one level below. Understanding the internals of LLMs will help everyone make better choices at app level.
The biggest roadblock on the path to GenAI is the adoption of a new mindset. The new mindset will require us to accept the fact that LLM apps make mistakes and work in less than 100% of cases, which is a bit unusual for some software developers. The best way to change a mindset is to start using GenAI products such as GitHub copilot every day. It will help to focus on GenAI but also show that GenAI products require a new UX - a co-operational model of communication between the user and LLM.
Resources:
https://github.com/fate-ubw/raglab - RAGLAB framework
https://github.com/fate-ubw/RAGLAB/tree/main/raglab/rag - RAGLAB supporting algorithms
https://arxiv.org/abs/2408.11381 - RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
https://arxiv.org/abs/2408.05002 - An Empirical Study on Challenges for LLM Developers
https://huggingface.co/learn/cookbook/en/llm_judge - Using LLM-as-a-judge 🧑⚖️ for an automated and versatile evaluation
https://www.promptfoo.dev/ - Open-source LLM testing used by 30,000+ developers
https://developers.cloudflare.com/ai-gateway/ - Observe and control your AI applications.
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ - The Llama 3 Herd of Models