RouteLLM to the rescue
When you start a new GenAI project, one of the first questions is which LLM to use. There are several strategies. You can use the default one - ChatGPT. If it doesn't work for you, then this is where a problem starts. The problem is identifying the criteria for choosing LLM. Cost, quality, latency, training data, domain, and size are just a few criteria. Thankfully, RouteLLM has eliminated this problem.
This week, Anthropic introduced caching, promising to reduce costs and improve latency. The problem is that GTP-4o is the best model today, and in theory, choosing Claude might mean sacrificing quality over costs and speed. But in practice, the quality depends on your specific use case.
The idea of RouteLLM is simple—let's analyze each request at runtime and decide which LLM to use. Prompt1 might be executed by GPT-4o, but Prompt2 should be sent to Claude, and Prompt3 will be best executed by Phi-3.
To create RouteLLM, one needs examples of prompts. Next, send these prompts to the LLMs in question and store the answers. Once we've got this dataset, we can ask GPT-4o to evaluate each answer from 0 to 5. Having prompts, answers, and scores, we are ready to fine-tune an LLM to predict the score for every new prompt we receive from the user. During run-time, we can call the ine-tuned LLM and, based on the score, route it to the best LLM.
I've just described how to create RouteLLM using a Causal LLM Classifier, but other options exist, such as Similarity-Weighted Ranking, Matrix Factorization Model, BERT-Based Classifier, or Hybrid Ensemble Method. The Hybrid Ensemble Method is the second most interesting to me. In this case, we analyse a prompt using a small model and do the routing based on the analysis.
In the example above, we call real LLMs when developing and testing RouteLLM implementation. An alternative is to use - RouterBench. RouterBench is a benchmark for LLM routing. It allows us to test our own RouteLLM implementation without calling real LLM. The benchmark is built on extensive coverage, practical relevance and extensibility. In short, it includes tasks for common sense reasoning, knowledge-based language understanding, conversation, math, and coding.
The results are very promising. The Oracle router was created as part of the pilot study. The Oracle achieved near-optimal performance at a low cost. "Moreover, the surprising observation that GPT-4 is seldom chosen suggests the existence of less expensive LLMs that can deliver high-quality answers for most queries."
Resources:
https://arxiv.org/abs/2403.12031 - RouterBench: A Benchmark for Multi-LLM Routing System
https://arxiv.org/abs/2406.18665 - RouteLLM: Learning to Route LLMs with Preference Data
https://blog.withmartian.com/post/router-bench - Introducing RouterBench
https://www.anyscale.com/blog/building-an-llm-router-for-high-quality-and-cost-effective-responses - Building an LLM Router for High-Quality and Cost-Effective Responses
https://blog.gopenai.com/routellm-simplifying-the-routing-of-large-language-models-778f855884de - RouteLLM: Simplifying the Routing of Large Language Models
https://www.vellum.ai/blog/claude-3-5-sonnet-vs-gpt4o - Claude 3.5 Sonnet vs GPT-4o