Ideas that simplify LLM(LLama2) adoption for everyone. LoRA, QLoRA and LoRAX, what is the difference?
One of the blockers in the broad adoption of LLMs is that these models are enormous. This creates an inefficiency for many businesses. On one side, a value from LLMs; on the other side, costs for hardware. Sometimes, we can end up paying for stuff we don't need.
In other words, Large Language Models require high-end GPUs, which increases costs. LLMs don't behave exactly as we want them, so we need to fine-tune them for structured responses, which requires additional resources for each instance of fine-tuned model. The last thing is that it takes time to load the model into the memory for specific requests, execute a query and return a response, which means increasing latency.
That's why LoRA, QLoRA and LoRAX were developed.
What is LLM? This is a big file which contains numbers. These numbers (parameters) represent the model in the form of matrices or layers. By adding input to the first layer and the first layer to the second, and so on and so on, we produce a result which can be returned to the user. The question is how to make these matrices smaller and calculations more efficient.
LoRA - is a technic which allows us to make a super efficient fine-tuning of Large Language Models. Fine-tuning requires updating weights in all layers, but with LoRA, we reduce the number of trainable parameters. Instead of training the whole model, we train the adapter layer only. In the case of LoRA for Llama2 7B, we reduced the number of trainable parameters from 6,738,415,616 to 4,194,304. To do this LoRA decomposes a weight matrix into two smaller weight matrices.
QLoRA, on the other side, helps by reducing the memory footprint of a model. It adds quantization on top of LoRA. QLoRA reduces the precision of parameters in a model. By default, the size of Llama2 7b is 21.33 GB, and the type of parameters is bfloat-16. If we change the data type to bnb.nf4, the model size will be 14.18 GB.
Now, imagine a situation - we have 100 fine-tuned Llama2 models. How much hardware do we need for this? Do we need 100 instances or just 10 will be enough? This is why we need LoRAX. LoRAX allows us to quickly reload adapter layers without reloading the whole model. Instead of reloading 21GB, we need to reload only 157Mb. It means to have better latency, we don’t need to allocate 100 servers, and fewer machines will do the job.
One more thing to remember. LoRA has got around 10 additional hyperparameters. It means that the values of these hyperparameters might affect the performance of the model.
In-depth review:
https://lightning.ai/pages/community/lora-insights/ - Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
YouTube - QLoRA is all you need (Fast and lightweight model fine-tuning)
https://predibase.com/blog/lora-exchange-lorax-serve-100s-of-fine-tuned-llms-for-the-cost-of-one - LoRA Exchange (LoRAX): Serve 100s of Fine-Tuned LLMs for the Cost of 1

