The power of small LLMs.
Phi2 from Microsoft has got 220K downloads on HuggingFace, and it makes Phi2 one of the most popular language models.
The size of Phi2 is 2.7B parameters, which is a lot less than Llama-7B. With this size, the model is compared to Llama-70B and Gemini Nano 2 on popular benchmarks. So the model is small, and performance is excellent, but what is behind it, and how can we use it?
Phi2 is based on a simple idea - what if we reduce the size of the training dataset but increase its quality? This idea was explored in the "Textbooks Are All You Need" paper, which presented the Phi1 model. The authors selected "textbook quality" data from the web and added syntactically generated textbooks using GPT-3.5. The results were impressive - Phi1 attains pass@1 accuracy of 50.6% on HumanEval and 55.5% on MBPP.
One month ago, Phi2 was released, and last week, the license of the model was changed to MIT. As a result, we've got a super powerful small language model to do experimentation and even create products on top of it.
Phi2 has been pretrained, but not fine-tuned. The process of creating an LLM can be split into two big parts: pretraining and fine-tuning with human feedback. After the first stage, we receive a model which can predict the next word very well, but the answers are not always what one can expect. After the second stage, the model is capable of performing a specific task. In the Llama world, there is the Llama2 model, which is a pre-trained model, whereas Llama-2-Chat is fine-tuned for the dialogue use cases. The ChatGPT website provides access to fine-tuned models.
To fine-tune a model, we need a dataset. As of today, the "OpenAssistant Conversations Dataset" is one of the best for fine-tuning an LLM for dialogue use cases. The dataset consists of 163K messages with annotations.
I used a code from https://github.com/mkthoma/llm_finetuning repository to fine-tune Phi2 on Google Colab. The process of fine-tuning the model on 16M tokens took 1.5 hours on A100 GPU or 2 hours on V100 GPU. Compare this to pretraining time, pretraining of Phi-2 took 14 days on 96 A100 GPUs and a dataset of 1.4T tokens.
One tip - when loading a model with the "AutoTokenizer.from_pretrained" method specifies a revision. This will make sure that you don't need to re-install dependency when a model gets updated.
Resources:
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ - Phi-2: The surprising power of small language models
https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need/ - Textbooks Are All You Need
https://huggingface.co/microsoft/phi-2 - Phi-2 on Hugging face
https://arxiv.org/abs/2306.11644 - Textbooks Are All You Need
https://github.com/mkthoma/llm_finetuning - Finetuning of Open Source LLM Models
YouTube - OpenAssistant is Completed
YouTube - Developing Llama 2 | Angela Fan
https://huggingface.co/datasets/OpenAssistant/oasst1 - OpenAssistant Conversations Dataset (OASST1)
https://huggingface.co/microsoft/phi-2/blob/main/modeling_phi.py - Phi2 architecture