LLMs coming to mobile

In the last few months, LLMs stepped into the mobile world. Apple's OpenELM (Efficient Language Model) joined Phi, TinyLlama and other Small Language Models.

Apr 28, 2024

If the previous generation of the Phi family, Phi-2, was released only as a pre-trained model in one size(2.7B), this week, we received instructed Phi-3 in three sizes: mini(3.8B), small(7B), and medium(14B). 4-bit quantized version of Phi-3-mini generates 12 tokens per second on iPhone.

OpenELM family is represented by 4 models - 270M, 450M, 1.1B and 3B. For each model, pre-trained and instructed versions were published, in total eight models. OpenELM uses a variable number of parameters for each transformer layer. This technique was introduced in the paper "Delight: Deep and light-weight Transformer" and used to achieve higher accuracies with optimal number of parameters.

The idea of small language models is based on two observations: high-quality data leads to better results, and to generate meaningful text, we don't need to use the whole English vocabulary. See TinyStories paper for a deep dive.

Small models will be capable of assisting users with everyday tasks on mobile devices, such as setting alarm clocks. All of these tasks will be executed with low latency. They also solve privacy concerns, allowing users to have conversations on their devices without sending data to the cloud. In addition, there is a wider application for small models. Small models on the device can play the role of orchestrators for heavy user requests, which will require communication with other models, services, or internet searches.

If we compare these two models, Phi-3 and OpenELM, we will see that Llama hugely influenced them. In the case of OpenELM, there is an influence from the OLMo model. The OLMo model was the first truly open-sourced model. The authors of OLMo shared model weights, data they used for training (Dolma), model checkpoints, and a lot more to show the whole process of LLM creation. OpenELM followed OLMo's transparency standards and shared a lot of internal materials - "We release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research". I've prepared a diff of two models - https://github.dev/shchahrykovich/phi3-vs-openelm/pull/2/files - for those interested in details.

One more thing - tooling for training and inference. OpenELM was trained using Apple's corenet library and it includes a folder with MLX implementation for inference. Corenet uses PyTorch, where MLX is "an array framework for Apple silicon." MLX is faster than PyTorch, even with optimizations for Apple silicon (MPS). It shows an interesting approach where the same tooling will be used for training SMLs and LLMs, but a new tooling will be used at inference time.

References:

https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential - Tiny but mighty: The Phi-3 small language models with big potential
https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/ - Introducing Phi-3: Redefining what’s possible with SLMs
https://export.arxiv.org/abs/2404.14219 - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
https://arxiv.org/abs/2404.14619 - OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
https://arxiv.org/abs/2008.00623 - DeLighT: Deep and Light-weight Transformer
https://arxiv.org/abs/2401.02385 - TinyLlama: An Open-Source Small Language Model
https://developer.apple.com/metal/pytorch/
https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0 - How Fast Is MLX? A Comprehensive Benchmark on 10 Apple Silicon Chips and 3 CUDA GPUs
https://towardsdatascience.com/mlx-vs-mps-vs-cuda-a-benchmark-c5737ca6efc9 - MLX vs MPS vs CUDA: a Benchmark
https://towardsdatascience.com/adding-custom-layers-on-top-of-a-hugging-face-model-f1ccdfc257bd - Adding Custom Layers on Top of a Hugging Face Model
https://arxiv.org/abs/2305.07759 - TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
https://github.com/apple/corenet - CoreNet: A library for training deep neural networks

Shchegrikovich LLM

Discussion about this post