Llama 3 report Q&A

Jul 28, 2024

Why is the latest and the largest Llama 405B, not 300B or 500B?

Several factors define the model size. One of them is computed budget. For Llama3, the compute budget was 3.8 × 1025 FLOPs. A new scaling law was developed to predict an optimal number of tokens for a specific compute budget. The result of this prediction was - 402B(model size) and 16.55T(tokens), and the observation that small changes in these numbers won't affect the resulting performance too much. The resulting performance is another factor in determining the model size. We must choose different data sources and create a mix to create a high-quality model. Several small models were trained on different data mixes to find the best data mix for training the bigger model. As a result, even before training the model, the data mix, the size and the performance forecast were defined.

What is the name of the model?

Strictly speaking, Llama3 was presented in April 2024. Llama 3.1 was presented in Jul 2024. Also, Llama Guard 3(moderation model) and Prompt Guard(tool to safeguard against prompt attacks) were also published in July.

What does Llama 3 support?

Llama 3 is a foundational model for a family of models. The biggest model is 405B. It's comparable in performance to GTP-4. Llama natively supports multilinguality, coding, tool use and coding. There are also experiments to add support for image, speech and video.

What is the difference between Llama 2 and Llama 3?

Llama 3 was trained on 15T tokens, Llama 2 on 1.8T. The compute budget for Llama 3 was 50x more. For Llama 3 the Meta's production cluster was used, where is Llama 2 was trained on research cluster. The peak throughput of a storage cluster for Llama 3 was 7 TB/s. The main difference is in training data quality, diversity and quantity.

Why Llama 3 doesn't use a Mixture of Experts?

Mistral's models built on top of the Mixture of Exports architecture showed really great performance. There were speculations that GPT-4 uses the same architecture. But Llama 3 is almost the same as Llama 2 from an architecture perspective—a standard dense Transformer model. The team prioritized training stability and the ability to scale the model development process.

How Llama2 were used in Llama3 training?

For quality filtering - Roberta-based quality classifiers were trained on Llama 2 predictions. To extract code and reasoning data - DistilledRoberta classifiers were trained on Llama 2 web data annotations. Multilinguality - Llama 2-based classifiers were used to ensure the high quality of multilingual documents. To predict the resulting performance of Llama 3.

References:

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ - The Llama 3 Herd of Models

Shchegrikovich LLM

Discussion about this post