Back to all summaries

Paper summary

The Llama 3 Herd of Models

This note summarizes how an open multilingual foundation language model release can jointly handle pre-training, post-training, tool use, long context, and safety.

Authors

Llama Team, AI @ Meta

Publication

arXiv, 2024-07-23

Context

  • This note summarizes how an open multilingual foundation language model release can jointly handle pre-training, post-training, tool use, long context, and safety.
  • The main focus is how Llama 3 aims for frontier-class performance by emphasizing data, scale, alignment, and infrastructure rather than complex architectural changes.

What

  • This paper argues that Llama 3, especially the 405B dense Transformer, can reach performance comparable to GPT-4-class proprietary models on many benchmarks.
  • The 8B and 70B models are also strong for their size, and the family tries to expand coding, reasoning, multilingual, tool-use, and long-context capability within one model line.
  • The paper also reports competitive multimodal extension results, while noting that those models were not yet ready for broad release.

Why

  • For open models, scaling parameter count alone is not enough. Data quality, post-training quality, safety, and serving practicality must all align before a model is truly usable at the frontier.
  • This paper is therefore a case study in how far a dense Transformer can close the performance gap through scale, alignment, and safety engineering without relying on major architectural novelty.

How

  • The base architecture keeps a dense Transformer instead of MoE, and pre-training uses about 15T multilingual tokens with very large compute.
  • Training then extends to a 128K context window, followed by repeated SFT, rejection sampling, and DPO to strengthen reasoning, coding, tool use, multilingual capability, and steerability.
  • For safety, the system combines data filtering, adversarial-prompt safety finetuning, and system-level filters such as Llama Guard 3.
  • Evaluation is broad: benchmark scores, human evaluation, adversarial robustness, contamination analysis, and internal safety evaluations such as cyber/CBRN uplift are all included.

Pitfalls

  • The multimodal models were still under development and not release-ready, so the multimodal results should not be interpreted as immediate general availability.
  • Human evaluation and contamination analysis are inherently imperfect, and safety testing is not exhaustive, so jailbreak and non-English risks may remain.
  • Internal safety benchmarks and cyber/CBRN-related evaluations are difficult for external groups to reproduce exactly.

Next steps

  • Track how far open dense LLMs can progress without architectural novelty, and compare that trend with later open models.
  • More comparisons are needed to separate how much tool use, long context, and safety post-training each contribute to benchmark gains.

Related notes