The Large World of Small(er) Language Models

NLP

LLM

Lit. Review

I explore Small(er) Language Models, examining the techniques and innovations that make these models more efficient than their larger counterparts.

Author

Kasra Lekan

Published

September 18, 2024

Modified

October 11, 2024

Technical Content Disclaimer

This post does not constitute a journal-level review. Therefore, my research for this post was intended to be informational and did not exhaust the search space. If you notice any key papers or references that I have missed or if I misinterpreted the findings of any reference, please let me know in the comments.

An Aside on Naming and Definitions

Naming in science can sometimes lead to confusion when the original name and purpose of something evolves. The relevant example here is “Large Language Models” (LLMs). The modern basis for nearly all such models is three things:

A large amount of training data to service a stacked collection of underlying layers.
Attention layers, the contents of which have evolved since Vaswani et al. (2017), in (mostly) transformer-like layer collections.
Training on sequences of tokens, which mostly correspond to natural language.

A few caveats, however, are necessary even in this basic list. First, the term “large” is relative. The use of “large” indicates that the original LLMs were much larger than any of the previous models. Second, there is nothing ensuring the primacy of Transformer layers. Recent models which focus on reducing inference compute integrate components of state space models (SSMs), most notably proposed in Gu and Dao (2023). I highly recommend watching Daniel Fu’s talk from NeurIPS 2023 to get a foundational overview of this area Fu (2023). Third, tokens are not necessarily language. To give just one example from my own work, PLAN-BERT ((Shao, Guo, and Pardos 2021)) uses course codes at UC Berkeley as tokens to predict course schedules.

As usual, Andrej Karpathy has already expressed this idea with a strong public response:

It’s a bit sad and confusing that LLMs (“Large Language Models”) have little to do with language; It’s just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something. (@karpathy 2024)

🛩️ What is a Small Language Model?

As with “large” models, the definition of “small” is somewhat loose. Rather than thinking of any specific size cutoff, here I will adapt the name to Small(er) Language Models. Many of the models considered here attempt to increase either the inference compute efficiency through explicit model choices or by more efficiently condensing the capabilities of an effective language model into a smaller parameter count.

Small(er) LMs: Technique Overview

Here I explore the techniques used by a collection of modern small(er) language models to determine what techniques are used to create more efficient models. I break down these techniques into pre-train, train, and post-train.

📚 Pre-train: data curation and synthetic data

Several modern approaches incorporate high-quality synthetic data into their training datasets. For example, the Phi-2 model (Javaheripi and Bubeck 2023) uses synthetic datasets specifically created to teach common sense reasoning and general knowledge, covering areas such as science, daily activities, and theory of mind. The Phi-3 model Abdin et al. (2024) takes this concept further by employing a sophisticated prompting and seeding formula inspired by the TinyStories approach. This method involves (Li et al. 2023):

Collecting publicly available information into an initial dataset
Using a large language model (LLM) to synthesize new content based on this data
Filtering the generated content for quality
Feeding the filtered content back into the LLM for further synthesis

This iterative process allows researchers to build up a high-quality corpus of data large enough to train a capable small language model over several weeks.

🚝 Train: Distillation and Architecture

Knowledge distillation

Knowledge distillation is a technique where a smaller model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher). This approach allows the smaller model to benefit from the knowledge captured by the larger model while maintaining a more compact size, e.g., Gemma 2 (Schmid et al. 2024). Knowledge distillation is not new, even in the language model space (Sanh et al. 2019).

Architectural Approach Examples

Several architectural innovations have been employed to create more efficient small language models. A few examples include:

Grouped-Query Attention (GQA): This attention mechanism helps improve efficiency in processing queries. See Abdin et al. (2024).
Embedding Tying: This technique reduces the model’s parameter count by sharing weights between the input embedding and output layer.
Hybrid Models: Combines Mamba layers (a type of state space model) with shared attention layers. This approach aims to balance the efficiency of state space models with the expressiveness of attention mechanisms, e.g., Goel (2024)
Mixture of Expert Models: While not strictly a small model technique, the Mixture of Experts approach can create more efficient large models. For example, Phi-3.5-MoE comprises 16 experts, each containing 3.8B parameters. During inference, it activates only a subset of these experts (typically two), resulting in 7.6B active parameters out of a total of 42B.

🔪 Post-train

Pruning and Quantization

Pruning techniques remove less important weights or entire neurons from the model, reducing its size and potentially improving inference speed.

Quantization involves reducing the precision of the model’s weights (e.g., from 32-bit floating-point to 8-bit integers), which can significantly reduce the model’s memory footprint and inference time with minimal impact on performance.

While theoretically pruning and quantization can both be used to reduce model sizes, quantization is used far more often. There are a number of interesting papers advancing the effectiveness of LLM quantization. Originally, LLMs had poor quantization results since they exhibit outliers in specific activation channels across all layers and tokens¹. Malinovskii (2024) has a good blog post summarizing the progression of this research.

Fine-tuning

After initial training, models can be fine-tuned on specific tasks or domains to improve their performance in targeted areas without increasing model size. This is only useful for specific use cases which correspond to the model style rather than the model’s capabilities. However, where applicable, QLora (Dettmers et al. 2023) and its descendants enable low-compute model specialization.

Small(er) LMs: Core Techniques

Modification

This section was added after the original post.

A few core techniques emerge consistently in the research and industry announcements including knowledge distillation (sometimes paired with sophisticated parameter search techniques), memory-efficient attention mechanisms, and Mixture-of-Expert models. Quantization is being thoroughly explored for more computationally constrained scenarios, e.g. Liu et al. (2024).

Case Study: Llama-3.1-Nemotron-51B

“Block-distillation – For blocks of the reference model (blue), we create multiple variants for the ‘student model’ (yellow) that mimic the block-wise teacher functionality.”

Last month, NVIDIA introduced Llama-3.1-Nemotron-51B, a language model derived from Meta’s Llama-3.1-70B. It achieves 2.2x faster inference and handles 4x larger workloads on a single GPU while maintaining comparable accuracy to its parent model (Bercovich and Karpas 2024). The core technique used was a combination of multi-path Knowledge Distillation with a novel “Neural Architecture Search (NAS)” approach which allowed them to find a model that matched the chosen point on the size-capability frontier. This technique combined with knowledge distillation represents an exciting step forward in Small(er) Language Models.

We then use our block-distillation framework to train all these block variants for all layers of a (large) parent LLM in parallel. In a basic version of block-distillation, training data is passed through the reference model (also known as a teacher). For each block, its input is taken from the teacher and injected into the matching block of the student. The outputs of the teacher and student for the block are compared and the student block is trained so that the student block mimics the functionality of the teacher block. A more advanced scenario where a single student block mimics multiple teacher blocks is depicted in the right-hand diagram of Figure 2. Next, we use our Puzzle algorithm to efficiently score each alternative replacement “puzzle piece” and search our enormous design space for the most accurate models, while adhering to a set of inference constraints, such as memory size and required throughput. Finally, by using knowledge distillation (KD) loss for both block scoring and training, we demonstrate the potential to narrow the accuracy gap between our model and the reference model using a much more efficient architecture with a tiny fraction of the reference model training costs. Using our methods on Llama-3.1-70B as the reference model, we built Llama-3.1-Nemotron-51B-Instruct, a 51B model that breaks the efficient frontier of LLMs on a single NVIDIA H100 GPU.

“Runtime of Puzzle chosen blocks (layers) for attention layers (blue) and FFN layers (red) across the 80 layers of the reference model. Green areas correspond to overall runtime savings.”

There are several instances where learned structures are used in novel ways to improve performance (one of my favorites is learned indices). When training new models I, like many other researchers, would default to some kind of consistent pattern. Seeing some of these layers pruned without impacting model performance suggests that one could use this method to inform future model design or interoperability research by examining patterns in layer pruning.

Separately, OpenAI started to offer knowledge distillation of their core models (OpenAI 2024). Although we cannot determine the techniques employed on their backend, it is easy to imagine a world where people are able to choose a model distilled from the most powerful models to best suit their computational requirements.

Case Study: DeepSeek-V2

In the Transformer decoder, the attention matrix for the current token depends on the preceding tokens (hence why each subsequent token generated is more computationally expensive). Naturally, the components of the attention calculation are cached, but this introduces a costly memory overhead. To address this overhead, several attention mechanisms have been introduced:

Multi-Head Attention (MHA)
Multi-Query Attention (MQA)
Grouped-Query Attention (GQA)
Multi-Head Latent Attention (MLA was introduced with DeepSeek-V2, (DeepSeek-AI et al. 2024))

These attention mechanisms represent tradeoffs between attention effectiveness, computational cost, and scalability. For more details, see the phenomenal post: Abideen (2024). His conclusion was:

MHA can be faster for inference but its KV-cache overheads make it impossible to scale to larger-sized models. MQA significantly reduces KV-cache but degrades in quality as the model size increases. GQA is a balance between both attention mechanisms in terms of KV-caching and memory bandwidths. MLA requires a significantly lower KV cache yet outperforms MHA in output quality.

Conclusion

Creating efficient small(er) language models involves a combination of techniques applied at various stages of the model development process. Outside of research applications, small(er) language models enable faster inference speeds, which is beneficial for production applications, and on-device inference, which is crucial to privacy-concerned users.

Most posts about small(er) language models are something akin to “Model X beats much larger model Y on benchmark Z!!!” To those in the research community, it is not controversial to point out that benchmarks have serious limitations. As such, benchmarks may not accurately estimate the performance of some small(er) language models or may fail to highlight the vulnerabilities of their performance. For instance, a few months back, I wanted to run a multi-lingual language model on-device and found that most at the time were only capable in English and a few other high-resource languages. Since then, models like Phi-3.5-MoE-instruct have filled some of this gap, albeit at much higher parameter counts than I hoped.

As the field continues to evolve, we can expect to see further innovations in this space, making advanced NLP capabilities more accessible and deployable in resource-constrained environments.

References

Abdin, Marah, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, et al. 2024. “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” arXiv. http://arxiv.org/abs/2404.14219.

Abideen, Zain ul. 2024. “MHA Vs MQA Vs GQA Vs MLA.” Medium. https://medium.com/@zaiinn440/mha-vs-mqa-vs-gqa-vs-mla-c6cf8285bbec.

Beatty, Sally. 2024. “Tiny but Mighty: The Phi-3 Small Language Models with Big Potential.” Microsoft News. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/.

Bercovich, Akhiad, and Udi Karpas. 2024. “Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B.” NVIDIA Technical Blog. https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/.

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, et al. 2024. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv. https://doi.org/10.48550/arXiv.2405.04434.

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “QLoRA: Efficient Finetuning of Quantized LLMs.” Advances in Neural Information Processing Systems 36 (December): 10088–115. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html.

Fu, Daniel Y. 2023. “39014562 · Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture.” SlidesLive. https://neurips.cc/virtual/2023/oral/73841.

Glorioso, Paolo, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, and Beren Millidge. 2024. “Zamba2-Mini - Zyphra.” https://www.zyphra.com/post/zamba2-mini.

Goel, Karan. 2024. “The On‑Device Intelligence Update - Cartesia.” https://cartesia.ai/blog/2024-08-27-on-device.

Gu, Albert, and Tri Dao. 2023. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv. https://doi.org/10.48550/arXiv.2312.00752.

Javaheripi, Mojan, and Sébastien Bubeck. 2023. “Phi-2: The Surprising Power of Small Language Models.” Microsoft Research. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.

@karpathy. 2024. “It’s a Bit Sad and Confusing...” Tweet. Twitter. https://x.com/karpathy/status/1835024197506187617.

Li, Yuanzhi, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. “Textbooks Are All You Need II: Phi-1.5 Technical Report.” arXiv. https://doi.org/10.48550/arXiv.2309.05463.

Liu, Yifei, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. 2024. “VPTQ: Extreme Low-Bit Vector Post-Training Quantization for Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2409.17066.

Malinovskii, Vladimir. 2024. “The Evolution of Extreme LLM Compression: From QuIP to AQLM with PV-Tuning.” Yandex. https://medium.com/yandex/the-evolution-of-extreme-llm-compression-from-quip-to-aqlm-with-pv-tuning-19c44b91af96.

OpenAI. 2024. “Distillation API Docs.” https://platform.openai.com.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” arXiv. https://doi.org/10.48550/arXiv.1910.01108.

Schmid, Philipp, Omar Sanseviero, Pedro Cuenca, Lewis Tunstall, Tom Aarsen, and Vaibhav Srivastav. 2024. “Welcome Gemma 2 - Google’s New Open LLM.” https://huggingface.co/blog/gemma2.

Shao, Erzhuo, Shiyuan Guo, and Zachary A. Pardos. 2021. “Degree Planning with PLAN-BERT: Multi-Semester Recommendation Using Future Courses of Interest.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (17): 14920–29. https://doi.org/10.1609/aaai.v35i17.17751.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.

Footnotes

Interestingly, this also presents a significant problem in interpretability research.↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “The Large World of Small(er) Language Models.” September 18, 2024. https://blog.kasralekan.com/ideas/small-language-models/.