Langage Model Performance Plateaus. What’s next?

NLP

LLM

Lit. Review

A look at trends in language model performance, pealing back technical details, and suggesting future opportunities.

Author

Kasra Lekan

Published

August 11, 2024

A special credit to t3dotgg (2024) for the video that inspired me to write this post before writing a post on the technical innovations behind small / more efficient language models.

Technical Content Disclaimer

This post does not constitute a journal-level review. Therefore, my research for this post was intended to be informational and did not exhaust the search space. If you notice any key papers or references that I have missed or if I misinterpreted the findings of any reference, please let me know in the comments.

Language Model Performance is Plateauing

Figure 1 shows the consistent trend of improvement in open- and closed-source models. While I could spend this whole post writing just about his graph, for now just notice the general trend of improvement on MMLU 5-shot benchmark performance over time.

🎉 The Hype is Alive

20 months ago, “ChatGPT is a revolution, the most powerful model ever made,” and today, you can run a model more preferred than this literally on a toaster!🍞 (Schmid 2024)

The quote from HuggingFace Technical Lead Philipp Schmid references Gemma-2-9b-it which, as of August 2nd, ranked 47th on HuggingFace’s language model benchmark – higher than GPT-3.5-Turbo-0613. Gemma 2 includes four different models (see Schmid et al. 2024 for full model details):

gemma-2-9b: Base 9B model.
gemma-2-9b-it: Instruction fine-tuned version of the base 9B model.
gemma-2-27b: Base 27B model.
gemma-2-27b-it: Instruction fine-tuned version of the base 27B model.

Thus, a 9 billion parameter model bested a 175 billion parameter model. The astute reader will note that Schmid must have the most advanced toaster ever to run Gemma 2 🤣.

📈 Moore’s Law is Dead?

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. (Sutton 2019)

Sutton’s famous 2019 blog post¹ “The Bitter Lesson” (Sutton 2019) indicated the primacy of computational power² in improving model performance over time rather than building expert knowledge into the models. However, as we approach the physical limits of transistors on a chip, maintaining Moore’s law has become impossible.

When I first read “The Bitter Lesson,” I thought Sutton argued that engineering work on models did not matter; I was wrong. The key lesson I now extract from my experience, the academic literature, and “The Bitter Lesson” is that model architectures must optimize their use of computation. For example, the introduction of self-attention mechanisms in Transformer models allowed each token to attend to every other token in the input sequence, leveraging parallel computation to process large amounts of data efficiently. Similarly, architectures like convolutional neural networks (CNNs) capitalize on the spatial structure of data, using shared weights and local connectivity to optimize computational efficiency and scalability. These architectures do not merely rely on hand-crafted features but instead exploit the raw power of computation to learn and generalize from massive datasets.

The lesson here seems clear: the architectures that thrive are those that best adapt to and utilize the growing computational resources available.

💾 Limitations on Available Data

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. (Muennighoff et al. 2023)

Architectures need to be efficient because of the massive amount of data used during training. Several papers including Muennighoff et al. (2023) note how we are reaching the limits of available data generated by humans. Thus, other authors have investigated training on AI-generated data Shumailov et al. (2023) with mixed results.

If we cannot solve the data bottleneck, relying on more computation will not close the performance gap between state-of-the-art models and general intelligence.

The Future of Language Models

⚙️ A Focus on Efficient Performance

Like Gemma 2, which I discussed (and OpenAI’s Turbo models before it), there is an increasing focus on efficiency and inference speed as we begin to see plateaus in language model performance. Similarly, Mistral Large 2, the second generation of the startup’s flagship model, was announced with a post entitled “Large Enough.” The model is designed to “push the boundaries of cost efficiency, speed, and performance” (MistralAI 2024).

I am not saying this is a bad push. I intend to devote an entire post to the technical innovations that have driven the gains in efficiency and inference speed we have seen in these models. However, I do not believe these technical innovations will lead to the future of AI models.

🛝 A Paradigm Shift is Needed

Language models have the following high-level flaws as I see it:

A coupling of knowledge and reasoning³ capabilities.
- This is most similar to the issues that the engineers behind models like Gemma 2 seek to address. It’s difficult to train a small model with high-level reasoning capabilities when its weights have to hold so much information in them.
- There is a growing literature on grounding language models with “World Models.”
There is no concept of thinking deeply, i.e. more inference compute doesn’t get you a better answer.
- Even if we stop improving our chips, we are building far more of them than ever before. Thus, if we had a model that had reasoning capabilities grow with inference compute, that model could answer difficult questions given enough resources.

I believe that to achieve the next level of AI-based intelligence, a new approach that addresses at least one of these flaws is needed.

ARC Prize (2024) is a public competition set up to encourage researchers to focus on achieving General Intelligence with a new Arc-AGI benchmark.

Much of the discussion of performance in this post has been based on benchmarks, a fascinating topic in its own right. Benchmarks have been instrumental in the advancement of NLP all the way back to the GLUE benchmark (Wang et al. 2018), however, they can cause research to become myopically focussed (Gehrmann et al. 2021) or mischaracterize the rate of progress as demonstrated in the paper from Schaeffer, Miranda, and Koyejo (2023) which won the priced Outstanding Paper award at NeurIPS 2023. Figure 2 is a demonstration of one potential flaw from the team at the Arc Price.

🍒 A Final Word from Yann Lecun

Lecun (2016) famously articulated on the NIPS stage in 2016 that Self-Supervised Learning with lots of data was the foundation of future models.

If you are a student interested in building the next generation of AI systems, don’t work on LLMs (Yann LeCun [@ylecun] 2024)

Yann Lecun, a famous and foundational NLP researcher, has had some famous pronouncements in the past like in Figure 3. Recently, he has been arguing that LLMs are not the solution to the intelligence problem despite their truly awesome performance.

References

ARC Prize, Inc. 2024. “ARC Prize.” ARC Prize. https://arcprize.org/.

Gehrmann, Sebastian, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, et al. 2021. “The GEM Benchmark: Natural Language Generation, Its Evaluation and Metrics.” arXiv. http://arxiv.org/abs/2102.01672.

Gudibande, Arnav, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. “The False Promise of Imitating Proprietary LLMs.” arXiv. http://arxiv.org/abs/2305.15717.

Lecun, Yann. 2016. “Predictive Learning, NIPS 2016 Yann LeCun, Facebook Research.” https://www.youtube.com/watch?v=Ount2Y4qxQo.

@maximelabonne. 2024. “Due to Popular Demand, I’ve Updated This Figure to Include DeepSeek-V2 and Mistral Large 2. It’s Also More Zoomed for Readability. Https://t.co/jWEpxH9zgO.” Tweet. Twitter. https://x.com/maximelabonne/status/1816416043511808259/photo/1.

MistralAI. 2024. “Large Enough.” https://mistral.ai/news/mistral-large-2407/.

Muennighoff, Niklas, Alexander M. Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. “Scaling Data-Constrained Language Models.” In. https://openreview.net/forum?id=j5BuTrEj35.

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2024. “What Is Moore’s Law?” Our World in Data, February. https://ourworldindata.org/moores-law.

Sauerwein, David. 2024. “Reflections on The Bitter Lesson LinkedIn.” https://www.linkedin.com/posts/davidsauerwein_ai-machinelearning-compute-activity-7215818757405888512-ogMw/.

Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems 36 (December): 55565–81. https://proceedings.neurips.cc/paper_files/paper/2023/hash/adc98a266f45005c403b8311ca7e8bd7-Abstract-Conference.html.

Schmid, Philipp. 2024. “Hugging Face Post - LinkedIn.” https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_absolutely-wild-google-deepmindgemma-activity-7224466612043620352-dcgq/.

Schmid, Philipp, Omar Sanseviero, Pedro Cuenca, Lewis Tunstall, Tom Aarsen, and Vaibhav Srivastav. 2024. “Welcome Gemma 2 - Google’s New Open LLM.” https://huggingface.co/blog/gemma2.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv. http://arxiv.org/abs/2305.17493.

Sutton, Rich. 2019. “The Bitter Lesson.” http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

t3dotgg. 2024. “AI Isn’t Gonna Keep Improving.” https://www.youtube.com/watch?v=Y8Ym7hMR100.

Valmeekam, Karthik, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. “On the Planning Abilities of Large Language Models : A Critical Investigation.” arXiv. http://arxiv.org/abs/2305.15771.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–55. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5446.

Yang, Sherry, Ofir Nachum, Yilun Du, Stephen McAleer, Igor Mordatch, Linxi Fan, Jeannette Bohg, and Dale Schuurmans. 2023. “NeurIPS Workshop · Foundation Models for Decision Making.” SlidesLive. https://neurips.cc/virtual/2023/workshop/66525.

Yann LeCun [@ylecun]. 2024. “If You Are a Student Interested in Building the Next Generation of AI Systems, Don’t Work on LLMs.” Tweet. Twitter. https://x.com/ylecun/status/1793326904692428907.

Footnotes

Sauerwein (2024) has a good post covering the summary of “The Bitter Lesson” and related responses.↩︎
Moore’s law observes that the number of transistors in an integrated circuit roughly doubles every two years. Roser, Ritchie, and Mathieu (2024) have a great post visualizing it.↩︎
“Reasoning” is a tricky term to nail down. Here I am referring to current benchmarks rather than more advanced tasks like planning. Planning is an extremely interesting capability that is essential for language-model-based agents. Check out the NeurIPS 2023 workshop on “Foundation Models for Decision Making” for a taste of this research (Yang et al. 2023) as well as Valmeekam et al. (2023).↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “Langage Model Performance Plateaus. What’s Next?” August 11, 2024. https://blog.kasralekan.com/ideas/lm-performance-plateau/.