Extending “Towards Monosemanticity”

research
NLP
LLM
Mechanistic interpretability by applying dictionary learning to Distill-GPT-2.
Author

Kasra Lekan

Published

April 3, 2024

Background

Based on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Bricken et al. 2023) by Anthropic and Language models can explain neurons in language models (Bills et al. 2023) by OpenAI, I attempted to generate natural language explanations for the neurons in Distill-GPT-2 by projecting the final MLP output layer to a higher dimension (similar to dictionary learning) and then using a language model (gpt-4-turbo-2024-04-09) to generate natural language descriptions of each higher dimension using activation values. The underlying theoretical foundation is the Superposition hypothesis which, simply put, states that each neuron in a language model learns a complicated mix of concepts. For instance, a neuron may activate strongly on Korean and DNA sequences. Thus, by projecting MLP outputs to a higher dimension we can attempt to create “features” that represent an explainable concept.

Challenges

  • Reproducing Anthropic’s representation from the paper’s appendix
  • Tuning hyperparameters
  • Loss degradation over longer training runs
  • Automated interpretability using OpenAI’s implementation package  - API changes requiring code refactoring due to data parsing changes or rewriting due to missing information.  - Poor responses from GPT-4, ultimately making automated interpretability impossible without adjusting the prompts

Observations

Training Autoencoders for Reconstruction

The primary metric I used for MLP reconstruction efficacy was “reconstruction score”:

\[score = \frac{zero\_abl\_loss - recons\_loss}{zero\_abl\_loss - loss}\]

I was able to reproduce Anthropic’s autoencoder on a single-layer transformer with a GELU activation, achieving a reconstruction score of ~94% with 2 billion training tokens (fewer than Anthropic’s run). With Distill-GPT-2, I was only able to achieve a reconstruction score of ~77% (with a 32x dictionary size). I observed that (1) training with more tokens did not significantly improve reconstruction scores. Additionally, performance would deteriorate throughout the training run after reaching an optimum, suggesting that when scaling up this approach, a more sophisticated training strategy would be necessary.

Dictionary Size

Anthropic tested many dictionary sizes from 1x to 256x but focussed on 8x for their primary findings. I hypothesized that a larger size would be optimal for a larger model since it is trained on more tokens and learns more complex representations. I first trained a 32x dictionary and later trained a 128x dictionary. Training a larger dictionary naturally was more computationally intensive, \(O(n)\).

Interpretability

Figure 1

The natural language explanations suggested that 32x is not expressive enough for Distill-GPT-2. Many features had the same explanation because they activated on a wide range of tokens. Since these ranges were not specific, these explanations fitted more to the high-frequency tokens in the evaluation text rather than the model Figure 1. Thus, I posit that larger models need much larger dictionaries as they encode more features for each neuron in the MLP layer. There are ~15x more parameters in Distill-GPT-2 than in the single-layer transformer that Anthropic analyzed. Thus, I opted to test a 128x dictionary in addition to the 32x.

Training a 128x dictionary was far more computationally intensive and did not reach a high enough reproduction score to facilitate interpretability. After training for 2 billion tokens, the score was only ~48%. Additional training, led to degredation from this optimum, emphasizing the need for increased sophistication with autoencoder training as the model being interpreted becomes larger.

Acknowledgements

This work would not have been possible without guidance from Professor Yangfeng Ji.

Check out my presentation on this research here.

References

Bills, Steven, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. “Language Models Can Explain Neurons in Language Models.” Open AI Public 2.
Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread 2.
Nanda, Neel. 2024a. “Neelnanda-Io/1L-Sparse-Autoencoder.” https://github.com/neelnanda-io/1L-Sparse-Autoencoder.
———. 2024b. TransformerLensOrg/TransformerLens.” TransformerLensOrg. https://github.com/TransformerLensOrg/TransformerLens.

Citation

For attribution, please cite this work as:
Lekan, Kasra. 2024. “Extending ‘Towards Monosemanticity’.” April 3, 2024. https://blog.kasralekan.com/ideas/towards-monosemanticity/.