Mind the Gap

Money and Politics in US House of Representatives Elections

Kasra Lekan — Tue, 01 Oct 2024 04:00:00 GMT

“In this Sept. 9, 2009 artist rendering, U.S. Solicitor General Elena Kagan, right, argues before the Supreme Court, Citizens United v. Federal Election Commission in Washington. (Dana Verkouteran/AP)” (Kaminer 2015)

Context

As we near the 2024 presidential vote, I have heard several discussions of the corrupting influence of money in American politics from journalists and pundits inside and outside the United States. They argue that since Citizens United which, plainly put determined that Money is Speech, enshrining the ability for firms or individuals to donate huge sums to political candidates.

PEW Research (Nadeem 2023)

Figure 1

Many argue that this allows politicians to be influenced by big donors who sometimes will donate to both candidates in a race to ensure their influence regardless of the outcome. I do not consider implicit bias in favor of the donor’s agenda here. I wanted to tackle a far simpler question: in our modern age of targeted advertising and dense information ecosystems, “Does money facilitate election wins?”

Percent of Races Won by Top Spending Candidate (OpenSecrets 2022)

It is no secret that the top spending candidate wins almost all of the time. This post was inspired by someone stating “In 2022, the candidate with more money won over 93% of the time.” But we would expect this regardless. Let’s suppose we have hypothetical candidates A and B. Candidate A is well-liked by her constituents and as a result receives a large amount of donations. Candidate B is not well-liked and receives fewer donations. Candidate A wins AND is better funded because she is more popular. Therefore, a more in-depth is needed to evaluate whether having more money is unduly beneficial.

A confounding variable (popularity) influences both donations and voting, creating a misleading association between them.

Methods

Datasets: For this analysis I combine FEC campaign finance data (FEC 2024), election results data (Lab 2024), and overall election data for Cook PVIs (Cook PVI℠ 2023) and national vote totals from Wikipedia (e.g. Wikipedia 2021). The code and data for the analysis can be found here.

Restrictions of the Dataset: I consider districts during the period 2012-2020 to avoid redistricting between years in the dataset. I only consider races between a registered Democrat and Republican candidate.

Variables: The independent variable is the spending ratio between the winning candidate and the runner-up. The dependent variable is the Margin of Victory (%) or Adjusted Margin (%). Adjusted Margin was calculated to control for the political bias of the district and the country. To control for the political bias of the district, I adjust the Margin of Victory by the Cook PVI for the district (it is best to think of this as calculating the under- or over-performance of the candidate based on the political lean of the district). I also adjust by what I am calling the “National House Environment” which is just the difference between the percentage of all Americans who voted for Republican candidates and those who voted for Democratic candidates. This adjustment was an attempt to control for the influence of the environment of the nation and media on the voting of “swing” voters in each district. I did not control for incumbency, but I separately analyzed cases where incumbents were not present.

Findings

(a) All races; Raw Margin of Victory vs. Spending Ratio

(b) All races; Adjusted Margin vs. Spending Ratio

(d) Races with +/- 10% Adjusted Margin; Adjusted Margin vs. Spending Ratio

Figure 2: Comparing victory margins with relative spending. Note that “Challenger vs. Incumbent” indicates that the Challenger wins a race against an Incumbent.

As I hypothesized, the correlation decreases when using the Adjusted Margin. With an overall correlation of .27, the relationship between the Spending Ratio and Adjusted Margin is weak but not negligible. Most of the data is from races where Incumbents are present and win, which is extremely common in US House races. Thus, I broke down the correlations by race type (data from “most contested races”):

Overall correlation between Spending Ratio and Adjusted Vote Margin (n=437): 0.2685
Correlation for Incumbent vs Challenger (n=306): 0.2263
Correlation for Open Seat (n=68): 0.2243
Correlation for Challenger vs Incumbent (n=63): -0.0215

While there are much fewer races where the Challenger beats the Incumbent (n=63 “most contested races” or n=73 “all races”), it is interesting that there is no correlation between spending ratio and adjusted margin in these cases.

Discussion

FiveThirtyEight’s piece “How Money Affects Elections” looks at this issue from a more qualitative perspective based on the work of some prominent researchers in the field. They raise a few interesting ideas that mesh with my findings:

… early fundraising strongly predicted who would win primary races. …advertising is useful for making voters aware that a candidate or an issue exists at all. (Koerth 2018)

… the strong raw association between raising the most cash and winning probably has more to do with big donors who can tell (based on polls or knowledge of the district or just gut-feeling woo-woo magic) that one candidate is more likely to win — and then they give that person all their money. (Koerth 2018)

Based on the relatively weak correlations I found and the views of various authors on this topic, money influences elections, but in a far more minor way than one would expect. Money is most relevant when a challenger attempts to defeat an incumbent or when a seat is open. Thus, money, to an extent, can be a barrier to entry into the political fray. Regarding eroding American democracy, I would contend that money is primarily an issue in elections due to the negative impact on voter perceptions (see Figure 1). Once politicians get into office, however, the influence of such donations can take a far more corrosive toll.

References

Cook PVI℠. 2023. “2023 Cook PVI℠: District Map and List (118th Congress).” https://www.cookpolitical.com/cook-pvi/2023-partisan-voting-index/118-district-map-and-list.

FEC. 2024. “Campaign Finance Data.” https://www.fec.gov/data/browse-data/?tab=bulk-data.

Kaminer, Wendy. 2015. “The Truth About Citizens United.” https://www.wbur.org/cognoscenti/2015/01/21/campaign-finance-myths-wendy-kaminer.

Koerth, Maggie. 2018. “How Money Affects Elections.” FiveThirtyEight. https://fivethirtyeight.com/features/money-and-elections-a-complicated-love-story/.

Lab, MIT Election Data and Science. 2024. “U.S. House 1976–2022.” Harvard Dataverse. https://doi.org/10.7910/DVN/IG0UN2.

Nadeem, Reem. 2023. “5. Money, Power and the Influence of Ordinary People in American Politics.” Pew Research Center. https://www.pewresearch.org/politics/2023/09/19/money-power-and-the-influence-of-ordinary-people-in-american-politics/.

OpenSecrets. 2022. “Did Money Win?” OpenSecrets. https://www.opensecrets.org/elections-overview/winning-vs-spending.

Wikipedia. 2021. “2020 United States House of Representatives Elections - Wikipedia.” https://en.wikipedia.org/wiki/2020_United_States_House_of_Representatives_elections.

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “Money and Politics in US House of Representatives Elections.” October 1, 2024. https://blog.kasralekan.com/ideas/money-and-politics/.

The Large World of Small(er) Language Models

Kasra Lekan — Wed, 18 Sep 2024 04:00:00 GMT

Technical Content Disclaimer

This post does not constitute a journal-level review. Therefore, my research for this post was intended to be informational and did not exhaust the search space. If you notice any key papers or references that I have missed or if I misinterpreted the findings of any reference, please let me know in the comments.

An Aside on Naming and Definitions

Naming in science can sometimes lead to confusion when the original name and purpose of something evolves. The relevant example here is “Large Language Models” (LLMs). The modern basis for nearly all such models is three things:

A large amount of training data to service a stacked collection of underlying layers.
Attention layers, the contents of which have evolved since Vaswani et al. (2017), in (mostly) transformer-like layer collections.
Training on sequences of tokens, which mostly correspond to natural language.

A few caveats, however, are necessary even in this basic list. First, the term “large” is relative. The use of “large” indicates that the original LLMs were much larger than any of the previous models. Second, there is nothing ensuring the primacy of Transformer layers. Recent models which focus on reducing inference compute integrate components of state space models (SSMs), most notably proposed in Gu and Dao (2023). I highly recommend watching Daniel Fu’s talk from NeurIPS 2023 to get a foundational overview of this area Fu (2023). Third, tokens are not necessarily language. To give just one example from my own work, PLAN-BERT ((Shao, Guo, and Pardos 2021)) uses course codes at UC Berkeley as tokens to predict course schedules.

As usual, Andrej Karpathy has already expressed this idea with a strong public response:

It’s a bit sad and confusing that LLMs (“Large Language Models”) have little to do with language; It’s just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something. (@karpathy 2024)

🛩️ What is a Small Language Model?

As with “large” models, the definition of “small” is somewhat loose. Rather than thinking of any specific size cutoff, here I will adapt the name to Small(er) Language Models. Many of the models considered here attempt to increase either the inference compute efficiency through explicit model choices or by more efficiently condensing the capabilities of an effective language model into a smaller parameter count.

Small(er) LMs: Technique Overview

Here I explore the techniques used by a collection of modern small(er) language models to determine what techniques are used to create more efficient models. I break down these techniques into pre-train, train, and post-train.

📚 Pre-train: data curation and synthetic data

Several modern approaches incorporate high-quality synthetic data into their training datasets. For example, the Phi-2 model (Javaheripi and Bubeck 2023) uses synthetic datasets specifically created to teach common sense reasoning and general knowledge, covering areas such as science, daily activities, and theory of mind. The Phi-3 model Abdin et al. (2024) takes this concept further by employing a sophisticated prompting and seeding formula inspired by the TinyStories approach. This method involves (Li et al. 2023):

Collecting publicly available information into an initial dataset
Using a large language model (LLM) to synthesize new content based on this data
Filtering the generated content for quality
Feeding the filtered content back into the LLM for further synthesis

This iterative process allows researchers to build up a high-quality corpus of data large enough to train a capable small language model over several weeks.

🚝 Train: Distillation and Architecture

Knowledge distillation

Knowledge distillation is a technique where a smaller model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher). This approach allows the smaller model to benefit from the knowledge captured by the larger model while maintaining a more compact size, e.g., Gemma 2 (Schmid et al. 2024). Knowledge distillation is not new, even in the language model space (Sanh et al. 2019).

Architectural Approach Examples

Several architectural innovations have been employed to create more efficient small language models. A few examples include:

Grouped-Query Attention (GQA): This attention mechanism helps improve efficiency in processing queries. See Abdin et al. (2024).
Embedding Tying: This technique reduces the model’s parameter count by sharing weights between the input embedding and output layer.
Hybrid Models: Combines Mamba layers (a type of state space model) with shared attention layers. This approach aims to balance the efficiency of state space models with the expressiveness of attention mechanisms, e.g., Goel (2024)
Mixture of Expert Models: While not strictly a small model technique, the Mixture of Experts approach can create more efficient large models. For example, Phi-3.5-MoE comprises 16 experts, each containing 3.8B parameters. During inference, it activates only a subset of these experts (typically two), resulting in 7.6B active parameters out of a total of 42B.

🔪 Post-train

Pruning and Quantization

Pruning techniques remove less important weights or entire neurons from the model, reducing its size and potentially improving inference speed.

Quantization involves reducing the precision of the model’s weights (e.g., from 32-bit floating-point to 8-bit integers), which can significantly reduce the model’s memory footprint and inference time with minimal impact on performance.

While theoretically pruning and quantization can both be used to reduce model sizes, quantization is used far more often. There are a number of interesting papers advancing the effectiveness of LLM quantization. Originally, LLMs had poor quantization results since they exhibit outliers in specific activation channels across all layers and tokens¹. Malinovskii (2024) has a good blog post summarizing the progression of this research.

Fine-tuning

After initial training, models can be fine-tuned on specific tasks or domains to improve their performance in targeted areas without increasing model size. This is only useful for specific use cases which correspond to the model style rather than the model’s capabilities. However, where applicable, QLora (Dettmers et al. 2023) and its descendants enable low-compute model specialization.

Small(er) LMs: Core Techniques

Modification

This section was added after the original post.

A few core techniques emerge consistently in the research and industry announcements including knowledge distillation (sometimes paired with sophisticated parameter search techniques), memory-efficient attention mechanisms, and Mixture-of-Expert models. Quantization is being thoroughly explored for more computationally constrained scenarios, e.g. Liu et al. (2024).

Case Study: Llama-3.1-Nemotron-51B

“Block-distillation – For blocks of the reference model (blue), we create multiple variants for the ‘student model’ (yellow) that mimic the block-wise teacher functionality.”

Last month, NVIDIA introduced Llama-3.1-Nemotron-51B, a language model derived from Meta’s Llama-3.1-70B. It achieves 2.2x faster inference and handles 4x larger workloads on a single GPU while maintaining comparable accuracy to its parent model (Bercovich and Karpas 2024). The core technique used was a combination of multi-path Knowledge Distillation with a novel “Neural Architecture Search (NAS)” approach which allowed them to find a model that matched the chosen point on the size-capability frontier. This technique combined with knowledge distillation represents an exciting step forward in Small(er) Language Models.

We then use our block-distillation framework to train all these block variants for all layers of a (large) parent LLM in parallel. In a basic version of block-distillation, training data is passed through the reference model (also known as a teacher). For each block, its input is taken from the teacher and injected into the matching block of the student. The outputs of the teacher and student for the block are compared and the student block is trained so that the student block mimics the functionality of the teacher block. A more advanced scenario where a single student block mimics multiple teacher blocks is depicted in the right-hand diagram of Figure 2. Next, we use our Puzzle algorithm to efficiently score each alternative replacement “puzzle piece” and search our enormous design space for the most accurate models, while adhering to a set of inference constraints, such as memory size and required throughput. Finally, by using knowledge distillation (KD) loss for both block scoring and training, we demonstrate the potential to narrow the accuracy gap between our model and the reference model using a much more efficient architecture with a tiny fraction of the reference model training costs. Using our methods on Llama-3.1-70B as the reference model, we built Llama-3.1-Nemotron-51B-Instruct, a 51B model that breaks the efficient frontier of LLMs on a single NVIDIA H100 GPU.

“Runtime of Puzzle chosen blocks (layers) for attention layers (blue) and FFN layers (red) across the 80 layers of the reference model. Green areas correspond to overall runtime savings.”

There are several instances where learned structures are used in novel ways to improve performance (one of my favorites is learned indices). When training new models I, like many other researchers, would default to some kind of consistent pattern. Seeing some of these layers pruned without impacting model performance suggests that one could use this method to inform future model design or interoperability research by examining patterns in layer pruning.

Separately, OpenAI started to offer knowledge distillation of their core models (OpenAI 2024). Although we cannot determine the techniques employed on their backend, it is easy to imagine a world where people are able to choose a model distilled from the most powerful models to best suit their computational requirements.

Case Study: DeepSeek-V2

In the Transformer decoder, the attention matrix for the current token depends on the preceding tokens (hence why each subsequent token generated is more computationally expensive). Naturally, the components of the attention calculation are cached, but this introduces a costly memory overhead. To address this overhead, several attention mechanisms have been introduced:

Multi-Head Attention (MHA)
Multi-Query Attention (MQA)
Grouped-Query Attention (GQA)
Multi-Head Latent Attention (MLA was introduced with DeepSeek-V2, (DeepSeek-AI et al. 2024))

These attention mechanisms represent tradeoffs between attention effectiveness, computational cost, and scalability. For more details, see the phenomenal post: Abideen (2024). His conclusion was:

MHA can be faster for inference but its KV-cache overheads make it impossible to scale to larger-sized models. MQA significantly reduces KV-cache but degrades in quality as the model size increases. GQA is a balance between both attention mechanisms in terms of KV-caching and memory bandwidths. MLA requires a significantly lower KV cache yet outperforms MHA in output quality.

Conclusion

Creating efficient small(er) language models involves a combination of techniques applied at various stages of the model development process. Outside of research applications, small(er) language models enable faster inference speeds, which is beneficial for production applications, and on-device inference, which is crucial to privacy-concerned users.

Most posts about small(er) language models are something akin to “Model X beats much larger model Y on benchmark Z!!!” To those in the research community, it is not controversial to point out that benchmarks have serious limitations. As such, benchmarks may not accurately estimate the performance of some small(er) language models or may fail to highlight the vulnerabilities of their performance. For instance, a few months back, I wanted to run a multi-lingual language model on-device and found that most at the time were only capable in English and a few other high-resource languages. Since then, models like Phi-3.5-MoE-instruct have filled some of this gap, albeit at much higher parameter counts than I hoped.

As the field continues to evolve, we can expect to see further innovations in this space, making advanced NLP capabilities more accessible and deployable in resource-constrained environments.

References

Abdin, Marah, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, et al. 2024. “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” arXiv. http://arxiv.org/abs/2404.14219.

Abideen, Zain ul. 2024. “MHA Vs MQA Vs GQA Vs MLA.” Medium. https://medium.com/@zaiinn440/mha-vs-mqa-vs-gqa-vs-mla-c6cf8285bbec.

Beatty, Sally. 2024. “Tiny but Mighty: The Phi-3 Small Language Models with Big Potential.” Microsoft News. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/.

Bercovich, Akhiad, and Udi Karpas. 2024. “Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B.” NVIDIA Technical Blog. https://developer.nvidia.com/blog/advancing-the-accuracy-efficiency-frontier-with-llama-3-1-nemotron-51b/.

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, et al. 2024. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv. https://doi.org/10.48550/arXiv.2405.04434.

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “QLoRA: Efficient Finetuning of Quantized LLMs.” Advances in Neural Information Processing Systems 36 (December): 10088–115. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html.

Fu, Daniel Y. 2023. “39014562 · Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture.” SlidesLive. https://neurips.cc/virtual/2023/oral/73841.

Glorioso, Paolo, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, and Beren Millidge. 2024. “Zamba2-Mini - Zyphra.” https://www.zyphra.com/post/zamba2-mini.

Goel, Karan. 2024. “The On‑Device Intelligence Update - Cartesia.” https://cartesia.ai/blog/2024-08-27-on-device.

Gu, Albert, and Tri Dao. 2023. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv. https://doi.org/10.48550/arXiv.2312.00752.

Javaheripi, Mojan, and Sébastien Bubeck. 2023. “Phi-2: The Surprising Power of Small Language Models.” Microsoft Research. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.

@karpathy. 2024. “It’s a Bit Sad and Confusing...” Tweet. Twitter. https://x.com/karpathy/status/1835024197506187617.

Li, Yuanzhi, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. “Textbooks Are All You Need II: Phi-1.5 Technical Report.” arXiv. https://doi.org/10.48550/arXiv.2309.05463.

Liu, Yifei, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. 2024. “VPTQ: Extreme Low-Bit Vector Post-Training Quantization for Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2409.17066.

Malinovskii, Vladimir. 2024. “The Evolution of Extreme LLM Compression: From QuIP to AQLM with PV-Tuning.” Yandex. https://medium.com/yandex/the-evolution-of-extreme-llm-compression-from-quip-to-aqlm-with-pv-tuning-19c44b91af96.

OpenAI. 2024. “Distillation API Docs.” https://platform.openai.com.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” arXiv. https://doi.org/10.48550/arXiv.1910.01108.

Schmid, Philipp, Omar Sanseviero, Pedro Cuenca, Lewis Tunstall, Tom Aarsen, and Vaibhav Srivastav. 2024. “Welcome Gemma 2 - Google’s New Open LLM.” https://huggingface.co/blog/gemma2.

Shao, Erzhuo, Shiyuan Guo, and Zachary A. Pardos. 2021. “Degree Planning with PLAN-BERT: Multi-Semester Recommendation Using Future Courses of Interest.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (17): 14920–29. https://doi.org/10.1609/aaai.v35i17.17751.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.

Footnotes

Interestingly, this also presents a significant problem in interpretability research.↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “The Large World of Small(er) Language Models.” September 18, 2024. https://blog.kasralekan.com/ideas/small-language-models/.

An Unexpected DNS Error

Kasra Lekan — Sun, 01 Sep 2024 04:00:00 GMT

A few days ago, I went to my freshly redesigned website having not viewed or edited it since early the previous day. I received a DNS issue as shown in Figure 1. This error was gone if I refreshed the page. However, most people who get an error going to a website will assume that the website is down so the error needed to be addressed.

The result of viewing my website for the first time in ~24 hours before attempting to address the issue. Refreshing the page caused it to resolve.

Figure 1

Diagnosing the Problem

I knew that refreshing the page resolved the problem and that the issue would not arise if I re-attempted to view the website soon after visiting it. This indicated that the problem likely did not have to do with standard speed benchmarks that we concern ourselves with in web development, e.g. first content paint. Nevertheless, I wanted to rule out first content paint being an issue so I did some speed testing on the deployed website. As expected, the times were great with the first content painted within .54 seconds and the entire website loading in 1.5 seconds.

I was unable to find any documentation of others having this problem despite my having a fairly standard NexJS website without a sprawling dependency list or complicated logic. I had deployed many websites before with Vercel with no issues so I knew it was something with my project that was different than my previous ones. I was using two packages that I had never used before framer-motion and react-rough-notations. I considered the possibility that these dependencies may be the culprits because they drastically slowed down compilation when I was developing locally, taking roughly 5 seconds to load content after a fresh restart with next dev.

Based on my observations, I decided to focus on optimizing the bundling of my website packages in any way I could. In the back of my mind, I thought that these issues should not cause the DNS error I saw before but I had no other theories and had to start on a solution.

A Digression on Serverless Computing

Vercel¹ deployments are, to a first approximation, a layer on top of AWS Lambdas, a serverless computing provider. While I have not worked on any large-scale projects that provisioned large cloud systems, I studied them during my Masters and they are perhaps the most underrated technical achievement fueling the Internet today.

In general, cloud computing provides for economies of scale reducing the cost of server management and the aggregation of the best technical expertise. Serverless functions are another layer of innovation on top of cloud computing. They are extremely optimized down to the kernel level so that they can cold-start with 100s of milliseconds or even microseconds of a request. Thus, serverless functions can start at request time rather than running permanently (Figure 2). The reason I don’t have to pay for deploying my “hobby” projects (with low traffic) is that it costs virtually nothing² to run these serverless functions.

Figure 2: Serverless computing (FaaS) visualized. Image credit: Prof. Yue Cheng

Solutions

The initial bundle analyzer output showing the size of the various parts of the website after building. NodeJS server results are shown.

Figure 3

Based on the documentation and the results of the bundle analyzer, I did the following:

I added a Suspense wrapper to my website’s content with a loading element.
- I knew this would not solve my issue since my first-paint speed was good, but using Suspense to have a loading UI is best practice.
Added the optimizePackageImports flag and applied it to framer-motion and react-rough-notations. From the docs, “This option will only load the modules you actually use, while still giving you the convenience of writing import statements with many named exports.”
- I was especially interested in its performance with framer-motion since the dependency JavaScript is large and there is not much that you can do to reduce it³ since you cannot import specific animations.
I converted framer-motion calls to use vanilla CSS for simple animations.
- I used framer-motion when vanilla CSS could accomplish the same effects since I already had the dependency. I realized that I could remove framer-motion from the necessary page load by using vanilla CSS when paired with the next optimization.
I applied component lazy loading and skipped SSR⁴ for my animated components based on the docs.

🛬 Results

The final bundle analyzer output showing the size of the various parts of the website after building with optimizations. Optimizations reduced the parsed size by ~19%. NodeJS server results are shown.

Figure 4

After applying the optimizations my server-side bundle size (parsed not static) was reduced by 19.1% as shown in Figure 4. Since these updates, my website has worked properly even after prolonged periods without requests.

🚀 Why not Astro

When first building my new website, I attempted to use Astro instead of NextJS. Since my website is all static, Astro would perhaps be a better technical fit especially since their implementation of Server Islands which is similar to NextJS 14’s Partial Prerendering would allow me to adds dynamic components when necessary. There are a lot of things I love about Astro. However, I struggled to use stateful components in Astro and ultimately had to use Next.

🪞 Reflections

The automatic optimizations that enable the rapid deployment of efficient websites like my own are truly stunning. A combination of serverless optimization, package bundle optimizations, and component optimizations in React metaframeworks allow me to focus on the content first. When I made my first website some years ago, I manually minified all my images and converted them to better formats for web viewing. With modern frameworks like Next, there are built-in Image components that perform optimizations like this for you.

We stand on the shoulders of giants.

Footnotes

My website is built with NextJS so Vercel was the natural choice for deployment.↩︎
Pricing for AWS Lambdas are billed at the 1-millisecond granularity. As of writing:
- $0.20 per million requests
- $0.0000166667 per GB-second of compute
This implies running 6000 1 GB Lambda function for one second costs $0.10.↩︎
This is not entirely true. There are some tools Framer provides to reduce the bundle size but they have limitations. As Framer points out, this normally shouldn’t be necessary because most bundlers apply “tree shaking” to reduce the bundle to just what is used.↩︎
Server Side Rendering (SSR) refers to when the HTML that the client is served is generated on the server, often using a database or other external APIs.↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “An Unexpected DNS Error.” September 1, 2024. https://blog.kasralekan.com/ideas/optimizing-serverless/.

Why Game Design is Hard

Kasra Lekan — Fri, 23 Aug 2024 04:00:00 GMT

Maplestory was a popular 2D MMO released in 2003.

Figure 1

A Simple Observation

Why are there so few MMO¹ games? Although I rarely played MMOs, the few times I have have been extremely fun because it is a large-scale social experience. At times you’re interacting with 10’s or 100’s of real people in real time.

I posed this question to a gamer/network engineer I know. He asked me to consider the tech stack required to provide a seamless high-player-count experience. That requires a server that is receiving data from ~10-100 clients who have different latency to the server (but are providing input at the same time) and must reconcile all those actions to inform the clients what happened. This is a difficult technical problem that most game engines² can ignore due to their low supported player count. While this was a compelling argument, I also considered how making an MMO is potentially a poor business decision for most developers.

I decided to explore deeper. Jump to the context section for an overview of the industry.

Technical and Business Factors

🚂 Game Engines and Devices

Game distribution today is diverse. While physical copies maintain some presence in the market, most games are distributed through online marketplaces like Steam or mobile app stores. Since distribution is relatively simple, the main barriers to entry for new titles are (1) Game Design / Development Times and (2) Marketing.

flowchart TD

subgraph Z[" "]
direction LR
  A{No-Code or Coding?} --> C{Platform/Device}

  C --> D1[Console]
  C --> D2[Desktop]
  C --> D3[Mobile]
  C --> D4[VR]
end

subgraph ZA[" "]
direction LR
    B{2D or 3D?} --> E{Licensing Agreement}
    
    E-->H1[Open Source/Free]
    E-->H2[Commercial Licensing]
end

Z --> ZA

Figure 2: A sample decision hierarchy for game engine selection.

Game design is part engineering, part art, and part business. Game engines are designed to carry some of the engineering load. While there are many engines, the choice is based on a few key criteria summarized in Figure 2. A few game engines worth pointing out:

Kaplay (formerly Kaboom) - The premier JavaScript-based game library, enabling web-first games but also desktop and mobile games through tools like Electron and Tauri.
Open 3D Engine (O3DE) - an open-source, multi-platform 3D engine originally built for AAA titles³
Godot - another open-source, cross-platform engine; designed for both 2D and 3D.
Unity and Unreal - closed-source but the most popular game engines. Unity has an extremely popular mobile game engine component as well.

🤨 Customer Satisfaction

Simple games have an easier time satisfying the player. When playing classic Mario, there are mechanics, nuance, and plot. Ultimately, however, the player is a lone gamer trying to complete the level. There is a clear objective with no external input.

When more complexity is added, most especially with multi-player support, the game design becomes more difficult. Just like businesses can have different categories of customers, games have different kinds of players. Balancing the game for each of these to make them happy can be difficult. For instance, a common imbalance occurs between “casual” and “hard-core” players. Some players will only devote an hour every once in a while to the game while the hard-core player will quickly reach the “end-game” content. If all of the development energy was placed into the initial progression of the game, the end-game will suffer and hard-core players will leave disappointed.

🆕 Case Study: New World

New World is a massively popular multiplayer online (MMO) game created by Amazon Games and released in September 2021… For players, the game is an extremely immersive experience: they don’t have to wait for screens to load or other interruptions. This is known as “seamlessness,” and it’s a valuable trait for developers to be able to deliver. (Walsh 2022)

New World is the most popular novel MMO I have seen in North America. Owned by Amazon, the game naturally used much of their infrastructure and the now deprecated Lumberyard engine. Walsh (2022) summarize the architecture which supported the game (Figure 3).

New World‘s Architecture

Figure 3

Collectively, the Amazon EC2 instances for a single world in Aeternum can simulate more than 7,000 artificial intelligence entities and hundreds of thousands of objects for 2,500 players. Each server set often processes millions of state changes per second, selecting the relevant data to create individual immersive experiences.

While this architecture, it is exciting that cloud computing has made provisioning an infrastructure for such a massive game far easier.

Despite its commercial success, ultimately New World lost most of its player base soon after launch. The game peaked on September 27th, 2021 with just over 900,000 players. By the end of the year, that number dwindled to 117,000 players. One month later - 68,000. Next month - 34,000. The game maintained that player base through the end of 2023 (steamcharts 2024).

Conclusion

Given the technical complexity of supporting a large player base with “seamlessness” and the challenges in satisfying the wide variety of customers, it is unsurprising that few new MMOs are made. Naturally, successful games can bring in strong revenue but the return on invested time may be higher by creating a simpler game. It’s worth noting that many indie developers make games that they would want to play and do not enter the developer with the calculating approach that I took in writing this article.

Context

The video game industry is a dynamic sector, characterized by a diverse customer base, a wide array of developers, dominant geographies, various distribution systems and devices, multiple business models, and a plethora of game genres.

🌏 Market Overview

Here I’m focused on a small subset of the overall game market in terms of both revenue and player count. However, some data on the industry overall will be useful in contextualizing the information presented here.

Code

Langage Model Performance Plateaus. What’s next?

Kasra Lekan — Sun, 11 Aug 2024 04:00:00 GMT

A special credit to t3dotgg (2024) for the video that inspired me to write this post before writing a post on the technical innovations behind small / more efficient language models.

Technical Content Disclaimer

Language Model Performance is Plateauing

@maximelabonne (2024)

Figure 1

Figure 1 shows the consistent trend of improvement in open- and closed-source models. While I could spend this whole post writing just about his graph, for now just notice the general trend of improvement on MMLU 5-shot benchmark performance over time.

🎉 The Hype is Alive

20 months ago, “ChatGPT is a revolution, the most powerful model ever made,” and today, you can run a model more preferred than this literally on a toaster!🍞 (Schmid 2024)

The quote from HuggingFace Technical Lead Philipp Schmid references Gemma-2-9b-it which, as of August 2nd, ranked 47th on HuggingFace’s language model benchmark – higher than GPT-3.5-Turbo-0613. Gemma 2 includes four different models (see Schmid et al. 2024 for full model details):

gemma-2-9b: Base 9B model.
gemma-2-9b-it: Instruction fine-tuned version of the base 9B model.
gemma-2-27b: Base 27B model.
gemma-2-27b-it: Instruction fine-tuned version of the base 27B model.

Thus, a 9 billion parameter model bested a 175 billion parameter model. The astute reader will note that Schmid must have the most advanced toaster ever to run Gemma 2 🤣.

📈 Moore’s Law is Dead?

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. (Sutton 2019)

Sutton’s famous 2019 blog post¹ “The Bitter Lesson” (Sutton 2019) indicated the primacy of computational power² in improving model performance over time rather than building expert knowledge into the models. However, as we approach the physical limits of transistors on a chip, maintaining Moore’s law has become impossible.

When I first read “The Bitter Lesson,” I thought Sutton argued that engineering work on models did not matter; I was wrong. The key lesson I now extract from my experience, the academic literature, and “The Bitter Lesson” is that model architectures must optimize their use of computation. For example, the introduction of self-attention mechanisms in Transformer models allowed each token to attend to every other token in the input sequence, leveraging parallel computation to process large amounts of data efficiently. Similarly, architectures like convolutional neural networks (CNNs) capitalize on the spatial structure of data, using shared weights and local connectivity to optimize computational efficiency and scalability. These architectures do not merely rely on hand-crafted features but instead exploit the raw power of computation to learn and generalize from massive datasets.

The lesson here seems clear: the architectures that thrive are those that best adapt to and utilize the growing computational resources available.

💾 Limitations on Available Data

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. (Muennighoff et al. 2023)

Architectures need to be efficient because of the massive amount of data used during training. Several papers including Muennighoff et al. (2023) note how we are reaching the limits of available data generated by humans. Thus, other authors have investigated training on AI-generated data Shumailov et al. (2023) with mixed results.

If we cannot solve the data bottleneck, relying on more computation will not close the performance gap between state-of-the-art models and general intelligence.

The Future of Language Models

⚙️ A Focus on Efficient Performance

Like Gemma 2, which I discussed (and OpenAI’s Turbo models before it), there is an increasing focus on efficiency and inference speed as we begin to see plateaus in language model performance. Similarly, Mistral Large 2, the second generation of the startup’s flagship model, was announced with a post entitled “Large Enough.” The model is designed to “push the boundaries of cost efficiency, speed, and performance” (MistralAI 2024).

I am not saying this is a bad push. I intend to devote an entire post to the technical innovations that have driven the gains in efficiency and inference speed we have seen in these models. However, I do not believe these technical innovations will lead to the future of AI models.

🛝 A Paradigm Shift is Needed

Language models have the following high-level flaws as I see it:

A coupling of knowledge and reasoning³ capabilities.
- This is most similar to the issues that the engineers behind models like Gemma 2 seek to address. It’s difficult to train a small model with high-level reasoning capabilities when its weights have to hold so much information in them.
- There is a growing literature on grounding language models with “World Models.”
There is no concept of thinking deeply, i.e. more inference compute doesn’t get you a better answer.
- Even if we stop improving our chips, we are building far more of them than ever before. Thus, if we had a model that had reasoning capabilities grow with inference compute, that model could answer difficult questions given enough resources.

I believe that to achieve the next level of AI-based intelligence, a new approach that addresses at least one of these flaws is needed.

ARC Prize (2024) is a public competition set up to encourage researchers to focus on achieving General Intelligence with a new Arc-AGI benchmark.

Figure 2

Much of the discussion of performance in this post has been based on benchmarks, a fascinating topic in its own right. Benchmarks have been instrumental in the advancement of NLP all the way back to the GLUE benchmark (Wang et al. 2018), however, they can cause research to become myopically focussed (Gehrmann et al. 2021) or mischaracterize the rate of progress as demonstrated in the paper from Schaeffer, Miranda, and Koyejo (2023) which won the priced Outstanding Paper award at NeurIPS 2023. Figure 2 is a demonstration of one potential flaw from the team at the Arc Price.

🍒 A Final Word from Yann Lecun

Lecun (2016) famously articulated on the NIPS stage in 2016 that Self-Supervised Learning with lots of data was the foundation of future models.

Figure 3

If you are a student interested in building the next generation of AI systems, don’t work on LLMs (Yann LeCun [@ylecun] 2024)

Yann Lecun, a famous and foundational NLP researcher, has had some famous pronouncements in the past like in Figure 3. Recently, he has been arguing that LLMs are not the solution to the intelligence problem despite their truly awesome performance.

References

ARC Prize, Inc. 2024. “ARC Prize.” ARC Prize. https://arcprize.org/.

Gehrmann, Sebastian, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, et al. 2021. “The GEM Benchmark: Natural Language Generation, Its Evaluation and Metrics.” arXiv. http://arxiv.org/abs/2102.01672.

Gudibande, Arnav, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. “The False Promise of Imitating Proprietary LLMs.” arXiv. http://arxiv.org/abs/2305.15717.

Lecun, Yann. 2016. “Predictive Learning, NIPS 2016 Yann LeCun, Facebook Research.” https://www.youtube.com/watch?v=Ount2Y4qxQo.

@maximelabonne. 2024. “Due to Popular Demand, I’ve Updated This Figure to Include DeepSeek-V2 and Mistral Large 2. It’s Also More Zoomed for Readability. Https://t.co/jWEpxH9zgO.” Tweet. Twitter. https://x.com/maximelabonne/status/1816416043511808259/photo/1.

MistralAI. 2024. “Large Enough.” https://mistral.ai/news/mistral-large-2407/.

Muennighoff, Niklas, Alexander M. Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. “Scaling Data-Constrained Language Models.” In. https://openreview.net/forum?id=j5BuTrEj35.

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2024. “What Is Moore’s Law?” Our World in Data, February. https://ourworldindata.org/moores-law.

Sauerwein, David. 2024. “Reflections on The Bitter Lesson LinkedIn.” https://www.linkedin.com/posts/davidsauerwein_ai-machinelearning-compute-activity-7215818757405888512-ogMw/.

Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. 2023. “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems 36 (December): 55565–81. https://proceedings.neurips.cc/paper_files/paper/2023/hash/adc98a266f45005c403b8311ca7e8bd7-Abstract-Conference.html.

Schmid, Philipp. 2024. “Hugging Face Post - LinkedIn.” https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_absolutely-wild-google-deepmindgemma-activity-7224466612043620352-dcgq/.

Schmid, Philipp, Omar Sanseviero, Pedro Cuenca, Lewis Tunstall, Tom Aarsen, and Vaibhav Srivastav. 2024. “Welcome Gemma 2 - Google’s New Open LLM.” https://huggingface.co/blog/gemma2.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv. http://arxiv.org/abs/2305.17493.

Sutton, Rich. 2019. “The Bitter Lesson.” http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

t3dotgg. 2024. “AI Isn’t Gonna Keep Improving.” https://www.youtube.com/watch?v=Y8Ym7hMR100.

Valmeekam, Karthik, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. “On the Planning Abilities of Large Language Models : A Critical Investigation.” arXiv. http://arxiv.org/abs/2305.15771.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–55. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5446.

Yang, Sherry, Ofir Nachum, Yilun Du, Stephen McAleer, Igor Mordatch, Linxi Fan, Jeannette Bohg, and Dale Schuurmans. 2023. “NeurIPS Workshop · Foundation Models for Decision Making.” SlidesLive. https://neurips.cc/virtual/2023/workshop/66525.

Yann LeCun [@ylecun]. 2024. “If You Are a Student Interested in Building the Next Generation of AI Systems, Don’t Work on LLMs.” Tweet. Twitter. https://x.com/ylecun/status/1793326904692428907.

Footnotes

Sauerwein (2024) has a good post covering the summary of “The Bitter Lesson” and related responses.↩︎
Moore’s law observes that the number of transistors in an integrated circuit roughly doubles every two years. Roser, Ritchie, and Mathieu (2024) have a great post visualizing it.↩︎
“Reasoning” is a tricky term to nail down. Here I am referring to current benchmarks rather than more advanced tasks like planning. Planning is an extremely interesting capability that is essential for language-model-based agents. Check out the NeurIPS 2023 workshop on “Foundation Models for Decision Making” for a taste of this research (Yang et al. 2023) as well as Valmeekam et al. (2023).↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “Langage Model Performance Plateaus. What’s Next?” August 11, 2024. https://blog.kasralekan.com/ideas/lm-performance-plateau/.

Website Redesign

Kasra Lekan — Mon, 29 Jul 2024 04:00:00 GMT

🕰️ Context

Like other programmers with some experience in website development, I have toyed around with the design of my website over time. My first website was on WordPress before I learned to code, the second was built using a React template, and the third was a Hugo-based website using HugoBlox. The transition between the second and third website was motivated by the tech debt of the old React template I used as well as a desire to have a simple website that supported me during my Master’s degree with features for my research publications. The website was always planned as a temporary solution until such time as I could focus on redesigning from the ground up.

⚒️ My Design Process

My design process had six steps:

Design Search
Condense Learnings
Determine My Content (data)
Data-informed Design
Coding
User Testing (+ repeat steps as needed)

In the following section, I detail steps 1 and 2. When reviewing my content, I realized (1) I have limited required copy and content that the user needs to see (mostly my work experience, published work, and projects) and (2) I wanted to support my writing with academic-style citations. After searching, I decided to separate my writing into its own website using the Quarto framework (“Quarto” 2024). This separation provides for easier writing, but importantly it allows me to refactor my base website without worrying about backwards compatibility for my writing. Quarto is quite expressive and customizable for a markdown-native framework which has allowed me to add a number of extra features to this website including a reference backrefs, styled progress bar, view counters, and use of the view transitions API ¹.

🪞 Web Design Learnings

Successful Examples Categorized

Websites May Change

These websites may be redesigned by the time you are reading this.

Committing to an aesthetic

Examples include:

Minimal (mostly dark)
3D with three.js
- itssharl.ee – brilliant use of animations and using the same 3D shapes on hover animations for desktop
- bruno-simon.com – cool factor achieved
- jesse-zhou.com – extremely detailed design
- edwardh.io – fun integration of 3D in a normal portfolio website
Bright and Curvy
- amysboyd.com
- seanhalpin.xyz
Gallery (lots of pictures)
- cydstumpel.nl
- nicolasloureiro.com
Other Aesthetic Examples
- Coder
  - tamalsen.dev
  - vscode-portfolio
- 8-bit
  - thegeekdesigner.com – great copy as well as brilliantly designed
  - expensive.toys – a fun 404 page

Animations & Interaction

Examples include:

Loading animation
- patrickheng.com
Scroll animation
- aimpie.design
- cherupil.com
Cursor following animations, clickable items, etc.
- minhpham.design/ – hilarious hidden copy on hover

My Key Design Considerations

Wow Factor: This is one of those you know it when you see it traits. Having reflected on all the websites above and many more, I believe that the Wow Factor is based on (1) Animations / Interactivity and (2) Design Cohesion. Cohesion for some designs is relatively easy, e.g. a minimalistic design is naturally cohesive due to the relatively few design elements being implemented. Ultimately, cohesion also depends on decisions about Typography, Colors, and Animations.
Reducing Cognitive Load: Some websites are awe-inspiring but make it hard to get to the raw data or have scrolling animations that make my fingers hurt due to the amount of time it takes to traverse the page. A brilliant example of a low-cognitive-load website that is also impressive is gkoberger.com which uses clickable objects in the hero div and related animations to present a simple, yet information-dense website. This website served as a key inspiration for my design.
Who is the Audience: Many excellent websites struggle to deliver a similar experience on mobile devices. There are inherent limitations of mobile but at a minimum, the same information should be accessible to the user. This is just one example of having to develop for different audiences who either are viewing the website differently or looking for different information from the website.

The tension between generating a Wow Factor while having a low User Cognitive Load and serving the different audiences viewing my website makes the design process challenging.

My objectives to balance these considerations were:

Commit to a unique aesthetic for my website, e.g. making it look like a Google/Apple Maps page.
Ensure all high-priority information is accessible within 5 seconds of loading the website (ideally all the information is within the first page on desktop) - Subgoal: Be information efficient: tighten up copy, provide smaller bites of information with clickable elements for users to learn more when interested, and use visuals to display information when possible.
Implement animations for Loading, Hovers, Clicking into sections, and (if applicable) Scrolling. Consider opportunities for interactivity.
Ensure all information is accessible on mobile and consider how to make mobile viewing equally enjoyable as desktop viewing.

✔️ Results

A taste of my new website.

Figure 1

View my website at kasralekan.com.

References

“Quarto.” 2024. Quarto. https://quarto.org/.

Footnotes

This implementation of view transitions has been quite hacky and required much more refinement than anticipated due to the way Quarto injects content or executes JS based on the header yml elements in the different page files.↩︎

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “Website Redesign.” July 29, 2024. https://blog.kasralekan.com/ideas/website-revamp/.

Extending “Towards Monosemanticity”

Kasra Lekan — Wed, 03 Apr 2024 04:00:00 GMT

Background

Based on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Bricken et al. 2023) by Anthropic and Language models can explain neurons in language models (Bills et al. 2023) by OpenAI, I attempted to generate natural language explanations for the neurons in Distill-GPT-2 by projecting the final MLP output layer to a higher dimension (similar to dictionary learning) and then using a language model (gpt-4-turbo-2024-04-09) to generate natural language descriptions of each higher dimension using activation values. The underlying theoretical foundation is the Superposition hypothesis which, simply put, states that each neuron in a language model learns a complicated mix of concepts. For instance, a neuron may activate strongly on Korean and DNA sequences. Thus, by projecting MLP outputs to a higher dimension we can attempt to create “features” that represent an explainable concept.

Challenges

Reproducing Anthropic’s representation from the paper’s appendix
- Huge thanks to Neel Nanda for his blog post and repo (Nanda 2024a, 2024b).
Tuning hyperparameters
Loss degradation over longer training runs
Automated interpretability using OpenAI’s implementation package - API changes requiring code refactoring due to data parsing changes or rewriting due to missing information. - Poor responses from GPT-4, ultimately making automated interpretability impossible without adjusting the prompts

Observations

Training Autoencoders for Reconstruction

The primary metric I used for MLP reconstruction efficacy was “reconstruction score”:

I was able to reproduce Anthropic’s autoencoder on a single-layer transformer with a GELU activation, achieving a reconstruction score of ~94% with 2 billion training tokens (fewer than Anthropic’s run). With Distill-GPT-2, I was only able to achieve a reconstruction score of ~77% (with a 32x dictionary size). I observed that (1) training with more tokens did not significantly improve reconstruction scores. Additionally, performance would deteriorate throughout the training run after reaching an optimum, suggesting that when scaling up this approach, a more sophisticated training strategy would be necessary.

Dictionary Size

Anthropic tested many dictionary sizes from 1x to 256x but focussed on 8x for their primary findings. I hypothesized that a larger size would be optimal for a larger model since it is trained on more tokens and learns more complex representations. I first trained a 32x dictionary and later trained a 128x dictionary. Training a larger dictionary naturally was more computationally intensive, .

Interpretability

Figure 1

The natural language explanations suggested that 32x is not expressive enough for Distill-GPT-2. Many features had the same explanation because they activated on a wide range of tokens. Since these ranges were not specific, these explanations fitted more to the high-frequency tokens in the evaluation text rather than the model Figure 1. Thus, I posit that larger models need much larger dictionaries as they encode more features for each neuron in the MLP layer. There are ~15x more parameters in Distill-GPT-2 than in the single-layer transformer that Anthropic analyzed. Thus, I opted to test a 128x dictionary in addition to the 32x.

Training a 128x dictionary was far more computationally intensive and did not reach a high enough reproduction score to facilitate interpretability. After training for 2 billion tokens, the score was only ~48%. Additional training, led to degredation from this optimum, emphasizing the need for increased sophistication with autoencoder training as the model being interpreted becomes larger.

Acknowledgements

This work would not have been possible without guidance from Professor Yangfeng Ji.

Check out my presentation on this research here.

References

Bills, Steven, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. “Language Models Can Explain Neurons in Language Models.” Open AI Public 2.

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread 2.

Nanda, Neel. 2024a. “Neelnanda-Io/1L-Sparse-Autoencoder.” https://github.com/neelnanda-io/1L-Sparse-Autoencoder.

———. 2024b. “TransformerLensOrg/TransformerLens.” TransformerLensOrg. https://github.com/TransformerLensOrg/TransformerLens.

Citation

For attribution, please cite this work as:

Lekan, Kasra. 2024. “Extending ‘Towards Monosemanticity’.” April 3, 2024. https://blog.kasralekan.com/ideas/towards-monosemanticity/.