Overview

Repo (code, write-up included) 💻

Around 3 months ago, I applied to Neel Nanda’s MATS stream as my first foray into mechanistic interpretability research. While I was ultimately rejected, I ended up learning much more than I think I had expected. This is somewhere between a post-mortem and an account of how it looked to produce research for the first time on my own under specific time constraints.

Background

I got into ML as a whole in July 2025, as I was generally a bit bored with life at the time. I did the fastai MOOC, moved to reading papers and math textbooks, and then landed on mechanistic interpretability after finding this post from Neel only a few days after it had come out. I focused on building some skills through ARENA and doing short projects before MATS opened in November.

What is MATS and how does Neel’s MATS stream work?

MATS (ML Alignment Theory Scholars) is a fellowship program in which mentors across many different areas of AI safety work with a group of mentees on a specific research agenda, culminating in a poster at a presentation event. Many also publish their research at venues like NeurIPS, ICLR, and ICML, which is a big deal for building credibility in the space. MATS as well as other similar programs are seen as a good way to get your foot in the door when it comes to getting established in the AI safety space.

Neel Nanda’s stream is specifically unique in that there seems to be much less oversight when it comes to the application process. I’ll try to not put words in his mouth (and you can view his criteria for applications here), but when I applied, it consisted of doing a 20-hour research investigation on a problem of your choice and then writing everything up for him to see.

Research

In short, the project investigated how Qwen3-14B, a reasoning-capable model, processes fictionally framed jailbreaks versus direct harmful requests, using both behavioral and mechanistic methods. The idea for this was rounded down from around a dozen that I was interested in and 2-3 that I let myself investigate for up to 3 hours each.

The base experiment (testing prompts) and separate behavioral work (prefilling acceptance/refusal) was what I had mostly planned out ahead of time, as I became curious about a post in 2025 from Can Rager and David Bau on using prefilling attacks to extract information from DeepSeek that it otherwise avoids. I also didn’t start with the fictionally framed prompts, but rather was broadly interested in prompt-based stuff before transitioning into something narrow (which I’ll discuss a bit more later).

The mechanistic work was focused on trying to pin down some sort of causal element that I couldn’t just get by doing the above experiments. I tried PCA to examine the internal structure of the model, using Arditi et al. 2024’s difference-in-means vector approach for ablating refusal to see what different prompt types were inducing in the model, and some neuron-level analysis based on what I had seen at specific layers when conducting PCA.

What I’d do differently

It’s been a while since I’ve touched this work I produced specifically, but there were things about how I approached research that I’d probably aim to fix if I were to resubmit to Neel’s stream again. Note that this is somewhat opinionated and may go out of date somewhat fast depending on how I change my methodology or whether AI becomes significantly better at some tasks over the short term.

Neel also has a great series of posts here and here that broadly outline some of the mistakes that I made better than I can at the moment, so feel free to read this if you’re interested in a bit of a homework assignment.

Losing focus

After the behavioral experiments, I wasn’t entirely sure where to go. In hindsight this was probably a signal to step back and reassess the research direction, but I ended up pushing forward into mechanistic work hoping to find something more definitive. This meant spending time on methods I wasn’t fully confident evaluating without a clean hypothesis driving each one. I wasn’t walking completely blindly into these more technical experiments, but it also wasn’t well-planned past a certain point, and that gap between “I found something at a certain layer” and “I know what to do with it” is where a lot of time got spent unproductively. Planning a clearer path from behavioral findings to mechanistic claims before running experiments would have helped, and is something I’d treat as closer to a requirement on any future project.

Overclaiming

Although I realized this days after submission, there were parts of my original write-up that claimed things that my results weren’t necessarily pointing to. I attribute this to a mix of subconsciously convincing myself that my results were positive while being in a bit of a rush. Finishing my first project drove a lot of this and decreased a lot of how critical I was of my own work as I was doing it. This is much more of a framing or processing issue than a technical one, and I personally see it as fixable with feedback and simply by doing more work in the field.

AI-assisted writing

There are moments where it’s okay to use LLMs for producing writing related to your research, though I’d recommend against it when identifying your claims. I fell into this as I didn’t leave enough time for the writing, and I think it partially led to the overclaiming I detailed above. LLMs aren’t necessarily the best at evaluating your research ideas, and might produce confident prose from uncertain premises without flagging the problem. In my experience, this can be improved if you use a large prompt or document that gives them specific context.

Reflections on the research direction

I know that looking at all of this with clean eyes after the rejection, a part of me believes that I went far too narrow. A very niche jailbreak type tested on only one model at one size doesn’t say much about models with reasoning capabilities, and I should’ve diverted earlier on into attacking something with less built-in limitations (though most works in 20 hours would naturally carry some).

On the other hand, I still find myself interested in why jailbreaks like the fictionally framed ones that I examined work. Even further, why would they work on Qwen3-14B specifically? What else could you look into mechanistically, and how could you make this more generalizable? Is there anything in the literature that discusses this already?

In all, I think that you could refactor this quite a bit and come out with something worthwhile. I hold the opinion that there are probably many more impactful low-hanging fruit worth investigating though, but perhaps this sort of tension between investment and detachment is normal.

Post-hoc revisions

The write-up that I submitted to Neel is actually a first revision that isn’t present in the repo. For anyone interested, it is available at this link if you’d like to compare the differences. After the rejection, I rewrote nearly all of the prose by hand and revised several of the claims to better reflect what the results could actually support. I restructured large parts of the paper to be more succinct and tie into broader literature as well. The submitted version was more confident than the evidence warranted, which is discussed more in the overclaiming section above. I’d point anyone reading this to the repo version as the more accurate account of the work, though it’s worth knowing the two differ.

Closing

I’d still recommend that anybody trying to get into the field applies to these mentorship programs. The application process for Neel’s stream in particular is designed so that the work you produce has value regardless of outcome, as you’re essentially running a small research project. The things that you learn how to do don’t disappear with a rejection, and even if this specific project wasn’t strong enough, I came out of it with a clearer sense of what good research practice actually requires.