Hello! This is my first post here of hopefully many more, and I would like to dive into something that I’ve faced plenty of times in learning ML and other things before that as well. If there are any imperfections in how I’m representing things or my reasoning in general, don’t be afraid to reach out and tell me. Perhaps I could kill another bad idea in doing so!
Introduction
During these past few weeks, I have been working through Neel Nanda’s guide for becoming a mechanistic interpretability researcher after a hundred or so hours of starting on the broad goal of self-learning ML. As of writing this, I’m still working through some of the basics for doing research in general for this field, but I’ve stumbled upon perhaps one of the largest examples I’ve experienced in a trend of killing or pruning ideas, and to what end should one follow something before they move to something else. This is something that I’ve seen mentioned broadly by a few researchers I follow and maybe more importantly for me at least, something that I’ve noticed in other autodidactic learning pursuits.
The Cycle
Starting out
This mostly had started when I was playing around with the model organisms for emergent misalignment (mentioned as EM from here on out) produced in Betley et al. 2025. I was getting Claude to generate prompts for them to see the phenomenon of EM myself and although this took a while, I ended up getting some interesting results. The first was a rather evil response from the model about self-medicating for depression without telling a doctor (which in retrospect was not EM considering the paper had fine-tuned on bad medical advice), and the second were those for broader violent acts. Examples of these would be committing insurance fraud by reporting a crashed car as stolen, taking all of the money out of a lost wallet, and more.
I then noticed a bit of a trend, if highly anecdotal. Why do I seem to hit on getting misaligned responses more for “benign” prompts versus the more evil (stalking an ex, poisoning a roommate) that Claude was generating? Is there any sort of correlation between the violence in prompts themselves versus violence in the output? I was quite interested although a bit skeptical.
Early Investigation
This drove me to do some investigating. I first thought about how I would design this experiment and what its worth would be. This involved creating prompts that would exist in two categories of being mildly evil versus strongly evil. I would simply test these and measure some sort of rate for how often a misaligned response appears and how “evil” it ends up being based on some rating.
I should note that I was already skeptical about the value of this. Why would it even be important to get insights from this experiment at all? I thought that it might be worth it if I got some bizarre results like seeing that there is no correlation at all between prompt evilness and response evilness, or if there ended up being some inverse correlation.
Literature Review
The first thing after becoming more skeptical was shifting to a literature review on what might have existed in spaces adjacent to my idea. I simply did this through using ChatGPT Deep Research and giving a prompt that aligned with what I wanted (is there literature on this sort of gradient I’m looking into, what sort of work exists in AI safety that notes anything relatively similar).
This led me to stumble upon a few papers that pretty much not only implied what I had been thinking, but provided better ways to do this than my original idea. The first was Wyse et al. 2025 on how performance on insecure models from the earlier Betley et al. 2025 were impacted by “nudges” in the prompt. This was basically what I had already been considering, although they did harmless versus evil instead of less versus more evil.
The second paper was a maybe bit more damning. Zhao et al. 2025 from around the same time had literally found a harmful direction that was separate from the refusal direction that I had already known about much earlier from Arditi et al. 2024. This applies a well-defined technical dimension I could use much more quickly rather than doing my now seemingly unoptimized experiment.
These two papers alone tell me that one, prompts can influence a model displaying EM to give more misaligned responses and two, that there was probably a way smarter idea to measure my correlation with the direction than just doing it all by hand (which would be reinventing the wheel).
Death (or at least shelving)
At the end of all of this, I had decided that my previously interesting idea was no longer that important at all and that even if it hadn’t been done before, there was enough evidence to shelve it, or just do away with it entirely. I should note that I’m still open to exploring this if I find something convincing, though just looking at the field right now, there are probably low-hanging fruit that are easier to tackle.
I think one thing that I find is interesting is that I don’t particularly feel upset or frustrated about this. I had an idea that I liked and had developed a healthy skeptical attitude toward before looking into the literature and finding that it was pretty much worthless. I now know the nitty-gritty of the field just a little bit better while potentially freeing up weeks of time instead of spending it on something that would’ve generated concerningly low ROI in retrospect.
Relation to other pursuits I’ve been involved in
I think the reason why I was able to move relatively quickly (formulation to death was less than a week) is because looking back, I’ve done some very aggressive pruning before in many of the things that I’ve tried in the past.
Japanese
The biggest one for me is learning Japanese. I’ve been learning for just about 8 years as of writing this, and while I don’t feel it on a daily basis, 8 years is a crazy amount of time to be doing one thing. My process has evolved a lot and I think that’s become with killing ideas that I found weren’t generating enough value across months or years of having done them.
An example of this I can think of is switching from textbooks to full-on immersion. I had been doing classes with textbooks in high school for nearly 4 years and thought that the natural point of progression for me was just to go do the next hardest textbook. Around junior year, I started picking much more work on my own and made it through textbooks that we weren’t going to cover at all. I still wanted to know more though and was growing a bit frustrated with my approach. I looked into alternatives online and had found an increasing sentiment about immersion learning and the benefits it brought to learners.
After some deliberation, I pruned the one thing that kept me afloat in language learning for years to pursue… watching YouTube videos and reading books. From there, I added on Anki cards, immersion routines, and speaking/accent practice.
The most important thing I want to say here is that I’ve since reshaped or completely pruned all of these over the years. I quit doing Anki for Japanese after making around 15,000 vocabulary cards that included some incredibly worthless esoteric concepts/words, I had to stop reading books that were too difficult at the time and pick them up again years later, and I even reshaped my horribly inefficient shadowing approach (repeating the content of entire audiobooks) for something that a professor I had when I was studying in Japan showed me based on her research (repeating quick clips of newscasters while using OJAD, a Japanese pitch accent tool, as a guide).
Illustration
Another one that I can think of is drawing, when I used to do that. This is a bit smaller, but I still had to cull several activities at a much quicker pace than Japanese. This included quitting Drawabox after getting what I wanted out of it, as well as picking up and dropping different techniques for practicing anatomy and figure drawing. I learned a lot about optimization that didn’t come as fast from Japanese.
This is getting into ramble territory, but that last sentence in particular is something interesting to me. I had gone into drawing with the notion that it was a free and creative activity, yet came out with the idea that to follow my idea of becoming skilled (i.e. building a career out of it), I needed to train aggressively. It feels a bit similar to research in the way that both are open-ended, although sticking around takes grit, routine and all of this killing/pruning/reshaping that I’ve been talking about.
How this all ties together
I think that looking at the broad picture, it’s clear to me at least that I’ve had to kill a tremendous amount of things over the years. Some of these were things that I did for days or weeks before realizing they were suboptimal, and others were things that I did for years before seeing that they had served their purpose or even finding that they were holding me back (ahem, the Japanese Anki deck I made and the thousands of archaic literary words I had learned over being more conversational).
Killing (or again, shelving) this current research project is just another step in this massacre I’ve been a part of, as silly as that sounds. To be skeptical and constantly audit where your time is going seems to be essential for becoming successful at self-learning across domains, and while it still is early for me in ML research, I feel like this trend is continuing here as well.