Theory of Change | habichuela.dev

Language models keep getting more capable. To some that’s obvious. Most people, I don’t think, have any idea what’s coming.

And yet our understanding of how these systems actually work — or where they fail — is limited. Capabilities are outpacing the tools we have to inspect them, and that gap is where my research interests sit.

Whatever you believe about the ceiling of AI, most people would agree there’s a non-trivial probability that highly capable systems transform how humans live within a short window. If that’s right, the expected value of good safety work is enormous — even under heavy uncertainty about which specific problems will end up mattering most.

My bet would be that of all directions in the research agenda, interpretability is the most important sub-direction: if we can’t see what these models are doing internally, we can’t reliably know whether they’re safe, and we can’t catch failures we didn’t know to look for. And specifically I’m drawn to “bitter lesson-pilled” interpretability — methods that scale cleanly with compute. Per the bitter lesson, general approaches that ride the compute curve tend to outpace clever human-engineered ones over time. If interpretability methods don’t scale with the systems they’re trying to interpret, they’ll be obsolete faster than we can climb the exponential of progress.

I was first introduced to AI safety in the summer of 2025 at Carnegie Mellon. Some very smart people there got me to read AI 2027 and then Situational Awareness: The Decade Ahead, which made me realize — at least a little — the magnitude of what was happening in the world around me.

I wanted to contribute. Given my background in reverse engineering and binary exploitation, I figured I’d have decent intuition for the kind of work this requires.

Right now, I’m trying to deepen my foundations in order to get to the frontier of research, where the interpretability problems are most urgent and the tools to solve them are being built. However, I also feel the urgency of the issue, with so little time left I’ve changed many of my life decisions and career decisions to optimize for short timelines.

On more thing, a lot of people feel intimidated by the competitive programs, the lack of mentorship, and the overall difficulty of figuring out how any of this works. If that’s you: don’t forget we have the most powerful learning tool humanity has ever invented available for $20/month. Nothing is stopping you from learning whatever you want and doing tremendously good research — except time and effort.