Research

I treat mechanistic interpretability as a practical tool for AI safety building methods that scale beyond toy settings and validating them on real model behaviors. I am especially interested in using LLM agents to automate interpretability (autointerp), turning circuit analysis from manual, single-prompt inspection into a scalable process. My work is organized around three intertwined directions.

Agents for autointerp

Building agentic pipelines that discover, label, and validate circuits and features at scale, so interpretability keeps pace with model capability instead of lagging on isolated examples. Circuits are information-dense and inherently prompt-specific, making large-scale behavioral analysis laborious: while tools like circuit tracer construct circuits for arbitrary prompts, raw circuits still require heavy human effort to cluster into interpretable supernodes, and understanding a model’s behavior on a task demands dataset-level analysis that manual inspection cannot scale to.

Interpretability for safety

Locating and editing the causal mechanisms behind undesirable behavior — systematic bias in sensitive domains, unstable reasoning — moving toward targeted, mechanism-level interventions. Modern interpretability methods remain underexplored for socially complex behaviors such as moral and political reasoning; I seek to identify the parameters and circuits responsible for specific behavioral traits.

Reliability under real use

LLMs shift stance and tone under minor prompt changes; I design benchmarks to measure their stability and build interpretable methods to improve it in multi-turn settings. Working with Prof. Yue Dong and Prof. Kevin Esterling, I study LLM behavior using social science methods and stress-test model opinions on sensitive downstream tasks such as long-form, multifaceted summarization, where models can behave inconsistently when their intrinsic knowledge conflicts with the extrinsic input.

Applied ML

Previously, I also worked on applied ML across NLP, computer vision, and cloud systems, often for low-resource and underrepresented settings. It was through these systems that I realized the “black box” nature of ML models where fine-tuning offered limited control and little insight into why a model behaved as it did, and how their behavior changed. This got me interested in interpretability and safety.