Alex Infanger
About
Ahoy. I'm an independent researcher based in the San Francisco Bay Area working on AI interpretability and alignment. In 2022, I finished my PhD on theory and algorithms for Markov chains at the Institute for Computational and Mathematical Engineering at Stanford.
axelnifgarden [ ] amgil [ ] com. (Up to a permutation, that is. Note my middle name is Dara.).
News, Talks, etc.
- July 2025. New preprint: "Misalignment from Treating Means as Ends".
-
Reward functions, learned or manually specified, are rarely perfect. Instead of accurately expressing human goals, these reward functions are often distorted by human beliefs about how best to achieve those goals. Specifically, these reward functions often express a combination of the human's terminal goals -- those which are ends in themselves -- and the human's instrumental goals -- those which are means to an end. We formulate a simple example in which even slight conflation of instrumental and terminal goals results in severe misalignment: optimizing the misspecified reward function results in poor performance when measured by the true reward function. This example distills the essential properties of environments that make reinforcement learning highly sensitive to conflation of instrumental and terminal goals. We discuss how this issue can arise with a common approach to reward learning and how it can manifest in real environments.
- May 2025. New preprint: "Distillation Robustifies Unlearning" (Twitter/X thread) (LessWrong Post and Discussion).
-
Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.
- March 2025. Team Shard's project "Distillation Robustifies Unlearning" was selected for the MATS extension program and for a spotlight talk at the MATS 7 Symposium.
- January 2025. Started MATS Cohort 7 with Team Shard đ.
- October 2024. New preprint: "The Persian Rug: Solving Toy Models of Superposition using Large-Scale Symmetries" (Twitter/X thread).
-
We derive an optimal solution and corresponding loss for Elhage et al. (Anthropic)'s toy model of superposition in the limit of large input dimension. We discuss implications for designing and scaling sparse autoencoders.
- March 2024. "A posteriori error bounds for truncated Markov chain linear systems arising from first transition analysis" published in Operations Research Letters.
-
Many Markov chain expectations and probabilities can be computed as solutions to systems of linear equations, by applying âfirst transition analysisâ (FTA). When the state space is infinite or very large, these linear systems become too large for exact computation. In such settings, one must truncate the FTA linear system. This paper derives lower and upper bounds for when one does this truncation and shows the bounds' effectiveness on two numerical examples. (code)
- December 2023. "Solution Representations for Poissonâs Equation, Martingale Structure, and the Markov Chain Central Limit Theorem" published in Stochastic Systems.
- This paper is concerned with Poisson's equation for Markov chains. For those unaware of this equation, Poisson's equation for Markov chains connects with Poisson's equation the PDE \( \Delta \varphi = f \) (e.g. for electricity and magnetism) in the following way. When you discretize the PDE you can find a representation for the solution that the potential \(u(x)\) as the average value of its neighbors, plus a modified charge value at \(x\) (a classical calculation, do it yourself!). This means \(u(x)\) can be interpreted as the expected value of the sum of the modified charge values "picked up" by a symmetric random walk starting at \(x\). Poisson's equation for Markov chains generalizes this expected value problem to the case where the symmetric random walk is now any Markov chain.
- October 2023. "Eliciting Language Model Behaviors using Reverse Language Models" won a spotlight at the 2023 SoLaR Workshop at Neurips.
- At a high level, the idea in this paper is that if you had a reverse language model (that models a language model in reverse), you could start with dangerous/toxic outputs and go backwards to find all the prompts that lead to that dangerous output. After finding all such prompts, you could then do adversarial training to protect the original (forwards) model from them.
- Fall 2022. I spent a few months facilitating reading groups on the AGI safety fundamentals curriculum in Boston. This was for the MIT AI Alignment Team (note I was not officially affiliated with MIT: this work was funded by the FTX Future Fund regrant program).
- June 2022. "Truncation Algorithms for Markov Chains and Processes" received an honorable mention for the Gene Golub Dissertation Award.
- This thesis focuses on the problem of approximating an infinite or very large state space Markov chain \(X=(X_n:n\geq 0)\) on a smaller subset of the state space \(A\). A well-known approach to this problem is to re-route transitions of the original chain that attempt to leave \(A\) into \(A^c\) back into \(A\). We give new conditions under which such an approximation is good for estimating the stationary distribution \(\pi\) of \(X\) (in the sense of convergence as \(A\) gets large). We also provide a new approximation for estimating \(\pi\) on \(A\) that comes with error bounds.
- June 2021. I was honored to receive ICME's Teaching Assistant Award for the 2020-2021 academic school year. This was the year I was an assistant for two ICME PhD core courses: CME 305 (Discrete Mathematics & Algorithms) and CME 308 (Stochastic Methods in Engineering).
Preprints
-
"Misalignment from Treating Means as Ends" (2025)
H. Marklund, A. Infanger, and B. Van Roy.
-
"Distillation Robustifies Unlearning" (2025)
B. W. Lee, A. Foote, A. Infanger, L. Shor, H. Kamath, J. Goldman-Wetzler, B. Woodworth, A. Cloud, A. M. Turner.
(Twitter/X thread) (LessWrong Post and Discussion)
-
"The Persian Rug: Solving Toy Models of Superposition using Large-Scale Symmetries" (2024)
A. Cowsik, K. Dolev, A. Infanger.
(code) (Twitter/X thread)
-
"Eliciting Language Model Behaviors using Reverse Language Models" (2023)
J. Pfau, A. Infanger, A. Sheshadri, A. Panda, J. Michael, C. Huebner.
(code)
-
"A new truncation algorithm for Markov chain equilibrium distributions with computable error bounds" (2022)
A. Infanger, P. W. Glynn.
(code)
-
"On convergence of a truncation scheme for approximating stationary distributions of continuous state space Markov chains and processes" (2022)
A. Infanger, P. W. Glynn.
-
"On convergence of general truncation-augmentation schemes for approximating stationary distributions of Markov chains" (2022)
A. Infanger, P. W. Glynn, Y. Liu.
Publications
Links
Twitter, LinkedIn, GitHub, CV (long form CV).
Last updated: 07/18/2025. Website style based off of the website of (the totally awesome) Johan Ugander.