These are selected publications. For a full list of my publications, see my Google Scholar profile .
* denotes equal contribution.

Machine learning

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design

International Conference of Machine Learning 2024

Andrew Campbell*, Jason Yim*, Regina Barzilay, Tom Rainforth, Tommi Jaakkola

Description

We sought to extend flow matching to discrete state spaces. The theory and construction of discrete flows simplifies discrete diffusion which allows for greater flexibility in probability paths, sampling, and target/source couplings. Gat et al. extended our work by scaling it to larger models, experimenting on more domains (coding, images), and developing more extensions. (In retrospect we should’ve called our method discrete flow matching 🥲.)

Next, we combined discrete and Riemannian flow matching into a multimodal diffusion model called Multiflow for jointly generating protein sequence (tokens) and structure (SE(3)). We achieved state-of-the-art results at the time for protein generation. We briefly investigated co-dependency and mutual information properties between the discrete and continuous flows. There is still a lot to understand in multimodal generation.

Paper Talk Multimodal Flow Code Discrete Flow Code

SE (3) diffusion model with application to protein backbone generation

International Conference of Machine Learning 2023

Jason Yim*, Brian L Trippe*, Valentin De Bortoli*, Emile Mathieu*, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola

Description

I wanted to develop a generative model over AlphaFold’s SE(3) protein representation with the goal of enabling structure-based protein design with generative models. I collaborated with the authors of Riemannian score matching (Bortoli et al.) to extend their theory to SE(3) then developed a modified version of AlphaFold’s SE(3)-invariant attention-based neural network for SE(3) diffusion. We introduced a widely used benchmark that measured generation quality, diversity, and novelty to assess protein structure generation. This work was used as the foundation of RFdiffusion (see below) in collaboration with David Baker.

Paper MIT News Talk Code

Generator Matching: Generative modeling with arbitrary Markov processes

International Conference on Learning Representations 2025 (Oral)

Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, Yaron Lipman

Description

Generator Matching (GM) is a framework of constructing generative models over Markov processes on arbitrary state spaces. My prior work on Riemannian and discrete flow matching are special cases of GM. I implemented the protein experiments where we demonstrated using GM for the superposition of Markov processes that improved performance in multimodal generation. I like this general framework to encapsulate flows and diffusion on different state spaces.

Paper

Improved motif-scaffolding with SE (3) flow matching

Transactions on Machine Learning Research 2024

Jason Yim, Andrew Campbell, Emile Mathieu, Andrew YK Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Frank Noé, and others

Description

This work was an continuation of my SE(3) diffusion model where I extended Riemannian flow matching (Chen et al.) to SE(3). As expected, the theory and methodology were much simpler than Riemannian diffusion. My co-authors fixed some numerical instability with exponential and logarithmic maps that greatly improved training stability. Concurrent works (Bose et al., Ajay et al.) demonstrated a low-tempearture sampling trick by scaling the vector field during reverse integration that we also found to improve performance. We applied the model to a protein inpainting task called motif-scaffolding and demonstrated state-of-the-art results. We investigated guided sampling with twised sequential monte carlo (Wu et al.) but found this did not work well without a large number of particles.

Paper Code

Hierarchical protein backbone generation with latent and structure diffusion

ICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design

Jason Yim, Marouane Jaakik, Ge Liu, Jacob Gershon, Karsten Kreis, David Baker, Regina Barzilay, Tommi Jaakkola

Description

Latent diffusion excels at capturing semantically rich features in the latent space then uses a powerful decoder to generate data conditioned on the latent. We sought to learn a semantically rich latent space of protein topologies then train a hierarchical two-stage diffusion model. First, latent diffusion generates a protein topology. Second, conditioned on the sampled latent, structure diffusion generates the protein structure that adheres to the topology. We found reward optimization to be effective in this framework by doing the guidance in the latent space first then generating diverse protein structures conditioned on the latents. This approach was a proof of concept of how reward guidance could more effective in the latent rather than ambient space. Unfortunately I did not see this project to completion before graduating.

Paper

Proteína: Scaling Flow-based Protein Structure Generative Models

International Conference on Learning Representations 2025 (Oral)

Tomas Geffner*, Kieran Didi*, Zuobai Zhang*, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis

Description

I was a advisor providing high-level feedback and guidance. The Nvidia team explored the “bitter lesson” for protein generation by scaling up non-equivariant diffusion models with a Diffusion Transformer architecture up to 400 million parameters and datasets up to 21 million structures. They used classifier-free guidance, autoguidance, and LoRA to guide generation towards desired protein topologies. The model demonstrated state-of-the-art generation quality and diversity while being able to extend to longer proteins up to 800 residues. Other methods often struggle at scaling up to proteins beyond 400 residues (see graph).

Paper Project page Code

Improving protein optimization with smoothed fitness landscapes

International Conference on Learning Representations 2024

Andrew Kirjner*, Jason Yim*, Raman Samusevich, Shahar Bracha, Tommi Jaakkola, Regina Barzilay, and Ila Fiete

Description

We started off wanting to do reinforcement learning (RL) for protein sequence optimization but it turned into a big benchmarking project of 7 different methods after we found serious flaws in the previous benchmarks: sometimes the test examples were 99% similar to examples in train. We constructed a new de-leaked benchmark and evaluated each method. We observed that each method had merits but the real issue was the noisy predictions of the reward model. We explored different regularization strategies and found minimizing the total variation (i.e. increasing the smoothness) (Zhou et al.) could greatly improve test performance. We also found a very simple approach of using Gibbs With Gradients (Grathwohl et al.) with a good reward model would outperform all the more complicated methods using RL, GFlowNets, LLMs, etc. The takeaway? A good reward model is all you need.

Paper MIT News Code

Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem

International Conference on Learning Representations 2023

Brian L. Trippe*, Jason Yim*, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi Jaakkola

Description

This work was the first proof-of-concept of using a diffusion model to generate small, toyish protein structures. The performance was poor but we demonsrated a semantically rich latent space could be learned that smoothly interpolated between different protein structure topologies (see gif). We additionally demonstrated the first application of Sequential Monte Carlo (SMC) to guide the diffusion trajectory towards different topologies.

Paper Talk Code

Science

De novo design of protein structure and function with RFdiffusion

Nature 2023

Joseph L. Watson*, David Juergens*, Nathaniel R. Bennett*, Brian L. Trippe*, Jason Yim*, Helen E. Eisenach*, Woody Ahern*, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, and David Baker

Description

I collaborated with David Baker and his students to apply my SE(3) diffusion model to protein design. It worked spectacularly where we took a pre-trained structure prediction model and fine-tuned it with a SE(3) diffusion loss. After adding some conditioning capabilities, we tried the model, RFdiffusion, across every protein design task in David’s lab and found it outperformed all prior methods. The AI-designed proteins were then synthesized in the wet lab and found to harbor the functions that we designed them for. This was the first instance of a single generative model successfully designing novel proteins that could bind and catalyze. The fact that a single model could solve multiple tasks and generalize beyond the training set was exciting. There has been widespread adoption by scientists with even a dedicated team to document and improve the software around RFdiffusion.

Paper Code NY Times Article Economist Article

Scalable emulation of protein equilibrium ensembles with generative deep learning

Science 2025

Sarah Lewis, Tim Hempel, Jose Jimenez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor Garcia Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Cecilia Clementi, Frank Noe

Description

During my Microsoft internship, I worked with Frank Noe on a protein structure SE(3) diffusion model called BioEmu for sampling protein dynamics. Protein (or molecular) dynamics in simple terms is simulating how proteins move and interact with other molecules to understand their functions. The main idea of BioEmu is to investigate data scaling by first pre-train on publicly available protein structure datasets and then fine-tune on protein dynamics data, both public and in-house generated with large scale compute from Microsoft Azure. The goal is to learn a distribution that samples proportionally to the equilibrium distribution, i.e. sampling proportionally to the amount of time a protein is in a certain pose. BioEmu achieved state-of-the-art performance on capturing the equilibrium distribution of out-of-distribution proteins. If molecular dynamics can scale and generalize to new proteins, then lots of new therapeutic would become possible to develop.

Paper Code Microsoft Research Blog

Atom-level enzyme active site scaffolding using RFdiffusion2

Nature Methods 2025

Woody Ahern*, Jason Yim*, Doug Tischer*, Saman Salike, Seth M. Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus S. Bauer, Regina Barzilay, Tommi S. Jaakkola, Rohith Krishna, David Baker

Description

RFdiffusion works for many cases but struggles with enzyme design. Enzymes require atomistic precision to achieve a desired reaction. We extended RFdiffusion with a atomistic representation and my SE(3) flow matching model to achieve atomistic precision when designing enzymes. We developed a novel conditioning approach that separates the design problem into designing the catalytic residues and the scaffold separately but conditioning the diffusion to make sure the catalytic residues and scaffold are compatible. The new model, RFdiffusion2, achieved state-of-the-art results on computational benchmarks for enzyme design and is used inside David Baker’s lab for designing novel enzymes. De novo enzyme design can unlock new chemical reactions not possible through traditional methods such as plastic degradation and new therapeutics.

Paper Code Institute for Protein Design Blog

Computational design of metallohydrolases

Nature 2025

Donghyo Kim, Seth M. Woodbury, Woody Ahern, Doug Tischer, Nikita Hanikel, Saman Salike, Jason Yim, Samuel J. Pellock, Anna Lauko, Indrek Kalvet, Donald Hilvert, David Baker

Description

Scientists in David Baker’s lab used RFdiffusion2 to computationally design the most active metallohydrolases to date. This is the closest instance of a de novo, i.e. novel, enzyme approaching the same catalytic efficiency as enzymes that nature created through evolution. The biggest hurdle in enzyme design is achieving the same catalytic efficiency as nature and so the results in this work is a big step towards that goal.

Paper Institute for Protein Design Blog

Protein complex prediction with AlphaFold-Multimer

bioRxiv 2022

Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova, Andrew Senior, Tim Green, Augustin Žı́dek, Russ Bates, Sam Blackwell, Jason Yim, Olaf Ronneberger, Sebastian Bodenstein, Michal Zielinski, Alex Bridgland, Anna Potapenko, Andrew Cowie, Kathryn Tunyasuvunakool, Rishub Jain, Ellen Clancy, Pushmeet Kohli, John Jumper, and Demis Hassabis

Description

AlphaFold-multimer (AFm) was the last project I worked on at DeepMind. AFm is the middle child between AlphaFold2 (AF2) and Alphafold3 (AF3). We developed a lot of the data processing, evolutionary sequence modeling, and evaluation that would end up being used in AF3. The main changes in AF3 were the inclusion of arbitrary molecules and the diffusion model that replaced the equivariant graph neural network in AFm. I was a core research engineer working on data processing, evaluation, training, and exploring improvements to the graph neural network architecture for scaling up to large proteins. Unfortunately transformers just scale a lot better!

Paper Code

Predicting conversion to wet age-related macular degeneration using deep learning

Nature Medicine 2020

Jason Yim*, Reena Chopra*, Terry Spitz, Jim Winkens, Annette Obika, Christopher Kelly, Harry Askham, Marko Lukic, Josef Huemer, Katrin Fasler, and others

Description

My first project at DeepMind and first time on a large scale deep learning project. I was the sole research engineer responsible for data, training, evaluation, and collaborating with human doctors. We trained a large 3D U-Net (large at the time) on gigapixel medical volumetric scans to predict if the patient would develop eye disease in the next 6 months. We tried lots of modeling ideas that attempted to incorporate time series of how the 3D scans changed over time (hence modeling 4D data). But the rregular time intervals between scans and small number of scans per patient led to the 4D model not outperforming the 3D U-net. Real world data, especially medical data, is notoriously messy. Data ingestion and cleaning took a significant amount of time. We worked with optometrists to relabel the data and ran a prospective study of comparing the model’s performance against 6 optometrists out of which the model performed better than 5.

Paper Code DeepMind Blog