Publications

Selected publications from Tong Group

^* These authors contributed equally. ^† These authors jointly supervised this work.

Recent Publications

2026

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Antonio Franca, Alexander Tong

In ICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM)

7/21/2026

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

Shucheng Li, Iolo Jones, Alexander Tong^†, Michael M. Bronstein^†

In ICML 2026 Workshop on High-dimensional Learning Dynamics

7/21/2026

Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.

MacroGuide: Topological Guidance for Macrocycle Generation

Alicja Maksymiuk, Alexandre Duplessis, Michael Bronstein, Alexander Tong, Fernanda Duarte, İsmail İlkan Ceylan

In ICML 2026

7/21/2026

Macrocycles are ring-shaped molecules that offer a promising alternative to small-molecule drugs due to their enhanced selectivity and binding affinity against difficult targets. Despite their chemical value, they remain underexplored in generative modeling, likely owing to their scarcity in public datasets and the challenges of enforcing topological constraints in standard deep generative models. We introduce MacroGuide: Topological Guidance for Macrocycle Generation, a diffusion guidance mechanism that uses Persistent Homology to steer the sampling of pretrained molecular generative models toward the generation of macrocycles, in both unconditional and conditional (protein pocket) settings. At each denoising step, MacroGuide constructs a Vietoris-Rips complex from atomic positions and promotes ring formation by optimizing persistent homology features. Empirically, applying MacroGuide to pretrained diffusion models increases macrocycle generation rates from 1% to 99%, while matching or exceeding state-of-the-art performance on key quality metrics such as chemical validity, diversity, and PoseBusters checks.

Strong Stochastic Flow Maps

Sam McCallum^*, Zander W. Blasingame^*, Timothy Herschell, Niklas Rindtorff, Alexander Tong^†, James Foster^†

In ICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM)

7/21/2026

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.

Autoregressive Boltzmann Generators

Danyal Rehman, Charlie B. Tan, Yoshua Bengio, Joey Bose, Alexander Tong

In ICML 2026 (Spotlight)

7/21/2026

Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driven the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on normalizing flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG), a novel autoregressive modelling framework that overcomes these limitations by departing from the flow-based BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132 million parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error, E-W2, on 8-residue systems by over 60%.

Beta cell-derived cholecystokinin drives obesity-associated pancreatic adenocarcinoma development

Cathy C. Garcia^*, Aarthi Venkat^*, Daniel C. McQuaid^*, Sherry S. Agabiti, Alexander Tong, Boby Mathew, Rebecca L. Cardone, Rebecca Starble, Christian F. Ruiz, Christy Zheng, Akin Sogunro, Jeremy B. Jacox, Ken H. Loh, Richard G. Kibbey, Smita Krishnaswamy^†, Mandar Deepak Muzumdar^†

In Nature Communications

6/15/2026

Pancreatic endocrine-exocrine crosstalk plays a key role in normal physiology and disease and can be altered by host metabolic states, such as obesity. Classically, endocrine islet beta (β) cell secretion of insulin is thought to promote the development of obesity-associated pancreatic adenocarcinoma (PDAC), an exocrine cell-derived tumor. Here, we show that β cell expression of the peptide hormone cholecystokinin (CCK) is necessary and sufficient for obesity-associated PDAC progression in mice and that CCK expression – rather than insulin – correlates strongly with enhanced tumorigenesis. Single-cell RNA-sequencing, in silico latent-space archetypal and trajectory analysis, and experimental lineage tracing in vivo reveal that obesity induces the expansion of postnatal immature β cells, which adapt to express CCK via stress-responsive JNK/cJun signaling. Finally, obesity perturbs CCK-dependent peri-islet exocrine cell transcriptional states and enhances islet-proximal tumor formation. These results define endocrine-exocrine CCK signaling as a bona fide driver of obesity-associated PDAC development and uncover avenues to target the endocrine pancreas to subvert exocrine tumorigenesis.

Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers

Bruno Trentini, Dejan Stancevic, Michael M. Bronstein, Alexander Tong, Luca Ambrogioni

Preprint

5/26/2026

For a fixed flow-based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one-endpoint diffusion. We derive a conditional-marginal entropy-rate objective for bridge-aware discretization, separating endpoint-conditioned bridge geometry from marginal flow evolution, and use it to build a training-free entropic inference-time scheduler from first principles. For Gaussian Brownian bridges this rate is closed-form and U-shaped, motivating boundary-heavy nonuniform grids. On trained two-dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10-step ODE-Heun MMD over linear by 18.1%, with a paired 22.7% SDE-Heun improvement in the same low-NFE sweep. On EDM/CIFAR-10, the entropic time-discretization gives the best tested five-step FID (186.3 ± 4.0 versus 200.5 ± 2.9 for linear and 238.0 ± 5.3 for cosine). On AlphaFlow protein generation, entropic conditional-marginal (cond-marg) scheduling shows advantage in low-NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy-rate scheduling as a practical low-budget allocation signal for high-dimensional bridge and flow samplers.

Coupling Models for One-Step Discrete Generation

Fred Zhangzhi Peng, Avishek Joey Bose, Anru R. Zhang, Alexander Tong

Preprint

5/12/2026

Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models, a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically, Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding.

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

Fred Zhangzhi Peng, Alexis Fox, Anru R. Zhang, Alexander Tong

Preprint

5/9/2026

Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models.

FALCON: Few-step Accurate Likelihoods for Continuous Flows

Danyal Rehman, Tara Akhound-Sadegh, Artem Gazizov, Yoshua Bengio, Alexander Tong

In ICLR 2026 (Oral)

5/7/2026

Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model.

OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

Emily Jin^*, Andrei Cristian Nica^*, Mikhail Galkin, Jarrid Rector-Brooks, Kin Long Kelvin Lee, Santiago Miret, Frances H. Arnold, Michael Bronstein, Avishek Joey Bose, Alexander Tong, Chenghao Liu

In ICLR 2026

5/7/2026

Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling (S⁴), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer RMSD₁<0.5 Å and attains over 80% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng^*, Zachary Bezemek^*, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Avishek Joey Bose^†, Alexander Tong^†

In ICLR 2026 (Oral)

5/7/2026

Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation.

Topological Flow Matching

Kacper Wyrwal, Ismail Ilkan Ceylan, Alexander Tong

In ICLR 2026

5/7/2026

Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces—such as fMRI data on brain graphs—as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce topological flow matching, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrödinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a plug-and-play replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.

Branched Schrödinger Bridge Matching

Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee

In ICLR 2026

5/7/2026

Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schrödinger Bridge Matching, effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct outcomes. To address this, we introduce Branched Schrödinger Bridge Matching (BranchSBM), a novel framework that learns branched Schrödinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes, enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations.

Flow matching for generative modelling in bioinformatics and computational biology

Alex Morehead^*, Lazar Atanackovic^*, Akshata Hegde^*, Yanli Wang^*, Frimpong Boadu^*, Joel Selvaraj^*, Alexander Tong, Aditi Krishnapriyan, Jianlin Cheng

In Nature Machine Intelligence

4/23/2026

Numerous problems in bioinformatics and computational biology can be framed as a task of learning a mapping from one state of a biological system to another relevant state or of exploring novel data points across biologically constrained spaces. However, manually deriving such mappings—for example, to transform cells in a diseased state back into a healthy state, or extrapolating from existing datasets to create new data—is often non-trivial and can require extraordinary domain expertise and resources. Fortunately, the field of generative artificial intelligence (AI) has introduced a new training paradigm referred to as (conditional) flow matching, which has emerged as a promising solution to this problem, with broad applicability in computer vision, natural language processing, and the physical and life sciences. Flow matching is a powerful and principled, data-driven framework for efficiently learning a mapping between arbitrary pairs of high-dimensional data distributions, making it well suited for addressing problems in molecular and cell biology. In this Review, we characterize the theoretical foundations of flow matching and its applications in biomolecular modelling for small molecules, proteins, DNA/RNA, and their interactions, as well as its uses in single/multi-cellular modelling for cell phenotyping and imaging, each contributing towards the development of an AI-based virtual cell. Finally, this review highlights open-source flow-matching methods and discusses future directions in flow-based generative modelling for bioinformatics and computational biology.

General Multimodal Protein Design Enables DNA-Encoding of Chemistry

Jarrid Rector-Brooks^*, Théophile Lambert^*, Marta Skreta^*, Daniel Roth^*, Yueming Long, Zi-Qi Li, Xi Zhang, Miruna Cretu, Francesca-Zhoufan Li, Tanvi Ganapathy, Emily Jin, Avishek Joey Bose, Jason Yang, Kirill Neklyudov, Yoshua Bengio, Alexander Tong, Frances H. Arnold, Cheng-Hao Liu

Preprint

4/6/2026

Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp3)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations.

MIOFlow 2.0: A Unified Framework for Inferring Cellular Stochastic Dynamics from Single Cell and Spatial Transcriptomics Data

Xingzhi Sun, João Felipe Rocha, Brett Phelan, Dhananjay Bhaskar, Guillaume Huguet, Yanlei Zhang, Alexander Tong, Ke Xu, Oluwadamilola Fasina, Mark Gerstein, Natalia Ivanova, Christine L. Chaffer, Guy Wolf, Smita Krishnaswamy

Preprint

3/23/2026

Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disease. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data's intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.

2025

Amortized Sampling with Transferable Normalizing Flows

Charlie B. Tan^*, Majdi Hassan^*, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong^†, Kirill Neklyudov^†

In NeurIPS 2025

12/10/2025

Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

Curly Flow Matching for Learning Non-gradient Field Dynamics

Katarina Petrović, Lazar Atanackovic, Viggo Moro, Kacper Kapuśniak, İsmail İlkan Ceylan, Michael Bronstein, Avishek Joey Bose^†, Alexander Tong^†

In NeurIPS 2025

12/10/2025

Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schrödinger bridge problem with a non-zero drift reference process -- in stark contrast to typical zero-drift reference processes -- which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems.

Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Tara Akhound-Sadegh^*, Jungyoon Lee^*, Avishek Joey Bose, Valentin De Bortoli, Arnaud Doucet, Michael M. Bronstein, Dominique Beaini, Siamak Ravanbakhsh, Kirill Neklyudov^†, Alexander Tong^†

In NeurIPS (spotlight)

12/10/2025

Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

Vincent Pauline, Tobias Höppe, Kirill Neklyudov, Alexander Tong, Stefan Bauer, Andrea Dittadi

arXiv preprint

12/1/2025

Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in ℝᵈ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.

FORT: Forward-Only Regression Training of Normalizing Flows

Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong^†, Avishek Joey Bose^†

ICML GenBio Best Paper Award 2025

7/23/2025

Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to large-scale diffusion and flow matching models. However, such modern generative models suffer from expensive inference, inhibiting their use in numerous scientific applications like Boltzmann Generators (BGs) for molecular conformations that require fast likelihood evaluation. In this paper, we revisit classical normalizing flows in the context of BGs that offer efficient sampling and likelihoods, but whose training via maximum likelihood is often unstable and computationally challenging. We propose Regression Training of Normalizing Flows (RegFlow), a novel and scalable regression-based training objective that bypasses the numerical instability and computational challenge of conventional maximum likelihood training in favour of a simple ℓ2-regression objective. Specifically, RegFlow maps prior samples under our flow to targets computed using optimal transport couplings or a pre-trained continuous normalizing flow (CNF). To enhance numerical stability, RegFlow employs effective regularization strategies such as a new forward-backward self-consistency loss that enjoys painless implementation. Empirically, we demonstrate that RegFlow unlocks a broader class of architectures that were previously intractable to train for BGs with maximum likelihood. We also show RegFlow exceeds the performance, computational cost, and stability of maximum likelihood training in equilibrium sampling in Cartesian coordinates of alanine dipeptide, tripeptide, and tetrapeptide, showcasing its potential in molecular systems.

Scalable Equilibrium Sampling with Sequential Boltzmann Generators

Charlie B. Tan^*, Avishek Joey Bose^*, Chen Lin, Leon Klein, Michael M. Bronstein, Alexander Tong

In ICML 2025

7/21/2025

Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann Generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators.

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Marta Skreta^*, Tara Akhound-Sadegh^*, Viktor Ohanesian^*, Roberto Bondesan, Alán Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong^†, Kirill Neklyudov^†

In ICML 2025 (spotlight)

7/21/2025

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional 'corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.

Defining and Benchmarking Open Problems in Single-Cell Analysis

Malte Luecken^*, Scott Gigante^*, Daniel Burkhardt^*, Robrecht Cannoodt, Daniel Strobl, Nikolay Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, Michael Vinyard, Daniel Magruder, Alma Andersson, Emma Dann, Qian Qin, Dominik Otto, Michal Klein, Olga Botvinnik, Louise Deconinck, Kai Waldrant, Bastian Rieck, Constantin Ahlmann-Eltze, Eduardo Da Veiga Beltrame, Andrew Benz, Carmen Bravo González-Blas, Ann Chen, Benjamin DeMeo, Can Ergen, Swann Floc'hlay, Adam Gayoso, Stephanie Hicks, Yuge Ji, Vitalii Kleshchevnikov, Gioele La Manno, Maximilian Lombardo, Romain Lopez, Dario Righelli, Hirak Sarkar, Valentine Svensson, Alexander Tong, Galen Xing, Chenling Xu, Jonathan Bloom, Angela Pisco, Julio Saez-Rodriguez, Drausin Wulsin, Luca Pinello, Yvan Saeys, Fabian Theis, Smita Krishnaswamy

In Nature Biotechnology, 2025

6/15/2025

With the growing number of single-cell analysis tools, benchmarks are increasingly important to guide analysis and method development. However, a lack of standardisation and extensibility in current benchmarks limits their usability, longevity, and relevance to the community. We present Open Problems, a living, extensible, community-guided benchmarking platform including 10 current single-cell tasks that we envision will raise standards for the selection, evaluation, and development of methods in single-cell analysis.

Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen

Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis

In ICLR 2025

5/7/2025

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

Jarrid Rector-Brooks, Mohsin Hasan, Zhangzhi Peng, Zachary Quinn, Chenghao Liu, Sarthak Mittal, Nouha Dziri, Michael Bronstein, Yoshua Bengio, Pranam Chatterjee, Alexander Tong^†, Avishek Joey Bose^†

In ICLR 2025

5/7/2025

Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process - typically via RLHF - to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text-based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Lazar Atanackovic^*, Xi Zhang^*, Brandon Amos, Mathieu Blanchette, Leo J. Lee, Yoshua Bengio, Alexander Tong, Kirill Neklyudov

In ICLR 2025

5/7/2025

Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynamics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depend on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrate along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions, unlike previously proposed methods. We demonstrate the ability of MFM to improve the prediction of individual treatment responses on a large-scale multi-patient single-cell drug screen dataset.

The Superposition of Diffusion Models Using the Itô Density Estimator

Marta Skreta^*, Lazar Atanackovic^*, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov

In ICLR 2025 (spotlight)

5/7/2025

geneRNIB: a living benchmark for gene regulatory network inference

Jalil Nourisa, Antoine Passemiers, Marco Stock, Berit Zeller-Plumhoff, Robrecht Cannoodt, Christian Arnold, Alexander Tong, Jason Hartford, Antonio Scialdone, Yves Moreau, Yang Li, Malte D. Luecken

In bioRxiv

4/1/2025

Gene regulatory networks (GRNs) underpin cellular identity and function, playing a key role in health and disease. Despite various benchmarking efforts, existing studies remain limited in the number of GRN inference methods, datasets, and evaluation metrics. The absence of a universally accepted ground truth further complicates the evaluation, requiring continuous refinement of benchmarking strategies. In addition, regulatory interactions are highly context-specific and vary between perturbations, cell types, tissues, and organisms. However, current benchmarks do not account for this complexity, limiting their applicability in personalized medicine. Here, we introduce geneRNIB, a comprehensive GRN bench-marking framework built on three key principles: context-specific evaluation, continuous integration, and holistic assessment in the absence of a true reference network. geneRNIB enables the seamless incorporation of new algorithms, datasets, and evaluation metrics to reflect ongoing developments. In the current version, we systematically integrated and assessed ten GRN inference methods, spanning single- and multiomics approaches across five diverse datasets including thousands of perturbation scenarios. We introduced eight novel metrics specifically designed to assess context-specific causal inference. Our findings indicate that simple models with fewer assumptions often outperformed more complex pipelines. Notably, gene expression-based correlation algorithms yielded better results than more advanced approaches incorporating prior datasets or pre-trained on large datasets. In addition, we identified several potential factors that influence the performance of GRN inference and offered actionable guidelines for the future development of the method. By addressing these critical limitations in existing benchmarks, geneRNIB advances GRN research and fosters progress toward personalized medicine.Competing Interest StatementThe authors have declared no competing interest.

Hidden sampling biases inflate performance in gene regulatory network inference

Marco Stock^*, Florin Ratajczak^*, Paul Bertin, Eva Hoermanseder, Yoshua Bengio, Jason Hartford, Pascal Falter-Braun, Matthias Heinig, Alexander Tong, Antonio Scialdone

Preprint (bioRxiv)

4/1/2025

Accurate reconstruction of gene regulatory networks (GRNs) from single-cell transcriptomic data remains a major methodological challenge. Recent machine learning approaches, particularly graph neural networks and graph autoencoders, have reported improved performance, yet these gains do not consistently translate to realistic biological settings. Here, we show that a key reason for that is the way negative regulatory interactions are sampled for supervised training and evaluation. We find that widely used sampling strategies introduce node-degree biases that allow models to exploit trivial graph-structural cues rather than biological signals. Across multiple benchmarks, simple degree-based heuristics match or exceed state-of-the-art graph neural network models under these biased evaluation protocols. We further introduce a degree-aware sampling approach that eliminates these artifacts and provides more reliable assessments of GRN inference methods. Our results call for standardized, bias-aware benchmarking practices to ensure meaningful progress in supervised GRN inference from single-cell RNA-seq data.

Path Planning for Masked Diffusion Model Sampling

Fred Zhangzhi Peng^*, Zachary Bezemek^*, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Alexander Tong^†, Pranam Chatterjee^†

arXiv preprint

2/1/2025

Any order generation of discrete data using masked diffusion models (MDMs) offers a compelling alternative to traditional autoregressive models, especially in domains that lack a natural causal ordering of data. However, current popular MDMs depart from their successful continuous diffusion model counterparts with simplified masked inference wherein unmasked tokens cannot be iteratively refined -- even if there is a mistake. In this paper, we extract the full power of MDMs by introducing a novel inference sampling strategy termed Path Planning (P2) that decomposes each generation step into two sub-stages: planning and denoising. Under P2, the planner at every step selects appropriate tokens that are marked to be updated, which can then be sampled using the denoiser. We demonstrate that P2 generalizes all existing sampling strategies for MDMs and critically enhances generative quality through the new capability of refining and updating existing unmasked tokens. We theoretically prove that P2 establishes a (new) expanded evidence lower bound (ELBO) on the log marginal likelihood of data. We instantiate P2 with a family of planners including: 1.) Self-Planning, 2.) BERT-Planning, and 3.) Trained-Planning with a learned planner leading to SOTA generative performance for MDMs on a suite of domains. Specifically, solely using P2 inference, we observe relative improvements of 22% in protein sequence foldability, 8% in RNA sequence pLDDT, 4% in math reasoning, 68% in story generation (ROUGE score), and 33% in code generation for the challenging pass@1 metric.

2024

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Guillaume Huguet^*, James Vuckovic^*, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong^†, Avishek Joey Bose^†

In NeurIPS 2024

12/10/2024

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

Metric Flow Matching for Smooth Interpolations on the Data Manifold

Kacper Kapusniak, Peter Potaptchik, Teodora Reu, Leo Zhang, Alexander Tong, Michael Bronstein, Avishek Joey Bose, Francesco Di Giovanni

In NeurIPS 2024

12/10/2024

Matching objectives underpin the success of modern generative models and rely on constructing conditional paths that transform a source distribution into a target distribution. Despite being a fundamental building block, conditional paths have been designed principally under the assumption of Euclidean geometry, resulting in straight interpolations. However, this can be particularly restrictive for tasks such as trajectory inference, where straight paths might lie outside the data manifold, thus failing to capture the underlying dynamics giving rise to the observed marginals. In this paper, we propose Metric Flow Matching (MFM), a novel simulation-free framework for conditional flow matching where interpolants are approximate geodesics learned by minimizing the kinetic energy of a data-induced Riemannian metric. This way, the generative model matches vector fields on the data manifold, which corresponds to lower uncertainty and more meaningful interpolations. We prescribe general metrics to instantiate MFM, independent of the task, and test it on a suite of challenging problems including LiDAR navigation, unpaired image translation, and modeling cellular dynamics. We observe that MFM outperforms the Euclidean baselines, particularly achieving SOTA on single-cell trajectory prediction.

A Computational Framework for Solving Wasserstein Lagrangian Flows

Kirill Neklyudov^*, Rob Brekelmans^*, Alexander Tong, Lazar Atanackovic, Qiang Liu, Alireza Makhzani

In ICML 2024

7/21/2024

The dynamical formulation of the optimal transport can be extended through various choices of the underlying geometry (kinetic energy), and the regularization of density paths (potential energy). These combinations yield different variational problems (Lagrangians), encompassing many variations of the optimal transport problem such as the Schro¨dinger bridge, unbalanced optimal transport, and optimal transport with physical constraints, among others. In general, the optimal density path is unknown, and solving these variational problems can be computationally challenging. Leveraging the dual formulation of the Lagrangians, we propose a novel deep learning based framework approaching all of these problems from a unified perspective. Our method does not require simulating or backpropagating through the trajectories of the learned dynamics, and does not need access to optimal couplings. We showcase the versatility of the proposed framework by outperforming previous approaches for the single-cell trajectory inference, where incorporating prior knowledge into the dynamics is crucial for correct predictions.

Iterated Denoising Energy Matching for Sampling from Boltzmann Densities

Tara Akhound-Sadegh^*, Jarrid Rector-Brooks^*, Avishek Joey Bose^*, Sarthak Mittal, Pablo Lemos, Cheng-Hao Liu, Marcin Sendera, Siamak Ravanbakhsh, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Alexander Tong

In ICML 2024

7/21/2024

Efficiently generating statistically independent samples from an unnormalized probability distribution, such as equilibrium samples of many-body systems, is a foundational problem in science. In this paper, we propose Iterated Denoising Energy Matching (iDEM), an iterative algorithm that uses a novel stochastic score matching objective leveraging solely the energy function and its gradient -- and no data samples -- to train a diffusion-based sampler. Specifically, iDEM alternates between (I) sampling regions of high model density from a diffusion-based sampler and (II) using these samples in our stochastic matching objective to further improve the sampler. iDEM is scalable to high dimensions as the inner matching objective, is simulation-free, and requires no MCMC samples. Moreover, by leveraging the fast mode mixing behavior of diffusion, iDEM smooths out the energy landscape enabling efficient exploration and learning of an amortized sampler. We evaluate iDEM on a suite of tasks ranging from standard synthetic energy functions to invariant n-body particle systems. We show that the proposed approach achieves state-of-the-art performance on all metrics and trains 2−5× faster, which allows it to be the first method to train using energy on the challenging 55-particle Lennard-Jones system.

Learnable Filters for Geometric Scattering Modules

Alexander Tong^*, Frederik Wenkel^*, Dhananjay Bhaskar, Kincaid Macdonald, Jackson Grady, Michael Perlmutter, Smita Krishnaswamy, Guy Wolf

In IEEE Transactions on Signal Processing

6/15/2024

SE(3)-Stochastic Flow Matching for Protein Backbone Generation

Avishek Joey Bose^*, Tara Akhound-Sadegh^*, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, Alexander Tong

In ICLR 2024 (Spotlight)

5/7/2024

The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce $\text{FoldFlow}$ a series of novel generative models of increasing modeling power based on the flow-matching paradigm over $3\text{D}$ rigid motions -- i.e. the group $\text{SE(3)}$ -- enabling accurate modeling of protein backbones. We first introduce $\text{FoldFlow-Base}$, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on $\text{SE(3)}$. We next accelerate training by incorporating Riemannian optimal transport to create $\text{FoldFlow-OT}$, leading to the construction of both more simple and stable flows. Finally, we design $\text{FoldFlow-SFM}$ coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over $\text{SE(3)}$. Our family of $\text{FoldFlow}$ generative models offer several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over $\text{SE(3)}$. Empirically, we validate our FoldFlow models on protein backbone generation of up to $300$ amino acids leading to high-quality designable, diverse, and novel samples.

Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport

Alexander Tong^*, Nikolay Malkin^*, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, Yoshua Bengio

In Transactions on Machine Learning Research (TMLR), 2024

5/4/2024

Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, OT-CFM is the first method to compute dynamic OT in a simulation-free way. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schrödinger bridge inference.

2023

A Heat Diffusion Perspective on Geodesic Preserving Dimensionality Reduction

Guillaume Huguet^*, Alexander Tong^*, Edward De Brouwer^*, Yanlei Zhang, Guy Wolf, Ian Adelstein, Smita Krishnaswamy

In NeurIPS

12/10/2023

DynGFN: Bayesian Dynamic Causal Discovery Using Generative Flow Networks

Lazar Atanackovic^*, Alexander Tong^*, Jason Hartford, Leo J. Lee, Bo Wang, Yoshua Bengio

In NeurIPS. Also presented at Frontiers4LCD Workshop @ NeurIPS 2022

12/10/2023

One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise, so for typical sample sizes there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying *cyclic* structure from dynamics, or on challenge (2) learning complex Bayesian *posteriors* over DAGs, but not both. In this paper we leverage the fact that it is possible to estimate the ``velocity' of gene expression with *RNA velocity* techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. Since our objective is to model uncertainty over discrete structures, we leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.

Causal Inference in Gene Regulatory Networks with GFlowNet: Towards Scalability in Large Systems

Trang Nguyen, Alexander Tong, Kanika Madan, Yoshua Bengio^†, Dianbo Liu^†

In arXiv

10/1/2023

Understanding causal relationships within Gene Regulatory Networks (GRNs) is essential for unraveling the gene interactions in cellular processes. However, causal discovery in GRNs is a challenging problem for multiple reasons including the existence of cyclic feedback loops and uncertainty that yields diverse possible causal structures. Previous works in this area either ignore cyclic dynamics (assume acyclic structure) or struggle with scalability. We introduce Swift-DynGFN as a novel framework that enhances causal structure learning in GRNs while addressing scalability concerns. Specifically, Swift-DynGFN exploits gene-wise independence to boost parallelization and to lower computational cost. Experiments on real single-cell RNA velocity and synthetic GRN datasets showcase the advancement in learning causal structure in GRNs and scalability in larger systems.

Geodesic Sinkhorn for Fast and Accurate Optimal Transport on Manifolds

Guillaume Huguet^*, Alexander Tong^*, María Ramos Zapatero, Christpher J. Tape, Guy Wolf, Smita Krishnaswamy

In IEEE MLSP

9/1/2023

Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state of the art for such computations, but require $O(n^2)$ computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn---based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only $O(n\log n)$ computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.

Neural FIM for Learning Fisher Information Metrics from Point Cloud Data

Oluwadamilola Fasina^*, Guillaume Huguet^*, Alexander Tong, Yanlei Zhang, Guy Wolf, Maximilian Nickel, Ian Adelstein, Smita Krishnaswamy

In ICML

7/21/2023

Graph Fourier MMD for signals on data graphs

Sam Leone, Alexander Tong, Guillaume Huguet, Guy Wolf, Smita Krishnaswamy

In SAMPTA

7/10/2023

While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little attention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel distance between distributions and signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called {\em gene localization score} which helps select genes for cellular state space characterization.

Single-Cell Analysis Reveals Inflammatory Interactions Driving Macular Degeneration

Manik Kuchroo^*, Marcello DiStasio^*, Eric Song^*, Eda Calapkulu, Le Zhang, Maryam Ige, Amar H. Sheth, Abdelilah Majdoubi, Madhvi Menon, Alexander Tong, Abhinav Godavarthi, Yu Xing, Scott Gigante, Holly Steach, Jessie Huang, Guillaume Huguet, Janhavi Narain, Kisung You, George Mourgkos, Rahul M. Dhodapkar, Matthew J. Hirn, Bastian Rieck, Guy Wolf, Smita Krishnaswamy^†, Brian P. Hafler^†

In Nature Communications

6/15/2023

Due to commonalities in pathophysiology, age-related macular degeneration (AMD) represents a uniquely accessible model to investigate therapies for neurodegenerative diseases, leading us to examine whether pathways of disease progression are shared across neurodegenerative conditions. Here we use single-nucleus RNA sequencing to profile lesions from 11 postmortem human retinas with age-related macular degeneration and 6 control retinas with no history of retinal disease. We create a machine-learning pipeline based on recent advances in data geometry and topology and identify activated glial populations enriched in the early phase of disease. Examining single-cell data from Alzheimer’s disease and progressive multiple sclerosis with our pipeline, we find a similar glial activation profile enriched in the early phase of these neurodegenerative diseases. In late-stage age-related macular degeneration, we identify a microglia-to-astrocyte signaling axis mediated by interleukin-1β which drives angiogenesis characteristic of disease pathogenesis. We validated this mechanism using in vitro and in vivo assays in mouse, identifying a possible new therapeutic target for AMD and possibly other neurodegenerative conditions. Thus, due to shared glial states, the retina provides a potential system for investigating therapeutic approaches in neurodegenerative diseases.

Trellis tree-based analysis reveals stromal regulation of patient-derived organoid drug responses

María Ramos Zapatero^*, Alexander Tong^*, Jahangir Sufi, Petra Vlckova, Ferran Cardoso Rodriguez, Callum Nattress, Xiao Qin, Daniel Hochhauser, Smita Krishnaswamy, Christopher J. Tape

In Cell

6/15/2023

Patient-derived organoids (PDOs) can model personalized therapy responses; however, current screening technologies cannot reveal drug response mechanisms or how tumor microenvironment cells alter therapeutic performance. To address this, we developed a highly multiplexed mass cytometry platform to measure post-translational modification (PTM) signaling, DNA damage, cell-cycle activity, and apoptosis in >2,500 colorectal cancer (CRC) PDOs and cancer-associated fibroblasts (CAFs) in response to clinical therapies at single-cell resolution. To compare patient- and microenvironment-specific drug responses in thousands of single-cell datasets, we developed “Trellis”—a highly scalable, tree-based treatment effect analysis method. Trellis single-cell screening revealed that on-target cell-cycle blockage and DNA-damage drug effects are common, even in chemorefractory PDOs. However, drug-induced apoptosis is rarer, patient-specific, and aligns with cancer cell PTM signaling. We find that CAFs can regulate PDO plasticity—shifting proliferative colonic stem cells (proCSCs) to slow-cycling revival colonic stem cells (revCSCs) to protect cancer cells from chemotherapy.

Understanding Graph Neural Networks with Generalized Geometric Scattering Transforms

Michael Perlmutter, Alexander Tong, Feng Gao, Guy Wolf, Matthew Hirn

In SIAM Journal on Mathematics of Data Science (SIMODS), 2023

5/15/2023

The scattering transform is a multilayered wavelet-based deep learning architecture that acts as a model of convolutional neural networks. Recently, several works have introduced generalizations of the scattering transform for non-Euclidean settings such as graphs. Our work builds upon these constructions by introducing windowed and non-windowed geometric scattering transforms for graphs based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts. As a result, the proposed construction unifies and extends known theoretical results for many of the existing graph scattering architectures. In doing so, this work helps bridge the gap between geometric scattering and other graph neural networks by introducing a large family of networks with provable stability and invariance guarantees. These results lay the groundwork for future deep learning architectures for graph-structured data that have learned filters and also provably have desirable theoretical properties.

Learning Transcriptional and Regulatory Dynamics Driving Cancer Cell Plasticity Using Neural ODE-Based Optimal Transport

Alexander Tong^*, Manik Kuchroo^*, Shabarni Gupta, Aarthi Venkat, Beatriz P. San Juan, Laura Rangel, Brandon Zhu, John G. Lock, Christine L. Chaffer, Smita Krishnaswamy

In BioRxiv

4/1/2023

2022

Manifold Interpolating Optimal-Transport Flows for Trajectory Inference

Guillaume Huguet^*, D. S. Magruder^*, Alexander Tong^*, Oluwadamilola Fasina, Manik Kuchroo, Guy Wolf, Smita Krishnaswamy

In NeurIPS

12/10/2022

We present a method called Manifold Interpolating Optimal-Transport Flow (MIOFlow) that learns stochastic, continuous population dynamics from static snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models, manifold learning, and optimal transport by training neural ordinary differential equations (Neural ODE) to interpolate between static population snapshots as penalized by optimal transport with manifold ground distance. Further, we ensure that the flow follows the geometry by operating in the latent space of an autoencoder that we call a geodesic autoencoder (GAE). In GAE the latent space distance between points is regularized to match a novel multiscale geodesic distance on the data manifold that we define. We show that this method is superior to normalizing flows, Schr\"odinger bridges and other generative models that are designed to flow from noise to data in terms of interpolating between populations. Theoretically, we link these trajectories with dynamic optimal transport. We evaluate our method on simulated data with bifurcations and merges, as well as scRNA-seq data from embryoid body differentiation, and acute myeloid leukemia treatment.

Immune Cells and Their Inflammatory Mediators Modify Beta Cells and Cause Checkpoint Inhibitor-Induced Diabetes

Ana Luisa Perdigoto, Songyan Deng, Katherine C. Du, Manik Kuchroo, Daniel B. Burkhardt, Alexander Tong, Gary Israel, Marie E. Robert, Stuart P. Weisberg, Nancy Kirkiles-Smith, Angeliki M. Stamatouli, Harriet M. Kluger, Zoe Quandt, Arabella Young, Mei-Ling Yang, Mark J. Mamula, Jordan S. Pober, Mark S. Anderson, Smita Krishnaswamy, Kevan C. Herold

In JCI Insight 7(17), e156330, 2022

6/15/2022

Checkpoint inhibitors (CPIs) targeting programmed death 1 (PD-1)/programmed death ligand 1 (PD-L1) and cytotoxic T lymphocyte antigen 4 (CTLA-4) have revolutionized cancer treatment but can trigger autoimmune complications, including CPI-induced diabetes mellitus (CPI-DM), which occurs preferentially with PD-1 blockade. We found evidence of pancreatic inflammation in patients with CPI-DM with shrinkage of pancreases, increased pancreatic enzymes, and in a case from a patient who died with CPI-DM, peri-islet lymphocytic infiltration. In the NOD mouse model, anti-PD-L1 but not anti-CTLA-4 induced diabetes rapidly. RNA sequencing revealed that cytolytic IFN-γ+CD8+ T cells infiltrated islets with anti-PD-L1. Changes in β cells were predominantly driven by IFN-γ and TNF-α and included induction of a potentially novel β cell population with transcriptional changes suggesting dedifferentiation. IFN-γ increased checkpoint ligand expression and activated apoptosis pathways in human β cells in vitro. Treatment with anti-IFN-γ and anti-TNF-α prevented CPI-DM in anti-PD-L1-treated NOD mice. CPIs targeting the PD-1/PD-L1 pathway resulted in transcriptional changes in β cells and immune infiltrates that may lead to the development of diabetes. Inhibition of inflammatory cytokines can prevent CPI-DM, suggesting a strategy for clinical application to prevent this complication.

Multiscale PHATE identifies multimodal signatures of COVID-19

Manik Kuchroo^*, Jessie Huang^*, Patrick Wong^*, Jean-Christophe Grenier, Dennis Shung, Alexander Tong, Carolina Lucas, Jon Klein, Daniel Burkhardt, Scott Gigante, Abhinav Godavarthi, Benjamin Israelow, Tianyang Mao, Ji Eun Oh, Julio Silva, Takehiro Takahashi, Camila D. Odio, Arnau Casanovas-Massana, John Fournier, Yale IMPACT Team, Shelli Farhadian, Charles S. Dela Cruz, Albert I. Ko, F. Perry Wilson, Julie Hussin^†, Guy Wolf^†, Akiko Iwasaki^†, Smita Krishnaswamy^†

In Nature Biotechnology

6/15/2022

As the biomedical community produces datasets that are increasingly complex and high dimensional, there is a need for more sophisticated computational tools to extract biological insights. We present Multiscale PHATE, a method that sweeps through all levels of data granularity to learn abstracted biological features directly predictive of disease outcome. Built on a coarse-graining process called diffusion condensation, Multiscale PHATE learns a data topology that can be analyzed at coarse resolutions for high-level summarizations of data and at fine resolutions for detailed representations of subsets. We apply Multiscale PHATE to a coronavirus disease 2019 (COVID-19) dataset with 54 million cells from 168 hospitalized patients and find that patients who die show CD16hiCD66blo neutrophil and IFN-γ+ granzyme B+ Th17 cell responses. We also show that population groupings from Multiscale PHATE directly fed into a classifier predict disease outcome more accurately than naive featurizations of the data. Multiscale PHATE is broadly generalizable to different data types, including flow cytometry, single-cell RNA sequencing (scRNA-seq), single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq), and clinical variables.

Embedding Signals on Knowledge Graphs with Unbalanced Diffusion Earth Mover's Distance

Alexander Tong, Guillaume Huguet, Dennis Shung, Amine Natik, Manik Kuchroo, Guillaume Lajoie, Guy Wolf, Smita Krishnaswamy

In ICASSP

5/15/2022

In modern relational machine learning it is common to encounter large graphs that arise via interactions or similarities between observations in many domains. Further, in many cases the target entities for analysis are actually signals on such graphs. We propose to compare and organize such datasets of graph signals by using an earth mover's distance (EMD) with a geodesic cost over the underlying graph. Typically, EMD is computed by optimizing over the cost of transporting one probability distribution to another over an underlying metric space. However, this is inefficient when computing the EMD between many signals. Here, we propose an unbalanced graph earth mover's distance that efficiently embeds the unbalanced EMD on an underlying graph into an L1 space, whose metric we call unbalanced diffusion earth mover's distance (UDEMD). This leads us to an efficient nearest neighbors kernel over many signals defined on a large graph. Next, we show how this gives distances between graph signals that are robust to noise. Finally, we apply this to organizing patients based on clinical notes who are modelled as signals on the SNOMED-CT medical knowledge graph, embedding lymphoblast cells modeled as signals on a gene graph, and organizing genes modeled as signals over a large peripheral blood mononuclear (PBMC) cell graph. In each case, we show that UDEMD-based embeddings find accurate distances that are highly efficient compared to other methods.

2021

A sandbox for prediction and integration of DNA, RNA, and protein data in single cells

Malte D Luecken, Daniel B Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann T Chen, Louise Deconinck, Angela M Detweiler, Alejandro Granados, Shelly Huynh, Laura Isacco, Yang Joon Kim, Sunil Kuppasani, Heiko Lickert, Aaron McGeever, Honey Mekonen, Joaquin Caceres, Maurizio Morri, Michaela Mueller, Norma F Neff, Sheryl Paul, Kaylie Schneider, Scott Steelman, Michael Sterr, Dan J Treacy, Alexander Tong, Alexandra-Chloé Villani, Guilin Wang, Jia Yan, Ce Zhang, Angela O Pisco, Smita Krishnaswamy, Fabian J Theis, Jonathan M Bloom

In NeurIPS Datasets and Benchmarks

12/10/2021

The last decade has witnessed a technological arms race to encode the molecular states of cells into DNA libraries, turning DNA sequencers into scalable single-cell microscopes. Single-cell measurement of chromatin accessibility (DNA), gene expression (RNA), and proteins has revealed rich cellular diversity across tissues, organisms, and disease states. However, single-cell data poses a unique set of challenges. A dataset may comprise millions of cells with tens of thousands of sparse features. Identifying biologically relevant signals from the background sources of technical noise requires innovation in predictive and representational learning. Furthermore, unlike in machine vision or natural language processing, biological ground truth is limited. Here we leverage recent advances in multi-modal single-cell technologies which, by simultaneously measuring two layers of cellular processing in each cell, provide ground truth analogous to language translation. We define three key tasks to predict one modality from another and learn integrated representations of cellular state. We also generate a novel dataset of the human bone marrow specifically designed for benchmarking studies. The dataset and tasks are accessible through an open-source framework that facilitates centralized evaluation of community-submitted methods.

MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data

Michal Gerasimiuk^*, Dennis L. Shung^*, Alexander Tong, Adrian J. Stanley, Machael Shultz, Jeffrey Ngu, Loren Laine, Guy Wolf, Smita Krishnaswamy

In IEEE Big Data

12/5/2021

A major challenge in embedding or visualizing clinical patient data is the heterogeneity of variable types including continuous lab values, categorical diagnostic codes, as well as missing or incomplete data. In particular, in EHR data, some variables are {m missing not at random (MNAR)} but deliberately not collected and thus are a source of information. For example, lab tests may be deemed necessary for some patients on the basis of suspected diagnosis, but not for others. Here we present the MURAL forest -- an unsupervised random forest for representing data with disparate variable types (e.g., categorical, continuous, MNAR). MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random, such that the marginal entropy of all other variables is minimized by the split. This allows us to also split on MNAR variables and discrete variables in a way that is consistent with the continuous variables. The end goal is to learn the MURAL embedding of patients using average tree distances between those patients. These distances can be fed to nonlinear dimensionality reduction method like PHATE to derive visualizable embeddings. While such methods are ubiquitous in continuous-valued datasets (like single cell RNA-sequencing) they have not been used extensively in mixed variable data. We showcase the use of our method on one artificial and two clinical datasets. We show that using our approach, we can visualize and classify data more accurately than competing approaches. Finally, we show that MURAL can also be used to compare cohorts of patients via the recently proposed tree-sliced Wasserstein distances.

Multimodal data visualization and denoising with integrated diffusion

Manik Kuchroo^*, Abhinav Godavarthi^*, Alexander Tong, Smita Krishnaswamy, Guy Wolf

In IEEE MLSP

9/1/2021

We propose a method called integrated diffusion for combining multimodal data, gathered via different sensors on the same system, to create a integrated data diffusion operator. As real world data suffers from both local and global noise, we introduce mechanisms to optimally calculate a diffusion operator that reflects the combined information in data by maintaining low frequency eigenvectors of each modality both globally and locally. We show the utility of this integrated operator in denoising and visualizing multimodal toy data as well as multi-omic data generated from blood cells, measuring both gene expression and chromatin accessibility. Our approach better visualizes the geometry of the integrated data and captures known cross-modality associations. More generally, integrated diffusion is broadly applicable to multimodal datasets generated by noisy sensors collected in a variety of fields.

Data-Driven Learning of Geometric Scattering Networks

Alexander Tong^*, Frederik Wenkel^*, Kincaid MacDonald, Smita Krishnaswamy, Guy Wolf

In IEEE MLSP. Also presented at ML4M Workshop @ NeurIPS 2020

9/1/2021

Graph neural networks (GNNs) in general, and graph convolutional networks (GCN) in particular, often rely on low-pass graph filters to incorporate geometric information in the form of local smoothness over neighboring nodes. While this approach performs well on a surprising number of standard benchmarks, the efficacy of such models does not translate consistently to more complex domains, such as graph data in the biochemistry domain. We argue that these more complex domains require priors that encourage learning of band-pass and high-pass features rather than oversmoothed signals of standard GCN architectures. Here, we propose an alternative GNN architecture, based on a relaxation of recently proposed geometric scattering transforms, which consists of a cascade of graph wavelet filters. Our learned geometric scattering (LEGS) architecture adaptively tunes these wavelets and their scales to encourage band-pass features to emerge in learned representations. This results in a simplified GNN with significantly fewer learned parameters compared to competing methods. We demonstrate the predictive performance of our method on several biochemistry graph classification benchmarks, as well as the descriptive quality of its learned features in biochemical graph data exploration tasks. Our results show that the proposed LEGS network matches or outperforms popular GNNs, as well as the original geometric scattering construction, while also retaining certain mathematical properties of its handcrafted (nonlearned) design.

Diffusion Earth Mover's Distance and Distribution Embeddings

Alexander Tong^*, Guillaume Huguet^*, Amine Natik^*, Kincaid MacDonald, Manik Kuchroo, Ronald Coifman, Guy Wolf, Smita Krishnaswamy

In ICML. Also presented at LMRL Workshop @ NeurIPS 2020

7/21/2021

We propose a new fast method of measuring distances between large numbers of related high dimensional datasets called the Diffusion Earth Mover's Distance (EMD). We model the datasets as distributions supported on common data graph that is derived from the affinity matrix computed on the combined data. In such cases where the graph is a discretization of an underlying Riemannian closed manifold, we prove that Diffusion EMD is topologically equivalent to the standard EMD with a geodesic ground distance. Diffusion EMD can be computed in {{< math >}}$ ilde{O}(n)${{< /math >}} time and is more accurate than similarly fast algorithms such as tree-based EMDs. We also show Diffusion EMD is fully differentiable, making it amenable to future uses in gradient-descent frameworks such as deep neural networks. Finally, we demonstrate an application of Diffusion EMD to single cell data collected from 210 COVID-19 patient samples at Yale New Haven Hospital. Here, Diffusion EMD can derive distances between patients on the manifold of cells at least two orders of magnitude faster than equally accurate methods. This distance matrix between patients can be embedded into a higher level patient manifold which uncovers structure and heterogeneity in patients. More generally, Diffusion EMD is applicable to all datasets that are massively collected in parallel in many medical and biological systems.

Quantifying the effect of experimental perturbations in single-cell RNA-sequencing data using graph signal processing

Daniel B. Burkhardt^*, Jay S. Stanley^*, Alexander Tong, Ana Luisa Perdigoto, Scott A. Gigante, Kevan C. Herold, Guy Wolf, Antonio J. Giraldez, David van Dijk, Smita Krishnaswamy

In Nature Biotechnology

6/15/2021

Single-cell RNA-sequencing (scRNA-seq) is a powerful tool to quantify transcriptional states in thousands to millions of cells. It is increasingly common for scRNA-seq data to be collected in multiple conditions to measure the effect of an experimental perturbation. However, quantifying differences between scRNA-seq datasets remains an analytical challenge. Previous efforts at quantifying such differences focus on discrete regions of the transcriptional state space such as clusters of cells. Here, we describe a continuous measure of the effect of an experiment across the transcriptomic space with single cell resolution. First, we use the manifold assumption to model the cellular state space as a graph with cells as nodes and edges connecting cells with similar transcriptomic profiles. Next, we calculate an Enhanced Experimental Signal (EES) that estimates the likelihood of observing cells from each condition at every point in the manifold. We show that the EES has useful properties for analysis of single cell perturbation studies. We show that we can use the magnitude and frequency of the EES, using an algorithm we call vertex frequency clustering, to identify specific populations of cells that are or are not affected by an experimental treatment at the appropriate level of granularity. Using these selected populations we can derive gene signatures of affected populations of cells. We demonstrate both algorithms using a combination of biological and synthetic datasets. Implementations are provided in the MELD Python package, which is available at https://github.com/KrishnaswamyLab/MELD.

POT: Python Optimal Transport

Remi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurelie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Leo Gautheron, Nathalie T H Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J Sutherland, Romain Tavenard, Alexander Tong, Titouan Vayer

In JMLR

6/15/2021

Optimal transport has recently been reintroduced to the machine learning community thanks in part to novel efficient optimization procedures allowing for medium to large scale applications. We propose a Python toolbox that implements several key optimal transport ideas for the machine learning community. The toolbox contains implementations of a number of founding works of OT for machine learning such as Sinkhorn algorithm and Wasserstein barycenters, but also provides generic solvers that can be used for conducting novel fundamental research. This toolbox, named POT for Python Optimal Transport, is open source with an MIT license.

Fixing Bias in Reconstruction-based Anomaly Detection with Lipschitz Discriminators

Alexander Tong, Guy Wolf, Smita Krishnaswamy

Journal version in Journal of Signal Processing Systems (2021). Presented at IEEE MLSP 2020 (*Best Student Paper Award*).

6/15/2021

Anomaly detection is of great interest in fields where abnormalities need to be identified and corrected (e.g., medicine and finance). Deep learning methods for this task often rely on autoencoder reconstruction error, sometimes in conjunction with other errors. We show that this approach exhibits intrinsic biases that lead to undesirable results. Reconstruction-based methods are sensitive to training-data outliers and simple-to-reconstruct points. Instead, we introduce a new unsupervised Lipschitz anomaly discriminator that does not suffer from these biases. Our anomaly discriminator is trained, similar to the ones used in GANs, to detect the difference between the training data and corruptions of the training data. We show that this procedure successfully detects unseen anomalies with guarantees on those that have a certain Wasserstein distance from the data or corrupted training set. These additions allow us to show improved performance on MNIST, CIFAR10, and health record data.

Abstract 2839: Understanding the mesenchymal-to-epithelial transition and its drivers in triple-negative breast cancer with continuous normalizing flows

Alexander Tong, Beatriz P. San Juan, Brandon Zhu, Christine L. Chaffer, Smita Krishnaswamy

AACR

4/15/2021

Here we focus on understanding mechanisms that drive dynamic changes in gene expression and epigenetic marks that enable triple negative breast cancer cells to change states, and to thereby invade tissues and seed secondary tumors. The epithelial-to-mesenchymal transition (EMT) facilitates invasion and migration away from the primary tumor site. However, it is increasingly apparent that the reverse process, the mesenchymal-to-epithelial transition (MET), enhances metastatic colonization and growth via reacquisition of the epithelial phenotype. With no therapies currently available to stop metastatic tumor growth, we aim to uncover the mechanisms driving the MET towards identifying novel anti-metastatic therapies. We use the 3D in vitro mammosphere model system where single tumor-initiating cells residing in a partial-EMT state develop into a 3D organoid over 30 days. We sampled cells at 5 time points and performed scRNA-seq and scATAC-seq to analyze cell states. We develop a novel computational model of cellular development based on the theory of dynamic optimal transport (OT) and continuous normalizing flows. Our model TrajectoryNet is a neural ODE (ordinary differential equation) network that models the gradient of cell state with respect to time continuously over the input space and over time from cross-sectional single-cell data. TrajectoryNet interpolates between collected timepoints and learns a continuous realistic progression that describes cellular evolution in terms of gene expression and chromatin accessibility. Key to TrajectoryNet is a unique regularization to penalize the magnitude of the gradient over the flow. We prove this results in dynamic OT, thereby discouraging the neural network from taking circuitous or unrealistic paths. In contrast to TrajectoryNet, pseudotime, and RNA velocity are best at analyzing within a particular timepoint and do not handle large gaps in timepoints. We compare TrajectoryNet to RNA velocity and static OT and show that TrajectoryNet achieves better trajectories in terms of predicting withheld timepoints. Using TrajectoryNet, we identify a continuous ordering of events that occur during MET that show when and how the epithelial cell states begin to emerge. Such a continuous ordering can give rise to causal associations that can be inhibited to alter MET mechanisms. We also differentiate between trajectories that show self-renewal and maintenance of the tumor-initiating cells, and trajectories that revert to an epithelial state. Further we find that only ~10% of the initial seeded cells develop into mammospheres and identify which initial cells have the potential to seed secondary tumors. Hence, we can refine features (gene and epigenetic states) that define aggressive tumor-initiating cells in triple negative breast cancer, as well as their dynamics through the MET in order to find therapeutic targets.

2020

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Egbert Castro, Andrew Benz, Alexander Tong, Guy Wolf^†, Smita Krishnaswamy^†

In IEEE Big Data. Also at GRLB Workshop @ ICML 2020

12/5/2020

Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. While numerous approaches aim to train classifiers that accurately predict molecular properties from graphs that encode their structure, an equally important task is to organize biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recently proposed geometric scattering transform. Then, it leverages a semi-supervised variational autoencoder to extract a low-dimensional embedding that retains the information in these features that enable prediction of molecular properties as well as characterize graphs. Our approach is based on the intuition that geometric scattering generates multi-resolution features with in-built invariance to deformations, but as they are unsupervised, these features may not be tuned for optimally capturing relevant domain-specific properties. We demonstrate the effectiveness of our approach to data exploration of RNA foldings. Like proteins, RNA molecules can fold to create low energy functional structures such as hairpins, but the landscape of possible folds and fold sequences are not well visualized by existing methods. We show that GSAE organizes RNA graphs both by structure and energy, accurately reflecting bistable RNA structures. Furthermore, it enables interpolation of embedded molecule sequences mimicking folding trajectories. Finally, using an auxiliary inverse-scattering model, we demonstrate our ability to generate synthetic RNA graphs along the trajectory thus providing hypothetical folding sequences for further analysis.

Interpretable Neuron Structuring with Graph Spectral Regularization

Alexander Tong^*, David van Dijk^*, Jay S. Stanley III, Matthew Amodio, Kristina Yim, Rebecca Muhle, James Noonan, Guy Wolf, Smita Krishnaswamy

In IDA Also presented at RLGM Workshop @ ICLR 2019

10/20/2020

While neural networks are powerful approximators used to classify or embed data into lower dimensional spaces, they are often regarded as black boxes with uninterpretable features. Here we propose Graph Spectral Regularization for making hidden layers more interpretable without significantly impacting performance on the primary task. Taking inspiration from spatial organization and localization of neuron activations in biological networks, we use a graph Laplacian penalty to structure the activations within a layer. This penalty encourages activations to be smooth either on a predetermined graph or on a feature-space graph learned from the data via co-activations of a hidden layer of the neural network. We show numerous uses for this additional structure including cluster indication and visualization in biological and image data sets.

TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics

Alexander Tong, Jessie Huang, Guy Wolf, David van Dijk, Smita Krishnaswamy

In ICML. Also at LMRL Workshop @ NeurIPS 2019

7/21/2020

It is increasingly common to encounter data from dynamic processes captured by static crosssectional measurements over time, particularly in biomedical settings. Recent attempts to model individual trajectories from this data use optimal transport to create pairwise matchings between time points. However, these methods cannot model continuous dynamics and non-linear paths that entities can take in these systems. To address this issue, we establish a link between continuous normalizing flows and dynamic optimal transport, that allows us to model the expected paths of points over time. Continuous normalizing flows are generally under constrained, as they are allowed to take an arbitrary path from the source to the target distribution. We present TrajectoryNet, which controls the continuous paths taken between distributions. We show how this is particularly applicable for studying cellular dynamics in data from single-cell RNA sequencing (scRNA-seq) technologies, and that TrajectoryNet improves upon recently proposed static optimal transport-based models that can be used for interpolating cellular distributions.

Interpolating Optimal Transport Barycenters of Patient Manifolds

Alexander Tong, Smita Krishnaswamy

In ISMB

7/14/2020

Single-cell data is now being collected across many patients in varying conditions. However, data is still relatively expensive. This opens up the opportunity for computational methods to decrease overall cost by inferring a single-cell measurement based on similarity to the meta-data of other similar samples. We examine this problem with an optimal transport perspective. This allows us to leverage a variant of the Sinkhorn algorithm for extremely computationally efficient approximations of transport along discrete manifolds. Our method first constructs the manifold between samples, then aligns this to the manifold of patients, and finally applies this to interpolate a barycenter sample along this manifold. We show first that we are able to better interpolate samples between timepoints than existing methods e.g. Waddington-OT (Schiebinger et al. 2019 Cell) by accounting for structure between multiple timepoints instead of pairs. We then show when the relationship between patients is an inferred manifold, how to impute a patient’s single-cell measurements based on other similar single-cell samples by aligning the manifold of patients with that of single-cell measurements. When the manifold of patients exhibits non-linear but intrinsically low-dimensional structure, we are able to more accurately infer a single-cell measurement.

2019

Finding Archetypal Spaces Using Neural Networks

David van Dijk^*, Daniel B. Burkhardt^*, Matthew Amodio, Alexander Tong, Guy Wolf, Smita Krishnaswamy

In IEEE Big Data

12/5/2019

Archetypal analysis is a data decomposition method that describes each observation in a dataset as a convex combination of ''pure types'' or archetypes. These archetypes represent extrema of a data space in which there is a trade-off between features, such as in biology where different combinations of traits provide optimal fitness for different environments. Existing methods for archetypal analysis work well when a linear relationship exists between the feature space and the archetypal space. However, such methods are not applicable to systems where the feature space is generated non-linearly from the combination of archetypes, such as in biological systems or image transformations. Here, we propose a reformulation of the problem such that the goal is to learn a non-linear transformation of the data into a latent archetypal space. To solve this problem, we introduce Archetypal Analysis network (AAnet), which is a deep neural network framework for learning and generating from a latent archetypal representation of data. We demonstrate stateof-the-art recovery of ground-truth archetypes in non-linear data domains, show AAnet can generate from data geometry rather than from data density, and use AAnet to identify biologically meaningful archetypes in single-cell gene expression data.

Fixing Bias in Reconstruction-based Anomaly Detection with Lipschitz Discriminators

Alexander Tong, Guy Wolf, Smita Krishnaswamy

In Journal of Signal Processing Systems

5/1/2019

Anomaly detection is of great interest in fields where abnormalities need to be identified and corrected (e.g., medicine and finance). Deep learning methods for this task often rely on autoencoder reconstruction error, sometimes in conjunction with other penalties. We show that this approach exhibits intrinsic biases that lead to undesirable results. Reconstruction-based methods can sometimes show low error on simple-to-reconstruct points that are not part of the training data, for example the all black image. Instead, we introduce a new unsupervised Lipschitz anomaly discriminator (LAD) that does not suffer from these biases. Our anomaly discriminator is trained, similar to the discriminator of a GAN, to detect the difference between the training data and corruptions of the training data. We show that this procedure successfully detects unseen anomalies with guarantees on those that have a certain Wasserstein distance from the data or corrupted training set. These additions allow us to show improved performance on MNIST, CIFAR10, and health record data. Further, LAD does not require decoding back to the original data space, which makes anomaly detection possible in domains where it is difficult to define a decoder, such as in irregular graph structured data. Empirically, we show this framework leads to improved performance on image, health record, and graph data.

2018

Allocate-On-Use Space Complexity of Shared-Memory Algorithms

James Aspnes, Bernhard Haeupler, Alexander Tong, Philipp Woelfel

In DISC

10/15/2018

Many fundamental problems in shared-memory distributed computing, including mutual exclusion [8], consensus [18], and implementations of many sequential objects [14], are known to require linear space in the worst case. However, these lower bounds all work by constructing particular executions for any given algorithm that may be both very long and very improbable. The significance of these bounds is justified by an assumption that any space that is used in some execution must be allocated for all executions. This assumption is not consistent with the storage allocation mechanisms of actual practical systems.