Solving molecular biology with massive data generation and AI

Supervision

Ben Lehner
SANGER, 1st Suprvisor
Jordi Garcia Ojalvo
UPF, 2nd Supervisor

Objectives

There has been amazing recent progress in the prediction and design of protein structures using machine learning. However, for most other properties of proteins, including their stability, binding affinities and specificities to other proteins, DNA and drugs, and their dynamics and regulation, there is a huge shortage of well-calibrated training data (labels on sequences). Our objective at the Sanger Institute is to generate data at scale efficiently and cost-effectively to train the next generation of AI models and to better understand and engineer the fundamental sequence-encoded properties of proteins and RNAs.

Methodology

The project could be primarily machine learning, primarily experimental, or a mixture. Experimentally the focus would be developing and using massively parallel DNA synthesis-selection-sequencing experiments to generate data at scale for training machine learning models and understanding how sequence encodes the biophysical properties of proteins or RNAs. Computationally the objective would be to develop, test, and apply models to engineer protein and RNA properties at scale, making use of the lab’s capacity for very large-scale experimental testing to perform massively parallel design-test cycles and/or highly parallel model-derived hypothesis testing using explainable AI and lab-in-the-loop approaches.

Required skills

Strong background in machine learning, statistics or maths and/or in genomics, biophysics and molecular biology.

Expected Results

Large-scale well-calibrated datasets and AI models, including explainable and interpretable models for protein and RNA variant interpretation, design and optimisation.

Planned Secondments

Host: UPF (J. Garcia- Ojalvo), Duration: 2 Months; When: Year 1, Goal: Fuse structural information with bacterial resistance.

Host: IRB (P. Aloy), Duration: 1 Month, When: Year2, Goal: Integrating chemical biology knowledge.

Host: MSAID (M. Frejno), 1 Month; When: Year 3, Goal: Explore potential for collaboration on interpretable and generative AI models.

Enrolment in doctoral programs

University of Cambridge and Wellcome Sanger Institute.

References

Escobedo A, Voigt G, Faure AJ, Lehner B. Genetics, energetics, and allostery in proteins with randomized cores and surfaces. Science. 2025 Jul 24;389(6758):eadq3948.

Beltran A, Jiang X, Shen Y, Lehner B. Site-saturation mutagenesis of 500 human protein domains. Nature. 2025 Jan;637(8047):885-894.

Toledano I, Supek F, Lehner B. Genome-scale quantification and prediction of pathogenic stop codon readthrough by small molecules. Nature Genetics. 2024 Sep;56(9):1914-1924.

Faure AJ*, Domingo J*, Schmiedel JM*, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric landscapes of protein binding domains. Nature. 2022 Apr;604(7904):175-183.