Advances in machine learning and AI systems are increasingly influencing how we approach the quantitative sciences, including physics, chemistry, and biology. These opportunities include having machines learn new representations of interactions between particles, how matter transforms in reactions, help us decide what experiment to conduct next or detect emerging phenomena. Sparse data situations remain a significant hurdle in many sciences or situations where common data assumptions do not hold. Consequently, it remains critical to ground our efforts in the millennia of scientific insights embodied in the literature to avoid, in the best case, having machines relearn what we already know. The Chalmers AI4Science is a monthly seminar where we invite early-career researchers to present their work at the interface of machine learning, artificial intelligence, and a scientific discipline. This seminar series aims to provide an international platform at Chalmers for discussions about these topics and strengthen interdisciplinary research involving machine learning and AI at Chalmers. The Chalmers AI4Science seminar is organized by Simon Olsson and Rocío Mercado.
Subscribe to our mailing list for reminders
Abstract:
Biomolecules are highly dynamic systems. They reorganize between a network of conformations connected by rare structural intermediates, which is referred to as their conformational ensemble. The ensemble, including the rare intermediate structures, determines biomolecular function in the cell. However, mapping biomolecular conformational ensembles is still challenging in computational and experimental approaches. Computer simulations are the modern manifestation of scientific theories and, merged with machine learning, can help us overcome challenges in the biomolecular sciences. In the first part of my talk, I will illustrate our work to integrate path sampling with machine learning to empower our ability to simulate rare conformational transitions. Our algorithm provides efficient sampling, mechanism, thermodynamic, and kinetic information of rare molecular events at a moderate computational cost. In the second part, I will discuss using simulation-based inference to identify biomolecular conformations in cryo-electron microscopy (cryo-EM) data. cryoEM is a powerful paradigm for characterizing protein conformational ensembles. However, even though the frozen sample contains information on the entire ensemble, accurately identifying rare or disordered molecular conformations depicted in a single cryoEM image is still challenging. We integrated physics-based simulations, Bayesian inference, and deep learning to develop the cryoEM simulation-based inference (cryoSBI) framework for inferring molecular conformations and their uncertainties from individual cryoEM images. We validated cryoSBI on synthetic and experimental data. Our approach paves the way to characterizing entire conformational ensembles from experimental data.
Roberto Covino is W3 Professor of Computational Life Science at the Institute of Computer Science at Goethe University Frankfurt, and a Fellow at the Frankfurt Institute for Advanced Studies. His research uses theory, simulation, and machine learning to understand how biomolecular functions emerge from the interplay between structure, dynamics, and complexity. He focuses on understanding the mechanism of key events in proteins and cellular membranes. His research interests range from molecular biophysics to statistical modelling, machine learning, lipid and protein biochemistry. RC studied physics and theoretical physics at the University of Bologna and obtained his PhD in physics at the University of Trento. He then worked as post-doctoral scientistic at the Max Planck Institute of Biophysics. RC was appointed Professor of AI in Protein Science at Bayreuth University in 2023 and received the call to Goethe University in 2024.
Abstract:
Atomic systems (molecules, crystals, proteins, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This is a challenging representation to use for machine learning because the coordinates are sensitive to 3D rotations, translations, and inversions (the symmetries of 3D Euclidean space). In this talk I’ll give an overview of Euclidean invariance and equivariance in machine learning for atomic systems. Then, I’ll share some recent applications of these methods on a variety of atomistic modeling tasks (ab initio molecular dynamics, prediction of crystal properties, and scaling of electron density predictions). Finally, I’ll explore open questions in expressivity, data-efficiency, and trainability of methods leveraging invariance and equivariance.
Tess Smidt is an Assistant Professor of Electrical Engineering and Computer Science at MIT. Tess earned her SB in Physics from MIT in 2012 and her PhD in Physics from the University of California, Berkeley in 2018. Her research focuses on machine learning that incorporates physical and geometric constraints, with applications to materials design. Prior to joining the MIT EECS faculty, she was the 2018 Alvarez Postdoctoral Fellow in Computing Sciences at Lawrence Berkeley National Laboratory and a Software Engineering Intern on the Google Accelerated Sciences team where she developed Euclidean symmetry equivariant neural networks which naturally handle 3D geometry and geometric tensor data.
Abstract:
Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don’ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample.
Soledad Villar is an assistant professor of applied mathematics and statistics at Johns Hopkins University. Currently she is a visiting researcher at Apple Research in Paris. She was born and raised in Montevideo, Uruguay.
Abstract:
In July 2022 Microsoft announced a new global team in Microsoft Research, spanning the UK, China and the Netherlands, to focus on AI for science. In September 2022 we announced that we have also opened a new lab in Berlin, Germany. In this talk I will first discuss the research areas that we are currently exploring in AI4Science at Microsoft Research in Cambridge (UK), Amsterdam and in our new lab in Berlin, covering topics such as drug discovery, material generation, neural PDE solvers, electronic structure theory and computational catalysis. Then I will dive a little deeper into our recent work on Clifford Neural layers for PDE modeling. The PDEs of many physical processes describe the evolution of scalar and vector fields. In order to take into account the correlation between these different fields and their internal components, we represent these fields as multivectors, which consist of scalar, vector, as well as higher-order components. Their algebraic properties, such as multiplication, addition and other arithmetic operations can be described by Clifford algebras, which we use to design Clifford convolutions and Clifford Fourier transforms. We empirically evaluate the benefit of Clifford neural layers by replacing convolution and Fourier operations in common neural PDE surrogates by their Clifford counterparts on two-dimensional Navier-Stokes and weather modeling tasks, as well as three-dimensional Maxwell equations. If time permits I will briefly cover very recent work on protein structure prediction and coarse graining molecular dynamics.
Rianne is a Principal Researcher at Microsoft Research Amsterdam, where she works as part of the AI4Science team on the intersection of deep learning and computational chemistry. Her research has spanned a range of topics from generative modeling, variational inference, source compression, graph-structured learning to condensed matter physics. Before joining MSR she was a Research Scientist at Google Brain. she received her PhD in theoretical condensed-matter physics in 2016 at the University of Amsterdam, where she also worked as a postdoctoral researcher as part of the Amsterdam Machine Learning Lab (AMLAB). In 2019 she won the Faculty of Science Lecturer of the Year award at the University of Amsterdam for teaching a machine learning course in the master of AI.
Abstract:
Absorption, distribution, metabolism, and excretion (ADME) properties play an important role in the success of drug candidates. Unfavorable pharmacokinetics (PK) can prevent that compounds progress in drug development and early ADME/PK properties’ screening aims at reducing the number of molecules failing in the development process. This talk will focus on how to use machine learning to leverage historical ADME/PK data and make predictions for new compounds. Machine learning models developed for PK property predictions will be presented, as well as some of their applications at NIBR. Such models are applicable to large libraries, virtual compounds, and generative chemistry workflows. Hence, predictions enable early informed decisions and compound prioritization, aiming at reducing late-stage attrition. However, using machine learning-based predictions to support decision-making in a drug discovery project involves important considerations. Current challenges and future directions for improving the use of ADMET models in industry will be discussed.
Dr. Raquel Rodríguez-Pérez is a Principal Scientist at Novartis Institutes for Biomedical Research and works in the Modeling & Simulation Data Science team in the Translational Medicine Department. She develops machine learning models to predict compound properties relevant in pharmacokinetics. She supports drug discovery teams with modeling and data science tools in order to make better and faster decision in lead optimization. Prior to working at Novartis, Raquel obtained her B.Sc. and M.Sc. degrees in Biomedical Engineering from the University of Barcelona and her PhD in Computational Life Sciences from the University of Bonn. She worked on data analysis for bioinformatics applications at the Institute for BioEngineering of Catalonia (IBEC) and did her thesis about machine learning models for interpretable compound activity predictions. Therefore, she has experience with the application of machine learning and deep learning methods in different life sciences problems. She was a Marie Curie fellow and worked at the Computational Chemistry - Data Science group in Boehringer Ingelheim, Germany. She has acted as a mentor of scientists at different careers levels both in academia and industry. Overall, her research interests include bio/cheminformatics, machine learning, and data science for biomedical applications.
Abstract:
Deep learning, and in general, differentiable programming allow expressing many scientific problems as end-to-end learning tasks while retaining some inductive bias. Common themes in scientific machine learning involve learning surrogate functions of expensive simulators, sampling complex distributions directly or time-propagation of known or unknown differential equation systems efficiently.
In this talk, we will analyze our recent work in applying deep learning surrogates and auto-differentiation in molecular simulations. In particular, we will explore active learning of machine learning potentials with differentiable uncertainty; the use of deep neural network generative models to learn reversible coarse-grained representations of atomic systems. Lastly, we will describe the application of differentiable simulations for learning interaction potentials from experimental data and for reaction path finding without prior knowledge of collective variables.
Rafael Gomez-Bombarelli (Rafa) is the Jeffrey Cheah Career Development Professor at MITs Department of Materials Science and Engineering. His works aims to fuse machine learning and atomistic simulations for designing materials and their transformations. By embedding domain expertise and experimental results into their models, alongside physics-based knowledge, the Learning Matter Lab designs materials than can be realized in the lab and scaled to practical applications. Together with experimental collaborators, they develop new practical materials such as heterogeneous thermal catalysts (zeolites), transition metal oxide electrocatalysts, therapeutic peptides, organic electronics for displays, or electrolytes for batteries.
Rafa received BS, MS, and PhD (2011) degrees in chemistry from Universidad de Salamanca (Spain), followed by postdoctoral work at Heriot-Watt (UK) and Harvard Universities, and a stint in industry at Kyulux North America. He has been awarded the Camille and Henry Dreyfus Foundation "Machine Learning in the Chemical Sciences and Engineering Awards" in 2021 and the Google Faculty Research Award in 2019. He was co-founder of Calculario a Harvard spinout company, was Chief Learning Officer of ZebiAI, a drug discovery startup acquired by Relay Therapeutics in 2022 and serves as consultant and scientific advisor to multiple startups
Abstract:
The universality of thermodynamics and statistical mechanics has led to a language comprehensible to chemists, physicists, materials scientists, geologists & others, enabling countless scientific discoveries in diverse fields. In the last decade, a new arguably common language that everyone seems to speak but no one quite fully understands, has emerged with the advent of artificial intelligence (AI). It is natural to ask if AI can be integrated with the various theoretical and simulation methods rooted in thermodynamics and statistical mechanics for discoveries that none of these could achieve individually. It is also natural to ask if chemists, who are not fundamentally trained in AI, should trust any of the results obtained using AI or even worse, theory or computer simulations that were guided by AI. In this seminar I will show how such an integration of disciplines can be attained, creating trustable, robust AI frameworks for use by chemists and physical scientists. I will talk about such methods developed by my group using and extending different flavors of AI [1-3]. I will demonstrate the methods on different problems involving protein kinases, riboswitches and amino acid nucleation [4-5], where we predict mechanisms at timescales much longer than milliseconds while keeping all-atom/femtosecond resolution, including the problem of recovering a Boltzmann weighted ensemble of conformations from AlphaFold2 [6]. I will conclude with an outlook for future challenges and opportunities, envisioning a new sub-discipline of “Artificial Chemical Intelligence” where chemistry (both theory and simulations) move hand-in-hand with AI to enable smart molecular discovery.
[1] Wang, Ribeiro, Tiwary. Nature Comm 10, 3573 (2019).
[2] Tsai, Kuo, Tiwary. Nature Comm 11, 5115 (2020).
[3] Wang, Herron, Tiwary. Proc Natl Acad Sci 119, e2203656119 (2022).
[4] Wang, Parmar, Schneekloth, Tiwary. ACS Central Science 8, 741 (2022).
[5] Shekhar, Smith, Seeliger, Tiwary. Angewandte Chemie 61, e202200983 (2022).
[6] Vani, Aranganathan, Tiwary. bioRxiv https://doi.org/10.1101/2022.05.25.493365
Pratyush Tiwary is an Associate Professor at the University of Maryland, College Park in the Department of Chemistry and Biochemistry and the Institute for Physical Science and Technology. He received his undergraduate degree in Metallurgical Engineering from IIT-BHU, PhD in Materials Science from Caltech followed by postdoctoral work at ETH Zurich and Columbia University. His work at the interface of molecular simulations, statistical mechanics and machine learning has been recognized through many awards including Sloan Research Fellowship in Chemistry, NSF CAREER award, NIH Maximizing Investigators’ Research Award and ACS OpenEye Outstanding Junior Faculty Award.
Abstract:
While the great success of modern deep learning lies in its ability to approximate maps between finite-dimensional vector spaces, many tasks in science and engineering involve continuous measurements that are functional in nature. For example, in climate modeling one might wish to predict the pressure field over the earth from measurements of the surface air temperature field. The goal is then to learn an operator, between the space of temperature functions to the space of pressure functions. In recent years, operator learning techniques have emerged as a powerful tool for supervised learning in infinite-dimensional function spaces. In this talk we will provide an introduction to this topic, present a general approximation framework for operators, and demonstrate how one can construct deep learning models that can handle functional data. We will see how such tools can help us build neural ODE and PDE solvers that can be trained even in the absence of labeled data, and enable the fast prediction of continuous spatio-temporal fields up to three orders of magnitude faster compared to conventional numerical solvers. We will also discuss key open questions related to generalization, data-efficiency and inductive bias, the resolution of which is critical for the success of AI in science and engineering.
Paris Perdikaris is an Assistant Professor in the Department of Mechanical Engineering and Applied Mechanics at the University of Pennsylvania. He received his PhD in Applied Mathematics at Brown University in 2015, and, prior to joining Penn in 2018, he was a postdoctoral researcher at the department of Mechanical Engineering at the Massachusetts Institute of Technology. His current research interests include physics-informed machine learning, uncertainty quantification, and engineering design optimization. His work and service has received several distinctions including the DOE Early Career Award (2018), the AFOSR Young Investigator Award (2019), the Ford Motor Company Award for Faculty Advising (2020), the SIAG/CSE Early Career Prize (2021), and the Scialog Fellowship (2021).
Abstract:
Koopman operator theory, and its main algorithm extended dynamic mode decomposition (EDMD), has emerged as a powerful modeling approach for complex dynamical systems arising in physics, chemistry, materials science, and engineering. The basic idea is to leverage existing simulation data to learn a linear model that allows to predict expectation values of observable functions at future times. Though the algorithm is conceptually quite simple, its underlying mathematical structure is very rich, and can be used for different purposes including control, coarse graining, or the identification of metastable states in complex molecules and materials.
The success of the method depends critically on the choice of finite-dimensional subspace (called dictionary), which reflects a priori knowledge of the system. Here, kernel methods emerge as very useful, since they allow using a rich dictionary defined implicitly by the data. On the other hand, the application of kernel methods typically leads to large linear algebra problems that can be challenging to solve in practice.
In this talk, I will present recent results on the efficient use of kernel-based EDMD in the context of molecular simulation. I will first introduce the general Koopman framework and the associated variational formulation for metastable systems. Then, I will show how to introduce kernels in this context. Finally, I will show how low-rank approximations based on random Fourier features (RFFs) can be used to solve the resulting linear algebra problems efficiently.
Dr. Feliks Nüske is a research group leader at the Max-Planck-Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany. His research is on data-driven methods for modelling and simulation of molecular systems. Feliks received his Ph.D. in applied mathematics from Freie Universität Berlin in 2017. He then joined the Center for Theoretical Biological Physics at Rice University, U.S., for a postdoc. Before moving to his current position Dr. Nüske did a second postdoc in the Department of Mathematics at Paderborn University, Germany.
Abstract:
The field of cellular biology has long sought to understand the intricate mechanisms that govern cellular responses to various perturbations, be they chemical, physical, or biological. Traditional experimental approaches, while invaluable, often face limitations in scalability and throughput, especially when exploring the vast combinatorial space of potential cellular states. Enter generative machine learning that has shown exceptional promise in modelling complex biological systems. This talk will highlight recent successes, address the challenges and limitations of current models, and discuss the future direction of this exciting interdisciplinary field. Through examples of practical applications, we will illustrate the transformative potential of generative ML in advancing our understanding of cellular perturbations and in shaping the future of biomedical research.
Dr. Lotfollah recently completed his PhD in Computational Biology at the Technical University of Munich (TUM) and Helmholtz Munich under the supervision of Fabian Theis. A member of ELLIS, Dr. Lotfollahi is director of machine learning research at Relation Therapeutics. Simultaneously, Mo is a scientist at Helmholtz Munich and an incoming faculty member at the Wellcome Sanger Institute. Dr. Lotfollahi’s research is dedicated to developing AI/ML algorithms for biomedical data, with a particular focus on single-cell technologies for diagnostics, therapeutics, and drug discovery. Acknowledged for their contributions, Dr. Lotfollahi has received multiple awards, as evidenced by their exemplary achievements. Their work has been highlighted in various press and journals, showcasing the impact of their research. Dr. Lotfollahi has also garnered numerous fellowships, grants, and scholarships, including prestigious ones from Joachim Hertz, EMBL, and Meta. Beyond academic pursuits, Dr. Lotfollahi has served as a researcher, consultant, and advisor in biotechnology and ML/AI companies such as Cellarity, Meta AI, SANTA ANA Bio, and Relation Therapeutics.
Abstract:
Determining the different conformational states of a protein and the transition paths between them is key to fully understanding the relationship between biomolecular structure and function. I will discuss how a generative neural network (GNN) can learn a continuous conformational space representation from example structures produced by molecular dynamics simulations or experiments. I will then show how such representation, obtained via our freely available software molearn [1], can be leveraged on to predict putative protein transition states [2], or to generate conformations useful in the context of flexible protein-protein docking [3]. Finally, I will demonstrate that transfer learning is possible, i.e., a GNN can learn features common to any protein.
[1] S.C. Musson and M. T. Degiacomi (2023). Molearn: a Python package streamlining the design of generative models of biomolecular dynamics. Journal of Open Source Software, 8(89), 5523 [2] V.K. Ramaswamy, S.C. Musson, C. Willcocks, M.T. Degiacomi (2021). Learning Protein Conformational Space with Convolutions and Latent Interpolations. Physical Review X, 11(1), 011052 [3] M.T. Degiacomi (2019). Coupling Molecular Dynamics and Deep Learning to Mine Protein Conformational Space. Structure, 27(6), 1034-1040.
Matteo obtained an MSc in Computer Science and a PhD in computational biophysics (2012) in the Swiss Federal Institute of Technology of Lausanne (EPFL). During his PhD studies he combined molecular dynamics simulations and global optimization algorithms to predict the assembly of large protein complexes. In 2013 he joined the research groups of Prof Justin Benesch and Prof Dame Carol Robinson FRS in the University of Oxford. His research, funded by a Swiss National Science Foundation Early Postdoc Mobility Fellowship, focused on the development of computational methods for the prediction of protein assembly guided by ion mobility, cross-linking, SAXS and electron microscopy data, as well as their application to the study of small Heat Shock Proteins and protein-lipid interactions. In 2017 he obtained an EPSRC Fellowship, allowing him to establish his independent research in Durham University, and in 2020 he was appointed Associate Professor in soft condensed matter physics. His current research revolves around the development of machine learning methods to sample protein conformational spaces.
Abstract:
AlphaFold2 revolutionized structural biology by accurately predicting protein structures from sequence. Its implementation however (i) lacks the code and data required to train models for new tasks, such as predicting alternate protein conformations or antibody structures, (ii) is unoptimized for commercially available computing hardware, making large-scale prediction campaigns impractical, and (iii) remains poorly understood with respect to how training data and regimen influence accuracy. Here we report OpenFold, an optimized and trainable version of AlphaFold2. We train OpenFold from scratch and demonstrate that it fully reproduces AlphaFold2’s accuracy. By analyzing OpenFold training, we find new relationships between data size/diversity and prediction accuracy and gain insights into how OpenFold learns to fold proteins during its training process.
Mohammed AlQuraishi is an Assistant Professor in the Department of Systems Biology and a member of Columbia's Program for Mathematical Genomics, where he works at the intersection of machine learning, biophysics, and systems biology. The AlQuraishi Lab focuses on two biological perspectives: the molecular and systems levels. On the molecular side, the lab develops machine learning models for predicting protein structure and function, protein-ligand interactions, and learned representations of proteins and proteomes. On the systems side, the lab applies these models in a proteome-wide fashion to investigate the organization, combinatorial logic, and computational paradigms of signal transduction networks, how these networks vary in human populations, and how they are dysregulated in human diseases, particularly cancer.
Dr. AlQuraishi holds undergraduate degrees in biology, computer science, and mathematics. He earned an MS in statistics and a PhD in genetics from Stanford University. He subsequently joined the Systems Biology Department at Harvard Medical School as a Departmental Fellow and a Fellow in Systems Pharmacology, where he developed the first end-to-end differentiable model for learning protein structure from data. Prior to starting his academic career, Dr. AlQuraishi spent three years founding two startups in the mobile computing space. He joined the Columbia Faculty in 2020.
Abstract:
Engineered proteins play increasingly essential roles in industries and applications spanning pharmaceuticals, agriculture, specialty chemicals, and fuel. Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Large self-supervised models pretrained on millions of protein sequences have recently gained popularity in generating embeddings of protein sequences for protein property prediction. However, protein datasets contain information in addition to sequence that can improve model performance. This talk will cover pretrained models that use both sequence and structural data, their application to predict which portions of proteins can be removed while retaining function, and a new set of protein fitness benchmarks to measure progress in pretrained models of proteins.
Kevin Yang is a senior researcher at Microsoft Research in Cambridge, MA who works on problems at the intersection of machine learning and biology. He did his PhD at Caltech with Frances Arnold on applying machine learning to protein engineering. Before joining MSR, he was a machine learning scientist at Generate Biomedicines, where he used machine learning to optimize proteins. Before graduate school, Kevin taught math and physics for three years at a high school in Inglewood, California through Teach for America.
Abstract:
In the past few years, deep learning methods for molecular design have made the transition from theoretical research prototypes into practical and commercially important tools in use across the pharmaceutical industry. Here, I will present ReInvent, AstraZeneca’s open-source platform for reinforcement learning guided molecular optimization, focusing on the scientific developments behind it and the ever-increasing connection with physics-based molecular simulations. I will highlight some recent approaches to improve the sample efficiency of the reinforcement learning process, thereby allow for integration with more complex simulation workflows. Finally, I will briefly discuss methods for chemical synthesis planning, and how these various models can work together to power increasingly autonomous systems for drug discovery.
Jon Paul Janet is currently Associate Principal Scientist in the Molecular AI group at AstraZenca in Gothenburg, Sweden. Previously, JP works on early stage drug discovery and has developed machine-learning augmented virtual design strategies for inorganic complexes. He received a Ph.D. in Chemical Engineering and Computational Science and Engineering from the Massachusetts Institute of Technology in 2019 following M.Sc. degrees in Scientific Computing and Applied Mathematics from the Technical University of Berlin and the Royal Institute of Technology in Stockholm both in 2015, as well as a B.Sc. in Chemical Engineering from the University of Cape Town in 2012.
Abstract:
Machine learning represents an exciting opportunity to accelerate discovery in the chemical sciences, and to shorten the time from discovery to products. However, the available (experimental) data for chemistry is often limited, and it is not equally distributed in the vast ‘chemical space’. Our approach is try to bridge this gap by relying on a combination of machine learning and physical simulation. In the first part of the talk, I will describe our work in the field of molecular design for organic electronic materials. Many molecular design algorithms rely on machine learning models to predict the properties of a molecule for a certain application. Although ML models often work well on similar molecules as they were trained on, they often break down when generalizing to different parts of the chemical space. Generative models then abuse these weaknesses of the propery predictor and start generating false positives. We have therefore spent time to develop a series of very fast physics-based property predictors for important properties of organic electronic materials. These can then be coupled with high-throughput virtual screening or molecular design models to discovery new promising candidates. An alternative is to work in a constrained fragment space, which ensures that machine learning methods are sufficiently generalizable. We will give examples for such as fragment-constrained optimization of singlet fission materials using genetic algorithms which are steered by prediction uncertainty. In the second part of the talk, I will present our work in the area of reaction prediction, using a combination of quantum-chemical models and machine learning. Also here, we have developed fast physics-based property predictors for chemical reactivity that we use in generative models, including the first benchmark task for chemical reaction design. Using these simulations methods, we also generate large reactivity datasets on which deep learning models can be trained.
Kjell Jorner is an Assistant Professor of Digital Chemistry at ETH Zurich since 2023. His work focuses on accelerating chemical discovery with digital tools, with a special emphasis on reactivity and catalysis. His group does interdisciplinary research, drawing from the fields of computational chemistry, cheminformatics and machine learning. Before joining ETH Zurich, he was a postdoctoral researcher with Alán Aspuru-Guzik (2021-2022) and at AstraZenecaUK (2018-2020). Kjell has a PhD from Uppsala University (2018) on computational physical organic chemistry for the photochemistry of aromatic compounds.
Abstract:
The space of possible materials is unimaginably large. To find our way in this space, having a map that can guide us would be nice. In this presentation, we show that machine learning can provide us with such a map [1]. We can use machine learning to encode patterns that are tacit or hidden in a large number of dimensions of this chemical space, and then use it to guide the design of materials. The simplest application of this navigation system is to predict properties that are hard to predict with conventional quantum chemistry or molecular simulation alone [2, 3]. Once we have this in place, we can use it to most efficiently gather information about structure-property-function relationships. A fundamental difficulty here is, however, that we often have to deal with multiple, often competing objectives. For instance, increasing the reactivity often decreases the selectivity. Interestingly, one can show that using a geometric construction, one can also effectively, and without bias, use machine learning to dramatically accelerate materials design and discovery in such a multiobjective design space [4].It is important to realize, however, that machine learning relies on data that a machine can use [5]. Toward this goal, we need to develop infrastructure to allow for the capture without overhead while providing chemists with tools that simplify their daily work [4, 5]. A challenge, however, is that data typically cannot be easily collected in this nice tabular firm. Recent advantages of applying large language models (LLMs) to chemistry indicate that they might be used to address this challenge. I will showcase how LLMs can autonomously use tools, leverage structured data as well as soft inductive biases, and, in this way, transform how we model chemistry. [6, 7].
[1] Jablonka, K. M.; Ongari, D.; Moosavi, S. M.; Smit, B. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Chem. Rev. 2020, 120 (16), 8066-8129.
[2] Jablonka, K. M.; Ongari, D.; Moosavi, S. M.; Smit, B. Using Collective Knowledge to Assign Oxidation States of Metal Cations in Metal-Organic Frameworks. Nat. Chem. 2021, 13 (8), 771-777.
[3] Jablonka, K. M.; Moosavi, S. M.; Asgari, M.; Ireland, C.; Patiny, L.; Smit, B. A Data-Driven Perspective on the Colours of Metal-Organic Frameworks. Chem. Sci. 2021, 12 (10), 3587-3598.
[4] Jablonka, K. M.; Jothiappan, G. M.; Wang, S.; Smit, B.; Yoo, B. Bias Free Multiobjective Active Learning for Materials Design and Discovery. Nat Commun 2021, 12 (1), 2312.
[5] Jablonka, K. M.; Patiny, L.; Smit, B. Making the collective knowledge of chemistry open and machine actionable, Nat. Chem. 2022. [6]Jablonka, K. M.; Ai, Q.; Al-Feghali, A.; Badhwar, S.; Bran, J. D. B. A. M.; Bringuier, S.; Brinson, L. C.; Choudhary, K.; Circi, D.; Cox, S.; de Jong, W. A.; Evans, M. L.; Gastellu, N.; Genzling, J.; Gil, M. V.; Gupta, A. K.; Hong, Z.; Imran, A.; Kruschwitz, S.; Labarre, A.; Lála, J.; Liu, T.; Ma, S.; Majumdar, S.; Merz, G. W.; Moitessier, N.; Moubarak, E.; Mouriño, B.; Pelkie, B.; Pieler, M.; Ramos, M. C.; Ranković, B.; Rodriques, S. G.; Sanders, J. N.; Schwaller, P.; Schwarting, M.; Shi, J.; Smit, B.; Smith, B. E.; Van Heck, J.; Völker, C.; Ward, L.; Warren, S.; Weiser, B.; Zhang, S.; Zhang, X.; Zia, G. A.; Scourtas, A.; Schmidt, K. J.; Foster, I.; White, A. D.; Blaiszik, B. 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon. arXiv June 9, 2023. https://doi.org/10.48550/arXiv.2306.06283.
[7] Jablonka, K. M.; Schwaller, P.; Ortega-Guerrero, A.; Smit, B. Is GPT-3 All You Need for Low-Data Discovery in Chemistry? 2023. https://doi.org/10.26434/chemrxiv-2023-fw8n4.
Kevin Jablonka obtained his bachelor's degree in chemistry at TU Munich. He joined EPFL for his master's studies (and an extended study degree in applied machine learning), after which he joined Berend Smit's group for a Ph.D. He now leads a research group at the Helmholtz Institute for Polymers in Energy Applications of the University of Jena and the Helmholtz Center Berlin. Kevin's research interests are in the digitization of chemistry. For this, he has been contributing to the cheminfo electronic lab notebook ecosystem. He also developed a toolbox for digital reticular chemistry. Using tools from this toolbox, he addressed questions from the atom to the pilot-plant scale. Kevin is also interested in using large language models in chemistry and co-leads the ChemNLP project (with support from OpenBioML.org and Stability.AI).
Abstract:
Tools for synthesis planning is changing rapidly with the emergence of artificial intelligence (AI) models. AI-assisted synthesis planning tools can now perform retrosynthesis tasks, evaluate reactivity, or suggest reaction conditions to mention a few examples. In this talk, I will present current research from AstraZeneca R&D with a focus on retrosynthesis. I will provide an overview of our open-source retrosynthesis platform, AiZynthFinder, show how transformers can complement rule-based approaches, and detail some recent developments on constrained retrosynthesis. The talk will be concluded with a discussion on outstanding challenges for AI-assisted synthesis planning, and a peak into future research directions.
Samuel Genheden leads the Deep Chemistry team in Discovery Sciences, AstraZeneca R&D. He received his PhD in theoretical chemistry from Lund University in 2012, having studied computational methods to estimate ligand-binding affinities. He continued with postdocs at the Universities of Southampton and Gothenburg, where he simulated membrane phenomena using multiscale approaches. He joined the Molecular AI department at AstraZeneca in 2020 and became team leader in 2022. The team’s research focuses on the AiZynth platform for AI-assisted retrosynthesis planning. Samuel’s interests lie in studying chemical and biological systems with computers and using these approaches to impact drug development. He is a keen advocate for open-source software.
Abstract:
Artificial intelligence (AI) is fueling computer-aided drug discovery. Chemical language models (CLMs) constitute a recent addition to the medicinal chemist’s toolkit for AI-driven drug design. CLMs can be used to generate novel molecules in the form of strings (e.g., SMILES, SELFIES) without relying on human-engineered molecular assembly rules. By taking inspiration from natural language processing, CLMs have shown able to learn “syntax” rules for molecule generation, and to implicitly capture “semantic” molecular features, such as physicochemical properties, bioactivity, and chemical synthesizability. This talk will illustrate some successful applications of CLMs to design novel bioactive compounds from scratch in the context of drug discovery, at the interface between theory and wet-lab experiments. Moreover, the talk will provide a personal perspective on current limitations and future opportunities for AI in medicinal and organic chemistry, to accelerate molecule discovery and chemical space exploration.
Francesca Grisoni is a tenure-track Assistant Professor at the Eindhoven University of Technology, where she leads the Molecular Machine Learning team. After receiving her PhD in 2016 at the University of Milano-Bicocca, with a dissertation on machine learning for (eco)toxicology, Francesca worked as a data scientist and as a biostatistical consultant for the pharmaceutical industry. Later, she joined the University of Milano-Bicocca (in 2017) and the ETH Zurich (in 2019) as a postdoctoral researcher, working on machine learning for drug discovery and molecular property prediction. Her current research focuses on developing novel chemistry-centered AI methods to augment human intelligence in drug discovery, at the interface between computation and wet-lab experiments.
Abstract:
In this talk I aim to showcase how machine learning inspired optimisations can help with current state-of-the-art experiments. In particular, I will first consider the readout of semiconductor spin qubits using simple principal component analysis. I will then highlight a specifically fabricated semiconductor device with a 3x3 ‘pixel array’, and discuss the simultaneous tuning of those 9 gate voltages to construct a quantum point contact. And finally, I will move on to larger arrays of quantum dots and the detection of transitions between charge states (i.e. finding the facets of high-dimensional coulomb diamonds).
Evert is a theoretical condensed matter physicist with a background in open systems, numerical simulations and many-body effects. He now also actively works on investigating how both condensed matter physics and machine learning can help each other.
Abstract:
Prof. Bingqing Cheng moved to the Institute of Science and Technology (IST) Austria as a Tenure-Track Assistant Professor on September 2021. Before she was a Departmental Early Career Fellow in the Computer Laboratory, University of Cambridge (11/2020– 08/2021), and a Junior research fellow at Trinity College (03/2019-). She did a PhD (09/2014–02/2019) in Materials Science at École Polytechnique Fédérale de Lausanne (EPFL), supervised by Michele Ceriotti, a Master’s degree in The University of Hong Kong, and a joint Bachelor’s degree in The University of Hong Kong & Shanghai Jiao Tong University.
Abstract:
Protein sequences are shaped by functional optimization on the one hand and by evolutionary history, i.e. phylogeny, on the other hand. A multiple sequence alignment of homologous proteins contains sequences which evolved from the same ancestral sequence and have similar structure and function. In such an alignment, statistical patterns in amino-acid usage at different sites encode structural and functional constraints. Protein language models trained on multiple sequence alignments capture coevolution between sites and structural contacts, but also phylogenetic relationships. I will discuss a method we recently proposed that leverages these models to predict which proteins interact among the paralogs of two protein families, and improves the prediction of the structure of some protein complexes. Next, I will show that these models can be used to generate new protein sequences from given protein families. While multiple sequence alignments are very useful, their construction is imperfect. To address these limitations, we developed ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture, which efficiently uses long contexts. I will show that ProtMamba has promising generative properties, and is able to predict fitness.
Anne-Florence Bitbol is an Assistant Professor at the Swiss Federal Institute of Technology in Lausanne (EPFL), where she leads the Laboratory of computational Biology and Theoretical Biophysics, within the Institute of Bioengineering, also affiliated to the Swiss Institute of Bioinformatics. She studied physics at ENS Lyon, and obtained her PhD in 2012 at Université Paris-Diderot, advised by Jean-Baptiste Fournier. She then joined the Princeton Biophysics Theory group led by William Bialek, Curtis Callan and Ned Wingreen, as an HFSP Postdoctoral Fellow. In 2016 she became an independent CNRS researcher at Laboratoire Jean Perrin of Sorbonne Université in Paris, before joining EPFL in 2020.
Anne-Florence is broadly interested in understanding biological phenomena through physical concepts and mathematical and computational tools. She investigates the impacts of optimization and historical contingency in biological systems, from the molecular to the population scales. She studies how the protein sequence-function relationship is affected by phylogeny and physical constraints, and she develops inference methods from protein sequences, e.g. to predict protein-protein interactions from sequences. These methods are based on information theory, statistical physics and machine learning. She also assesses how microbial population evolution is impacted by spatial structure and environment changes, with applications to antibiotic resistance evolution, and to the evolution of bacteria in the gut. She currently holds an ERC Starting Grant.
Abstract:
Governing equations are essential to the study of physical systems, providing models that can generalize to predict previously unseen behaviors. There are many systems of interest across disciplines where large quantities of data have been collected, but the underlying governing equations remain unknown. This work introduces an approach to discover governing models from data. The proposed method addresses a key limitation of prior approaches by simultaneously discovering coordinates that admit a parsimonious dynamical model. Developing parsimonious and interpretable governing models has the potential to transform our understanding of complex systems, including in neuroscience, biology, and climate science.
Dr. Bethany Lusch is an Assistant Computer Scientist in the data science group at the Argonne Leadership Computing Facility at Argonne National Lab. Her research expertise includes developing methods and tools to integrate AI with science, especially for dynamical systems and PDE-based simulations. Her recent work includes developing machine-learning emulators to replace expensive parts of simulations, such as computational fluid dynamics simulations of engines and climate simulations. She is also working on methods that incorporate domain knowledge in machine learning, representation learning, and using machine learning to analyze supercomputer logs. She holds a PhD and MS in applied mathematics from the University of Washington and a BS in mathematics from the University of Notre Dame.
Abstract:
Deep neural networks have achieved outstanding success in many tasks ranging from computer vision, to natural language processing, and robotics. However such models still pale in their ability to understand the world around us, as well as generalizing and adapting to new tasks or environments. One possible solution to this problem are causal models, since they can reason about the connections between causal variables and the effect of intervening on them. This talk will introduce the fundamental concepts of causal inference, connections and synergies with deep learning as well as practical applications and advances in sustainability and AI for science.
Stefan Bauer is a professor at TU Munich, group leader at Helmholtz Institute Munich and a CIFAR Azrieli Global Scholar. Using and developing tools of causality, deep learning and real robotic systems, his research focuses on the longstanding goal of artificial intelligence to design machines that can extrapolate experience across environments and tasks. He obtained his PhD in Computer Science from ETH Zurich and was awarded with the ETH medal for an outstanding doctoral thesis. Before that, he graduated with a BSc and MSc in Mathematics from ETH Zurich and a BSc in Economics and Finance from the University of London. During his studies, he held scholarships from the Swiss and German National Merit Foundation. In 2019, he won the best paper award at the International Conference of Machine Learning (ICML) and in 2020, he was the lead organizer of the real-robot-challenge.com, a robotics challenge in the cloud.
Abstract:
With novel measurement technologies easily resulting in a deluge of data, we need to consider multiple perspectives in order to ‘see the forest for the trees.’ A single perspective or scale is often insufficient to faithfully capture the underlying patterns of complex phenomena, in particular in the life sciences. However, moving from an ‘either–or’ selection of relevant scales to a ‘both–and’ utilisation of all scales promises better insights and improved expressivity. The emerging field of topological machine learning provides us with effective tools for building multi-scale representations of complex data. This talk presents two use cases that demonstrate the power of learning such representations. The first use case involves improving antimicrobial resistance prediction—a critical problem in a world suffering from superbugs—while the second use case permits us a glimpse into how cognition changes from early childhood to adolescence.
Bastian is Principal Investigator of the AIDOS Lab at the Institute of AI for Health and the Helmholtz Pioneer Campus, focusing on machine learning methods in biomedicine. Dr. Rieck is also TUM Junior fellow and a member of ELLIS. Dr. Rieck was previously senior assistant in the Machine Learning & Computational Biology Lab of Prof. Dr. Karsten Borgwardt at ETH Zürich and was awarded his Ph.D. in computer science from Heidelberg University.
Abstract:
Proteins are molecular machines that drive most biological processes. Understanding how proteins function and how they interact with other molecules is essential for us to comprehend life and also aids in regulating diseases. One way to regulate protein function is through small molecules that interact with proteins to inhibit their activity. However, discovering new molecules and efficient methods to predict how effectively they may bind to a target protein is a challenging task. Harnessing computational tools, using machine learning and molecular simulations, provides atomistic models for studying the interactions between a small molecule and a protein. To make computational models valuable, they need to have a predictive power such as proposing a ligand pose, typically done by docking algorithms, or a free energy of binding, often done through so called alchemical free energy calculations. In this talk, I will walk you through our process of creating new molecules from x-ray fragment hits [1] using a generative machine learning model looking at the SARS-CoV-2 main protease. I will also highlight two different ways in which we can assess the free energy of binding: through machine learning [2] and alchemical free energy calculations [3], where we will look at protein examples involved in cancers and antimicrobial resistance. References: [1] Runcie, Mey, JCIM, 2023, 63, 19, 5996-6005 [2] Gorantla, Kubincova, Weisse, Mey, 2023, https://doi.org/10.1021/acs.jcim.3c01208 [3] Mey et al. Living J. Comput. Mol. Sci. 2020, 2, 18378
Dr Antonia Mey is a Chancellor’s Fellow and Lecturer in the School of Chemistry at the University of Edinburgh. She obtained a BSc from Keele University in Physics with Chemistry and a PhD in Physics from the University of Nottingham working with Prof. Juan Garrahn on lattice protein models. After her PhD she joined Prof. Frank Noé group at the Free University in Berlin as a postdoc, where she developed new tools for the analysis of molecular simulations using Markov Models. In 2015, she moved to Edinburgh to work with Prof. Julien Michel on alchemical free energy simulation methods. In 2020, she started her independent career with a Christina Miller Fellowship in Edinburgh which was soon replaced by a Chancellor’s Fellowship in 2021 which she currently holds. Her main research interests lie in the field of computational biophysics and molecular modelling using machine learning and biomolecular simulations to understand protein function.