Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

Jules Schleinitz, Alba Carretero-Cerdán, Anjali Gurajapu, Yonatan Harnik, Gina Lee, Amitesh Pandey, Anat Milo, Sarah E. Reisman

Research output: Contribution to journalArticlepeer-review

Abstract

The development of machine learning models to predict the regioselectivity of C(sp3)-H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C-H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C-H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.

Original languageAmerican English
Pages (from-to)7476-7484
Number of pages9
JournalJournal of the American Chemical Society
Volume147
Issue number9
DOIs
StatePublished - 5 Mar 2025

All Science Journal Classification (ASJC) codes

  • Catalysis
  • General Chemistry
  • Biochemistry
  • Colloid and Surface Chemistry

Cite this