TY - JOUR
T1 - Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates
AU - Schleinitz, Jules
AU - Carretero-Cerdán, Alba
AU - Gurajapu, Anjali
AU - Harnik, Yonatan
AU - Lee, Gina
AU - Pandey, Amitesh
AU - Milo, Anat
AU - Reisman, Sarah E.
N1 - Publisher Copyright: © 2025 The Authors. Published by American Chemical Society.
PY - 2025/3/5
Y1 - 2025/3/5
N2 - The development of machine learning models to predict the regioselectivity of C(sp3)-H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C-H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C-H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.
AB - The development of machine learning models to predict the regioselectivity of C(sp3)-H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C-H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C-H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.
UR - http://www.scopus.com/inward/record.url?scp=86000184966&partnerID=8YFLogxK
U2 - 10.1021/jacs.4c15902
DO - 10.1021/jacs.4c15902
M3 - Article
C2 - 39982221
SN - 0002-7863
VL - 147
SP - 7476
EP - 7484
JO - Journal of the American Chemical Society
JF - Journal of the American Chemical Society
IS - 9
ER -