Abstract
The development of machine learning models to predict the regioselectivity of C(sp3)-H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C-H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C-H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.
| Original language | American English |
|---|---|
| Pages (from-to) | 7476-7484 |
| Number of pages | 9 |
| Journal | Journal of the American Chemical Society |
| Volume | 147 |
| Issue number | 9 |
| DOIs | |
| State | Published - 5 Mar 2025 |
All Science Journal Classification (ASJC) codes
- Catalysis
- General Chemistry
- Biochemistry
- Colloid and Surface Chemistry
Fingerprint
Dive into the research topics of 'Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver