MIT's AI model predicts molecular solubility in solvents
Predicting how well a molecule will dissolve in a particular liquid, a property known as solubility, is a fundamental challenge in chemistry, especially in the creation of new pharmaceuticals. This crucial step, often a bottleneck in drug design and manufacturing, dictates everything from the efficiency of chemical reactions to the safety profile of the production process. Now, chemical engineers at MIT have unveiled a sophisticated computational model that significantly enhances this predictive capability, promising to accelerate drug discovery and promote the use of less hazardous solvents in industry.
For decades, chemists have relied on models like the Abraham Solvation Model to estimate solubility, which aggregates contributions from a molecule’s internal chemical structures. While helpful, these traditional methods offer limited accuracy. More recently, machine learning has entered the fray, with advancements such as SolProp, a model developed in William Green’s lab at MIT in 2022. SolProp improved upon previous methods by predicting related properties and combining them using thermodynamic principles. However, it struggled to accurately forecast solubility for molecules it hadn’t encountered during its training, a significant hurdle for novel drug development pipelines.
The impetus for the new model arose from a collaborative project by MIT graduate students Lucas Attia and Jackson Burns during a course on applying machine learning to chemical engineering. Their breakthrough was largely facilitated by the release of BigSolDB in 2023, a comprehensive dataset compiling solubility information from nearly 800 published papers. This invaluable resource included data on approximately 800 molecules dissolved in over 100 common organic solvents, encompassing over 40,000 data points and even accounting for the critical influence of temperature on solubility.
Attia and Burns trained two distinct machine learning models on this extensive dataset. Both models represent molecular structures using “embeddings”—numerical representations that capture details like atom count and bonding arrangements, allowing the models to predict various chemical properties. One approach, FastProp, developed by Burns and others in Green’s lab, utilizes “static embeddings,” where the molecular representation is pre-determined. The second, ChemProp, an MIT-developed model already used in antibiotic discovery and other applications, learns these embeddings during the training process itself, simultaneously associating molecular features with properties like solubility.
When tested on a set of 1,000 solutes withheld from the training data, both new models demonstrated remarkable accuracy, outperforming SolProp by two to three times. Their ability to precisely predict subtle variations in solubility due to temperature, even amidst substantial experimental noise, was a particularly strong indicator of their robust learning capabilities, according to Burns. Surprisingly, despite the theoretical advantages of ChemProp’s adaptive learning, both models performed virtually identically. This unexpected parity suggests that the primary constraint on their performance isn’t the models themselves, but rather the inherent variability and quality of the underlying training data, often compiled from diverse labs using differing experimental conditions.
Dubbed FastSolv, the model based on FastProp was chosen for public release due to its speed and adaptable code. It has already been made freely available and is currently being adopted by numerous pharmaceutical companies. This development promises to streamline the drug discovery pipeline, enabling chemists to more efficiently select optimal solvents for reactions. Crucially, it also empowers the identification of less hazardous alternatives to commonly used industrial solvents, addressing a significant environmental and safety concern. As Burns notes, the model is “extremely useful in being able to identify the next-best solvent, which is hopefully much less damaging to the environment.”
The research, overseen by William Green, the Hoyt Hottel Professor of Chemical Engineering and director of the MIT Energy Initiative, and co-authored by Patrick Doyle, the Robert T. Haslam Professor of Chemical Engineering, was published today in Nature Communications. Funded in part by the U.S. Department of Energy, this advancement marks a pivotal step towards more efficient, safer, and environmentally conscious chemical synthesis across a range of industries.