image:
SwRI developed a large language model called Generative Approaches for Molecular Encodings (GAMES) to generate Simplified Molecular Input Line Entry System (SMILES) strings, which offer a text-based system to represent the structure of chemical molecules.
Credit: Southwest Research Institute
SAN ANTONIO — August 14, 2025 — Southwest Research Institute scientists and engineers have developed a custom large language model (LLM) to accelerate drug design and discovery.
A multidisciplinary team developed the Generative Approaches for Molecular Encodings (GAMES) LLM to generate Simplified Molecular Input Line Entry System (SMILES) strings. SMILES is an industry standard system that represents the structure of molecules using a short series of text characters to facilitate storage, retrieval and modeling. Funded by SwRI’s LAMP initiative, an internal research program to advance LLMs, researchers trained GAMES to understand and generate valid new SMILES combinations.
“This project demonstrates a systematic way to build databases and networks of molecules for AI processing and comparison using only language,” said Institute Scientist Dr. Jonathan Bohmann, lead developer of SwRI’s Rhodium™ molecular docking software designed to virtually screen drug compounds.
Rhodium software uses descriptors along with graphical processing to visualize the chemical properties of compounds. Incorporating GAMES into the Rhodium workflow offers a faster generalized approach to drug discovery and design.
“Using LLMs, we can directly apply machine learning and AI to molecules via SMILES strings, because they appear as readable text characters and don’t require translation into abstract representations,” Bohmann said.
SwRI trained the GAMES model with classes of carbon-based molecules and other reference compounds to validate and fine-tune the SMILES strings it generated.
“This project showcases the power of training LLMs in highly technical scientific domains to focus on specific tasks,” said SwRI Lead Computer Scientist Michael Hartnett. “In this case, we are working in the drug discovery domain, and our fine-tuning is focused on unlocking the most relevant knowledge.”
GAMES combines LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) techniques to efficiently fine tune LLMs, reducing the hardware and energy needed to run Rhodium models. The team hopes to apply this approach to other applications and domains across the Institute.
“Using LLMs to generate accurate SMILES could transform the drug discovery process, especially when trained using specific datasets,” said SwRI Research Scientist Daniel Hinojosa. “The fine-tuned techniques significantly improved performance, increasing the number of valid SMILES while reducing invalid outputs. Structured datasets and specific training techniques were key to this accomplishment.”
Researchers hope GAMES will offer a powerful framework for ranking compounds found in chemical libraries based on drug-likeness, a shorthand term for a combination of properties that make it most likely to be approved as a safe drug. Additionally, they plan to explore chemical landscapes systematically through testing. Hinojosa and Bohmann plan to pursue additional internal funding to advance the next phase of the project.
“While we’re in early stages of development, the results are already having a direct impact on ongoing research programs at SwRI,” Bohmann said.
GAMES received funding through the SwRI Internal Research and Development Program. In 2024, SwRI invested more than $11 million in tomorrow’s technology to broaden its knowledge base, expand its reputation as a leader in science and technology and encourage its staff’s professional development.
To learn more, visit: https://www.swri.org/what-we-do/internal-research-development or SwRI’s https://www.swri.org/markets/biomedical-health/pharmaceutical-development/drug-discovery/structure-based-virtual-screening.