image: Two examples of the same molecule represented in sequence, image and graph formats are provided for direct comparison.
Credit: Xin Wang et al.
In cheminformatics, where machine learning is transforming our understanding of how molecular properties are predicted and explained, a critical challenge has long remained: making these powerful but often "black box" models easier to interpret. Recently, researchers at the Australian National University developed a breakthrough solution: a "regional explanation" method that helps reveal how molecular structures drive their properties. This research was published June 3 in Intelligent Computing, a Science Partner Journal, in an article titled “Regional Explanations and Diverse Molecular Representations in Cheminformatics: A Comparative Study.”
The new regional explanation method bridges the gap between local and global explanations, capturing nonlinear relationships between molecular features and properties. The authors found that different molecular representations showed consistency in their regional explanations. The new method offers fine-grained, chemically meaningful insights often missed by traditional explanation methods. It was validated on 2 datasets, demonstrating broad applicability across different chemical domains.
To develop and validate this method, the researchers chose a dataset of 2,384 graphene oxide nanoflakes, each annotated with 783 molecular features used to predict formation energy, a key indicator of thermodynamic stability. After removing duplicates, 2,116 molecules remained. The researchers tested their method on 4 different data representations of these molecules, pairing each representation with an appropriate machine learning model: a multi-layer perceptron for the tabular representation, a transformer for sequences, a convolutional neural network for images and a graph convolutional network for graph data. To ensure robust and fair comparisons, missing values were addressed, and data normalization was applied. Both local explanation methods and the regional explanation approach were used to interpret model predictions. Analysis revealed that the predictive features identified by the new approach reflected real-world knowledge about chemical properties related to formation energy. The method's generalizability was confirmed using the Quantum Machine 9 (QM9) dataset, a larger and more chemically diverse benchmark set that supports results on the real-world graphene oxide nanoflake dataset.
The researchers believe their regional explanation method could have broad application, from materials design to drug discovery, and could serve as a practical tool to understand complex structure–property relationships. Future work may focus on incorporating automated clustering to better capture property-specific molecular patterns or on adding uncertainty quantification to enhance interpretability.
Journal
Intelligent Computing
Article Title
Regional Explanations and Diverse Molecular Representations in Cheminformatics: A Comparative Study
Article Publication Date
3-Jun-2025