News Release

Can recent machine learning interatomic potentials reliably predict surface stability?

Peer-Reviewed Publication

Songshan Lake Materials Laboratory

Can Recent Machine Learning Interatomic Potentials Reliably Predict Sur- face Stability?

image: 

 illustrating the comprehensive zero-shot benchmark of 19 universal machine learning interatomic potentials and the dominant impact of training data composition for surface energy prediction.

view more 

Credit: Ardavan Mehdizadeh and Peter Schindler from Northeastern University.

A zero-shot comprehensive benchmark of 19 universal machine learning interatomic potentials reveals that strategic training data and particularly non-equilibrium atomic configurations deliver five to seventeen times greater improvement than architectural sophistication for predicting cleavage ergies, a property governing surface stability critical for catalysis, electron emission devices, and next- generation battery interfaces.

Cleavage or surface energies determine how materials break, which surface is the most stable one, how catalysts function, and interfacial properties essential in semiconductor and battery devices. Traditional quantum chemistry calculations can accurately compute these properties, but remain prohibitively expensive for discovering new materials. Machine learning interatomic potentials (MLIPs), also termed “foundational potentials”, promise to accelerate calculations by several orders of magnitude, yet their reliability for surface properties has remained largely untested, until now.
Researchers from Northeastern University’s Department of Mechanical and Industrial Engineering have conducted the most comprehensive zero-shot evaluation to date of universal MLIPs (uMLIPs) for predicting cleavage energies, by testing uMLIP models on a task for which they were never explicitly trained, publishing their findings in AI for Science. The team benchmarked 19 state-of-the-art uMLIPs, including equivariant graph neural network- and transformer-based architectures, against 36,718 surface structures spanning elemental, binary, and ternary compounds. Critically, none of these models had ever seen cleavage energy data during training, making this a pure test of out-of-distribution generalization. The results serve as a guide for future uMLIP development priorities.

The Challenge

Predicting surface stability requires understanding the energy cost of breaking atomic bonds to create new surfaces. This property governs phenomena across materials science: determining which crystal facets appear on nanoparticles, predicting crack propagation in structural materials, identifying catalytically active sites, and understanding degradation mechanisms in battery electrodes. While density functional theory (DFT) can calculate surface energies with quantum mechanical accuracy, screening millions of possible surface configurations remains computationally intractable.
The promise of uMLIP models, which are trained once on diverse materials data and applicable across the periodic table, has captivated the computational materials community. Recent developments like MACE, ORB-v3, EquiformerV2, and others showcase increasingly sophisticated architectures, from simple graph networks to complex transformer models with billions of parameters. However, a critical question remained unanswered: do these architectural innovations translate to accurate predictions for surface properties, which lie outside most models’ training distributions?

The Solution

The research team designed a rigorous zero-shot evaluation framework to assess how well current uMLIPs generalize to cleavage energy prediction without explicit surface training. By evaluating the models on the authors’ cleavage energy dataset without any specific training, they created a benchmark that reveals which training strategies enable true out-of-distribution generalization. They compared models across diverse architectural paradigms, such as graph-based equivariant transformers (EquiformerV2, UMA, GRACE,eSEN), orbital-based representations (ORB), and multi-scale approaches (MatterSim), all trained on different combinations of large-scale materials databases.
The researchers evaluated all 19 models on 36,718 surfaces from their comprehensive benchmark database, then validated the best performer on over 13,000 additional relaxed surfaces from two independent databases. This represents true zero-shot evaluation, assessing how well models trained primarily on bulk materials can predict surface properties. The team further assessed each model’s ability to identify the thermodynamically most stable surface termination, which is the facet that would most commonly be exposed on a crystal, across 3,699 unique bulk materials.
Most significantly, they analyzed how training data composition influences performance by comparing architecturally identical models trained on different datasets: equilibrium-only structures (Materials Project trajectories), non-equilibrium configurations (Open Materials 2024), surface-adsorbate systems (Open Catalyst 2020), low-dimensional materials (Alexandria), and various combinations.
The findings were unequivocal: models trained on the Open Materials 2024 (OMat24) dataset achieved mean absolute percentage errors below 6% and correctly identified stable surface configurations for 87% of materials, despite never being explicitly trained on surface energies. In stark contrast, architecturally identical models trained exclusively on equilibrium structures showed five-fold higher errors (15% MAPE), while models trained on catalysis-focused surface-adsorbate data surprisingly failed catastrophically with 17-fold performance degradation (71% MAPE).
“What makes this evaluation particularly powerful is its zero-shot nature,” explains Dr. Schindler. “These models achieved sub-6% errors on surface properties despite never being trained on our surface dataset. This demonstrates that the right training data enables remarkable generalization to entirely new property domains.”
The framework also enabled a critical discovery: by comparing architecturally identical models trained on different datasets, the team could isolate the impact of training data composition from architectural effects, revealing a 5–17× performance gap between training strategies. “Our zero-shot evaluation revealed something unexpected: a simple graph neural network trained on appropriate data consistently outperformed sophisticated transformer architectures trained on equilibrium-only datasets,” explains corresponding author Dr. Peter Schindler. “By testing models on surface properties they had never seen during training, we isolated the true impact of training data composition versus architectural design. For current state-of-the-art uMLIPs generalization performance, training data is 5 to 17 times more influential than model complexity. This highlights the crucial role of high-quality training data generation as the primary enabler of models with universal applicability.”

Why Non-Equilibrium Data Makes the Difference

The superior performance of OMat24-trained models stems from their exposure to bond-breaking physics, which is absent in equilibrium-only training data. OMat24 implements three complementary sampling strategies that capture diverse atomic environments: systematic perturbations (±0.5 A˚ displacements),molecular dynamics at extreme temperatures (up to 3000 K), and rattled structures spanning 300–1000K. These generate configurations with forces and stresses orders of magnitude beyond typical equilibrium values.
The zero-shot nature of the evaluation makes this insight particularly valuable: models learned bond-breaking physics from training data alone and successfully transferred this knowledge to surface properties, demonstrating true generalization.
“Surface cleavage is fundamentally a bond-breaking process,” notes first author Ardavan Mehdizadeh. “Models trained only on equilibrium structures have never learned the energetics of stretched bonds and unusual coordination environments. In contrast, models exposed to diverse non-equilibrium configurations develop an implicit understanding of how energy changes during bond breaking, precisely the physics governing surface formation.”
This insight has profound implications: rather than requiring expensive surface-specific training data, foundational potentials can learn surface energetics from appropriately sampled bulk structures. The OMat24 dataset’s perturbation-based sampling provides broader coverage of the relevant configuration space than architectures trained on equilibrium-only datasets or even explicit surface-adsorbate systems.
The research also revealed systematic patterns in prediction accuracy across the periodic table. Models struggled most with halogens (Br, I, Cl, showing 40–80% errors) and heavy alkali metals (Cs: 34% error), while excelling at transition metals and alkaline earth metals (< 4% error for elements like Be, Hf, Re, Ta). Low-symmetry crystal systems (triclinic, trigonal) proved more challenging than high-symmetry structures (hexagonal, cubic), suggesting clear directions for targeted training data expansion.

Speed-Accuracy Trade-offs: Simpler Can Be Better

Beyond prediction accuracy, the comprehensive benchmark quantified computational efficiency across architectures, which is a critical consideration for high-throughput materials screening. The results revealed substantial variation: the simplest models (ORB, GRACE) required only 6–8 milliseconds per structure en- ergy prediction on modern GPUs, while sophisticated transformer-based models needed 200+ milliseconds, a 25- to 45-fold speed difference that translates to days versus months for screening tens of millions of candidate surfaces.
Remarkably, when trained on appropriate data, these faster architectures achieved comparable accuracy (7– 8% MAPE) to the most complex models. “For high-throughput applications, you can choose a model that’s 30 times faster and still maintains sub-10% errors, as long as it was trained on the right data,” Schindler emphasizes. “Architectural complexity provides marginal gains when training data quality is the limiting factor.”
This finding challenges the field’s current trajectory toward ever larger model architectures. The largest model evaluated (UMA with 1.4 billion parameters) achieved slightly lower errors than models with 30–150 million parameters, but at a substantially increased computational cost. For practical materials discovery workflows, the optimal choice balances accuracy, speed, and the specific application requirements. So, flexibility is enabled by understanding the speed-accuracy frontier.

The Future

The research team identifies several high-priority directions for advancing uMLIPs. Strategic expansion of training data offers an important opportunity to enhance model performance. Focusing new data generation on chemical systems where current models perform poorly, such as halogens, f-block elements, and low symmetry structures, can improve coverage efficiently without requiring substantial dataset growth. Moreover, future datasets should deliberately sample transition states and metastable configurations relevant to the target applications.
Automated gap identification offers a promising path forward. Machine learning workflows could locate regions of chemical and configuration space where current models yield uncertain predictions and generate targeted training data to close these gaps. Future efforts may further benefit from fine-tuning existing models and applying model distillation to transfer knowledge efficiently within this closed-loop framework.
Finally, rigorous out-of-distribution benchmarks are essential to complement existing evaluations by assessing model generalization to property domains not represented in the training data.

The Impact

This work provides the computational materials science community with clear, actionable guidelines for model selection and establishes cleavage energy prediction as a rigorous benchmark for evaluating out-of-distribution generalization. By demonstrating that strategic training data curation delivers greater impact than architectural innovations for this challenging task, the research reframes development priorities for the next generation of uMLIPs.
The implications extend beyond surface energies. The finding that appropriately sampled bulk training data enables accurate prediction of surface properties suggests a path toward truly general-purpose models that capture diverse materials phenomena without requiring exhaustive property-specific training datasets. As the field moves toward foundational potentials for materials design, this work emphasizes that success depends on understanding what physics to sample during training, not just how many structures to include.
The comprehensive dataset of predictions for all 19 models on 36,718 structures is publicly available at https://doi.org/10.5281/zenodo.16970767, enabling researchers to make informed choices about which models to deploy for surface-related applications. The complete benchmarking code is open-source at https://github.com/d2r2group/mlip-cleavage-benchmark.
“The vision of a single foundational model that accurately predicts all materials properties remains compelling and appears to be within reach,” concludes Schindler. “Our results suggest that realizing this vision relies less on identifying an ideal model architecture and more on curating training datasets that capture the full diversity of atomic environments, from equilibrium ground states to the highly distorted configurations that define property limits. When trained on the right data, even simple models can achieve remarkable generalization to entirely new property domains, as we have demonstrated with cleavage energies.”

Reference: Ardavan Mehdizadeh and Peter Schindler.Surface Stability Modeling with Universal Machine Learning Interatomic Potentials: A Comprehensive Cleavage Energy Benchmarking Study[J]. AI for Science, 2025, 1(2). DOI: 10.1088/3050-287X/ae1408.

 


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.