News Release 31-Dec-2025

Scientists improve robotic visual–inertial trajectory localization accuracy using cross-modal interaction and selection techniques

Peer-Reviewed Publication

KeAi Communications Co., Ltd.

**image:**
Demonstration of different visual-inertial odometry methods: (a) traditional VIO methods, which rely on handcrafted features and geometry-based optimization; (b) existing deep learning-based methods, which extract features and fuse multi-modal information using deep neural networks; (c) our proposed VIO method, which employs a modal interaction and selection module to enhance the robustness and accuracy of visual-inertial localization.
view more

Credit: Changjun Gu

In autonomous driving, robotics, and augmented reality, accurate localization remains one of the most challenging problems. Traditional visual–inertial odometry systems often struggle with environmental variations, sensor noise, and multi-modal information fusion, limiting applications such as autonomous vehicles navigating complex urban environments and drones operating in GPS-denied areas.

In a study published in the journal iOptics, a research team from Chongqing University of Posts and Telecommunications proposed a modality fusion strategy—visual–inertial cross-modal interaction and selection mechanisms. This approach not only improves localization accuracy for robots under GNSS-denied conditions, but also enhances the robustness of the algorithm in complex environments.

“The inspiration for attention mechanisms comes from human visual and cognitive systems. When we look at an image or understand a sentence, we do not process all information equally, but instead focus on the most relevant parts,” explains lead author Associate Professor Changjun Gu. “When reading different information, we typically establish connections between these key pieces, forming the foundation of the widely used transformer architecture today. Therefore, we introduced attention mechanisms in our research.”

The main hurdle lies in effectively leveraging the complementary strengths of visual and inertial sensors while addressing their respective limitations. “Visual sensors provide rich environmental information but are sensitive to changes in lighting conditions and textureless environments, while inertial sensors deliver continuous motion measurements but suffer from accumulated drift over time. Hence, a key question remains: how can a system intelligently combine these modalities to achieve robust and accurate localization across diverse environments?” explains Gu.

To that end, the research team introduced two technologies. The first is global-local cross-attention module, which enables effective knowledge exchange between visual and inertial feature representations. Unlike conventional methods that simply concatenate features, this module allows the system to learn which aspects of each modality are most relevant for accurate localization. By preserving the complementary strengths of each sensor while enabling meaningful cross-modal communication, the system achieves remarkable robustness in challenging scenarios.

The second is dual-path dynamic fusion module. Rather than blindly fusing data, this module identifies and selects the most informative features according to current environmental conditions. In well-lit environments with rich visual textures, the system emphasizes visual cues; in challenging lighting or feature-sparse areas, it shifts focus to inertial measurements. This adaptive approach ensures optimal performance across diverse operating conditions.

Project leader Xinbo Gao, explains, “Up until now, deep learning–based visual–inertial localization methods have primarily relied on simple feature concatenation followed by a pose regression network, with few studies considering modality fusion and selection. Our approach demonstrates that cross-modal interaction and selection mechanisms can be effectively incorporated into visual–inertial localization, leading to improved accuracy.”

The team hopes this work will encourage researchers to explore visual–inertial modality fusion and selection strategies, thereby enhancing both localization accuracy and generalization ability.

###

Contact the author: Changjun Gu, College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China. Email: gucj@cqupt.edu.cn.

The publisher KeAi was established by Elsevier and China Science Publishing & Media Ltd to unfold quality research globally. In 2013, our focus shifted to open access publishing. We now proudly publish more than 200 world-class, open access, English language journals, spanning all scientific disciplines. Many of these are titles we publish in partnership with prestigious societies and academic institutions, such as the National Natural Science Foundation of China (NSFC).

DOI

10.1016/j.iopt.2025.100019

Method of Research

Experimental study

Subject of Research

Not applicable

Article Title

MIS-VIO: Deep multi-modal visual-inertial localization with cross-modality interaction and selection

COI Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.