Survey of 50+ multilingual AI models reveals 3 core hurdles to fair global coverage
Analysis of 50+ models finds English makes up 70–80% of training corpora, leaving 100+ languages underserved.
Higher Education Press
A new survey of multilingual large language models (MLLMs), authored by researchers at Beijing Foreign Studies University, finds that while these systems hold promise for bringing AI to speakers of dozens of languages, three core hurdles—uneven training data, imperfect cross-language alignment, and embedded bias—must be overcome to ensure fair, accurate performance across the globe.
“This survey is a wake-up call,” says Prof. Yuemei Xu, “We’ve seen how uneven data and hidden biases can undermine the very promise of AI for millions of speakers. Our roadmap shows that with balanced, diverse corpora and stronger alignment methods, we can build truly global, fair systems that serve every language community.”
AI Tools Strive to Serve 100+ Languages in Education, Healthcare & Legal Sectors
As AI assistants, translation tools and automated content generators expand worldwide, ensuring they serve not just English speakers but also speakers of hundreds of languages is critical. Governments and industries relying on these models for education, healthcare, or legal support need confidence that the technology treats all languages—and their users—equitably. At the same time, researchers must understand where these models fall short to develop safer and more inclusive systems.
Core Hurdles Exposed: Language Imbalance, Misalignment & Bias Stall Multilingual AI Progress
The review highlights several key observations that underscore current challenges and opportunities in multilingual AI:
- Language Imbalance: Most MLLMs are trained on datasets that are overwhelmingly skewed toward high-resource languages such as English, leaving many languages underserved.
- Alignment Challenges: Models struggle to learn a truly “universal” representation of meaning; performance on typologically distant or low-resource languages remains uneven.
- Embedded Bias: Training texts carry social and cultural biases that the models absorb, potentially leading to unfair or harmful outputs in multiple languages. The survey catalogs various bias types and reviews existing techniques for detecting and mitigating them.
- CrossLingual Transfer: While some multilingual systems can carry knowledge from wellrepresented languages into related but less common ones, this benefit diminishes beyond a certain number of languages—a phenomenon known as the “curse of multilinguality.”
- Data Diversity Matters: Models pre-trained on more balanced, linguistically diverse corpora exhibit stronger performance on low-resource languages compared to those trained on English-dominated data.
Analysis of 50+ Models Reveals How Training Data, Architecture & Bias Metrics Shape Performance
The authors performed a systematic review of over fifty MLLMs released in the past five years, analyzing their training data, architectures and evaluation results. They quantified language distributions in popular corpora (e.g., Common Crawl, Wikipedia), compared methods for aligning word and sentence representations across languages, and surveyed bias evaluation metrics and debiasing techniques. Their approach combined quantitative analysis of reported model performance with qualitative assessment of research trends.
Roadmap to Global AI
To build truly global AI, the field must prioritize balanced, high-quality multilingual data; develop alignment methods that work well for all language pairs; and create robust, language-agnostic bias detection and mitigation strategies. This survey, published in Frontiers of Computer Science in April 2025 (https://doi.org/10.1007/s11704-024-40579-4), lays out those research roadmaps, offering a blueprint for the next generation of fairer, more capable multilingual AI.
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.