Feature Story | 5-Jul-2023

A neural machine translation system for all the Romance languages of the Iberian Peninsula

A project will apply neural machine translation to Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese

Universitat Oberta de Catalunya (UOC)

Recent years have seen an explosion in the number and effectiveness of machine translation technologies. Thanks to artificial intelligence, we all carry in our pockets powerful tools that can easily translate any of the most widespread languages. But what happens with those with fewer speakers and resources? How can an AI get to "learn" them? For the Romance languages of the Iberian Peninsula, the answer may lie in transfer learning and multilingual system training.

The Neural Machine Translation for the Languages of the Iberian Peninsula (TAN-IBE) project, funded by the Spanish Ministry of Science, Innovation and Universities, coordinated by the Universitat Oberta de Catalunya (UOC) and involving the universities of Oviedo, Lleida and Zaragoza, explores the most effective techniques for training machine translation systems based on neural networks (a type of AI), applied to seven of the Romance languages of the Iberian Peninsula: Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese.

 

An AI that transfers knowledge between languages

Neural network-based translation systems are trained on the basis of millions of sentences in one language with their translation into another. This is what is known as parallel corpora, vast datasets available in two languages. Once the neural network has been trained, it is able to effectively translate any text in these languages. The problem is that, whilst with languages such as Spanish and Portuguese it is easy to find these parallel corpora, with those languages that have less material available, such as Aranese, Aragonese and Asturian, it is hard to find enough data to train the artificial intelligence.

"The good thing is that neural systems can learn things about a language from another, similar language," explained Antoni Oliver, member of the UOC's Faculty of Arts and Humanities and researcher at the Linguistic Applications Inter-University Research Group (GRIAL-UOC), which is coordinating the TAN-IBE project. "That's why we've chosen Romance languages. The process needs to be able to learn by transfer, using a model between two languages to construct the translation system between another two. So, for example, when it's completed, the Spanish/Aranese translation tool will have done some of its learning from the Spanish/Catalan or the Spanish/Portuguese systems."

The construction of the translation model is not the only goal of this research project. It also seeks to:

  • Compile parallel and monolingual corpora for the seven featured Romance languages, with a particular focus on Asturian, Aragonese and Aranese.
  • Explore new techniques for the training of neural machine translation systems. In addition to transfer learning, the project will study multilingual machine translation, self-supervised machine translation and unsupervised machine translation.
  • Train neural machine translation systems between Spanish and the rest of the project's languages, in both directions.
  • Train multilingual systems able to translate from and into all the project's languages.
  • Create guides and scripts to help train neural machine translation systems in general and, more specifically, for the project's languages.
  • Publish the project's results with open licences. This includes the corpora, the machine translation models and engines, and the guides and scripts.

"Broadly speaking, the project comprises, firstly, compiling all the corpora for those languages with less material (Asturian, Aragonese and Aranese) and, secondly, training the translation systems," added Oliver. "The end result of the project will be both the open publication of the resources, insofar as this is possible, and the creation of a free-to-use neural machine translation system."

 

Agreements and studies to promote minority languages

The first part of the project is taking place outside of a lab environment. To obtain the data required to train the artificial intelligence models, there is a need to compile as much material as possible for Asturian, Aragonese and Aranese. "That's why this first phase focuses on securing agreements with regional governments, universities and publishers to provide the materials for creating the parallel corpora to train the neural system," said Oliver.

In this regard, this past May saw the inking of an important agreement with the Government of Asturias on assigning the entire corpus of texts translated from Spanish into Asturian held by its Directorate General of Language Policy. The agreement also stipulates that, if the Government of Asturias so requests, it can gain access to the technological and linguistic developments achieved by the TAN-IBE project for use in its own possible machine translation projects.

"Ultimately, our goal with this project is to help promote the use of these languages with fewer resources and foster more publishing in them," said Oliver. "For example, all laws could be published in two languages, quickly and efficiently, using fewer resources, although a human review would always be required. What's more, those who don't dare to use these languages because they don't feel confident enough can use these tools as support for improving their texts. Lastly, languages like Asturian, Aragonese and Aranese need to be included in digital technologies. If not, they may start disappearing and be forgotten."

 

This UOC research helps foster achievement of UN Sustainable Development Goal 4, Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all.

 

UOC R&I

The UOC's research and innovation (R&I) is helping overcome pressing challenges faced by global societies in the 21st century by studying interactions between technology and human & social sciences with a specific focus on the network society, e-learning and e-health.

Over 500 researchers and more than 50 research groups work in the UOC's seven faculties, its eLearning Research programme and its two research centres: the Internet Interdisciplinary Institute (IN3) and the eHealth Center (eHC).

The university also develops online learning innovations at its eLearning Innovation Center (eLinC), as well as UOC community entrepreneurship and knowledge transfer via the Hubbik platform.

Open knowledge and the goals of the United Nations 2030 Agenda for Sustainable Development serve as strategic pillars for the UOC's teaching, research and innovation. More information: research.uoc.edu.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.