image: Previously developed systems for the automated assessment of speaking proficiency focus on limited assessment criteria. However, the use of a novel multimodal spoken English evaluation dataset, comprising synchronized audio, video, and text transcripts, permits a more comprehensive and interpretable assessment.
Credit: Image credit: Candy Olivia Mawalim of JAIST. Image source link: https://doi.org/10.1016/j.caeai.2025.100386
Ishikawa, Japan -- Spoken English proficiency—the ability to communicate effectively in spoken English—is a key determinant of both academic and professional success. Traditionally, the degree of mastery over English grammar, vocabulary, pronunciation, and communication skills has been assessed through tedious and expensive human-administered tests. However, with the advent of artificial intelligence (AI) and machine learning in recent years, automated spoken English assessment tests have gained immense popularity among researchers worldwide.
While monologue-based speaking assessments are prevalent, they lack real-world relevance, particularly in environments where a dialogue or group interaction is crucial. Moreover, research on automated assessment of spoken English skills in interactive settings remains limited and often focuses only on single modalities, such as text or audio. In this light, a team of researchers led by Professor Shogo Okada and comprising Assistant Professor Candy-Olivia Mawalim from Japan Advanced Institute of Science and Technology (JAIST), have developed a multioutput learning framework that can simultaneously assess multiple aspects of spoken English proficiency. Their findings were published online in the Computers and Education: Artificial Intelligence journal on March 20, 2025.
The researchers utilized a novel spoken English evaluation (SEE) dataset comprising synchronized audio, video, and text transcripts from open-ended, high-stakes interviews with adolescents (9-16 years old) applying to high schools and universities. This dataset was collected by the real service from Vericant and is particularly notable for incorporating expert-assigned scores supervised by researcher from Education Testing Service (ETS) across a range of speaking skill dimensions, enabling a rich, multimodal analysis of English proficiency.
Dr. Mawalim shares, “Our framework allows for the modeling and integration of different aspects of speaking proficiency, thereby improving our understanding of the various underlying factors. Also, by incorporating open-ended interview settings in our assessment framework, we can gauge an individual’s ability to engage in spontaneous and creative communication and their overall sociolinguistic competence.”
The multioutput learning framework developed by the team integrates acoustic features such as prosody, visual cues like facial action units, and linguistic patterns such as turn-taking. Compared to unimodal approaches, this multimodal strategy significantly enhanced prediction accuracy, achieving an overall SEE score prediction accuracy of approximately 83% using the Light Gradient Boosting Machine (LightGBM) algorithm.
“The findings of our study have broad implications, offering diverse applications for stakeholders across various fields,” states Prof. Okada. “Besides providing direct actionable insights for students to improve their spoken English proficiency, our approach can help teachers to tailor their instructions to address individual student needs. Moreover, our multi-output learning framework can aid the development of more transparent and interpretable models for assessment of spoken language skills.”
The scientists also studied the importance of the utterance sequence in spoken English proficiency. Bidirectional encoder representations from transformers (BERT), a pre-trained deep learning model, revealed that the initial utterance had a lot of significance in predicting spoken proficiency. Furthermore, the influence of external factors, such as interviewer behaviour and the interview setting on spoken English proficiency, was also assessed. Their analyses showed that specific features, such as interviewer speech, gender, and in-person or remote interview setting, significantly impacted the coherence of the interviewees’ responses.
“With the rapid growth of AI-driven technologies and their expanding integration into our daily lives, multimodal assessments could become standard in educational settings in the near future. This can enable students to receive highly personalized feedback on their communication skills, not just language proficiency. This could lead to tailored curricula and teaching methods, helping students to hone and develop crucial soft skills like public speaking, presentation, and interpersonal communication more effectively,” says Dr. Mawalim, the lead author of the present study.
Taken together, the research offers a more nuanced and interpretable approach to automated spoken English assessment and lays the groundwork for developing intelligent, student-centered tools in educational and professional contexts.
###
Title of original paper: |
Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
Authors: |
Candy Olivia Mawalim*, Chee Wee Leong, Guy Sivan, Hung-Hsuan Huang, and Shogo Okada |
Journal: |
Computers and Education: Artificial Intelligence |
DOI: |
About Japan Advanced Institute of Science and Technology, Japan
Founded in 1990 in Ishikawa prefecture, the Japan Advanced Institute of Science and Technology (JAIST) was the first independent national graduate university that has its own campus in Japan. Now, after 30 years of steady progress, JAIST has become one of Japan’s top-ranking universities. JAIST strives to foster capable leaders with a state-of-the-art education system where diversity is key; about 40% of its alumni are international students. The university has a unique style of graduate education based on a carefully designed coursework-oriented curriculum to ensure that its students have a solid foundation on which to carry out cutting-edge research. JAIST also works closely with both local and overseas communities by promoting industry–academia collaborative research.
About Professor Shogo Okada from Japan Advanced Institute of Science and Technology, Japan
Dr. Shogo Okada serves as a Professor at Japan Advanced Institute of Science and Technology (JAIST), Japan. He received his PhD from Tokyo Institute of Technology in 2008. Dr. Okada has been an active researcher in the fields of computational intelligence and systems science and has 130 publications to his credit. His research interests include multimodal interaction, machine learning, and social signal modeling. He currently heads the Social Signal and Interaction Group at JAIST, Japan.
About Dr. Candy Olivia Mawalim from Japan Advanced Institute of Science and Technology, Japan
Mawalim serves as an assistant professor at JAIST, Japan. She was selected as a research fellow for young scientists DC1 (JSPS). Her research has been published in several top conferences and Q1 journals, including the ACM Trans. on Multimedia Computing Communications and Applications, Applied Acoustic, and Computer Speech & Language. She serves as a member of the appointed team for ISCA SIG-SPSC (Security & Privacy in Speech Communication), where her responsibilities encompass the educational aspects of the group’s activities, i.e., organizing once per month SPSC webinar and contribute as a technical committee in SPSC Symposium (2023 and 2024).
Funding information
This work was partially supported by JSPS KAKENHI (22H00536, 23H03506).
Journal
Computers and Education Artificial Intelligence
Article Title
Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents
Article Publication Date
20-Mar-2025