Kid-friendly audio AI
TACC supercomputers power privacy-protected speech recognition
University of Texas at Austin
image: The Lonestar6 (left) supercomputer and Corral data management and storage systems (right) at the Texas Advanced Computing Center are resources available through allocations awarded by the UT System Research Cyberinfrastructure (UTRC).
Credit: TACC
From the voice-to-text feature on your phone to the captions that make videos more accessible, speech transcription is already woven into everyday life. Behind the scenes, artificial intelligence is doing the heavy lifting, transforming the spoken word into text with speed and accuracy that once seemed impossible.
At the Texas Advanced Computing Center, the Lonestar6 supercomputer is helping UT Dallas speech scientists push the boundaries of Automatic Speech Recognition (ASR) for children. By creating mathematical abstractions called ‘discrete speech units’ from audio as a form of anonymous encoding, researchers can identify speech and language problems in young children and allow faster interventions to help them.
“The goal is for us to be able to comprehend and understand how children speak,” said Satwik Dutta, a PhD student in the Erik Jonsson School of Engineering and Computer Science and Eugene McDermott Graduate Fellow at UT Dallas. Dutta and his advisor John H.L. Hansen, Distinguished Chair in Telecommunications and Professor in Electrical Engineering, co-authored a study on developing child ASR systems published in the International Journal of Human - Computer Studies (May 2025).
“Over the years, developing such automatic speech recognition system has been very challenging, especially for children,” Dutta said. "That’s because children, especially those under the age of eight, are still developing their spoken and vocal skills, and their knowledge of grammar. Their speech can look very different to most open-source ASR systems created with adult speech data, resulting in poor model performance with kids’ speech."
Dutta is contributing to a National Science Foundation–funded project at UT Dallas called Measuring Interactions in Classrooms. Led by Hansen in collaboration with study co-author Dwight Irvin of the Anita Zucker Center for Excellence in Early Childhood Studies at the University of Florida, the project also includes partners from the University of Kansas, bringing together a multi-institutional team to advance early childhood research.
When the project began under COVID-19 restrictions, the researchers were limited to existing datasets from more than a thousand children recorded through headsets during virtual tutorials. Once restrictions eased, the team was able to gather new data in real-world settings, recording preschool children in noisy childcare settings using a small recorder called a LENA device, discreetly tucked into the pocket of a custom T-shirt.
TACC Supercomputers Advance Children’s Speech Research
This project studies a new aspect of automatic speech recognition using discrete speech units, which can be viewed as mathematically abstract representations of speech. The takeaway — creating the output sequence of discrete speech units makes it virtually impossible to go backward and create the original speech waveform, thus introducing a degree of privacy protection.
"As soon as the speech is loaded you can convert it into discrete speech units, then you don't have any concerns of violating privacy because the speech is gone. You can no longer generate it,” Dutta said.
The process of converting to discrete speech units removes layers of data content redundancy and reduces the overall training and computational requirements for the ASR model.
“That’s where TACC proved indispensable. My discrete speech-based ASR model had only 40 million parameters. Using TACC systems, I was able to get a similar performance to an end-to-end ASR model, which had 428.96 million parameters — almost 10 times the size."
TACC awarded supercomputer allocations on the Lonestar6 supercomputer and Corral data storage system through the UT System Research Cyberinfrastructure (UTRC), which provides computational resources to researchers within all 14 UT System institutions.
“Voice based data is computationally expensive, and I needed to compare my results with modern state-of-the-art systems. Without TACC that would not have been possible. We also appreciated the protected storage on Corral and protected nodes of Lonestar6 to run our processes," Dutta added.
The graphical processing units on Lonestar6 are well-suited for artificial intelligence work in developing deep learning models such as those used in this work.
More recent work accepted in the 7th ISCA Workshop on Child Computer Interaction (WOCCI 2025) explores the use of an ASR model called Whisper (originally developed on OpenAI) with the goal of running it on-device on a Raspberry Pi 5 (8GB). The Pi works as an edge device that transcribes and discards the raw voice data once it's processed. Using Lonestar6 for model evaluation, fine-tuning, and comparison, this research is advancing the development of child-focused speech recognition systems with stronger built-in privacy protections.
“Using supercomputers to study speech is new, innovative, and can accelerate the research of using speech AI for so many applications — education, clinical, educational, forensic — anywhere you can find speech." Dutta concluded. "I think as a as a scientist, if you're working on applications for children, the first thing that you should think about is how does it preserve children's privacy. Whatever we do, it should be trustworthy and ethical. I envision a safe digital future for all children.”
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.