Speech deepfakes continue to challenge researchers
University of Eastern Finland
image: Professor Tomi Kinnunen, University of Eastern Finland.
Credit: Niko Jouhkimainen/University of Eastern Finland.
Creating speech deepfakes is becoming increasingly easy. Not so long ago, the Finnish language still posed an obstacle, but not anymore.
“Today, anyone can create a speech deepfake. In the past, it took greater technical dedication, but nowadays, numerous voice cloning services are available to virtually anyone,” says Professor Tomi Kinnunen of the School of Computing at the University of Eastern Finland.
Speech synthesis could, in principle, be used to deceive biometric authentication systems as part of scam calls or disinformation on social media. Therefore, it is essential to understand when automatic systems and humans can be deceived – and develop countermeasures accordingly.
“Such countermeasures include, for instance, speech deepfake detection and deepfake source tracing, that is, identifying the voice cloning or synthesis software used to create the deepfake. In the case of biometric authentication, the aim is to improve the robustness of systems against various attacks,” Kinnunen notes.
“Neural networks and artificial intelligence are widely used in research in this field. Personally, however, I’ve felt it important to move on to more interpretable methods in which the detection method can ‘justify’ its decisions.”
Developing automated deepfake detection
Speech as a field of research is rapidly evolving, and there is plenty to investigate. Speech research has an interdisciplinary focus, drawing on machine learning, data collection, speech sciences and explainable AI.
According to Kinnunen, deepfake research is like playing cat and mouse. Recent years have seen significant advances in the accuracy of detection methods and countermeasures, but model generalisation remains a major challenge.
“Machine learning is based on fitting models to large sets of training data, and models can easily overfit to the training data used. As a result, the detection of speech deepfakes created with previously unseen synthesis techniques becomes difficult,” he explains.
“An additional challenge arises from the fact that real-world deepfakes often contain encoded or compressed speech, which masks the artefacts produced by speech synthesis. This makes detection more difficult.”
Speech technology research utilises signal processing and machine learning, essentially deep neural network models trained on large datasets.
“We are currently developing automated speech deepfake detection to determine whether speech is genuine or synthetic. We are also working on synthetic speech source tracing, in other words, examining the speech synthesis technique used to create the deepfake.”
In the ongoing SPEECHFAKES project funded by the Research Council of Finland, researchers have developed methods for identifying the sub-components of synthesis methods used to create speech deepfakes.
The project has also created entirely new metrics for assessing accuracy. The challenge is to objectively evaluate and compare different detection solutions to understand which models generalise best, and under which circumstances systems are likely to err.
“When biometric authentication is combined with deepfake detection, something as self-evident as accuracy assessment becomes less straightforward,” Kinnunen says.
The study Kinnunen refers to was published in IEEE Transactions on Pattern Analysis and Machine Intelligence, one of the leading machine learning journals.
“Our goal is to further improve method accuracy and interpretability. We will certainly be seeing more new AI-based voice cloning services and tools in the future.”
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.