New audio analysis model boosts sound event detection accuracy by 12.7% for smarter home and city monitoring
Channel attention mechanism suppresses background noise, sharpening machine listening
Higher Education Press
A new audio analysis model, EFAM, significantly enhances the ability of machines to detect and distinguish everyday sounds in recordings, overcoming the challenges posed by overlapping events and background noise. By boosting detection accuracy by over 12% compared to standard systems, this work paves the way for more reliable sound monitoring in homes, cities and industrial settings.
Why the World Needs Better Sound Detection
Accurate sound event detection is crucial for applications ranging from smart home assistants and security systems to wildlife monitoring and public safety alerts. Improved detection can help emergency services respond more quickly, assistive devices better serve people with hearing impairments, and environmental agencies more accurately track noise pollution and wildlife activity. Policymakers and industry leaders can leverage these advances to set more effective safety standards and to design next-generation audio-driven technologies.
12.7% Accuracy Boost: Key Metrics Behind the Next Leap in Audio Event Recognition
The study’s results highlight significant improvements in detection performance metrics:
- On the DESED validation set, EFAM achieved a PSDS1 score of 0.489, outperforming the baseline by 12.7%.
- The model reached a PSDS2 score of 0.771, indicating superior recognition and classification capabilities.
- EFAM’s class-balanced F1 score rose to 0.567, marking notable gains in correctly identifying event frames.
- Ablation tests confirmed that each module—the Bi-Path Fusion Convolution, Channel Attention Mechanism, and Dual-head Self-Attention Pooling—contributed measurable improvements to overall performance.
New Semi-Supervised Pipeline Uses Bi-Path Fusion, Channel Attention and Dual-Head Pooling to Isolate Overlapping Sounds
To achieve these results, the team built EFAM on a Mean-Teacher semi-supervised framework, which leverages both labeled and unlabeled audio clips. They enhanced feature extraction with a Bi-Path Fusion Convolution module, designed to capture low- and high-level audio cues through parallel convolution paths. A Channel Attention mechanism then reweighted feature channels to highlight critical sound signatures while suppressing background noise. Finally, a Dual-head Self-Attention Pooling function aggregated frame-level predictions, sharpening the model’s focus on actual event occurrences. The integration of pretrained BEATs embeddings provided robust initial audio representations drawn from millions of hours of sound data.
Advanced Feature Fusion and Attention Mechanisms Open Doors for Smarter Consumer, Industrial and Safety-Critical Audio Tools
This work demonstrates that combining advanced feature fusion and attention strategies within a semi-supervised learning paradigm can substantially elevate sound event detection performance. The EFAM model’s improvements offer promising pathways for developing smarter, more reliable audio analytics tools across consumer, industrial, and public safety domains.
“We envision a future where smart assistants, environmental monitors, and emergency systems all benefit from these improvements,” says Prof. Dongping Zhang. “Better sound detection means faster response times and stronger safety nets.”
Published in Frontiers of Computer Science in April 2025 (https://doi.org/10.1007/s11704-025-41108-7), this research is a collaborative effort among China Jiliang University, Hangzhou Hikvision Digital Technology Co., Ltd, Hangzhou Aihua Intelligent Technology Co., Ltd, and Beihang University.
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.