Looking to Listen (IMAGE)
Caption
A new model isolates and enhances the speech of desired speakers in a video. (a) The input is a video (frames + audio track) with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. (b) Both audio and visual features are extracted and fed into a joint audio-visual speech separation model. (c) The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video. The speech of specific people is enhanced in the videos while all other sound is suppressed. The new model was trained using thousands of hours of video segments from the team's new dataset, AVSpeech, which will be released publicly.
Credit
Figure: Courtesy of Authors/Google Video stills: Courtesy of Team Coco/CONAN
Usage Restrictions
None
License
Licensed content