When video is recorded in a studio, sound is clear of external noises and unrelated sounds. However, most video is not shot at studios. Voices of people shot in family events is mixed with music and with other voices. Video conferences from home or office are often disturbed by other people, ringing phones, or barking dogs. Television reporting from city streets is mixed with traffic noise, sound of winds, and the like.
Previous methods known in the art for single-channel, or monaural, speech separation usually use only audio signal as input. One of the main approaches is spectrographic masking, in which the separation model finds a matrix containing time-frequency (TF) components dominated by each speaker. The mask, or the filter, can be either binary or soft. One approach tackles the single-channel multi-speaker separation using a method known as deep clustering, in which discriminatively-trained speech embeddings are used as the basis for clustering, and subsequently separating, speech.
Audio-Visual Speech Processing
Recent research in audiovisual speech processing makes extensive use of neural networks. Neural networks with visual input have been used for lipreading, sound prediction, and for learning unsupervised sound representations. Work has also been done on audio-visual speech enhancement and separation. One approach uses handcrafted visual features to derive binary and soft masks for speaker separation. Most known approaches describe a neural network that output a spectrogram representing the enhanced speech.
Different approaches exist for generation of intelligible speech from silent video frames of a speaker
In an approach known as Vid2speech, presented by the inventors of the present invention in “ICASSP 2017—Vid2Speech: Speech Reconstruction from Silent Video” and other places, linear spectrograms representing speech from a sequence of silent video frames of a speaking person are generated. The Vid2speech model takes two inputs: a clip of K consecutive video frames showing the speaker face or part of the speaker's face, and a “clip” of (K+1) consecutive dense optical flow fields corresponding to the motion in (u;v) directions for pixels of consecutive frames.
The Vid2speech architecture consists of a dual-tower Residual neural network (ResNet) disclosed in an article by He, Kaiming, et al. titled: “Deep residual learning for image recognition” Published on CVPR. 2016, which takes the aforementioned inputs and encodes them into a latent vector representing the visual features. The latent vector is fed into a series of two fully connected layers followed by a post-processing network which aggregates multiple consecutive mel-scale spectrogram predictions and maps them to a linear-scale spectrogram representing the final speech prediction.
It is understood that any mentioning herein of the Vid2speech technique should not be interpreted as limiting and may include any other articulatory-to-acoustic mapping based on visual analysis.