AU2022200439B2

AU2022200439B2 - Multi-modal speech separation method and system

Info

Publication number: AU2022200439B2
Application number: AU2022200439A
Authority: AU
Inventors: Yang Liu; Ying Wei
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-03-12
Filing date: 2022-01-24
Publication date: 2022-10-20
Anticipated expiration: 2042-01-24
Also published as: CN113035227B; CN113035227A; AU2022200439A1

Abstract

The present disclosure provides a multi-modal speech separation method and system. The method includes: receiving mixed speech of the speakers and facial visual information of the speakers; performing face detection by using a Dlib library to obtain a quantity of speakers; processing the above information to obtain a complex spectrogram and face images of the speakers and transmitting the complex spectrogram and the face images to a multi-modal speech separation model, and dynamically adjusting a structure of the model according to the quantity of the speakers, where during training of the multi-modal speech separation model, a complex ideal ratio mask (cIRM) is used as training target, where the cIRM is defined as a ratio of a spectrogram of a clean speech to a spectrogram of the mixed speech in a complex domain, and is composed of a real part and an imaginary part and includes an amplitude and phase information of the speech, and the multi-modal speech separation model outputs complex time-frequency masks for a quantity of faces; and performing complex multiplication on the outputted masks and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech, and performing short-time Fourier transform (STFT) on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation. The disclosed model is more applicable to most application scenarios. 1/2 Mixed speech ST FT-- - ---- --- --- - - --- ------ --- - - Muli Speaker OeV+separation i FIG. I CNN layers Segmentation TCN layers Audio-visual Audio feature fusion extraction FC l ayers Weight sharing CNN layers Visual feature extraction FIG. 2

Description

1/2

Mixed speech

ST FT-- - ---- --- --- - - --- ------ --- - -

Muli

i OeV+separation Speaker

FIG. I CNN layers Segmentation TCN layers

Audio-visual Audio feature fusion extraction

FC l ayers Weight sharing

CNN layers Visual feature extraction

FIG. 2

MULTI-MODAL SPEECH SEPARATION METHOD AND SYSTEM TECHNICAL FIELD

The present disclosure belongs to the technical field of speech separation, and in particular, to a multi-modal speech separation method and system.

BACKGROUND

The description in this section merely provides background information related to the present disclosure and does not necessarily constitute the prior art. In our daily life, we often need to be exposed to a variety of mixed speech, and person-to-person mixed speech is most frequently required to be processed. In an environment where a plurality of pieces of speech are mixed, we have the ability to focus on the speech of one person and ignore the speech of others and environmental noise. This phenomenon is referred to as the cocktail party effect. With the powerful speech signal processing capability, the human auditory system can easily separate the mixed speech. With the intelligence of life, speech separation technology has played an important role in various speech interaction devices. However, for computers, how to efficiently achieve speech separation has always been a difficult problem. At present, the speech separation technology is widely used in our life. For example, the speech separation technology is added to the front of speech recognition to separate the speech of the target speaker from other interfering speech, thereby improving the robustness of the speech recognition system. It is precisely because the speech separation technology can facilitate the subsequent speech signal processing, more and more people are focusing on the speech separation. In the past few decades, various speech separation algorithms have been proposed, which have been proven to effectively improve the performance of speech separation. Even so, the speech separation technology still has more development prospects. Among the proposed speech separation algorithms, most of them only use audio feature information for speech separation. There are some conventional methods, such as the independent component analysis (ICA)-based method, the computational auditory scene analysis (CASA)-based method, the Gaussian mixture model (GMM)-based method, and so on. ICA is a calculation method used to separate a plurality of signals into additive subcomponents, so as to realize the separation of speech signals by searching for statistically independent and non-Gaussian components in a multi-dimensional array, thereby realizing rapid analysis and processing of the speech signals. Therefore, the ICA is widely used in blind source separation [Blind equalisation using approximate maximum likelihood source separation][Adaptive blind source separation with HRTFs beamforming preprocessing][Convolutive Blind Source Separation Applied to the Wireless Communication]. The CASA uses the computer technology to perform modeling on the processing of auditory signals by using the computer imitating the human, so as to imitate the human perceiving, processing, and interpreting speech from complex mixed speech sources. In the reference [Speech segregation based on speech localization], the ideal binary mask (IBM) is combined with the CASA method to construct a new speech separation model, improving the intelligibility of the separated speech. The GMM is a clustering algorithm that uses Gaussian distribution as a parameter model and has been widely used in single-channel speech separation methods. The reference [Soft Mask Methods for Single-Channel Speaker Separation] provides a method for resolving single-channel speech separation by using the GMM, which is to learn the parameters of the GMM by using the expectation maximization (EM) algorithm. But the method still has shortcomings. It is difficult to choose the order of the source distribution, and the method relies heavily on initialization, which is very complicated to implement. With the rapid development of deep learning, some excellent algorithms have been proposed, such as convolutional neural network (CNN), recurrent neural network (RNN), and so on. Artificial neural networks have greatly improved the performance of supervised learning tasks due to their strong nonlinear mapping capabilities, and is concerned by more and more people. At present, most of the speech separation algorithms based on deep learning that have been proposed use time-frequency decomposition techniques such as short-time Fourier transform (STFT) in the data preprocessing process to convert speech into spectrogram. However, in recent years, there has been end-to-end speech separation methods that directly uses the mixed speech signal as the input. The reference [Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation] proposed a speech separation method based on the deep CNN, which achieved more robust separation performance byjointly optimizing the time-frequency masking of multi-target sources. Since the conventional RNN has the vanishing gradient problem, an effective solution is to introduce a gating mechanism to control the accumulation speed of information, and long short-term memory (LSTM) is one of the typical representatives. In the literature [Long short-term memory for speaker generalization in supervised speech separation], the LSTM network is used instead of the RNN network in the speech separation algorithm, and it turns out that the separation performance has been improved. In addition, unlike the use of the speech spectrogram as input, the literature [TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION] proposed a model named TasNet for end-to-end speech separation in the time domain, which uses speech timing signals as inputs, combines one-dimensional convolution with LSTM, and finally directly outputs clean speech signals. Recently, it has been found that the use of a temporal convolutional network (TCN) is very effective in sequence modeling, and one-dimensional time series signals can directly use TCN to extract features, so that some end-to-end speech separation methods based on the TCN have been proposed. The literature [FurcaNext: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks] proposed four improved models for monaural end-to-end speech separation based on the TCN on the basis of TasNet, in which the FurcaPy model uses the multi-stream TCN to extract audio features, while the other three models directly modify the internal structure of the TCN to improve performance in speech separation tasks. The previous methods only use audio information. However, in the process of human speech perception, visual information is usually used automatically or unconsciously [Visual Speech Recognition: Lip Segmentation and Mapping]. When listening to a speaker, people will not only focus on the speech of the speaker but also look at the lip of the speaker to better understand what the speaker says. In addition, the permutation problem also exists when only audio information is used for speech separation. With the continuous development of smart devices, it has become increasingly more convenient to obtain the visual information of the speaker. Visual information can be used as a supplement to the speech separation model. Since some phonemes are easier to distinguish visually [CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement], and when a speaker is in a silent state, the visual information can better reflect the situation. Therefore, compared to using only audio information, the combination of visual and audio information avoids the one-sidedness and uncertainty of obtaining information in a single mode, thereby improving the performance of speech separation. In recent years, many people have made full use of visual information and proposed some audio-visual fusion speech separation algorithms. In 2018, Google proposed a deep learning-based audio-visual fusion speech separation algorithm [Looking to Listen at the Cocktail Party: A Speaker-Independent

Audio-Visual Model for Speech Separation], which is to combine the CNN and the LSTM network to extract audio and visual features, use complex ideal ratio masks (cIRM) as the training target, and multiply the cIRM by mixed audio spectrogram to obtain clean speaker speech. The model achieves speaker independence, and the performance is significantly improved compared to the method that only use audio information. Due to too many network parameters, overfitting will inevitably occur during the training. In order to solve this problem, the literature [Two Stage Audio-Video Speech Separation using Multimodal Convolutional Neural Networks] proposed a system having two AV models to separate the mixed speech. The two models were trained separately. After the first AV model was trained, the output and visual features would be used to train the second AV model. In addition, the literature [Audio-Visual Deep Clustering for Speech Separation] used an unsupervised learning method and proposed a speech separation model (AVDC) with two feature fusions. The use of visual information can better help the clustering of time-frequency units. Compared with the speech separation model that only uses audio information, AVDC has a better separation effect and resolves the problem of permutation problem across frames. In addition to the audio frequency domain processing methods, visual information is incorporated on the basis of end-to-end speech separation by some methods. The literature

[TIME DOMAIN AUDIO VISUAL SPEECH SEPARATION] combined visual information on the basis of TasNet to realize multi-model learning. Experiments show that the model has better effects than the method using only audio information. Based on the above literature, only one feature extractor is used to extract the audio feature of the full band, that is, the feature extractor used on each frequency is the same, and the obtained data has undesirable effects. In addition, the number of existing speakers is fixed when the network parameters are designed, that is, the model is static. A fixed number of speakers are required during training and testing, and the number of inputs cannot be flexibly changed. Therefore, the existing speech separation technology needs further improvement.

SUMMARY

In order to overcome the above shortcomings of the prior art, the present disclosure provides a multi-modal speech separation method, so that the selected model is more effective than other models. To achieve the foregoing objective, one or more embodiments of the present disclosure provide the following technical solutions:

In a first aspect, a multi-modal speech separation method is disclosed, including: receiving mixed speech of speakers and facial visual information of the speakers, and obtaining a quantity of speakers by means of face detection; preprocessing the above data to obtain a complex spectrogram of the mixed speech and face images of the speakers and transmitting the complex spectrogram and the face images to a multi-modal speech separation model, and dynamically adjusting the structure of model according to the quantity of the speakers, where during training of the multi-modal speech separation model, a complex ideal ratio mask (cIRM) is used as a training target, where the cIRM is defined as a ratio of a spectrogram of a clean speech to the spectrogram of mixed speech in a complex domain, and is composed of a real part and an imaginary part and includes an amplitude and phase information of the speech, and the multi-modal speech separation model is configured to output a same quantity of cIRMs as the speakers; and performing complex multiplication on the outputted cIRMs and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech, and performing inverse short-time Fourier transform (iSTFT) on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation. In a further technical solution, the mixed speech may be regarded as a result of adding the clean speech of a plurality of speakers. The mixed speech signal in the time domain is converted, by means of short-time Fourier transform (STFT), to the complex spectrogram that may be used as an input of the speech separation model. In a further technical solution, before each instance is inputted into the multi-modal speech separation model, the structure of the model is dynamically adjusted according to the quantity of speakers. In a further technical solution, the multi-modal speech separation model is composed of an audio feature extraction network, a visual feature extraction network, and an audio-visual fusion network. The audio feature extraction network is configured to extract high-frequency audio features and low-frequency audio features by using different convolutional neural networks (CNN), fuse the low-frequency audio feature and the high-frequency audio feature to realize a first stage of fusion, and then continue to extract audio features by using temporal convolutional networks (TCN). The visual feature extraction network is configured to extract visual features for the inputted face images by using a plurality of convolutional layers, and insert a hole into each convolution kernel to increase a size of the each convolution kernel, thereby increasing a receptive field. The audio-visual fusion network is configured to fuse the audio feature obtained by the audio feature extraction network and the visual feature obtained by the visual feature extraction network to obtain audio-visual fused features, thereby realizing a second stage of fusion of the features. Preferably, the extracting high-frequency audio features and low-frequency audio features by using different feature extractors specifically includes: transforming the time-domain signal of the mixed speech into complex spectrogram by using STFT, and then segmenting the complex spectrogram into a high-frequency part and a low-frequency part in frequency dimension; extracting the low-frequency audio feature and the high-frequency audio feature by using a two-stream CNN, where each stream includes two convolutional layers, where different dilation parameters are used for network layers for extracting the high-frequency features and network layers for extracting the low-frequency features; and fusing the high-frequency audio feature and the low-frequency audio feature to realize the first stage of fusion of the features. Preferably, one-dimensional convolutional network layers in the TCN are modified into two-dimensional convolutional network layers, so that the TCN is capable of processing output data of the audio feature extraction network. Preferably, an output of the visual feature extraction network is up-sampled to compensate for a sampling rate difference between an audio signal and a video signal. The audio feature and the visual feature are fused to realize the second stage of fusion of the features. In a further technical solution, the multi-modal speech separation model inputs audio-visual fused features to a fully connected layer. The fully connected layer outputs cIRMs for a quantity of speakers. Each cIRM corresponds to one speaker, and a sequence of the speakers corresponding to the masks is the same as a sequence of the speakers in the visual feature extraction network. In a second aspect, a multi-modal speech separation system is disclosed, including: a data receiving module, configured to receive mixed speech of speakers and facial visual information of the speakers; a multi-modal speech separation model processing module, configured to process the above information to obtain a complex spectrogram and face images and transmit the complex spectrogram and the face images to a multi-modal speech separation model, and dynamically adjust a structure of the model according to a quantity of the speakers, where during training of the multi-modal speech separation model, complex time-frequency masks are used as a training target, and the multi-modal speech separation model outputs complex time-frequency masks for a quantity of faces; and a speech separation module, configured to perform complex multiplication on the outputted masks and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech, and perform STFT on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation. The foregoing one or more technical solutions have the following beneficial effects: In the technical solution of the present disclosure, a multi-modal speech separation system is disclosed to resolve the cocktail party problem, as shown in FIG. 1. The human-based auditory system has a powerful speech separation capability. Therefore, the speech separation system of the technical solution of the present disclosure refers to the relevant physiological characteristics of people. The part of the human ear that receives speech is the cochlea, which can map speech of different frequencies to different positions on the basilar membrane. Based on the characteristic of the cochlea, a multi-stream convolutional network is used in the audio feature extraction network to simulate feature extraction filters of the cochlea in different frequency ranges, so as to extract the features of a high frequency part and a low frequency part of the speech respectively. In addition, the model of the technical solution of the present disclosure includes two stages of fusions of features: a fusion of high-frequency audio feature and low-frequency audio feature and a fusion of audio feature and visual feature. The improved TCN is used to process the fused features of high-frequency speech and low-frequency speech, and visual information is added to the speech separation model to improve the performance of speech separation. For each instance, the network structure of the model and the quantity of finally outputted speech are determined by a quantity of speakers in the video. Therefore, the model is flexible and suitable for speech separation of any quantity of speakers. Since different parts of the basilar membrane in the cochlea may process speech of different frequencies, the technical solution of the present disclosure uses two different feature extractors to extract the features of the high-frequency part and the features of the low-frequency part of the speech respectively.

According to the technical solution of the present disclosure, the speech separation model is combined with the face detection, so that the system can recognize how many speakers are in the video, thereby dynamically adjusting the structure of the model. In view of the permutation problem in the prior art that only uses audio information for speech separation, the technical solution of the present disclosure combines audio and video to fix the sequence of a plurality of outputs, thereby resolving the permutation problem. In addition, the visual information may be used as a supplementary part of the speech separation system, so as to further improve the performance. In view of the prior art using the RNN to perform time series modeling, a large quantity of parameters are required, a long time is needed, and the problems of gradient exploding and gradient vanishing may also exist. The technical solution of the present disclosure uses the TCN instead of a long short-term memory (LSTM). The advantages of the TCN mainly include parallelism, a flexible receptive field, a stable gradient, and lower memory. According to the technical solution of the present disclosure, SDR is selected as the evaluation index, the model of the technical solution of the present disclosure is compared with other superior models through experiments, and the feasibility of the model is verified through some auxiliary experiments. The additional aspects and advantages of the present disclosure will be set forth in part in the description below, parts of which will become apparent from the description below, or will be understood by the practice of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings constituting a part of the present disclosure are used to provide further understanding of the present disclosure. Exemplary embodiments of the present disclosure and descriptions thereof are used to explain the present disclosure, and do not constitute an improper limitation to the present disclosure. FIG. 1 is a diagram of a speech separation system according to an embodiment of the present disclosure. FIG. 2 is a structure diagram of a multi-modal speech separation model according to an embodiment of the present disclosure. FIG. 3 is an improved diagram of a temporal convolutional network (TCN) according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

It should be noted that, the following detailed descriptions are all exemplary, and are intended to provide further descriptions of the present disclosure. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present disclosure belongs. It should be noted that terms used herein are only for describing specific implementations and are not intended to limit exemplary implementations according to the present disclosure. As used herein, the singular form is intended to include the plural form, unless the context clearly indicates otherwise. In addition, it should be further understood that terms "include" and/or "comprise" used in this specification indicate that there are features, steps, operations, devices, assemblies, and/or combinations thereof. The embodiments in the present disclosure and features in the embodiments may be mutually combined in case that no conflict occurs. Overall conception: In an environment where a plurality of pieces of speech are mixed, it is often necessary to separate the speech of interest to facilitate subsequent speech processing. However, for computers, how to efficiently achieve speech separation has always been a relatively major problem. In the technical solution of the present disclosure, a deep learning-based multi-modal speech separation model is provided to resolve the cocktail party problem. The model makes full use of audio information and visual information of speakers, and adopts a two-stage feature fusion policy: a fusion of high-frequency audio feature and low-frequency audio feature and a fusion of audio feature and visual feature. The multi-stream convolution is used to process the high-frequency audio and the low-frequency audio separately, and connect the outputs of the high-frequency audio feature and the low-frequency audio feature as the input of a temporal convolutional network (TCN) to achieve audio feature extraction. The obtained audio feature is combined with the visual feature outputted by a dilated convolutional layer, and the speech separation is completed through the fully connected layers. In addition, at the data preprocessing stage, a Dlib library is used to detect a quantity of speakers in the video and dynamically adjust the network structure, so as to automatically determine how many clean outputs need to be generated. The data set used in the present disclosure is GRID. In order to make full use of the phase information of the speech, the cIRM is selected as the training target of the model. Through a series of experiments, it is proved that the performance of the disclosed model is superior to that of the model using other methods. In addition, additional experiments prove that the model of the present disclosure has a faster training speed than the model proposed by Google without affecting performance, and it is verified that visual information is indeed helpful for improving the performance of speech separation. Embodiment I As shown in FIG. 1, this embodiment discloses a multi-modal speech separation method, including: receiving mixed speech of speakers and facial visual information of the speakers, and obtaining a quantity of speakers by means of face detection; processing the above information to obtain a complex spectrogram and face images and transmitting the complex spectrogram and the face images to a multi-modal speech separation model, and dynamically adjusting a structure of the model according to the quantity of the speakers, where during training of the multi-modal speech separation model, a complex ideal ratio mask (cIRM) is used as a training target, where the cIRM is defined as a ratio of a spectrogram of clean speech to a spectrogram of the mixed speech in complex domain, and is composed of a real part and an imaginary part and includes an amplitude and phase information of the speech, and the multi-modal speech separation model is configured to output cIRMs for a quantity of speakers; and performing complex multiplication on the outputted cIRMs and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech, and performing short-time Fourier transform (STFT) on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation. In a specific implementation example, speech separation may be deemed as mixed speech x(t) composed of audio of a plurality of speakers, and the audio of the speakers are estimated as si(t), ..., and sc(t). The mixed speech x(t) may be deemed to be obtained by mixing the clean audio of C speakers: C x(t) si(t) i=1

The mixed speech signal in the time domain is converted, by means of short-time Fourier transform (STFT), to a complex spectrogram X that may be used as an input of the speech separation model, which includes a real part and an imaginary part. For a visual stream, the video is first converted to video frames, then Dlib is used to determine a quantity C of human faces in the video, and all faces are extracted as the input of the model.

During the training of the model, time-frequency masks are used as the training target, that is, for each speaker, the model will predict a time-frequency mask. It can be seen from the literatures [On training targets for supervised speech separation][Supervised Speech Separation Based on Deep Learning: An Overview] that compared with directly prediction of the spectrogram or time-domain waveforms of a clean speech, the mask is used as the training target, so that the speech separation system can have better performance. In addition, the literature [The importance of phase in speech enhancement] pointed out that the phase information of the speech is also very helpful for speech separation. The time-frequency mask used is complex ideal ratio mask. The cIRM is defined in the complex domain as a ratio of the spectrogram of the clean speech and the spectrogram of the mixed speech, consists of the real part and the imaginary part [Complex Ratio Masking for Monaural Speech Separation], and includes the amplitude and the phase information of the speech. Since the faces of C speakers are detected, the model outputs C cIRMs, and Mi, M 2 , ... , and Mc are used to represent the cIRM corresponding to each speaker. Then, complex multiplication is performed on the cIRMs and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech. Y,,(t) = M,, * X, n = 1, 2, . . , C where '*' represents the complex multiplication, X represents the complex spectrogram obtained by STFT conversion of the inputted mixed speech signal. Finally, the spectrograms are obtained by means of inverse short-time Fourier transform (iSTFT) to obtain the time-domain signal of the clean audio, thereby completing the speech separation. The multi-modal speech separation model is to be described in detail below, which is composed of an audio feature extraction network, a visual feature extraction network, and an audio-visual feature fusion network. The model structure is shown in FIG. 2. Taking two speakers as an example, the model is to be described in detail below. Regarding the audio feature extraction network, the features of the cochlea are used in the design of the audio feature extraction network for reference. Since humans can recognize and separate speech of different frequencies through the auditory system, the cocktail effect problem can be easily resolved. The cochlea plays an important role in audio processing, which can map speech of different frequencies to different locations of the basement membrane. The basement membrane at the bottom of the cochlea may process high-frequency speech, and the basement membrane at the top of the cochlea may process low-frequency speech. In addition, in the literature [TasNet: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH

SEPARATION], the author designed and trained a CNN-based feature extraction filter to make the network find a filter suitable for speech separation, and found from the frequency response diagram of the features that the frequency resolutions of the filter at the low frequency and the high frequency are different. Specifically, the low frequency part has a high resolution and the high frequency part has a low resolution, which indicates that the optimal filters for extracting the audio features at the low frequency and the high frequency are different in property. Therefore, high-frequency audio feature and low-frequency audio feature are to be extracted by using different convolutional neural networks (CNN). First, the time-domain signal of the mixed speech is transformed into a complex spectrogram by using an STFT, and then the complex spectrogram is segmented into a high-frequency part and a low-frequency part in a frequency dimension. Inspired by the cochlea, the low-frequency audio features and the high-frequency audio features are extracted by using a two-stream CNN. Each stream includes two convolutional layers. Each convolutional layer includes 32 convolution kernels having a size of 25x5. For the network layer that extracts high-frequency features, a dilation parameter of the first convolutional layer is set to 1x1, and a dilation parameter of the second convolutional layer is set to 2x1. For the network layer that extracts low-frequency features, the dilation parameter is set to 1x1 uniformly. According to the technical solution of the present disclosure, the low-frequency audio feature and the high-frequency audio feature are fused to realize a first stage of fusion, and then the TCN is used to continue to extract the features. The TCN not only can process data in parallel, effectively prevent gradient explosions, but also has strong flexibility [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling]. Therefore, increasingly more people consider using the TCN instead of the LSTM to process sequence data. Since the conventional TCN uses one-dimensional convolution internally, only one-dimensional data can be used as an input, and therefore the conventional TCN is more commonly used in end-to-end speech separation models. Since the original speech signal is not directly used as the input, but transformed into a complex spectrogram by the STFT, this requires the two-dimensional convolution to process the data. The one-dimensional convolution in the TCN is modified into two-dimensional convolution, so that the TCN is capable of processing the output data of the previous network layer normally. The internal structure improvement of the TCN is shown in FIG. 3. A total of 8 layers of the TCN are used, and batch normalization is used after each layer of the TCN to prevent overfitting. Regarding the visual feature extraction network, visual features are extracted for the inputted face images by using 6 convolutional layers, and a hole is inserted into each convolution kernel to increase a size of the each convolution kernel, thereby increasing a receptive field. The parameters of convolution kernels included in the convolutional layer are respectively 32, 48, 64, 128, 256, and 256, and a size of each convolution kernel is 5x1. Similar to the audio feature extraction network, batch normalization is used after each convolutional layer to prevent overfitting. Finally, in order to compensate for the sampling rate difference between the audio signal and the video signal, the output of the visual feature extraction network is up-sampled. It is to be noted that since the mixed speech is composed of audio of at least two persons, the faces of a plurality of speakers need to be inputted during the visual feature extraction. In order to achieve speaker independence, in the process of extracting the visual features of different speakers, weight sharing is realized. That is to say, the network having the same parameters is used to extract face features, so that the visual feature extraction network has the generalization performance. Regarding the audio and visual features fusion network, after the mixed speech and the visual information of speakers respectively pass through the audio feature extraction network and the visual feature extraction network, the audio feature and the visual feature are obtained. Then the audio feature and the visual feature are fused to obtain audio-visual fused features, thereby realizing a second stage of fusion of the features. Then the fused features are inputted into fully connected (FC) layers. A non-linear ReLU activation function is used in FC layers. 3 FC layers are used to process the fused features, and each FC layer includes 500 units. The FC layer outputs C cIRMs. Each mask corresponds to one speaker. The sequence of the masks corresponding to the speakers is the same as the sequence of the speakers corresponding to the fusion of the visual features in the audio and visual feature fusion process, so as to resolve the permutation problem caused the label arrangement during the speech separation using only audio information. Complex multiplication is performed on the cIRMs and the complex spectrogram of the mixed speech to obtain the complex spectrogram of the clean speech of the corresponding speaker. Finally, the complex spectrogram corresponding to each person is transformed into a clean speech signal by means of the iSTFT. Embodiment II This embodiment is intended to provide a computing device, including a memory, a processor, and a computer program stored in the memory and executable by the processor, where when the processor executes the program, steps of the foregoing method are performed. Embodiment III This embodiment is intended to provide a computer-readable storage medium. The computer-readable storage medium stores a computer program, where when the program is executed by a processor, steps of the foregoing method are performed. Embodiment IV This embodiment is intended to provide a multi-modal speech separation system, including: a data receiving module, configured to receive mixed speech of speakers and facial visual information of the speaker; a multi-modal speech separation model preprocessing module, configured to preprocess the above data to obtain a complex spectrogram and face images and transmit the complex spectrogram and the face image to a multi-modal speech separation model, and dynamically adjust a structure of the model according to a quantity of the speakers, where during training of the multi-modal speech separation model, a cIRM is used as a training target, where the cIRM is composed of a real part and an imaginary part and includes an amplitude and phase information of the speech, and the multi-modal speech separation model is configured to output cIRMs for a quantity of faces; and a speech separation module, configured to perform complex multiplication on outputted time-frequency masks and a spectrogram of the mixed speech to obtain spectrogram of clean speech, and perform STFT on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation. The steps involved in the devices of the above embodiments II, III, and IV correspond to the method embodiment I. For the specific implementation, reference may be made to the relevant description part of Embodiment I. The term "computer-readable storage medium" should be understood as a single medium or a plurality of media including one or more instruction sets, and should also be understood to include any medium that can store, encode, or carry an instruction set executed by the processor and enable the processor to execute any method in the present disclosure. Experiment and result: This part introduces the data set used and the setting of the parameters in the process of building the experiment, and then gives experimental results and analyzes the results. In order to facilitate comparison with the model results in other literatures, a quantity of speakers is set to 2. That is to say, the mixed speech includes two speakers, and the model finally outputs two speech signals. Regarding the data set: The multi-modal speech separation model provided in the present disclosure is trained by using a GRID data set. The GRID data set records facial video and audio information of 34 speakers. For each speaker, the data set includes 1000 facial video and corresponding audio. A duration of each video is 3s, and a frame rate of the video is 25 FPS. A mixed speech data set is further constructed by referring to [Audio-Visual Deep Clustering for Speech Separation]. Since data information corresponding to some speakers is incomplete, all data of these speakers is deleted. After preprocessing, the finally selected data set includes 17 male speakers and 15 female speakers. The model selects data from two male speakers and two female speakers to build a validation set, selects data from two other male speakers and two female speakers to build a test set, and selects the rest of the data to build a training set. All audio is down-sampled to 16 kHz. During construction of mixed speech, the audio of different speakers is randomly selected, and then the clean audio is fused to obtain the mixed speech. In this experiment, the mixed speech of two persons was used, and a total of 41847 mixed speech was finally obtained. Data preprocessing: Complex spectrograms of all mixed audio are obtained by using the STFT. A length of a Hamming window is 25 ins, a sliding distance is 10 ins, and a window size of fast Fourier transform (FFT) is 512. Since the obtained complex spectrogram includes a real part and an imaginary part, a magnitude of the spectrogram that is finally obtained by means of the STFT are 257x298x2. Then the complex spectrogram is divided into a high frequency part and a low frequency part in the frequency dimension. After many attempts, it is found that it is the most appropriate when the segmentation point in the frequency dimension is 180. Specifically, the magnitude of the low-frequency audio feature is 180x298x2, and the magnitude of the high-frequency audio feature is 77x298x2. For video data, since each video is 25 FPS, a total of 3s, after the video is transformed into video frames, 75 pictures may be obtained. OpenCV and Dlib are used to obtain a quantity of faces, locate a face area of each frame, and extract the face area. A size of each face picture that is finally obtained is 160x160. Experimental setup: In the experiment, a deep learning framework used is Keras, and the NVIDIA RTX 2080Ti graphic card is used to train the model. The network has trained 300 epochs and uses the Adam optimizer. An initial learning rate is set to 0.00001. In order to prevent overfitting, when every 100 epochs are trained, the learning rate is reduced to one-tenth of the previous epoch. Experimental result: An evaluation index that is used is an SDR. Since different mixed speech corresponds to different SDRs, the quality of the model cannot be determined simply by comparing the SDR of the output clean speech. For the convenience of comparison, the SDR of the finally outputted speech of the model is subtracted from the SDR of the mixed speech, and an actual improved SDR may be obtained, that is, ASDR. When there are only two speakers, the ASDR results using different methods are compared and analyzed. The inputted mixed speech is divided into three situations: male-male mixed speech, male-female mixed speech, and female-female mixed speech. When the inputs are respectively the male-male mixed speech, the male-female mixed speech, the female-female mixed speech, the performance of the speech separation model using different methods is obtained, and the results are shown in Table 1. It can be seen that compared with other speech separation methods, the method of the present disclosure has superior performance. Since the frequencies of male audio and female audio are quite different, the effect of speech separation on mixed audio of the opposite sex is the best, thereby achieving an increase of 11.14 dB in the model of the present disclosure. When the input is mixed speech of the same sex, compared with the mixed speech of the opposite sex, the separation effect of all models is reduced. However, the model of the present disclosure also performs better than other models, achieving an increase of 6.31 dB in the male-male mixed speech and an increase of 8.47 dB in the female-female mixed speech. Table 1 Comparison of the speech separation results of different models when the mixed speech includes only speech of two speakers

ASDR (dB) Method Male-male mixed Male-female mixed Female-female mixed speech speech speech AV-Match 4.93 9.80 6.96 AVDC 6.18 10.16 8.32 uPIT 4.68 9.56 4.92 Model of the 6.31 11.14 8.47 present disclosure . 1 8

In addition, the model of the present disclosure also resolves the problems of a large quantity of parameters and slow training time that often appear in neural network models. During the feature extraction of speech, the TCN is used instead of the LSTM for time series modeling, which greatly reduces training parameters and the training time. Since the model using the LSTM is similar to the Google model, the same data set is initially used to compare the model of the present disclosure with the Google model. During the experiment, it was found that the AVSpeech data set proposed by Google was too large and the training time was very long, which was not conducive to the experiment. Therefore, the GRID data set was used to train the model proposed by Google to be compared with the model of the present disclosure. In order to save time costs and ensure fairness, the epoch of the training parameters of the two models is set to 100. At this point, although the loss of the two models has not reached the lowest point, the comparison of the two models is not affected. The comparison results are shown in Table 2. It can be seen that under the same conditions, compared with the Google model, the model of the present disclosure can greatly save the training time, and the separation performance is not affected, but has been improved. Therefore, the model of the present disclosure is more applicable to most application scenarios. Table 2 Comparison of results between the model of the present disclosure and the Google model

Method Training time (100 epochs) ASDR (100 epochs) Model provided by Google About 400 h 9.43 Model of the present disclosure About 72 h 9.86

As described above, in this application, high-frequency audio feature and low-frequency audio feature in the model audio feature network are extracted by using a two-stream CNN. An auxiliary experiment is designed to prove that using a plurality of streams to extract features indeed facilitates speech separation. In a new model, a high-frequency feature extraction network is deleted, and the mixed spectrogram may be directly fed to a low-frequency feature extraction network without segmentation. The comparison results are shown in Table 3. It can be seen from the table that the features extracted by using a multi-stream CNN may be 1.34 dB higher than that extracted by using a single-stream CNN. Therefore, the best method is to use different networks to extract high-frequency feature and low-frequency feature separately, because ideal feature extractors corresponding to different frequencies of the speech are different. Table 3 Impact of the two-stream CNN and the single-stream CNN on the results of extracting audio features

Method ASDR (dB) Audio feature extraction by using the two-stream CNN 11.14 Audio feature extraction by using the single-stream CNN 9.65

In order to observe whether visual information is helpful to the performance of speech separation, the speech separation model that only uses the speech is compared with the audio-visual fusion model. Specifically, the technical solution of the present disclosure uses only the speech stream in the proposed model and deletes the visual feature extraction network, and the others remain unchanged, so that a speech separation model that only uses the mixed speech as the input is obtained. The comparison results are shown in Table 3. It can be seen from the table that adding visual feature information has a certain effect on the improvement of speech separation performance.

Table 4 Comparison of results between the model using only the audio and the audio-visual feature fusion model

Method ASDR (dB) Speech separation using only the audio 10.32

Speech separation of audio-visual fusion 11.14

The technical solution of the present disclosure provides a new audio-visual fusion speech separation model. The model combines audio features and visual features, imitates the physiological characteristics of the cochlea in the process of audio feature extraction, extracts the high-frequency features and low-frequency features of the speech by using different networks, and can automatically determine a quantity of outputted models by using the face detector. It can be seen from the experimental results that when the same data set is used, the model of the present disclosure has better performance than several models recently provided. In addition, the feasibility of the ideas proposed in the technical solution of the present disclosure is verified by means of some experiments that have been conducted. In the subsequent technical solutions, the speech separation technology may be applied to more complex scenarios, and is applied to speech enhancement to achieve background noise suppression. A person skilled in the art should understand that the modules or steps in the present disclosure may be implemented by using a general-purpose computer apparatus. Optionally, they may be implemented by using program code executable by a computing apparatus, so that they may be stored in a storage apparatus and executed by the computing apparatus. Alternatively, the modules or steps are respectively manufactured into various integrated circuit modules, or a plurality of modules or steps thereof are manufactured into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software. The foregoing descriptions are merely exemplary embodiments of the present disclosure, but not intended to limit the present disclosure. Those skilled in the art may make various alterations and variations to the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure shall fall within the protection scope of the present disclosure. The specific implementations of the present disclosure are described above with reference to the accompanying drawings, but are not intended to limit the protection scope of the present disclosure. A person skilled in the art should understand that various modifications or transformations may be made without creative efforts based on the technical solutions of the present disclosure, and such modifications or transformations shall fall within the protection scope of the present disclosure.

Claims

CLAIMS What is claimed is:

1. A multi-modal speech separation method, comprising: receiving mixed speech of speakers and facial visual information of the speakers, and obtaining a quantity of the speakers by means of face detection; pre-processing the mixed speech and facial visual information of the speakers to obtain a complex spectrogram of the mixed speech and face images of the speakers, transmitting the spectrogram and the face images to a multi-modal speech separation model, and dynamically adjusting a structure of the model according to the quantity of the speakers, wherein the multi-modal speech separation model is configured to output complex ideal ratio masks (cIRM) for the quantity of speakers; and performing complex multiplication on the outputted cIRMs and the complex spectrogram of the mixed speech to obtain a spectrogram, and transforming the spectrogram to obtain a time-domain signal of a clean speech, thereby completing speech separation; wherein, the multi-modal speech separation model, comprising: an audio feature extraction network, being configured to extract a high-frequency audio feature and a low-frequency audio feature by using different convolutional neural networks (CNN), and fuse the low-frequency audio feature and the high-frequency audio feature to realize a first stage of fusion of the features, and then continue to extract an audio feature by using a temporal convolutional network (TCN); a visual feature extraction network, being configured to extract a visual feature for inputted face images by using a plurality of convolutional layers, and insert a hole into each convolution kernel to increase a size of the each convolution kernel, thereby increasing a receptive field; and an audio-visual feature fusion network, being configured to fuse the audio feature obtained by the audio feature extraction network and the visual feature obtained by the visual feature extraction network to obtain an audio-visual fusion feature, thereby realizing a second stage of fusion of the features.

2. The multi-modal speech separation method according to claim 1, wherein the multi modal speech separation model has a dynamic network structure, for each instance, the structure of the model is dynamically adjusted according to the quantity of the speakers detected at a data preprocessing stage, and the multi-modal speech separation model is applicable to any quantity of the speakers; and preferably, during training of the multi-modal speech separation model, each cIRM is used as a training target, wherein the cIRM is defined as a ratio of a spectrogram of the clean speech to the spectrogram of the mixed speech in a complex domain, and is composed of a real part and an imaginary part and comprises an amplitude and phase information of the speech.

3. The multi-modal speech separation method according to claim 1, wherein the extracting high-frequency audio feature and low-frequency audio feature by using different feature extractors specifically comprises: transforming the time-domain signal of the mixed speech into a complex spectrogram by using a short-time Fourier transform (STFT), and then segmenting the complex spectrogram into a high-frequency part and a low-frequency part in a frequency dimension; extracting the low-frequency audio feature and the high-frequency audio feature by using a two-stream CNN, wherein each stream comprises two convolutional layers, wherein different dilation parameters are used for network layers for extracting the high-frequency feature and network layers for extracting the low-frequency feature; and fusing the high-frequency audio feature and the low-frequency audio feature to realize the first stage of fusion of the features.

4. The multi-modal speech separation method according to claim 1, wherein a one dimensional convolutional network layer in the TCN is modified into a two-dimensional convolutional network layer, so that the TCN is capable of processing output data of the audio feature extraction network.

5. The multi-modal speech separation method according to claim 1, wherein an output of the visual feature extraction network is up-sampled to compensate for a sampling rate difference between an audio signal and a video signal; and the audio feature and the visual feature are fused to realize the second stage of fusion of the features.

6. The multi-modal speech separation method according to claim 1, wherein the multi modal speech separation model inputs the audio-visual fusion feature to fully connected layer, wherein the fully connected layer outputs cIRMs for a quantity of faces, wherein each mask corresponds to one speaker, and a sequence of the speakers corresponding to the masks is the same as a sequence of the speakers corresponding to connection of visual features during a fusion process of the audio-visual fusion feature.

7. A multi-modal speech separation system, comprising: a data receiving module, being configured to receive mixed speech of the speakers and facial visual information of the speakers, and obtaining a quantity of the speakers by means of face detection; a multi-modal speech separation model processing module, being configured to process the above information to obtain a complex spectrogram and face images of the speakers, transmit the complex spectrogram and the face images to a multi-modal speech separation model, and dynamically adjust a structure of the model according to a quantity of the speakers, wherein during training of the multi-modal speech separation model, a complex time-frequency mask is used as a training target, wherein the complex time-frequency mask is defined as a ratio of a spectrogram of a clean speech to the spectrogram of the mixed speech in complex domain, and is composed of a real part and an imaginary part and comprises amplitude and phase information of the speech; wherein, the multi-modal speech separation model outputs complex time-frequency masks for the quantity of the speakers, comprising: an audio feature extraction network, being configured to extract a high-frequency audio feature and a low-frequency audio feature by using different convolutional neural networks (CNN), and fuse the low-frequency audio feature and the high-frequency audio feature to realize a first stage of fusion of the features, and then continue to extract an audio feature by using a temporal convolutional network (TCN); a visual feature extraction network, being configured to extract a visual feature for inputted face images by using a plurality of convolutional layers, and insert a hole into each convolution kernel to increase a size of the each convolution kernel, thereby increasing a receptive field; and an audio-visual feature fusion network, being configured to fuse the audio feature obtained by the audio feature extraction network and the visual feature obtained by the visual feature extraction network to obtain an audio-visual fusion feature, thereby realizing a second stage of fusion of the features; and a speech separation module, being configured to perform complex multiplication on the outputted masks and the spectrogram of the mixed speech to obtain the spectrogram of the clean speech, and perform inverse short-time Fourier transform (iSTFT) on the spectrogram of the clean speech to obtain a time-domain signal of the clean speech, thereby completing speech separation.

8. A computing device, comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein when the processor executes the program, steps of the method according to any of claims 1 to 6 are performed.

9. A computer-readable storage medium storing a computer program therein, wherein when the program is executed by a processor, steps of the method according to any of claims 1 to 6 are performed.