CN114596876B

CN114596876B - Sound source separation method and device

Info

Publication number: CN114596876B
Application number: CN202210073239.6A
Authority: CN
Inventors: 张兆翔; 谭铁牛; 宋增杰
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2023-04-07
Anticipated expiration: 2042-01-21
Also published as: CN114596876A

Abstract

The invention provides a sound source separation method and a device, wherein the method comprises the following steps: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound source spectrogram and the visual guidance features into a trained predictive coding circular convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map. The method comprises the steps of inputting the visual guide features and the aliasing multi-sound-source sound spectrogram into a trained predictive coding cyclic convolution neural network model to predict the mask image of each sound component, and then acquiring separated sound signals by using the mask image and the aliasing multi-sound-source sound spectrogram, so that the sound spectrogram and the visual guide features are processed in the same network model, the network model is small in scale, the visual features and the sound features can be progressively and effectively fused, and the sound source separation precision is improved.

Description

Sound source separation method and device

Technical Field

The invention relates to the technical field of computer vision and audio signal separation, in particular to a sound source separation method and a sound source separation device.

Background

The visual-guided sound source separation is an important and challenging classical visual-sound multi-modal task, and has wide application in the fields of speaker recognition in videos, speech enhancement, audio denoising, intelligent video editing and the like.

Because of the inconsistency between the image and the sound signal in the data format and the data characteristics, different network models are designed in the prior art to respectively process the image and the sound signal, and in addition, the fusion of different modal characteristics is realized by using a single model, so that the scale of the whole model is large, the deployment is difficult, the characteristic fusion only acts on a high-level layer of the network model, and the precision of the separated sound source is not high.

Therefore, it is an urgent need to provide a sound source separation method with a small model size and high sound source separation accuracy.

Disclosure of Invention

The invention provides a sound source separation method and a sound source separation device, which are used for solving the defects of large network model scale and low sound source separation precision in the prior art.

The invention provides a sound source separation method, which comprises the following steps:

acquiring visual guide features in a video frame image;

inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map;

and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.

Optionally, the trained predictive coding cyclic convolutional neural network model is obtained by:

inputting the visual guide features corresponding to the second aliasing multi-sound-source sound spectrogram and the frame image of the single-sound-source video into a predictive coding cyclic convolution neural network model, and outputting a second mask image; the second mask map is a predicted mask map;

comparing corresponding elements in the first single sound source spectrogram and the second aliasing multi-sound source spectrogram to obtain a third mask map; the third mask image is a real mask image;

taking the binary cross entropy between the second mask map and the third mask map as a loss function;

and optimizing the value of the loss function to obtain a trained predictive coding cyclic convolution neural network model.

Optionally, the prediction coding cyclic convolutional neural network model comprises a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer and an upsampling layer;

feedback connection is adopted among convolution layers in the prediction coding cyclic convolution neural network, feedforward connection is adopted among transposition convolution layers, and cyclic connection is adopted among convolution layers in the same layer and transposition convolution layers.

Optionally, inputting the first aliased multi-source spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map, where the method includes:

inputting the first aliasing multi-source spectrogram and the visual guidance features into the predictive coding cyclic convolution neural network to obtain visual-acoustic fusion features;

and inputting the visual and sound fusion features into the single-layer transposed convolution layer and the up-sampling layer to obtain a first mask image.

Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into the predictive coding cyclic convolution neural network to obtain a visual-audio fusion feature, including:

inputting the first aliasing multi-source spectrogram into the top layer of the convolutional layer, sequentially acquiring a prediction signal of each convolutional layer, and acquiring a neuron response of each convolutional layer by using the prediction signal of each convolutional layer;

inputting the visual guide features into the bottom layer of the transposed convolutional layer, and sequentially acquiring the neuron response of each layer of the transposed convolutional layer according to the neuron response of the convolutional layer at the same layer;

in the last iteration of the loop, the neuron response of the top layer of the transposed convolutional layer is taken as a visual-acoustic fusion feature.

Optionally, acquiring a separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map, comprises:

multiplying the first aliasing multi-sound-source sound spectrogram by corresponding elements of the first mask image to obtain a second single-sound-source sound spectrogram;

and transforming the second single sound source spectrogram to a time domain by using short-time Fourier inversion transformation to obtain separated sound signals.

Optionally, the second aliased multi-source spectrogram is obtained by:

carrying out sound sampling on different single-sound-source video data to obtain different single-sound-source sound signals;

linearly overlapping the different single-sound-source sound signals to obtain aliasing multi-sound-source sound signals;

transforming the aliased multi-source sound signal into the second aliased multi-source spectrogram using a short-time Fourier transform.

The present invention also provides a sound source separating apparatus, comprising:

the first acquisition module is used for acquiring visual guide features in the video frame images;

the second acquisition module is used for inputting the first aliasing multi-sound source spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to acquire a first mask map;

a third obtaining module, configured to obtain the separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the sound source separation method as described in any one of the above when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the sound source separation method as described in any one of the above.

According to the sound source separation method and device provided by the invention, the visual guide characteristics and the aliasing multi-sound-source sound spectrogram are input into the trained predictive coding cyclic convolution neural network model to predict the mask map of each sound component, and then the mask map and the aliasing multi-sound-source sound spectrogram are used for acquiring separated sound signals, so that the sound spectrogram and the visual guide characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, and the sound source separation precision is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a sound source separation method provided by an embodiment of the present invention;

fig. 2 is an overall frame schematic diagram of a sound source separation method provided by an embodiment of the present invention;

FIG. 3 is a flow chart of the training and application of the predictive coding circular convolution neural network model provided by the embodiment of the invention;

FIG. 4 is a schematic structural diagram of a predictive coding cyclic convolutional neural network model provided by an embodiment of the present invention;

FIG. 5 is a flowchart of a computation of a predictive coding circular convolutional neural network model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a sound source separation apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a sound source separation method according to an embodiment of the present invention, and as shown in fig. 1, the present invention provides a sound source separation method, including:

step 101, acquiring visual guide features in a video frame image.

Specifically, fig. 2 is a schematic overall framework diagram of a sound source separation method according to an embodiment of the present invention, and as shown in fig. 2, a video frame image v is divided into two ₁ And video frame image v ₂ Respectively inputting the visual feature extraction models to obtain visual guide features f ₁ And a visual guide feature f ₂ 。

The video frame image obtaining method may be: extracting M from a given video ₁ Frame video image, cutting each frame image into two images, making each image be an image containing only single sound production object pixel, normalizing the image to obtain the size of M ₁ ×H ₁ ×W ₁ X 3 two sets of frame images v ₁ And v ₂ 。H ₁ Representing high segments of the image, W ₁ Representing the wide segment of the image and 3 representing the three color channels red, green and blue.

For example, first, 3 frames of video frames are extracted from a given duet video, then each frame of image is cut into two images, each image is an image only containing a single sound object pixel, and the images are normalized to obtain two groups of frame images with the size of 3 × 224 × 224 × 3.

And (3) taking a pre-trained Residual Neural Network (ResNet-18) as a visual feature extraction model to extract apparent semantic features of the video frame image, and taking the output of the last convolutional layer of the ResNet-18 as a visual guide feature.

And 102, inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding circular convolution neural network model to obtain a first mask map.

In particular, the visual guide feature f ₁ Inputting the trained predictive coding circular convolution neural network model by the aliasing multi-sound source spectrogram to obtain a predicted mask map

Similarly, the visual guide feature f ₂ And inputting the aliasing multi-source spectrogram into a trained predictive coding cyclic convolutional neural network model to acquire a predicted mask map->

The method for acquiring the aliasing multi-source spectrogram can be as follows: from a given video at a sampling rate K ₁ Extracting fixed length G ₁ Aliasing multiple source sound signal of second, and then aliasing multiple source soundThe tone signal is converted into an aliased multi-source spectrogram by Short-Time Fourier Transform (STFT).

For example, 6 seconds of aliasing multi-source sound signals are extracted from a given duet video at a sampling rate of 11kHz, and the aliasing multi-source sound signals obtained through sampling are converted into aliasing multi-source sound spectrogram through STFT.

Inputting the first aliasing multi-sound source spectrogram and the visual guidance features into the trained predictive coding circular convolution neural network model, and before acquiring the first mask image, acquiring the trained predictive coding circular convolution neural network model.

inputting the visual guide features corresponding to the second aliasing multi-sound-source sound spectrogram and the frame image of the single-sound-source video into a predictive coding cyclic convolution neural network model, and outputting a second mask image; the second mask is a predicted mask;

taking the binary cross entropy between the second mask image and the third mask image as a loss function;

and optimizing the value of the loss function to obtain a trained predictive coding circular convolution neural network model.

Specifically, fig. 3 is a flowchart of training and applying a predictive coding circular convolutional neural network model according to an embodiment of the present invention, and as shown in fig. 3, the model is divided into a model training phase and a model applying phase.

In the model training stage, sampling is carried out from single sound source video data to obtain single sound source video frame images, and visual guide features are extracted from the single sound source video frame images by using a visual feature extraction model.

And constructing an aliasing multi-source sound signal by using the single-source video data, and converting the single-source sound signal and the aliasing multi-source sound signal into a single-source sound spectrogram and an aliasing multi-source sound spectrogram respectively by using STFT (space time warping).

Inputting the aliasing multi-sound source sound spectrogram and the visual guidance features into a predictive coding cyclic convolution neural network model, and outputting a predicted mask map.

And comparing corresponding elements in the single sound source spectrogram and the aliasing multi-sound source spectrogram to obtain a real mask image.

And (3) taking the binary cross entropy between the predicted mask image and the real mask image output by the model as a loss function, and optimizing the value of the loss function by using an algorithm to obtain a trained predictive coding cyclic convolution neural network model.

Image sampling is carried out on N sections of different single sound source video data at a sampling rate K ₂ Respectively extracting M from each single sound source video ₂ Frame video image, and normalizing the image to obtain size M ₂ ×H ₂ ×W ₂ X 3N sets of frame images, H ₂ Representing high segments of the image, W ₂ Representing the wide segment of the image, 3 representing the three-color channels of red, green and blue, and marking the N groups of frame images as v _n And the value range of N is a natural number from 1 to N.

Respectively inputting frame images of N groups of single sound source videos into a pre-trained ResNet-18 model, acquiring visual guide characteristics corresponding to the frame images of the N groups of single sound source videos, and recording the acquired visual guide characteristics as f _n And the value range of N is a natural number from 1 to N.

Optionally, the second aliased multi-source spectrogram is obtained by:

carrying out linear superposition on different single-sound-source sound signals to obtain aliasing multi-sound-source sound signals;

the aliased multi-source sound signal is transformed into a second aliased multi-source sound spectrogram using a short-time fourier transform.

Specifically, N pieces of different single-sound-source video data are subjected to sound sampling at a sampling rate K ₃ Respectively extracting fixed length G ₂ Second single-source sound signal, denoted as a _n And the value range of N is a natural number from 1 to N.

And linearly overlapping the acquired different single sound source sound signals by utilizing the approximate linear additivity of the sound signals to acquire aliasing multi-sound-source sound signals.

The expression for aliased multi-source sound signals is as follows:

in the formula, a _mix Representing aliased multi-source sound signals, a _n Represents the nth-segment single sound source sound signal, and N represents the total segment number of the single sound source sound signal.

Single sound source sound signal a using STFT _n And aliasing the multi-source sound signal a _mix Respectively converted into single sound source spectrogram S _n Sum-aliased multi-source spectrogram S _mix . The STFT window width of the Hanning window used is 1022, and the interval between adjacent Hanning windows is 256.

The aliasing multi-sound-source sound signals are synthesized by using the single-sound-source video data, and then the aliasing multi-sound-source sound signals are converted into the training data of the model, so that the data synthesis mode can fully utilize massive video data on the Internet, the aliasing multi-sound-source sound signals with known sound components can be obtained without artificial marking, and the self-supervision learning can be conveniently realized.

In acquiring the visual guide feature f _n Sum-aliased multi-source spectrogram S _mix After that, f is put _n And S _mix Inputting a predictive coding circular convolution neural network model.

Optionally, the prediction coding cyclic convolutional neural network model includes a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer, and an upsampling layer;

the convolution layers in the prediction coding cyclic convolution neural network are connected in a feedback mode, the transposition convolution layers are connected in a feedforward mode, and the convolution layers in the same layer and the transposition convolution layers are connected in a cyclic mode.

Specifically, fig. 4 is a schematic structural diagram of a prediction coding cyclic convolutional neural network model provided in an embodiment of the present invention, and as shown in fig. 4, the prediction coding cyclic convolutional neural network model includes a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer, and an upsampling layer.

The predictive coding cyclic convolutional neural network comprises a plurality of convolutional layers and a plurality of transposition convolutional layers, feedback connection is adopted between a top convolutional layer and a bottom convolutional layer, a prediction signal is transmitted by utilizing the feedback connection, feedforward connection is adopted between the transposition convolutional layer from the bottom convolutional layer to the transposition convolutional layer from the top layer, an error signal between the prediction signal and an actual response is transmitted by utilizing the feedforward connection, cyclic connection is adopted between the convolutional layer and the transposition convolutional layer at the same layer, and neuron response is transmitted between the convolutional layer and the transposition convolutional layer at the same layer by utilizing bad connection.

By enabling the predictive coding cyclic convolution neural network to have feedforward, feedback and cyclic connection, the progressive effective fusion of visual characteristics and sound characteristics can be realized, and the precision of sound source separation is improved.

Will f is _n And S _mix Inputting a predictive coding cyclic convolutional neural network in a predictive coding cyclic convolutional neural network model, after a plurality of steps of iteration, taking the top layer neuron response of a transposed convolutional layer in the predictive coding cyclic convolutional neural network as a visual sound fusion characteristic, and then inputting the visual sound fusion characteristic into a single-layer transposed convolutional layer and an upper sampling layer in the predictive coding cyclic convolutional neural network model to obtain a mask map

The mask map is a prediction mask map output by a predictive coding circular convolution neural network model.

Single sound source spectrogram S _n Sum aliased multi-source spectrogram S _mix If S is greater than S _n The value of the corresponding element in (1) is greater than S _mix If the value of the corresponding element is middle, the position of the corresponding element is assigned to 1, and the positions of the other corresponding elements are assigned to 0, so as to obtain a mask map M _n The mask map is the actual mask map of the sound component.

To mask the picture

And mask map M _n Binary Cross Entropy (BCE) as a loss function.

The expression for the loss function is as follows:

wherein L represents the value of the loss function, N represents the total number of segments of the single-source sound signal, BCE () represents the binary cross entropy function, M _n A real mask map representing the sound components,

and a prediction mask diagram representing the output of the prediction coding circular convolution neural network model.

And reducing the value of the loss function by adopting an Error Back Propagation (BP) algorithm and a Stochastic Gradient Descent (SGD) algorithm to train a predictive coding cyclic convolution neural network model, and obtaining the trained predictive coding cyclic convolution neural network model after repeated iterative training.

The trained predictive coding circular convolution neural network model is obtained by firstly obtaining the predicted mask image and the real mask image, then using the binary cross entropy between the predicted mask image and the real mask image as the loss function and optimizing the value of the loss function, the processing of the spectrogram and the visual guide characteristic in the same network model is realized, the network model is small in scale, and a foundation is laid for obtaining the mask image by using the trained predictive coding circular convolution neural network model.

In the model application stage, a trained predictive coding cyclic convolution neural network model is used for outputting a predicted mask map, then a predicted single sound source spectrogram is obtained according to the aliasing multi-sound source spectrogram and the predicted mask map, and finally the single sound source spectrogram is subjected to short-time inverse Fourier transform to obtain separated sound signals.

Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into a trained predictive coding cyclic convolution neural network model to obtain a first mask map, where the method includes:

inputting the first aliasing multi-source spectrogram and the visual guidance features into a predictive coding cyclic convolution neural network to obtain visual-acoustic fusion features;

and inputting the visual and sound fusion characteristics into the single-layer transposed convolution layer and the upper sampling layer to obtain a first mask image.

Specifically, the visual guidance features and the aliasing multi-sound source spectrogram are input into a predictive coding circular convolution neural network in a trained predictive coding circular convolution neural network model to obtain visual sound fusion features.

Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into a predictive coding cyclic convolution neural network to obtain a visual-audio fusion feature, including:

inputting the first aliasing multi-source spectrogram into the top layer of the convolution layer, sequentially acquiring a prediction signal of each convolution layer, and acquiring a neuron response of each convolution layer by using the prediction signal of each convolution layer;

inputting the visual guide characteristics into the bottom layer of the transposed convolutional layer, and sequentially acquiring the neuron response of each layer of the transposed convolutional layer according to the neuron response of the convolutional layer at the same layer;

Specifically, fig. 5 is a flowchart of a predictive coding cyclic convolutional neural network model provided in an embodiment of the present invention, and as shown in fig. 5, an aliasing multi-sound-source sound spectrogram and a visual guidance feature are input into the predictive coding cyclic convolutional neural network, iteration is started, it is determined whether the iteration number T reaches the maximum iteration number T, if the iteration number is less than the maximum iteration number, a feedback process and a feedforward process are sequentially performed, then 1 is added to the iteration number, it is determined again whether the iteration number T reaches the maximum iteration number T, the feedback process and the feedforward process are performed again until the iteration number T reaches the maximum iteration number T, iteration is stopped, a neuron response at the top layer of the transposed convolutional layer passes through the single-layer transposed convolutional layer and the upper sampling layer, and a predicted mask map is output. The specific process is as follows:

inputting the aliased multi-source spectrogram into the top layer r of the convolution layer _L And sequentially obtaining a prediction signal of each convolutional layer through a feedback process (a plurality of convolutional layers), and acquiring a neuron response of each convolutional layer by using the prediction signal of each convolutional layer. The convolutional layer may be provided in 7 layers.

The expression of the prediction signal for each convolutional layer is as follows:

p _l (t)＝(W _l+1,l ) ^T r _l+1 (t)

in the formula, p _l (t) represents the predicted signal of the first convolution layer in t iterations, where l is a natural number in the range of 7 to 1, and W is _l+1,l Representing a feedback connection from layer l +1 to layer l, r _l+1 (t) represents the neuronal response of the l +1 convolutional layer in t iterations.

When l is equal to 7, r _l+1 (t) is r ₈ (t), the aliased multi-source spectrogram S can be _mix As r ₈ (t) neuronal response.

In the first iteration (t equals 0), the neuron response of each convolutional layer can be acquired from the upper layer to the lower layer in turn according to the prediction signal of each convolutional layer.

In the case where t is equal to 0, the expression of the neuron response of each convolutional layer is as follows:

r _l (0)＝LeakyReLU(p _l (0))

in the formula, r _l (0) Representing the neuron response of the first convolution layer in the first iteration, wherein the value range of l is a natural number from 7 to 1, leakyReLU () represents a non-saturation activation function, and p _l (0) Representing the prediction signal of the first convolutional layer in the first iteration.

In subsequent iterations (t is greater than 0), the neuron response of each convolutional layer can be sequentially obtained from the upper layer to the lower layer according to the prediction signal of each convolutional layer and the neuron response of the transposed convolutional layer in the previous iteration.

In the case where t is greater than 0, the expression of the neuronal response for each convolutional layer is as follows:

r _l (t)＝LeakyReLU((1-b _l )q _l (t-1)+b _l p _l (t))

in the formula, r _l (t) represents the neuron response of the l-th convolutional layer in t iterations, wherein the value range of l is a natural number from 7 to 1, leakyReLU () represents a non-saturation activation function, and b _l Representing learnable parameters for balancing the importance of different items, q _l (t-1) represents the neuron response of the first-layer transposed convolutional layer in t-1 iterations, p _l (t) represents the predicted signal for the first convolutional layer in t iterations. After the feedback process, the number of layers of the transposed convolutional layer is the same as that of the convolutional layer through the feed-forward process (multi-layer transposed convolutional layer), and the transposed convolutional layer is also set to be 7 layers.

A prediction error is calculated between the neuron response of each transposed convolutional layer and the prediction signal of each convolutional layer.

The expression for the prediction error is as follows:

e _l-1 (t)＝q _l-1 (t)-p _l-1 (t)

in the formula, e _l-1 (t) denotes the prediction error at layer l-1 in t iterations, q _l-1 (t) represents the neuron response of the l-1 th transposed convolutional layer in t iterations, p _l-1 (t) represents the prediction signal of the (l-1) th convolutional layer in t iterations, wherein the value of l is a natural number ranging from 1 to 7.

When l is equal to 1, q _l-1 (t) is q ₀ (t), visual guidance features f may be assigned _n As q is ₀ (t) corresponding neuronal response, p _l-1 (t) is p ₀ (t)，p ₀ (t) neuron response r can be determined from the 1 st convolutional layer in t iterations ₁ (t) obtaining.

p ₀ The expression of (t) is as follows:

p ₀ (t)＝LeakyReLU((W _1,0 ) ^T r ₁ (t))

in the formula, p ₀ (t) represents the prediction signal for the 0 th convolutional layer in t iterations, leakyReLU () represents the unsaturated activation function, W _1,0 Representing a feedback connection from layer 1 to layer 0, r ₁ (t) represents the neuronal response of layer 1 convolutional layer in t iterations.

According to the prediction error of the next layer and the neuron response of the convolutional layer of the same layer, the neuron response of the transposed convolutional layer of the previous layer can be obtained from the lower layer to the upper layer in sequence.

The expression of the neuron response for each transposed convolutional layer is as follows:

q _l (t)＝LeakyReLU(r _l (t)+a _l (W _l-1,l ) ^T e _l-1 (t))

in the formula, q _l (t) represents the neuron response of the first layer of transposed convolutional layer in t iterations, leakyReLU () represents the unsaturated activation function, r _l (t) represents the neuronal response of the l-th convolutional layer in t iterations, a _l Representing a learnable balance parameter, W _l-1,l Showing the feed-forward connection from layer l-1 to layer l, e _l-1 (t) represents the prediction error of layer l-1 in t iterations.

The feedback process and the feedforward process are performed cyclically and alternately until the iteration number T is the maximum iteration number T, which is usually set equal to 5.

In the last iteration of the loop, the neuron response q of the top layer of the transposed convolution layer in the cyclic convolution neural network of the predictive coding is used _L (T) as visual-audio fusion features.

By circularly and alternately executing the feedback process and the feedforward process, the visual characteristic and the sound characteristic are progressively and effectively fused, and the precision of sound source separation is improved.

Inputting visual and sound fusion characteristics into a single-layer transposition convolutional layer and an upsampling layer, wherein the dimensionality of the single-layer transposition convolutional layer is 3 multiplied by 1, the scale factor of the upsampling layer is 2, and further outputting a predicted mask map

The aliasing multi-sound-source sound spectrogram and the visual guidance characteristics are input into a trained predictive coding cyclic convolution neural network model to obtain a predicted mask map, the sound spectrogram and the visual guidance characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, the sound source separation precision is improved, and a foundation is laid for subsequently obtaining separated sound signals.

And 103, acquiring a separated sound signal according to the first aliasing multi-sound-source sound spectrogram and the first mask map.

Specifically, the separated sound signals may be obtained according to the aliased multi-source spectrogram and a prediction mask map output by the trained prediction coding circular convolution neural network model.

Optionally, acquiring the separated acoustic signal according to the first aliased multi-source acoustic spectrogram and the first mask map comprises:

multiplying the corresponding elements of the first aliasing multi-sound-source sound spectrogram and the first mask map to obtain a second single-sound-source sound spectrogram;

Specifically, the aliased multi-source spectrogram is respectively compared with the predicted mask map

And &>

Is multiplied by the corresponding element to obtain a predicted single sound source spectrogram->

And &>

Single sound source spectrogram using Inverse Short Time Fourier Transform (iSTFT)

And &>

By transformation into the time domain, a separated sound signal 1 and a separated sound signal 2 are obtained.

The aliasing multi-sound source sound spectrogram and the predicted mask chart are multiplied by corresponding elements, short-time inverse Fourier transform is carried out, separated sound signals are obtained, and sound source separation is achieved.

According to the sound source separation method provided by the invention, the visual guide characteristics and the aliasing multi-sound-source sound spectrogram are input into the trained predictive coding cyclic convolution neural network model to predict the mask map of each sound component, and then the mask map and the aliasing multi-sound-source sound spectrogram are utilized to obtain the separated sound signal, so that the sound spectrogram and the visual guide characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, and the sound source separation precision is improved.

Fig. 6 is a schematic structural diagram of a sound source separation apparatus according to an embodiment of the present invention, and as shown in fig. 6, the present invention further provides a sound source separation apparatus, including: a first obtaining module 601, a second obtaining module 602, and a third obtaining module 603, wherein:

a first obtaining module 601, configured to obtain a visual guidance feature in a video frame image;

a second obtaining module 602, configured to input the first aliased multi-source spectrogram and the visual guidance feature into a trained predictive coding cyclic convolutional neural network model, so as to obtain a first mask map;

a third obtaining module 603, configured to obtain a separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map.

Specifically, the sound source separation device provided in the embodiment of the present application can implement all the method steps implemented by the above method embodiment, and can achieve the same technical effect, and details of the same parts and beneficial effects as those of the method embodiment in this embodiment are not described herein again.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor) 710, a communication Interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a sound source separation method comprising: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-sound-source sound spectrogram and the first mask map.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the sound source separation method provided by the above methods, the method comprising: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.

In still another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the sound source separation method provided above, the method including: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between similar elements and not for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in other sequences than those illustrated or otherwise described herein, and that the terms "first" and "second" used herein generally refer to a class and do not limit the number of objects, for example, a first object can be one or more.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A sound source separation method, comprising:

acquiring visual guide features in a video frame image;

acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map;

the predictive coding cyclic convolution neural network model comprises a predictive coding cyclic convolution neural network, a single-layer transposition convolution layer and an up-sampling layer;

feedback connection is adopted among convolution layers in the prediction coding cyclic convolution neural network, feedforward connection is adopted among transposition convolution layers, and cyclic connection is adopted among convolution layers of the same layer and transposition convolution layers; the feedback connection is used to convey a prediction signal, the feedforward connection is used to convey an error signal between the prediction signal and an actual response, and the loop connection is used to convey a neuron response between a convolutional layer and a transposed convolutional layer of the same layer.

2. The sound source separation method according to claim 1, wherein the trained predictive coding circular convolutional neural network model is obtained by:

inputting visual guide features corresponding to the second aliasing multi-sound source spectrogram and the frame image of the single-sound source video into a predictive coding cyclic convolution neural network model, and outputting a second mask map; the second mask map is a predicted mask map;

3. The sound source separation method of claim 1, wherein inputting the first aliased multi-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolutional neural network model to obtain a first mask map comprises:

4. The sound source separation method according to claim 3, wherein inputting the first aliased multi-source sound spectrogram and the visual guidance features into the predictive coding cyclic convolution neural network to obtain visual-audio fusion features comprises:

5. The sound source separation method according to claim 1, wherein obtaining a separated sound signal from the first aliased multi-source sound spectrogram and the first mask map comprises:

and transforming the second single sound source spectrogram into a time domain by using short-time Fourier inverse transformation to obtain separated sound signals.

6. The sound source separation method according to claim 2, wherein the second aliased multi-source spectrogram is obtained by:

carrying out sound sampling on different single sound source video data to obtain different single sound source sound signals;

7. A sound source separation apparatus, comprising:

the second acquisition module is used for inputting the first aliasing multi-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to acquire a first mask map;

a third obtaining module, configured to obtain a separated sound signal according to the first aliased multi-source spectrogram and the first mask map;

the predictive coding cyclic convolution neural network model comprises a predictive coding cyclic convolution neural network, a single-layer transposition convolution layer and an upper sampling layer;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the sound source separation method according to any of claims 1 to 6 when executing the computer program.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the sound source separation method according to any one of claims 1 to 6.