CN114596876B - Sound source separation method and device - Google Patents

Sound source separation method and device Download PDF

Info

Publication number
CN114596876B
CN114596876B CN202210073239.6A CN202210073239A CN114596876B CN 114596876 B CN114596876 B CN 114596876B CN 202210073239 A CN202210073239 A CN 202210073239A CN 114596876 B CN114596876 B CN 114596876B
Authority
CN
China
Prior art keywords
sound
source
spectrogram
layer
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210073239.6A
Other languages
Chinese (zh)
Other versions
CN114596876A (en
Inventor
张兆翔
谭铁牛
宋增杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210073239.6A priority Critical patent/CN114596876B/en
Publication of CN114596876A publication Critical patent/CN114596876A/en
Application granted granted Critical
Publication of CN114596876B publication Critical patent/CN114596876B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a sound source separation method and a device, wherein the method comprises the following steps: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound source spectrogram and the visual guidance features into a trained predictive coding circular convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map. The method comprises the steps of inputting the visual guide features and the aliasing multi-sound-source sound spectrogram into a trained predictive coding cyclic convolution neural network model to predict the mask image of each sound component, and then acquiring separated sound signals by using the mask image and the aliasing multi-sound-source sound spectrogram, so that the sound spectrogram and the visual guide features are processed in the same network model, the network model is small in scale, the visual features and the sound features can be progressively and effectively fused, and the sound source separation precision is improved.

Description

Sound source separation method and device
Technical Field
The invention relates to the technical field of computer vision and audio signal separation, in particular to a sound source separation method and a sound source separation device.
Background
The visual-guided sound source separation is an important and challenging classical visual-sound multi-modal task, and has wide application in the fields of speaker recognition in videos, speech enhancement, audio denoising, intelligent video editing and the like.
Because of the inconsistency between the image and the sound signal in the data format and the data characteristics, different network models are designed in the prior art to respectively process the image and the sound signal, and in addition, the fusion of different modal characteristics is realized by using a single model, so that the scale of the whole model is large, the deployment is difficult, the characteristic fusion only acts on a high-level layer of the network model, and the precision of the separated sound source is not high.
Therefore, it is an urgent need to provide a sound source separation method with a small model size and high sound source separation accuracy.
Disclosure of Invention
The invention provides a sound source separation method and a sound source separation device, which are used for solving the defects of large network model scale and low sound source separation precision in the prior art.
The invention provides a sound source separation method, which comprises the following steps:
acquiring visual guide features in a video frame image;
inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map;
and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.
Optionally, the trained predictive coding cyclic convolutional neural network model is obtained by:
inputting the visual guide features corresponding to the second aliasing multi-sound-source sound spectrogram and the frame image of the single-sound-source video into a predictive coding cyclic convolution neural network model, and outputting a second mask image; the second mask map is a predicted mask map;
comparing corresponding elements in the first single sound source spectrogram and the second aliasing multi-sound source spectrogram to obtain a third mask map; the third mask image is a real mask image;
taking the binary cross entropy between the second mask map and the third mask map as a loss function;
and optimizing the value of the loss function to obtain a trained predictive coding cyclic convolution neural network model.
Optionally, the prediction coding cyclic convolutional neural network model comprises a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer and an upsampling layer;
feedback connection is adopted among convolution layers in the prediction coding cyclic convolution neural network, feedforward connection is adopted among transposition convolution layers, and cyclic connection is adopted among convolution layers in the same layer and transposition convolution layers.
Optionally, inputting the first aliased multi-source spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map, where the method includes:
inputting the first aliasing multi-source spectrogram and the visual guidance features into the predictive coding cyclic convolution neural network to obtain visual-acoustic fusion features;
and inputting the visual and sound fusion features into the single-layer transposed convolution layer and the up-sampling layer to obtain a first mask image.
Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into the predictive coding cyclic convolution neural network to obtain a visual-audio fusion feature, including:
inputting the first aliasing multi-source spectrogram into the top layer of the convolutional layer, sequentially acquiring a prediction signal of each convolutional layer, and acquiring a neuron response of each convolutional layer by using the prediction signal of each convolutional layer;
inputting the visual guide features into the bottom layer of the transposed convolutional layer, and sequentially acquiring the neuron response of each layer of the transposed convolutional layer according to the neuron response of the convolutional layer at the same layer;
in the last iteration of the loop, the neuron response of the top layer of the transposed convolutional layer is taken as a visual-acoustic fusion feature.
Optionally, acquiring a separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map, comprises:
multiplying the first aliasing multi-sound-source sound spectrogram by corresponding elements of the first mask image to obtain a second single-sound-source sound spectrogram;
and transforming the second single sound source spectrogram to a time domain by using short-time Fourier inversion transformation to obtain separated sound signals.
Optionally, the second aliased multi-source spectrogram is obtained by:
carrying out sound sampling on different single-sound-source video data to obtain different single-sound-source sound signals;
linearly overlapping the different single-sound-source sound signals to obtain aliasing multi-sound-source sound signals;
transforming the aliased multi-source sound signal into the second aliased multi-source spectrogram using a short-time Fourier transform.
The present invention also provides a sound source separating apparatus, comprising:
the first acquisition module is used for acquiring visual guide features in the video frame images;
the second acquisition module is used for inputting the first aliasing multi-sound source spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to acquire a first mask map;
a third obtaining module, configured to obtain the separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the sound source separation method as described in any one of the above when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the sound source separation method as described in any one of the above.
According to the sound source separation method and device provided by the invention, the visual guide characteristics and the aliasing multi-sound-source sound spectrogram are input into the trained predictive coding cyclic convolution neural network model to predict the mask map of each sound component, and then the mask map and the aliasing multi-sound-source sound spectrogram are used for acquiring separated sound signals, so that the sound spectrogram and the visual guide characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, and the sound source separation precision is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a sound source separation method provided by an embodiment of the present invention;
fig. 2 is an overall frame schematic diagram of a sound source separation method provided by an embodiment of the present invention;
FIG. 3 is a flow chart of the training and application of the predictive coding circular convolution neural network model provided by the embodiment of the invention;
FIG. 4 is a schematic structural diagram of a predictive coding cyclic convolutional neural network model provided by an embodiment of the present invention;
FIG. 5 is a flowchart of a computation of a predictive coding circular convolutional neural network model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a sound source separation apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a sound source separation method according to an embodiment of the present invention, and as shown in fig. 1, the present invention provides a sound source separation method, including:
step 101, acquiring visual guide features in a video frame image.
Specifically, fig. 2 is a schematic overall framework diagram of a sound source separation method according to an embodiment of the present invention, and as shown in fig. 2, a video frame image v is divided into two 1 And video frame image v 2 Respectively inputting the visual feature extraction models to obtain visual guide features f 1 And a visual guide feature f 2
The video frame image obtaining method may be: extracting M from a given video 1 Frame video image, cutting each frame image into two images, making each image be an image containing only single sound production object pixel, normalizing the image to obtain the size of M 1 ×H 1 ×W 1 X 3 two sets of frame images v 1 And v 2 。H 1 Representing high segments of the image, W 1 Representing the wide segment of the image and 3 representing the three color channels red, green and blue.
For example, first, 3 frames of video frames are extracted from a given duet video, then each frame of image is cut into two images, each image is an image only containing a single sound object pixel, and the images are normalized to obtain two groups of frame images with the size of 3 × 224 × 224 × 3.
And (3) taking a pre-trained Residual Neural Network (ResNet-18) as a visual feature extraction model to extract apparent semantic features of the video frame image, and taking the output of the last convolutional layer of the ResNet-18 as a visual guide feature.
And 102, inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding circular convolution neural network model to obtain a first mask map.
In particular, the visual guide feature f 1 Inputting the trained predictive coding circular convolution neural network model by the aliasing multi-sound source spectrogram to obtain a predicted mask map
Figure BDA0003483040120000061
Similarly, the visual guide feature f 2 And inputting the aliasing multi-source spectrogram into a trained predictive coding cyclic convolutional neural network model to acquire a predicted mask map->
Figure BDA0003483040120000062
The method for acquiring the aliasing multi-source spectrogram can be as follows: from a given video at a sampling rate K 1 Extracting fixed length G 1 Aliasing multiple source sound signal of second, and then aliasing multiple source soundThe tone signal is converted into an aliased multi-source spectrogram by Short-Time Fourier Transform (STFT).
For example, 6 seconds of aliasing multi-source sound signals are extracted from a given duet video at a sampling rate of 11kHz, and the aliasing multi-source sound signals obtained through sampling are converted into aliasing multi-source sound spectrogram through STFT.
Inputting the first aliasing multi-sound source spectrogram and the visual guidance features into the trained predictive coding circular convolution neural network model, and before acquiring the first mask image, acquiring the trained predictive coding circular convolution neural network model.
Optionally, the trained predictive coding cyclic convolutional neural network model is obtained by:
inputting the visual guide features corresponding to the second aliasing multi-sound-source sound spectrogram and the frame image of the single-sound-source video into a predictive coding cyclic convolution neural network model, and outputting a second mask image; the second mask is a predicted mask;
comparing corresponding elements in the first single sound source spectrogram and the second aliasing multi-sound source spectrogram to obtain a third mask map; the third mask image is a real mask image;
taking the binary cross entropy between the second mask image and the third mask image as a loss function;
and optimizing the value of the loss function to obtain a trained predictive coding circular convolution neural network model.
Specifically, fig. 3 is a flowchart of training and applying a predictive coding circular convolutional neural network model according to an embodiment of the present invention, and as shown in fig. 3, the model is divided into a model training phase and a model applying phase.
In the model training stage, sampling is carried out from single sound source video data to obtain single sound source video frame images, and visual guide features are extracted from the single sound source video frame images by using a visual feature extraction model.
And constructing an aliasing multi-source sound signal by using the single-source video data, and converting the single-source sound signal and the aliasing multi-source sound signal into a single-source sound spectrogram and an aliasing multi-source sound spectrogram respectively by using STFT (space time warping).
Inputting the aliasing multi-sound source sound spectrogram and the visual guidance features into a predictive coding cyclic convolution neural network model, and outputting a predicted mask map.
And comparing corresponding elements in the single sound source spectrogram and the aliasing multi-sound source spectrogram to obtain a real mask image.
And (3) taking the binary cross entropy between the predicted mask image and the real mask image output by the model as a loss function, and optimizing the value of the loss function by using an algorithm to obtain a trained predictive coding cyclic convolution neural network model.
Image sampling is carried out on N sections of different single sound source video data at a sampling rate K 2 Respectively extracting M from each single sound source video 2 Frame video image, and normalizing the image to obtain size M 2 ×H 2 ×W 2 X 3N sets of frame images, H 2 Representing high segments of the image, W 2 Representing the wide segment of the image, 3 representing the three-color channels of red, green and blue, and marking the N groups of frame images as v n And the value range of N is a natural number from 1 to N.
Respectively inputting frame images of N groups of single sound source videos into a pre-trained ResNet-18 model, acquiring visual guide characteristics corresponding to the frame images of the N groups of single sound source videos, and recording the acquired visual guide characteristics as f n And the value range of N is a natural number from 1 to N.
Optionally, the second aliased multi-source spectrogram is obtained by:
carrying out sound sampling on different single-sound-source video data to obtain different single-sound-source sound signals;
carrying out linear superposition on different single-sound-source sound signals to obtain aliasing multi-sound-source sound signals;
the aliased multi-source sound signal is transformed into a second aliased multi-source sound spectrogram using a short-time fourier transform.
Specifically, N pieces of different single-sound-source video data are subjected to sound sampling at a sampling rate K 3 Respectively extracting fixed length G 2 Second single-source sound signal, denoted as a n And the value range of N is a natural number from 1 to N.
And linearly overlapping the acquired different single sound source sound signals by utilizing the approximate linear additivity of the sound signals to acquire aliasing multi-sound-source sound signals.
The expression for aliased multi-source sound signals is as follows:
Figure BDA0003483040120000081
in the formula, a mix Representing aliased multi-source sound signals, a n Represents the nth-segment single sound source sound signal, and N represents the total segment number of the single sound source sound signal.
Single sound source sound signal a using STFT n And aliasing the multi-source sound signal a mix Respectively converted into single sound source spectrogram S n Sum-aliased multi-source spectrogram S mix . The STFT window width of the Hanning window used is 1022, and the interval between adjacent Hanning windows is 256.
The aliasing multi-sound-source sound signals are synthesized by using the single-sound-source video data, and then the aliasing multi-sound-source sound signals are converted into the training data of the model, so that the data synthesis mode can fully utilize massive video data on the Internet, the aliasing multi-sound-source sound signals with known sound components can be obtained without artificial marking, and the self-supervision learning can be conveniently realized.
In acquiring the visual guide feature f n Sum-aliased multi-source spectrogram S mix After that, f is put n And S mix Inputting a predictive coding circular convolution neural network model.
Optionally, the prediction coding cyclic convolutional neural network model includes a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer, and an upsampling layer;
the convolution layers in the prediction coding cyclic convolution neural network are connected in a feedback mode, the transposition convolution layers are connected in a feedforward mode, and the convolution layers in the same layer and the transposition convolution layers are connected in a cyclic mode.
Specifically, fig. 4 is a schematic structural diagram of a prediction coding cyclic convolutional neural network model provided in an embodiment of the present invention, and as shown in fig. 4, the prediction coding cyclic convolutional neural network model includes a prediction coding cyclic convolutional neural network, a single-layer transposed convolutional layer, and an upsampling layer.
The predictive coding cyclic convolutional neural network comprises a plurality of convolutional layers and a plurality of transposition convolutional layers, feedback connection is adopted between a top convolutional layer and a bottom convolutional layer, a prediction signal is transmitted by utilizing the feedback connection, feedforward connection is adopted between the transposition convolutional layer from the bottom convolutional layer to the transposition convolutional layer from the top layer, an error signal between the prediction signal and an actual response is transmitted by utilizing the feedforward connection, cyclic connection is adopted between the convolutional layer and the transposition convolutional layer at the same layer, and neuron response is transmitted between the convolutional layer and the transposition convolutional layer at the same layer by utilizing bad connection.
By enabling the predictive coding cyclic convolution neural network to have feedforward, feedback and cyclic connection, the progressive effective fusion of visual characteristics and sound characteristics can be realized, and the precision of sound source separation is improved.
Will f is n And S mix Inputting a predictive coding cyclic convolutional neural network in a predictive coding cyclic convolutional neural network model, after a plurality of steps of iteration, taking the top layer neuron response of a transposed convolutional layer in the predictive coding cyclic convolutional neural network as a visual sound fusion characteristic, and then inputting the visual sound fusion characteristic into a single-layer transposed convolutional layer and an upper sampling layer in the predictive coding cyclic convolutional neural network model to obtain a mask map
Figure BDA0003483040120000091
The mask map is a prediction mask map output by a predictive coding circular convolution neural network model.
Single sound source spectrogram S n Sum aliased multi-source spectrogram S mix If S is greater than S n The value of the corresponding element in (1) is greater than S mix If the value of the corresponding element is middle, the position of the corresponding element is assigned to 1, and the positions of the other corresponding elements are assigned to 0, so as to obtain a mask map M n The mask map is the actual mask map of the sound component.
To mask the picture
Figure BDA0003483040120000092
And mask map M n Binary Cross Entropy (BCE) as a loss function.
The expression for the loss function is as follows:
Figure BDA0003483040120000093
wherein L represents the value of the loss function, N represents the total number of segments of the single-source sound signal, BCE () represents the binary cross entropy function, M n A real mask map representing the sound components,
Figure BDA0003483040120000094
and a prediction mask diagram representing the output of the prediction coding circular convolution neural network model.
And reducing the value of the loss function by adopting an Error Back Propagation (BP) algorithm and a Stochastic Gradient Descent (SGD) algorithm to train a predictive coding cyclic convolution neural network model, and obtaining the trained predictive coding cyclic convolution neural network model after repeated iterative training.
The trained predictive coding circular convolution neural network model is obtained by firstly obtaining the predicted mask image and the real mask image, then using the binary cross entropy between the predicted mask image and the real mask image as the loss function and optimizing the value of the loss function, the processing of the spectrogram and the visual guide characteristic in the same network model is realized, the network model is small in scale, and a foundation is laid for obtaining the mask image by using the trained predictive coding circular convolution neural network model.
In the model application stage, a trained predictive coding cyclic convolution neural network model is used for outputting a predicted mask map, then a predicted single sound source spectrogram is obtained according to the aliasing multi-sound source spectrogram and the predicted mask map, and finally the single sound source spectrogram is subjected to short-time inverse Fourier transform to obtain separated sound signals.
Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into a trained predictive coding cyclic convolution neural network model to obtain a first mask map, where the method includes:
inputting the first aliasing multi-source spectrogram and the visual guidance features into a predictive coding cyclic convolution neural network to obtain visual-acoustic fusion features;
and inputting the visual and sound fusion characteristics into the single-layer transposed convolution layer and the upper sampling layer to obtain a first mask image.
Specifically, the visual guidance features and the aliasing multi-sound source spectrogram are input into a predictive coding circular convolution neural network in a trained predictive coding circular convolution neural network model to obtain visual sound fusion features.
Optionally, inputting the first aliased multi-source spectrogram and the visual guidance feature into a predictive coding cyclic convolution neural network to obtain a visual-audio fusion feature, including:
inputting the first aliasing multi-source spectrogram into the top layer of the convolution layer, sequentially acquiring a prediction signal of each convolution layer, and acquiring a neuron response of each convolution layer by using the prediction signal of each convolution layer;
inputting the visual guide characteristics into the bottom layer of the transposed convolutional layer, and sequentially acquiring the neuron response of each layer of the transposed convolutional layer according to the neuron response of the convolutional layer at the same layer;
in the last iteration of the loop, the neuron response of the top layer of the transposed convolutional layer is taken as a visual-acoustic fusion feature.
Specifically, fig. 5 is a flowchart of a predictive coding cyclic convolutional neural network model provided in an embodiment of the present invention, and as shown in fig. 5, an aliasing multi-sound-source sound spectrogram and a visual guidance feature are input into the predictive coding cyclic convolutional neural network, iteration is started, it is determined whether the iteration number T reaches the maximum iteration number T, if the iteration number is less than the maximum iteration number, a feedback process and a feedforward process are sequentially performed, then 1 is added to the iteration number, it is determined again whether the iteration number T reaches the maximum iteration number T, the feedback process and the feedforward process are performed again until the iteration number T reaches the maximum iteration number T, iteration is stopped, a neuron response at the top layer of the transposed convolutional layer passes through the single-layer transposed convolutional layer and the upper sampling layer, and a predicted mask map is output. The specific process is as follows:
inputting the aliased multi-source spectrogram into the top layer r of the convolution layer L And sequentially obtaining a prediction signal of each convolutional layer through a feedback process (a plurality of convolutional layers), and acquiring a neuron response of each convolutional layer by using the prediction signal of each convolutional layer. The convolutional layer may be provided in 7 layers.
The expression of the prediction signal for each convolutional layer is as follows:
p l (t)=(W l+1,l ) T r l+1 (t)
in the formula, p l (t) represents the predicted signal of the first convolution layer in t iterations, where l is a natural number in the range of 7 to 1, and W is l+1,l Representing a feedback connection from layer l +1 to layer l, r l+1 (t) represents the neuronal response of the l +1 convolutional layer in t iterations.
When l is equal to 7, r l+1 (t) is r 8 (t), the aliased multi-source spectrogram S can be mix As r 8 (t) neuronal response.
In the first iteration (t equals 0), the neuron response of each convolutional layer can be acquired from the upper layer to the lower layer in turn according to the prediction signal of each convolutional layer.
In the case where t is equal to 0, the expression of the neuron response of each convolutional layer is as follows:
r l (0)=LeakyReLU(p l (0))
in the formula, r l (0) Representing the neuron response of the first convolution layer in the first iteration, wherein the value range of l is a natural number from 7 to 1, leakyReLU () represents a non-saturation activation function, and p l (0) Representing the prediction signal of the first convolutional layer in the first iteration.
In subsequent iterations (t is greater than 0), the neuron response of each convolutional layer can be sequentially obtained from the upper layer to the lower layer according to the prediction signal of each convolutional layer and the neuron response of the transposed convolutional layer in the previous iteration.
In the case where t is greater than 0, the expression of the neuronal response for each convolutional layer is as follows:
r l (t)=LeakyReLU((1-b l )q l (t-1)+b l p l (t))
in the formula, r l (t) represents the neuron response of the l-th convolutional layer in t iterations, wherein the value range of l is a natural number from 7 to 1, leakyReLU () represents a non-saturation activation function, and b l Representing learnable parameters for balancing the importance of different items, q l (t-1) represents the neuron response of the first-layer transposed convolutional layer in t-1 iterations, p l (t) represents the predicted signal for the first convolutional layer in t iterations. After the feedback process, the number of layers of the transposed convolutional layer is the same as that of the convolutional layer through the feed-forward process (multi-layer transposed convolutional layer), and the transposed convolutional layer is also set to be 7 layers.
A prediction error is calculated between the neuron response of each transposed convolutional layer and the prediction signal of each convolutional layer.
The expression for the prediction error is as follows:
e l-1 (t)=q l-1 (t)-p l-1 (t)
in the formula, e l-1 (t) denotes the prediction error at layer l-1 in t iterations, q l-1 (t) represents the neuron response of the l-1 th transposed convolutional layer in t iterations, p l-1 (t) represents the prediction signal of the (l-1) th convolutional layer in t iterations, wherein the value of l is a natural number ranging from 1 to 7.
When l is equal to 1, q l-1 (t) is q 0 (t), visual guidance features f may be assigned n As q is 0 (t) corresponding neuronal response, p l-1 (t) is p 0 (t),p 0 (t) neuron response r can be determined from the 1 st convolutional layer in t iterations 1 (t) obtaining.
p 0 The expression of (t) is as follows:
p 0 (t)=LeakyReLU((W 1,0 ) T r 1 (t))
in the formula, p 0 (t) represents the prediction signal for the 0 th convolutional layer in t iterations, leakyReLU () represents the unsaturated activation function, W 1,0 Representing a feedback connection from layer 1 to layer 0, r 1 (t) represents the neuronal response of layer 1 convolutional layer in t iterations.
According to the prediction error of the next layer and the neuron response of the convolutional layer of the same layer, the neuron response of the transposed convolutional layer of the previous layer can be obtained from the lower layer to the upper layer in sequence.
The expression of the neuron response for each transposed convolutional layer is as follows:
q l (t)=LeakyReLU(r l (t)+a l (W l-1,l ) T e l-1 (t))
in the formula, q l (t) represents the neuron response of the first layer of transposed convolutional layer in t iterations, leakyReLU () represents the unsaturated activation function, r l (t) represents the neuronal response of the l-th convolutional layer in t iterations, a l Representing a learnable balance parameter, W l-1,l Showing the feed-forward connection from layer l-1 to layer l, e l-1 (t) represents the prediction error of layer l-1 in t iterations.
The feedback process and the feedforward process are performed cyclically and alternately until the iteration number T is the maximum iteration number T, which is usually set equal to 5.
In the last iteration of the loop, the neuron response q of the top layer of the transposed convolution layer in the cyclic convolution neural network of the predictive coding is used L (T) as visual-audio fusion features.
By circularly and alternately executing the feedback process and the feedforward process, the visual characteristic and the sound characteristic are progressively and effectively fused, and the precision of sound source separation is improved.
Inputting visual and sound fusion characteristics into a single-layer transposition convolutional layer and an upsampling layer, wherein the dimensionality of the single-layer transposition convolutional layer is 3 multiplied by 1, the scale factor of the upsampling layer is 2, and further outputting a predicted mask map
Figure BDA0003483040120000131
The aliasing multi-sound-source sound spectrogram and the visual guidance characteristics are input into a trained predictive coding cyclic convolution neural network model to obtain a predicted mask map, the sound spectrogram and the visual guidance characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, the sound source separation precision is improved, and a foundation is laid for subsequently obtaining separated sound signals.
And 103, acquiring a separated sound signal according to the first aliasing multi-sound-source sound spectrogram and the first mask map.
Specifically, the separated sound signals may be obtained according to the aliased multi-source spectrogram and a prediction mask map output by the trained prediction coding circular convolution neural network model.
Optionally, acquiring the separated acoustic signal according to the first aliased multi-source acoustic spectrogram and the first mask map comprises:
multiplying the corresponding elements of the first aliasing multi-sound-source sound spectrogram and the first mask map to obtain a second single-sound-source sound spectrogram;
and transforming the second single sound source spectrogram to a time domain by using short-time Fourier inversion transformation to obtain separated sound signals.
Specifically, the aliased multi-source spectrogram is respectively compared with the predicted mask map
Figure BDA0003483040120000141
And &>
Figure BDA0003483040120000142
Is multiplied by the corresponding element to obtain a predicted single sound source spectrogram->
Figure BDA0003483040120000143
And &>
Figure BDA0003483040120000144
Single sound source spectrogram using Inverse Short Time Fourier Transform (iSTFT)
Figure BDA0003483040120000145
And &>
Figure BDA0003483040120000146
By transformation into the time domain, a separated sound signal 1 and a separated sound signal 2 are obtained.
The aliasing multi-sound source sound spectrogram and the predicted mask chart are multiplied by corresponding elements, short-time inverse Fourier transform is carried out, separated sound signals are obtained, and sound source separation is achieved.
According to the sound source separation method provided by the invention, the visual guide characteristics and the aliasing multi-sound-source sound spectrogram are input into the trained predictive coding cyclic convolution neural network model to predict the mask map of each sound component, and then the mask map and the aliasing multi-sound-source sound spectrogram are utilized to obtain the separated sound signal, so that the sound spectrogram and the visual guide characteristics are processed in the same network model, the network model is small in scale, the visual characteristics and the sound characteristics can be progressively and effectively fused, and the sound source separation precision is improved.
Fig. 6 is a schematic structural diagram of a sound source separation apparatus according to an embodiment of the present invention, and as shown in fig. 6, the present invention further provides a sound source separation apparatus, including: a first obtaining module 601, a second obtaining module 602, and a third obtaining module 603, wherein:
a first obtaining module 601, configured to obtain a visual guidance feature in a video frame image;
a second obtaining module 602, configured to input the first aliased multi-source spectrogram and the visual guidance feature into a trained predictive coding cyclic convolutional neural network model, so as to obtain a first mask map;
a third obtaining module 603, configured to obtain a separated sound signal according to the first aliased multi-source sound spectrogram and the first mask map.
Specifically, the sound source separation device provided in the embodiment of the present application can implement all the method steps implemented by the above method embodiment, and can achieve the same technical effect, and details of the same parts and beneficial effects as those of the method embodiment in this embodiment are not described herein again.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor) 710, a communication Interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a sound source separation method comprising: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-sound-source sound spectrogram and the first mask map.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the sound source separation method provided by the above methods, the method comprising: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.
In still another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the sound source separation method provided above, the method including: acquiring visual guide features in a video frame image; inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map; and acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between similar elements and not for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in other sequences than those illustrated or otherwise described herein, and that the terms "first" and "second" used herein generally refer to a class and do not limit the number of objects, for example, a first object can be one or more.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A sound source separation method, comprising:
acquiring visual guide features in a video frame image;
inputting the first aliasing multi-sound-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to obtain a first mask map;
acquiring a separated sound signal according to the first aliasing multi-source sound spectrogram and the first mask map;
the predictive coding cyclic convolution neural network model comprises a predictive coding cyclic convolution neural network, a single-layer transposition convolution layer and an up-sampling layer;
feedback connection is adopted among convolution layers in the prediction coding cyclic convolution neural network, feedforward connection is adopted among transposition convolution layers, and cyclic connection is adopted among convolution layers of the same layer and transposition convolution layers; the feedback connection is used to convey a prediction signal, the feedforward connection is used to convey an error signal between the prediction signal and an actual response, and the loop connection is used to convey a neuron response between a convolutional layer and a transposed convolutional layer of the same layer.
2. The sound source separation method according to claim 1, wherein the trained predictive coding circular convolutional neural network model is obtained by:
inputting visual guide features corresponding to the second aliasing multi-sound source spectrogram and the frame image of the single-sound source video into a predictive coding cyclic convolution neural network model, and outputting a second mask map; the second mask map is a predicted mask map;
comparing corresponding elements in the first single sound source spectrogram and the second aliasing multi-sound source spectrogram to obtain a third mask map; the third mask image is a real mask image;
taking the binary cross entropy between the second mask map and the third mask map as a loss function;
and optimizing the value of the loss function to obtain a trained predictive coding cyclic convolution neural network model.
3. The sound source separation method of claim 1, wherein inputting the first aliased multi-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolutional neural network model to obtain a first mask map comprises:
inputting the first aliasing multi-source spectrogram and the visual guidance features into the predictive coding cyclic convolution neural network to obtain visual-acoustic fusion features;
and inputting the visual and sound fusion features into the single-layer transposed convolution layer and the up-sampling layer to obtain a first mask image.
4. The sound source separation method according to claim 3, wherein inputting the first aliased multi-source sound spectrogram and the visual guidance features into the predictive coding cyclic convolution neural network to obtain visual-audio fusion features comprises:
inputting the first aliasing multi-source spectrogram into the top layer of the convolutional layer, sequentially acquiring a prediction signal of each convolutional layer, and acquiring a neuron response of each convolutional layer by using the prediction signal of each convolutional layer;
inputting the visual guide features into the bottom layer of the transposed convolutional layer, and sequentially acquiring the neuron response of each layer of the transposed convolutional layer according to the neuron response of the convolutional layer at the same layer;
in the last iteration of the loop, the neuron response of the top layer of the transposed convolutional layer is taken as a visual-acoustic fusion feature.
5. The sound source separation method according to claim 1, wherein obtaining a separated sound signal from the first aliased multi-source sound spectrogram and the first mask map comprises:
multiplying the first aliasing multi-sound-source sound spectrogram by corresponding elements of the first mask image to obtain a second single-sound-source sound spectrogram;
and transforming the second single sound source spectrogram into a time domain by using short-time Fourier inverse transformation to obtain separated sound signals.
6. The sound source separation method according to claim 2, wherein the second aliased multi-source spectrogram is obtained by:
carrying out sound sampling on different single sound source video data to obtain different single sound source sound signals;
linearly overlapping the different single-sound-source sound signals to obtain aliasing multi-sound-source sound signals;
transforming the aliased multi-source sound signal into the second aliased multi-source spectrogram using a short-time Fourier transform.
7. A sound source separation apparatus, comprising:
the first acquisition module is used for acquiring visual guide features in the video frame images;
the second acquisition module is used for inputting the first aliasing multi-source sound spectrogram and the visual guidance features into a trained predictive coding cyclic convolution neural network model to acquire a first mask map;
a third obtaining module, configured to obtain a separated sound signal according to the first aliased multi-source spectrogram and the first mask map;
the predictive coding cyclic convolution neural network model comprises a predictive coding cyclic convolution neural network, a single-layer transposition convolution layer and an upper sampling layer;
feedback connection is adopted among convolution layers in the prediction coding cyclic convolution neural network, feedforward connection is adopted among transposition convolution layers, and cyclic connection is adopted among convolution layers of the same layer and transposition convolution layers; the feedback connection is used to convey a prediction signal, the feedforward connection is used to convey an error signal between the prediction signal and an actual response, and the loop connection is used to convey a neuron response between a convolutional layer and a transposed convolutional layer of the same layer.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the sound source separation method according to any of claims 1 to 6 when executing the computer program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the sound source separation method according to any one of claims 1 to 6.
CN202210073239.6A 2022-01-21 2022-01-21 Sound source separation method and device Active CN114596876B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210073239.6A CN114596876B (en) 2022-01-21 2022-01-21 Sound source separation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210073239.6A CN114596876B (en) 2022-01-21 2022-01-21 Sound source separation method and device

Publications (2)

Publication Number Publication Date
CN114596876A CN114596876A (en) 2022-06-07
CN114596876B true CN114596876B (en) 2023-04-07

Family

ID=81806801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210073239.6A Active CN114596876B (en) 2022-01-21 2022-01-21 Sound source separation method and device

Country Status (1)

Country Link
CN (1) CN114596876B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118191734A (en) * 2024-05-16 2024-06-14 杭州爱华仪器有限公司 Multi-sound source positioning method, device, program, storage medium and electronic equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086773B (en) * 2018-08-29 2022-03-04 电子科技大学 Fault plane identification method based on full convolution neural network
US11501532B2 (en) * 2019-04-25 2022-11-15 International Business Machines Corporation Audiovisual source separation and localization using generative adversarial networks
CN110970056B (en) * 2019-11-18 2022-03-11 清华大学 Method for separating sound source from video
US11610599B2 (en) * 2019-12-06 2023-03-21 Meta Platforms Technologies, Llc Systems and methods for visually guided audio separation
CN111539449B (en) * 2020-03-23 2023-08-18 广东省智能制造研究所 Sound source separation and positioning method based on second-order fusion attention network model
CN112132158A (en) * 2020-09-04 2020-12-25 华东师范大学 Visual picture information embedding method based on self-coding network
CN112712819B (en) * 2020-12-23 2022-07-26 电子科技大学 Visual auxiliary cross-modal audio signal separation method
CN113255837A (en) * 2021-06-29 2021-08-13 南昌工程学院 Improved CenterNet network-based target detection method in industrial environment
CN113593601A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Audio-visual multi-modal voice separation method based on deep learning
CN113850246B (en) * 2021-11-30 2022-02-18 杭州一知智能科技有限公司 Method and system for sound source positioning and sound source separation based on dual coherent network

Also Published As

Publication number Publication date
CN114596876A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN109891434B (en) Generating audio using neural networks
US11017761B2 (en) Parallel neural text-to-speech
Luo et al. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation
KR102392094B1 (en) Sequence processing using convolutional neural networks
CN110335584A (en) Neural network generates modeling to convert sound pronunciation and enhancing training data
Agrawal et al. Modulation filter learning using deep variational networks for robust speech recognition
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
WO2020039571A1 (en) Voice separation device, voice separation method, voice separation program, and voice separation system
CN107452369A (en) Phonetic synthesis model generating method and device
CN112562634A (en) Multi-style audio synthesis method, device, equipment and storage medium
CN111341294B (en) Method for converting text into voice with specified style
Abouzid et al. Signal speech reconstruction and noise removal using convolutional denoising audioencoders with neural deep learning
CN111128211B (en) Voice separation method and device
CN113380262B (en) Sound separation method based on attention mechanism and disturbance perception
CN114596876B (en) Sound source separation method and device
JP6099032B2 (en) Signal processing apparatus, signal processing method, and computer program
CN116994564A (en) Voice data processing method and processing device
EP3507993B1 (en) Source separation for reverberant environment
CN116013343A (en) Speech enhancement method, electronic device and storage medium
CN115295002A (en) Single-channel speech enhancement method based on interactive time-frequency attention mechanism
JP6167063B2 (en) Utterance rhythm transformation matrix generation device, utterance rhythm transformation device, utterance rhythm transformation matrix generation method, and program thereof
CN111798859B (en) Data processing method, device, computer equipment and storage medium
CN113744753B (en) Multi-person voice separation method and training method of voice separation model
JP7472575B2 (en) Processing method, processing device, and program
WO2018044801A1 (en) Source separation for reverberant environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant