CN111370019B

CN111370019B - Sound source separation method and device, and neural network model training method and device

Info

Publication number: CN111370019B
Application number: CN202010136342.1A
Authority: CN
Inventors: 孔秋强; 王雨轩
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2023-08-29
Anticipated expiration: 2040-03-02
Also published as: CN111370019A

Abstract

A sound source separation method, a neural network model training method, a sound source separation device, a neural network model training device and a storage medium. The sound source separation method comprises the following steps: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source tag group; and inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group are in one-to-one correspondence with the condition vectors of the condition vector group.

Description

Sound source separation method and device, and neural network model training method and device

Technical Field

Embodiments of the present disclosure relate to a sound source separation method, a model training method of a neural network, a sound source separation apparatus, a model training apparatus of a neural network, and a storage medium.

Background

Sound source separation is a technique for separating sound sources in a sound recording. Sound source separation is the basis for computing auditory scene analysis (computational auditory scene analysis, CASA) systems. Essentially, CASA systems aim to separate sound sources in mixed audio in the same way as a human listener. The CASA system may detect and separate mixed audio to obtain different sound sources. Because of the large number of sound events in the world, a number of different sound events may occur simultaneously, resulting in the well known cocktail problem. The sound source separation may be performed using an unsupervised method of modeling an average harmonic structure, a neural network-based method, and the like. Methods of neural networks include fully connected neural networks, convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), and the like.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

At least one embodiment of the present disclosure provides a sound source separation method, including: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source tag group; and inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein target sound sources in the target sound source group are in one-to-one correspondence with condition vectors of the condition vector group.

At least one embodiment of the present disclosure further provides a model training method of a neural network, including: obtaining a training sample set, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio fragments and a plurality of first training condition vectors, the training mixed audio comprises the plurality of training audio fragments, and the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio fragments; training a first neural network to be trained by using the training sample set to obtain the first neural network, wherein the first neural network to be trained comprises a loss function, and training the first neural network to be trained by using the training sample set to obtain the first neural network comprises: acquiring a current training data set from the training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio fragments, and the current training mixed audio comprises the plurality of current training audio fragments; determining a plurality of first current training condition vectors corresponding to the plurality of current training audio segments one by one, wherein the current training data set further comprises the plurality of first current training condition vectors, and inputting the current training mixed audio and the plurality of first current training condition vectors into the first neural network to be trained for sound source separation processing so as to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of the first neural network to be trained according to the plurality of first current training target sound sources and the plurality of current training audio clips; correcting the parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

At least one embodiment of the present disclosure also provides a sound source separation apparatus including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, which when executed by the processor, perform the sound source separation method according to any of the above embodiments.

At least one embodiment of the present disclosure also provides a model training apparatus, including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, which when executed by the processor, perform the model training method according to any of the embodiments described above.

At least one embodiment of the present disclosure also provides a storage medium, non-transitory storing computer readable instructions, which when executed by a computer, may perform the sound source separation method according to any one of the above embodiments.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic flow chart of a sound source separation method provided in at least one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a first neural network according to at least one embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of a method for model training of a neural network provided in accordance with at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a first current training audio clip, a first current training sound anchor vector corresponding to the first current training audio clip, a second current training audio clip, and a second current training sound anchor vector corresponding to the second current training audio clip according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of separating sound sources from a mixture of a first current training audio segment and a second current training audio segment provided by at least one embodiment of the present disclosure;

FIG. 6 is a schematic block diagram of a sound source separation device provided in accordance with at least one embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of a model training apparatus provided in accordance with at least one embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of a storage medium provided by at least one embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Most sound source separation systems are currently designed to separate a particular sound source type, such as speech, music, etc. While CASA systems require a large number of sound sources to be separated, an open sound source separation problem. There may be hundreds of sound source types in the real world, greatly increasing the difficulty of separating all of these sound sources. Existing sound source separation systems require training of pairs of clean sounds and mixed sounds that include the clean sounds. For example, to separate a human voice from music, it is necessary to train with a pair of mixed sounds (mixed sounds include a mixture of human voice and music) and a pure human voice. However, no dataset is currently available that can provide clean sound for a large number of sound source types. For example, collecting clean natural sounds (e.g., thunder, etc.) is impractical because natural sounds are often mixed with other sounds.

At least one embodiment of the present disclosure provides a sound source separation method, a model training method of a neural network, a sound source separation apparatus, a model training apparatus of a neural network, and a storage medium. The sound source separation method comprises the following steps: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source tag group; and inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group are in one-to-one correspondence with the condition vectors of the condition vector group.

According to the sound source separation method, the condition vectors corresponding to different sound sources are added, so that various sound sources corresponding to the condition vectors in the same mixed audio are separated, and the problem of sound source separation of a large number of sound source types trained by using weak tag data is solved.

Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.

The regression-based sound source separation technique is briefly described below.

The regression method based on the neural network can be used for solving the problems of sound source separation or voice enhancement and the like. The sound source separation method provided by the embodiment of the disclosure also belongs to a regression-based method. Regression-based methods learn the mapping from a mixed sound source to a target sound source to be separated. For example, each sound source can be expressed as: s1, s2, …, sk, where k is the number of sound sources and is a positive integer, each sound source (s 1 or s2 or sk) is represented as a time domain signal (a time domain signal is a relationship describing a mathematical function or a physical signal versus time, e.g., the time domain waveform of one signal expresses the signal's change over time). The mixed sound source is expressed as:

current sound source separation systems require that a regression map f (x) be built for each individual sound source sk:

where x is the mixed sound source. For the speech enhancement task sk is the target clean speech. For the sound source separation task, sk may be a music part or an accompaniment part. The regression map F (x) can be modeled in the waveform and time-frequency (T-F) domain. For example, F (x) may be constructed in the T-F domain.

For example, the mixed sound source X and each individual sound source sk can be converted into X and S, respectively, using a Short Time Fourier Transform (STFT), and the size and phase of X are expressed as |x| and e ^j∠X The |x| is called a spectrogram of the mixed sound source X. The neural network may then be used to map the spectrogram |X| of the mixed sound source to the estimated spectrogram of the individual sound sourceThe phase in the mixed sound source is then used to determine the estimated individual sound source: />Finally, apply the inverse STFT toThereby obtaining separate individual sound sources s. However, current sound source separation systems require clean sound sources for training. Moreover, each sound source separation system can separate only one sound source type, and thus the number of sound source separation systems will increase linearly with the number of sound sources, which is impractical for separating all sound source types in AudioSet (AudioSet is a large-scale weak-tag audio data set). And the audio clip in AudioSet is from a truly recorded video (e.g., youTube video). Each audio segment may include a plurality of differentSound event, no clean sound source in AudioSet dataset. The audio clip selected based on the Sound Event Detection (SED) system only represents the presence of sound events, while there may be a plurality of different sound events in the same audio clip, whereby training of the sound source separation system is difficult.

Fig. 1 is a schematic flowchart of a sound source separation method according to at least one embodiment of the present disclosure, as shown in fig. 1, the sound source separation method includes steps S10-S13.

For example, as shown in fig. 1, a sound source separation method provided in an embodiment of the present disclosure includes:

step S10: acquiring mixed audio;

step S11: determining a sound source tag group corresponding to the mixed audio;

step S12: determining a condition vector group according to the sound source tag group;

step S13: and inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group are in one-to-one correspondence with the condition vectors of the condition vector group.

In the sound source separation method provided by the embodiment of the disclosure, a plurality of different sound sources are separated by setting condition vectors corresponding to the different sound sources based on the regression method of the neural network.

For example, in step S10, the mixed audio may include various sounds mixed, and the various sounds may include a person' S speaking voice, singing voice, natural thunder and rain voice, a performance voice of a musical instrument, and the like. For example, various sounds may be collected by a sound collection device and may be stored by means of a storage device or the like. For example, mixed audio may also be derived from the AudioSet dataset.

For example, in some examples, the mixed audio may include at least two different types of sound.

For example, in some examples, step S10 may include: acquiring original mixed audio; and performing spectrum transformation processing on the original mixed audio to obtain mixed audio.

For example, the mixed audio may be represented as a spectrogram of sound.

For example, the original mixed audio may be audio directly captured by a sound capture device. The sound collection device may include various types of microphones, microphone arrays, or other devices capable of collecting sound. The microphone may be an electret capacitor microphone or a microelectromechanical system (MEMS) microphone or the like.

For example, in some examples, performing spectral transformation processing on the original mixed audio to obtain mixed audio includes: and performing short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

For example, a short-time fourier transform (STFT) may convert the original mixed audio X into X, the size of X being represented as |x|, and |x| being a short-time fourier spectrum of sound, i.e., mixed audio.

For example, the sound heard by the human ear is not in a linear relationship with the actual frequency, and the Mel (Mel) frequency is more consistent with the auditory characteristics of the human ear, i.e., the Mel frequency is linearly distributed below 1000Hz, and the Mel frequency increases logarithmically above 1000 Hz. Thus, in other examples, performing spectral transformation processing on the original mixed audio to obtain mixed audio includes: performing short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio; the intermediate mixed audio is subjected to logarithmic mel-frequency spectrum processing to obtain mixed audio.

At this time, the intermediate mixed audio represents a short-time fourier spectrum of the sound, and the mixed audio represents a logarithmic mel spectrum of the sound.

It should be noted that the original mixed audio may include various noises, so that, in addition to the spectrum transformation processing of the original mixed audio, processing such as speech enhancement and noise reduction may be performed on the original mixed audio, so as to eliminate irrelevant information or noise information in the original mixed audio, thereby obtaining mixed audio with better quality.

For example, step S11 may include: performing sound event detection on the mixed audio by using a second neural network to determine a sound event group included in the mixed audio; a set of sound source tags is determined from the set of sound events.

For example, the number of sound events in the sound event group is equal to or greater than the number of sound source tags in the sound source tag group.

For example, a sound event group may include a plurality of sound events, and a sound source tag group may include a plurality of sound source tags, the plurality of sound tags being different from each other. In some examples, the plurality of sound events are different from each other and correspond to different sound source types, and at this time, the number of the plurality of sound source tags is equal to the number of the plurality of sound events, and the plurality of sound source tags corresponds to the plurality of sound events one by one. In other examples, the plurality of sound events are different from each other, however, the sound source types corresponding to a part of the plurality of sound events may be the same, so that the part of the sound events corresponds to the same sound source tag, and in this case, the number of the plurality of sound source tags is smaller than the number of the plurality of sound events.

It should be noted that, in the present disclosure, if the sound source types corresponding to two sound events are the same, and the occurrence times corresponding to the two sound events are different, the two sound events are considered as two different sound events; if the sound source types corresponding to the two sound events are different and the occurrence times corresponding to the two sound events are the same, the two sound events are also considered to be two different sound events.

For example, each sound source tag represents a sound source type of the corresponding one or more sound events. The sound source types may include: human speech, singing, natural thunder and rain, performance of various instruments, various animal sounds, machine sounds, etc. It should be noted that, the voice of the speaker can be divided into one sound source type; the voices of different animals can be of different sound source types, for example, the voices of tigers and the voices of monkeys can be of different sound source types respectively; the playing sounds of different instruments may be of different sound source types, for example, piano sound and violin sound are respectively of different sound source types.

It should be noted that, in some embodiments, the mixed audio provided by the present disclosure may also be a clean sound source, that is, only include one sound source, where the sound event group includes only one sound event, and accordingly, the sound source tag group includes only one sound source tag corresponding to the sound event.

For example, the second neural network may detect the mixed audio using a sound event detection technique (Sound event detection, SED). The second neural network may be any suitable network such as a convolutional neural network, for example, the second neural network may be a convolutional neural network comprising 13 convolutional layers, for example, each convolutional layer may comprise a convolutional kernel of 3*3. For example, in some examples, the second neural network may be AlexNet, VGGnet, resNet or the like.

For example, in step S12, in some examples, the condition vector group includes at least one condition vector, the sound source tag group includes at least one sound source tag, the number of at least one sound source tag is the same as the number of at least one condition vector, and the at least one condition vector corresponds to the at least one sound source tag one-to-one. Each of the at least one condition vector includes N type probability values, that is, the sound source separation method provided by the embodiments of the present disclosure can identify and separate N sound source types. The type probability value corresponding to the sound source type corresponding to the sound source label corresponding to each condition vector in the N type probability values is a target type probability value, the target type probability value is 1, and the rest type probability values except the target type probability value in the N type probability values are all 0, wherein N is a positive integer. By setting the condition vector in this way, each target sound source can be obtained by separation as a clean sound source.

For example, for an AudioSet dataset having a large-scale audio dataset of 527 sound source types, N is 527. That is, each condition vector may be a one-dimensional vector having 527 elements.

For example, in some embodiments, the N sound source types are respectively corresponding to the N sound source tags one by one, where the N sound source tags may include a first sound source tag, a second sound source tag, …, and an nth sound source tag, and the N sound source types include a first sound source type corresponding to the first sound source tag, a second sound source type corresponding to the second sound source tag, …, and an nth sound source type corresponding to the nth sound source tag. The N type probability values in each condition vector may include a first type probability value, a second type probability value, …, an i-th type probability value, …, an N-th type probability value, each condition vector may be represented as p= { P1, P2, …, pi, …, pN }, i.e., P1 represents the first type probability value, P2 represents the second type probability value, pi represents the i-th type probability value, cN represents the N-th type probability value, where i is a positive integer, and 1.ltoreq.i.ltoreq.n. The first type probability value represents the probability of a first sound source type corresponding to a first sound source tag, the second type probability value represents the probability of a second sound source type corresponding to a second sound source tag, and so on, the nth type probability value represents the probability of the nth sound source type corresponding to the nth sound source tag. In some examples, the set of condition vectors includes a first condition vector and a second condition vector, where if the first condition vector corresponds to a first sound source tag, in the first condition vector, the first type probability value is a target type probability value, p1=1, and the other type probability values except for the first type probability value are all 0, that is, p2=0, pi=0, pn=0; if the second condition vector corresponds to the nth sound source tag, in this case, in the second condition vector, the nth type probability value is the target type probability value, pn=1, and the remaining type probability values except for the nth type probability value are all 0, that is, p1=0, p2=0, pi=0.

For example, in some examples, the condition vector set includes a plurality of condition vectors, at which time step S13 may include: determining a plurality of input data sets based on the mixed audio and the plurality of condition vectors; and respectively carrying out sound source separation processing on the plurality of input data sets by using the first neural network so as to obtain a target sound source set.

For example, the plurality of input data sets are in one-to-one correspondence with a plurality of condition vectors, each of the plurality of input data sets including mixed audio and one of the plurality of condition vectors. That is, the first neural network includes two inputs, one for each of the mixed audio and one of the condition vectors. For example, in some examples, the plurality of input data sets includes a first input data set and a second input data set, the plurality of condition vectors includes a first condition vector and a second condition vector, the first input data set includes the first condition vector and the mixed audio, and the second input data set includes the second condition vector and the mixed audio.

For example, the target sound source group includes a plurality of target sound sources in one-to-one correspondence with a plurality of condition vectors, the plurality of input data groups in one-to-one correspondence with a plurality of target sound sources, each of the plurality of target sound sources corresponding to a condition vector in the input data group corresponding to each of the target sound sources. For example, the plurality of target sound sources include a first target sound source and a second target sound source, the first target sound source corresponds to the first input data set, the second target sound source corresponds to the second input data set, that is, the first target sound source corresponds to the first condition vector, the second target sound source corresponds to the second condition vector, and if the sound source type corresponding to the first condition vector is human sound, the sound source type corresponding to the second condition vector is piano sound, the first target sound source is human sound, and the second target sound source is piano sound.

For example, the plurality of condition vectors are different from each other, so that the plurality of target sound sources are different from each other.

For example, each target sound source may comprise at least one audio clip. In some examples, the mixed audio includes a sound event group comprising M sound events, where M is a positive integer. For example, if the M sound events respectively correspond to different sound source types, then the mixed audio is processed to obtain a target sound source group, where the target sound source group includes M target sound sources, and each target sound source of the M target sound sources includes an audio clip. For another example, Q sound events in the M sound events correspond to the same sound source type, where Q is a positive integer and Q is less than or equal to M, then, at this time, the mixed audio is processed to obtain a target sound source group, where the target sound source group includes (M-q+1) target sound sources, Q sound events correspond to one target sound source, and the target sound source may include Q audio segments.

For example, the length of time for each audio segment in the target sound source may be set by the user. For example, the length of time for each audio clip may be 2 seconds(s). If each target sound source includes a plurality of audio clips, the time lengths of the plurality of audio clips may be the same.

Fig. 2 is a schematic diagram of a first neural network according to at least one embodiment of the present disclosure.

For example, the first neural network is any suitable network such as a U-shaped neural network (U-net), a convolutional neural network, or the like.

For example, in some examples, the first neural network is a U-net, and the U-net includes four encoding layers and four decoding layers, each encoding layer having two convolutional layers and one pooling layer, each decoding layer having two transposed convolutional layers and one anti-pooling layer. U-net is a variant of convolutional neural network. The U-Net is composed of multiple encoding layers and multiple decoding layers modeled by convolutional layers. Each encoding layer halves the size of the feature map, but doubles the number of channels, i.e. encodes the spectrogram into a feature map of a smaller depth representation. Each decoding layer decodes the feature map into a spectrogram of the original size by transposing the convolution layers. In U-Net, a connection is added between the coding layer and the decoding layer at the same hierarchical level, where the connection may be, for example, merging, that is, a feature mapping connection with the same size in the layers connected by the coding layer and the decoding layer through a memory mapping manner (such that the vectors corresponding to the features are merged, and the number of channels of the layer where the features are located is doubled), thereby allowing low-level information to directly flow from the high-resolution input to the high-resolution output, that is, combining the low-level information with the high-level depth representation.

For example, as shown in fig. 2, four encoding layers are a first encoding layer 20, a second encoding layer 21, a third encoding layer 22, and a fourth encoding layer 23, respectively, and four decoding layers are a first decoding layer 30, a second decoding layer 31, a third decoding layer 32, and a fourth decoding layer 33, respectively. The first coding layer 20 includes a convolution layer CN11, a convolution layer CN12, and a pooling layer PL11, the second coding layer 21 includes a convolution layer CN21, a convolution layer CN22, and a pooling layer PL12, the third coding layer 22 includes a convolution layer CN31, a convolution layer CN32, and a pooling layer PL13, and the fourth coding layer 23 includes a convolution layer CN41, a convolution layer CN42, and a pooling layer PL14. The first decoding layer 30 includes a transposed convolutional layer TC11, a transposed convolutional layer TC12, and an anti-pooling layer UP11, the second decoding layer 31 includes a transposed convolutional layer TC21, a transposed convolutional layer TC22, and an anti-pooling layer UP12, the third decoding layer 32 includes a transposed convolutional layer TC31, a transposed convolutional layer TC32, and an anti-pooling layer UP13, and the fourth decoding layer 33 includes a transposed convolutional layer TC41, a transposed convolutional layer TC42, and an anti-pooling layer UP14.

For example, the first encoding layer 20 is correspondingly connected to the first decoding layer 30, the second encoding layer 21 is correspondingly connected to the second decoding layer 31, the third encoding layer 22 is correspondingly connected to the third decoding layer 32, and the fourth encoding layer 23 is correspondingly connected to the fourth decoding layer 33.

For example, in some examples, the number of channels of one input of the first neural network is 16, as shown in fig. 2, the convolution layer CN11 in the first coding layer 20 is used to extract features of the input to generate a feature map F11; the convolution layer CN12 in the first coding layer 20 is configured to perform a feature extraction operation on the feature map F11 to obtain a feature map F12; the pooling layer PL11 in the first encoding layer 20 is configured to perform a downsampling operation on the feature map F12 to obtain a feature map F13.

For example, the feature map F12 may be transmitted to the first decoding layer 30.

For example, the number of channels of the feature map F11, the number of channels of the feature map F12, and the number of channels of the feature map F13 are all the same, for example, all 64. For example, in some examples, the feature map F11 is the same size as the feature map F12 and is both larger than the feature map F13, e.g., the feature map F11 is four times the feature map F13; in other examples, feature map F11 has a size greater than feature map F12, feature map F12 has a size greater than feature map F13, and feature map F12 has a size four times the size of feature map F13, e.g., feature map F11 has a size of 570 x 570, feature map F12 has a size of 568 x 568, and feature map F13 has a size of 284 x 284.

For example, as shown in fig. 2, the convolution layer CN21 in the second coding layer 21 is used to extract features of the feature map F13 to generate the feature map F21; the convolution layer CN12 in the second coding layer 21 is configured to perform a feature extraction operation on the feature map F21 to obtain a feature map F22; the pooling layer PL12 in the second encoding layer 21 is configured to perform a downsampling operation on the feature map F22 to obtain a feature map F23.

For example, the feature map F22 may be transmitted to the second decoding layer 31.

For example, the number of channels of the feature map F21, the number of channels of the feature map F22, and the number of channels of the feature map F23 are all the same, for example, 128. For example, in some examples, the dimensions of feature map F21 and feature map F22 are the same and are both larger than the dimensions of feature map F23. For example, the size of the feature map F21 is four times the size of the feature map F23; in other examples, the size of feature map F21 is greater than the size of feature map F22, and the size of feature map F22 is greater than the size of feature map F23, e.g., the size of feature map F22 is four times the size of feature map F23.

For example, as shown in fig. 2, the convolution layer CN31 in the third encoding layer 22 is used to extract features of the feature map F23 to generate the feature map F31; the convolution layer CN32 in the third coding layer 22 is configured to perform a feature extraction operation on the feature map F31 to obtain a feature map F32; the pooling layer PL13 in the third coding layer 22 is configured to perform a downsampling operation on the feature map F32 to obtain a feature map F33.

For example, the feature map F32 may be transmitted to the third decoding layer 32.

For example, the number of channels of the feature map F31, the number of channels of the feature map F32, and the number of channels of the feature map F33 are all the same, for example, 256. For example, in some examples, the dimensions of feature map F31 and the dimensions of feature map F32 are the same and both are greater than the dimensions of feature map F33. For example, the size of the feature map F31 is four times the size of the feature map F33; in other examples, the size of feature map F31 is greater than the size of feature map F32, and the size of feature map F32 is greater than the size of feature map F33, e.g., the size of feature map F32 is four times the size of feature map F33.

For example, as shown in fig. 2, the convolution layer CN41 in the fourth encoding layer 23 is used to extract features of the feature map F33 to generate the feature map F41; the convolution layer CN42 in the fourth coding layer 23 is configured to perform a feature extraction operation on the feature map F41 to obtain a feature map F42; the pooling layer PL14 in the fourth coding layer 23 is configured to perform a downsampling operation on the feature map F42 to obtain a feature map F43.

For example, the feature map F42 may be transmitted to the fourth decoding layer 33.

For example, the number of channels of the feature map F41, the number of channels of the feature map F42, and the number of channels of the feature map F43 are all the same, for example, 512. For example, in some examples, the dimensions of feature map F41 and feature map F42 are the same and are both larger than the dimensions of feature map F43. For example, the size of the feature map F41 is four times the size of the feature map F43; in other examples, the size of feature map F41 is greater than the size of feature map F42, and the size of feature map F42 is greater than the size of feature map F43, e.g., the size of feature map F42 is four times the size of feature map F43.

For example, as shown in fig. 2, the first neural network further includes an encoding output layer 25 and a decoding input layer 26, the encoding output layer 25 may be connected to the decoding input layer 26, the encoding output layer 25 may be further connected to the fourth encoding layer 23, and the decoding input layer 26 may be further connected to the fourth decoding layer 33. The encoded output layer 25 includes a convolutional layer CN51 and the decoded input layer 26 includes a convolutional layer CN52. The convolution layer CN51 in the encoding output layer 25 is used to perform a feature extraction operation on the feature map F43 to generate a feature map F51. The feature map F51 is output to the decoding input layer 26, and the convolution layer CN52 in the decoding input layer 26 is configured to perform a feature extraction operation on the feature map F51 to obtain a feature map F52.

For example, the number of channels of the feature map F51 may be 1024, and the number of channels of the feature map F52 may be 512. For example, in some examples, the size of feature map F43 is greater than the size of feature map F51, and the size of feature map F51 is greater than the size of feature map F52; for another example, in other examples, the dimensions of feature map F43, the dimensions of feature map F51, and the dimensions of feature map F52 are all the same.

For example, as shown in fig. 2, the inverse pooling UP14 of the fourth decoding layer 33 is used to perform an upsampling operation on the feature map F52 to obtain a feature map F53; the feature map F53 and the feature map F42 transmitted by the fourth encoding layer 23 may be combined, and then the transpose convolution layer TC41 of the fourth decoding layer 33 performs a deconvolution operation on the combined feature map F53 and feature map F42 to obtain a feature map F61; the transpose convolution layer TC42 of the fourth decoding layer 33 is configured to perform a deconvolution operation on the feature map F61 to obtain and output a feature map F62 to the third decoding layer 32.

For example, the number of channels of the feature map F42, the number of channels of the feature map F53, and the number of channels of the feature map F61 may be the same, for example, 512, and the number of channels of the feature map F62 may be 256. For example, in some examples, the dimensions of feature map F42, the dimensions of feature map F53, the dimensions of feature map F61, and the dimensions of feature map F62 may be the same; in other examples, feature map F42 has a size greater than feature map F53, feature map F53 has a size greater than feature map F61, feature map F61 has a size greater than feature map F62, e.g., feature map F42 has a size of 64 x 64, feature map F53 has a size of 56 x 56, feature map F61 has a size of 54 x 54, and feature map F62 has a size of 52 x 52.

For example, as shown in fig. 2, the inverse pooling UP13 of the third decoding layer 32 is used to perform an upsampling operation on the feature map F62 to obtain a feature map F63; the feature map F63 and the feature map F32 transmitted by the third encoding layer 22 may be combined, and then the transpose convolution layer TC31 of the third decoding layer 32 performs a deconvolution operation on the combined feature map F63 and feature map F32 to obtain a feature map F71; the transposed convolutional layer TC32 of the third decoding layer 32 is configured to perform a deconvolution operation on the feature map F71 to obtain and output a feature map F72 to the second decoding layer 31.

For example, the number of channels of the feature map F32, the number of channels of the feature map F63, and the number of channels of the feature map F71 may be the same, for example, 256, and the number of channels of the feature map F72 is 128. For example, in some examples, the dimensions of feature map F32, the dimensions of feature map F63, the dimensions of feature map F71, and the dimensions of feature map F72 may be the same; in other examples, feature map F32 has a size greater than feature map F63, feature map F63 has a size greater than feature map F71, and feature map F71 has a size greater than feature map F72.

For example, as shown in fig. 2, the inverse pooling UP12 of the second decoding layer 31 is used to perform an upsampling operation on the feature map F72 to obtain a feature map F73; the feature map F73 and the feature map F22 transmitted by the second encoding layer 21 may be combined, and then the transpose convolution layer TC21 of the second decoding layer 31 performs a deconvolution operation on the combined feature map F73 and feature map F22 to obtain a feature map F81; the transposed convolutional layer TC22 of the second decoding layer 31 is configured to perform a deconvolution operation on the feature map F81 to obtain and output a feature map F82 to the first decoding layer 30.

For example, the number of channels of the feature map F22, the number of channels of the feature map F73, and the number of channels of the feature map F81 may be the same, for example, 128, and the number of channels of the feature map F82 is 64. For example, in some examples, the dimensions of feature map F22, the dimensions of feature map F73, the dimensions of feature map F81, and the dimensions of feature map F82 may be the same; in other examples, feature map F22 has a size greater than feature map F73, feature map F73 has a size greater than feature map F81, and feature map F81 has a size greater than feature map F82.

For example, as shown in fig. 2, the inverse pooling UP11 of the first decoding layer 30 is used to perform an upsampling operation on the feature map F82 to obtain a feature map F83; the feature map F83 and the feature map F12 transmitted by the first encoding layer 20 may be combined, and then the transpose convolution layer TC11 of the first decoding layer 30 performs a deconvolution operation on the combined feature map F83 and feature map F12 to obtain a feature map F91; the transpose convolution layer TC12 of the first decoding layer 30 is configured to perform a deconvolution operation on the feature map F91 to obtain and output a feature map F92.

For example, the number of channels of the feature map F12, the number of channels of the feature map F83, and the number of channels of the feature map F91 may be the same, for example, all 64, and the number of channels of the feature map F92 is 32. For example, in some examples, the dimensions of feature map F12, the dimensions of feature map F83, the dimensions of feature map F91, and the dimensions of feature map F92 may be the same; in other examples, feature map F12 has a size greater than feature map F83, feature map F83 has a size greater than feature map F91, and feature map F91 has a size greater than feature map F92.

For example, as shown in fig. 2, in some examples, the first neural network further includes an output layer including a convolution layer CN6, and the convolution layer CN6 may include a convolution kernel of 1*1. The convolution layer CN6 is configured to perform a convolution operation on the feature map F92 to obtain an output of the first neural network, and the number of channels of the output of the first neural network may be 1.

It should be noted that, the detailed description of the structure and function of the U-net may refer to the related content of the U-net in the prior art, and will not be described in detail herein.

Still other embodiments of the present disclosure provide a model training method for a neural network. Fig. 3 is a schematic flowchart of a method for model training of a neural network according to at least one embodiment of the present disclosure.

For example, as shown in fig. 3, a method for training a model of a neural network according to an embodiment of the present disclosure includes:

step S20: acquiring a training sample set;

step S21: and training the first neural network to be trained by using the training sample set to obtain the first neural network.

For example, the first neural network obtained by using the model training method provided by the embodiment of the present disclosure may be applied to the sound source separation method described in any of the above embodiments.

For example, in step S20, the training sample set includes a plurality of training data sets, each training data set includes a training mixed audio, a plurality of training audio segments, and a plurality of first training condition vectors, the training mixed audio includes a plurality of training audio segments, and the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio segments.

For example, the plurality of training audio segments are different from each other, and the sound source types corresponding to the plurality of training audio segments are different from each other. The length of time for the plurality of training audio segments is the same.

For example, the training mixed audio is obtained by mixing a plurality of training audio segments, and the time length of the training mixed audio may be the sum of the time lengths of the plurality of training audio segments.

For example, the plurality of first training condition vectors are different.

For example, in step S21, the first neural network to be trained includes a loss function.

For example, in some embodiments, step S21 may include: acquiring a current training data set from a training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio fragments, and the current training mixed audio comprises a plurality of current training audio fragments; determining a plurality of first current training condition vectors corresponding to the plurality of current training audio segments one to one, wherein the current training data set further comprises the plurality of first current training condition vectors; the method comprises the steps of inputting a current training mixed audio and a plurality of first current training condition vectors into a first neural network to be trained to perform sound source separation processing so as to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of a first neural network to be trained according to a plurality of first current training target sound sources and a plurality of current training audio clips; correcting parameters of the first neural network to be trained according to the first loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, the current training mixed audio may be obtained by mixing a plurality of current training audio segments, and the time length of the current training mixed audio may be the sum of the time lengths of the plurality of current training audio segments.

For example, the time lengths of the plurality of current training audio segments are all the same. For example, the length of time for each current training audio segment may be 2 seconds, 3 seconds, etc. The length of time of each current training audio segment may be set by the user, and the present disclosure is not limited to the length of time of each current training audio segment.

For example, in some examples, the plurality of current training audio segments are different from each other, and the sound source types corresponding to the plurality of current training audio segments are different from each other, at which time the plurality of first current training condition vectors are also different from each other. It should be noted that each current training audio segment corresponds to only one sound source type.

For example, the plurality of first current training target sound sources are different from each other, and the plurality of first current training target sound sources are in one-to-one correspondence with the plurality of first current training condition vectors.

For example, each current training audio segment may include only one sound event, or may include a plurality of sound events (i.e., the current training audio segment is formed by mixing a plurality of sound events), and the sound source types corresponding to the plurality of sound events may be different.

For example, in still other embodiments, step S21 further comprises: inputting a plurality of current training audio fragments and a plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the plurality of second current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments; calculating a second loss value of the loss function of the first neural network to be trained according to the plurality of second current training target sound sources and the plurality of current training audio clips; correcting parameters of the first neural network to be trained according to the second loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, the plurality of second current training target sound sources are different from each other.

For example, the time lengths of the plurality of second current training target sound sources are all the same.

For example, each training data set further includes a plurality of second training condition vectors corresponding one-to-one to the plurality of training audio segments. The current training data set further comprises a plurality of second current training condition vectors, the plurality of current training audio segments and the plurality of second current training condition vectors are in one-to-one correspondence, and the second current training condition vector corresponding to each current training audio segment is different from the first current training condition vector corresponding to each current training audio segment.

For example, in other embodiments, step S21 further includes: inputting a plurality of current training audio fragments and a plurality of second current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of third current training target sound sources, wherein the plurality of third current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments; calculating a third loss value of the loss function of the first neural network to be trained according to a plurality of third current training target sound sources and all-zero vectors; correcting parameters of the first neural network to be trained according to the third loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, the plurality of third current training target sound sources are different from each other.

For example, the time lengths of the plurality of third current training target sound sources are all the same, and the time length of the all-zero vector may be the same as the time length of each third current training target sound source.

For example, in embodiments of the present disclosure, in some examples, the predetermined condition corresponds to a minimization of a loss function of the first neural network with a certain number of training data sets being input. In other examples, the predetermined condition is that the number of training times or training periods of the first neural network reaches a predetermined number, which may be millions, as long as the training data set is sufficiently large.

For example, in some embodiments, obtaining the current training mixed audio includes: respectively acquiring a plurality of current training audio clips; performing joint processing on the plurality of current training audio fragments to obtain a first middle current training mixed audio; and performing spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

For example, separately obtaining the plurality of current training audio segments includes: processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors; and dividing the plurality of current training audio clips based on the plurality of current training target sound anchors to obtain a plurality of current training audio clips corresponding to the plurality of current training target sound anchors one by one.

For example, each current training audio segment includes a current training target sound anchor point corresponding to each current training audio segment.

For example, the second neural network is a neural network that has been trained. The second neural network may detect the current training audio clip using sound event detection techniques.

For example, a plurality of current training audio clips may be obtained from the AudioSet dataset.

For example, the length of time for a plurality of current training audio clips may be the same, e.g., 10 seconds each.

For example, processing the plurality of current training audio clips with the second neural network to obtain a plurality of current training target sound anchors, respectively, includes: for each of a plurality of current training audio clips, processing the current training audio clip with a second neural network to obtain a current training sound anchor vector corresponding to the current training audio clip; determining at least one current training audio clip of the plurality of current training audio clips corresponding to the current training audio clip; and selecting at least one current training target sound anchor point corresponding to the current training audio clip from the current training sound anchor point vectors according to the at least one current training audio clip, thereby obtaining a plurality of current training target sound anchor points.

For example, the second neural network may obtain a plurality of current training sound anchor vectors after processing the plurality of current training audio clips, where the plurality of current training sound anchor vectors are in one-to-one correspondence with the plurality of current training audio clips. The dimensions of the plurality of current training sound anchor vectors are the same.

For example, each current training anchor vector is a one-dimensional vector.

For example, each current training sound anchor vector may include R current training sound anchors, e.g., R may be 527 when multiple current training audio clips are each obtained from the AudioSet dataset.

For example, the plurality of current training audio clips are different from each other, and the plurality of current training audio clips are in one-to-one correspondence with the plurality of current training audio clips and are also in one-to-one correspondence with the plurality of current training target sound anchors. That is, only one current training audio clip is truncated from each current training audio clip.

For example, each current training target sound anchor may include a probability value corresponding to the current training target sound anchor and a point in time of the current training target sound anchor on the current training audio clip corresponding to the current training target sound anchor.

For example, the length of time of each current training audio clip is less than the length of time of the corresponding current training audio clip.

For example, the plurality of current training audio clips includes a first current training audio clip, and the first current training audio clip is truncated from the first current training audio clip. The sound source type corresponding to the first current training audio piece is piano sound. The plurality of current training sound anchor vectors includes a first current training sound anchor vector, and the first current training sound anchor vector corresponds to the first current training audio clip. The plurality of current training target sound anchors includes a first current training target sound anchor, the first current training target sound anchor corresponding to the first current training audio clip. For example, the length of time of the first current training audio segment may be denoted as t.

For a first current training audio clip, firstly, processing the first current training audio clip by using a second neural network to obtain a first current training sound anchor point vector corresponding to the first current training audio clip; and selecting a first current training target sound anchor point corresponding to the first current training audio clip from the first current training sound anchor point vector according to the first current training audio clip, wherein the first current training target sound anchor point represents an anchor point with highest probability of piano sound occurrence in the first current training audio clip.

Then, based on the first current training target sound anchor point, the first current training audio clip is subjected to segmentation processing to obtain a first current training audio clip corresponding to the first current training target sound anchor point. For example, the first current training audio clip may be obtained by intercepting the time length of t/2 from the first current training target sound anchor point to two sides in the first current training audio clip. It should be noted that, if the time length from the first current training target sound anchor point to the first endpoint of the first current training audio clip and the time length from the first current training target sound anchor point to the second endpoint of the first current training audio clip are both greater than or equal to t/2, the midpoint of the first current training audio clip is the first current training target sound anchor point; if the time length from the center point to the first end point of the first current training audio clip is t/3, and t/3 is smaller than t/2, then t/3 length is cut from one side of the first current training target sound anchor point, which is close to the first end point of the first current training audio clip, and 2t/3 length is cut from one side of the first current training target sound anchor point, which is close to the second end point of the first current training audio clip, so as to obtain a first current training audio clip, and at the moment, the midpoint of the first current training audio clip is positioned on one side of the first current training target sound anchor point, which is close to the second end point of the first current training audio clip.

For example, "joint processing of multiple current training audio segments" may mean that the multiple current training audio segments are arranged sequentially in time to obtain a complete first intermediate current training mixed audio.

It should be noted that, when one current training audio segment includes a plurality of sound events and the plurality of sound events correspond to different sound source types, the plurality of sound events include a target sound event, and the sound source type corresponding to the target sound event is the sound source type corresponding to the current training audio segment, that is, if the sound source type corresponding to the current training audio segment is guitar sound, the target sound event is guitar sound. The current training audio clip is the audio clip that has the greatest probability of occurrence of the target sound event (i.e., guitar sound) in the current training audio clip corresponding to the current training audio clip. When a current training audio clip includes only one sound event, the one sound event is the target sound event.

For example, in some embodiments, determining the first plurality of current training condition vectors includes: and processing the plurality of current training audio clips by using a second neural network respectively to obtain a plurality of current training sound anchor vectors corresponding to the plurality of current training audio clips one by one, wherein the plurality of current training sound anchor vectors are used as a plurality of first current training condition vectors.

For example, the range of values for each element in each first current training condition vector may be [0,1].

It should be noted that, in other embodiments, each first current training condition vector may also be a vector composed of 0 or 1, that is, each first current training condition vector includes G current training type probability values, a current training type probability value corresponding to a sound source type corresponding to a current training audio segment corresponding to the first current training condition vector in the G current training type probability values is a target current training type probability value, the target current training type probability value is 1, and all the remaining current training type probability values except the target current training type probability value in the G current training type probability values are 0.

For example, the current training mixed audio is represented as a spectrogram.

For example, in some embodiments, performing spectral transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio includes: and performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

For example, in other embodiments, performing spectral transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio includes: performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio; and carrying out logarithmic Mel spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

Fig. 4 is a schematic diagram of a first current training audio clip, a first current training sound anchor vector corresponding to the first current training audio clip, a second current training audio clip, and a second current training sound anchor vector corresponding to the second current training audio clip according to at least one embodiment of the present disclosure. Fig. 5 is a schematic diagram of separating sound sources from a mixture of a first current training audio segment and a second current training audio segment in accordance with at least one embodiment of the present disclosure.

For example, in one particular embodiment, as shown in FIG. 4, the plurality of current training audio clips includes a first current training audio clip AU1 and a second current training audio clip AU2. The plurality of current training voice anchor vectors includes a first current training voice anchor vector Pro1 and a second current training voice anchor vector Pro2. The first current training sound anchor vector Pro1 corresponds to AU1 of the first current training audio clip, and the second current training sound anchor vector Pro2 corresponds to AU2 of the second current training audio clip. The plurality of current training audio segments includes a first current training audio segment s1 and a second current training audio segment s2, the first current training audio segment s1 is intercepted from the first current training audio clip AU1, and the second current training audio segment s2 is intercepted from the second current training audio clip AU2.

For example, as shown in fig. 4, the time length of the first current training audio clip AU1 and the time length of the second current training audio clip AU2 are each 10 seconds(s).

For example, the first current training audio clip AU1 may include a gunshot, and the sound source type corresponding to the first current training audio clip s1 is a gunshot, that is, an audio clip related to the gunshot needs to be intercepted from the first current training audio clip AU 1; the second current training audio clip AU2 may include bell sounds, and the sound source type corresponding to the second current training audio clip s2 is bell sounds, that is, the audio clip related to bell sounds needs to be intercepted from the second current training audio clip AU 2.

Note that, in fig. 4, the first current training sound anchor vector Pro1 only shows the probability distribution of gunshot, and the second current training sound anchor vector Pro2 only shows the probability distribution of bell shot.

For example, a first current training target sound anchor corresponding to the first current training audio segment s1 in the first current training sound anchor vector Pro1 is tt1, and a second current training target sound anchor corresponding to the second current training audio segment s2 in the second current training sound anchor vector Pro2 is tt2. According to the first current training target sound anchor point being tt1, the first current training audio clip s1 can be obtained by intercepting the first current training audio clip AU1 based on a preset time length (for example, 2 seconds); and according to the second current training target sound anchor point being tt2, the second current training audio clip s2 can be obtained by intercepting the second current training audio clip AU2 based on the preset time length.

For example, as shown in fig. 5, in the model training process, the first neural network to be trained may be used to perform sound source separation processing on the current training mixed audio (the current training mixed audio is obtained by mixing the first current training audio segment s1 and the second current training audio segment s 2) and the first current training condition vector cj, so as to obtain the first current training target sound source sj. In the example shown in fig. 5, the first current training condition vector cj may be a condition vector corresponding to the first current training audio segment s1, if the first neural network to be trained is already trained, the first current training target sound source sj and the first current training audio segment s1 should be the same, if the first neural network to be trained is not already trained, a first loss value of a loss function of the first neural network to be trained may be calculated according to the first current training target sound source sj and the first current training audio segment s1, and then parameters of the first neural network to be trained may be corrected according to the first loss value.

The model training method of the present disclosure is described below by taking the examples shown in fig. 4 and 5 as examples.

For example, in embodiments of the present disclosure, one sound separation system may be constructed for separating all types of sound sources. First, audio clips corresponding to two sound source types (e.g., gun sound and bell sound), i.e., a first current training audio clip AU1 and a second current training audio clip AU2 shown in fig. 4, are randomly selected from the AudioSet data set. For each current training audio clip (either the first current training audio clip AU1 or the second current training audio clip AU 2), a sound event detection system may be applied to detect the current training audio segment containing the sound event (i.e., the first current training audio segment s1 and the second current training audio segment s2 shown in fig. 4). Then, a first current training condition vector c1 may be set for the first current training audio segment s1, and a first current training condition vector c2 may be set for the second current training audio segment s 2. In the model training process, the sound source separation system can be described as:

Wherein f (s1+s2, cj) represents the sound source separation system, s1+s2 is the current training mixed audio, j is 1 or 2, sj is the first current training target sound source corresponding to the first current training condition vector cj. The above equation shows that the result sj of the separation depends on the input current training mixed audio and the first current training condition vector cj. The first current training condition vector cj should contain information of the sound source to be separated, i.e. the first current training target sound source.

Note that the AudioSet dataset was weakly labeled. That is, in the AudioSet data set, each 10 second audio clip (i.e., the first current training audio clip AU1 and the second current training audio clip AU 2) is marked only as the presence or absence of a sound event, and there is no occurrence time of the sound event. However, the sound source separation system requires that the spectrogram corresponding to each audio segment used for training (i.e., the first current training audio segment s1 or the second current training audio segment s2 shown in fig. 4) contains sound events. However, there is no time information for the audio clip in the audio clip containing the sound event (i.e. the first current training audio clip AU1 or the second current training audio clip AU 2). To address this problem, the sound event detection system may be trained using a weakly labeled AudioSet data set. For a given sound event, the sound event detection system is used to detect the point in time at which each sound event occurs in a 10 second audio clip. The sound source separation system is then trained based on the point in time when the corresponding audio clip containing the sound event is truncated.

To train the sound event detection system with data in the weakly labeled AudioSet dataset, a log mel (log mel) spectrogram is used as a feature of the audio clip. The neural network is then applied to the log mel-frequency spectrogram. In order to predict the existence probability of the sound event over time, a time-distributed full-connection layer is applied on the feature map of the last convolution layer, so that a certain number of sound categories can be output from the full-connection layer, and then a sigmoid function is applied to the certain number of sound categories, so as to obtain an S-type nonlinear growth curve, so as to predict the occurrence probability of the sound event over time (i.e., the time-distributed prediction probability). The time distribution prediction probability O (t) is expressed as: o (t) ∈0,1] ^K T=1, …, T, where T is the number of time steps in the distributed fully connected layer and K represents the sound class. In training, the probability prediction is obtained by pooling the time distribution prediction probability O (t):the pooling function may be a maximum pooling function. The audio clip selected by the max pooling function has a high degree of accuracy and contains sound events rather than irrelevant sounds. Furthermore, a two-class cross entropy loss function (i.e.) >) To calculate a loss value for the sound event detection system.

For example, a 10 second audio clip in the AudioSet data set is used as an input to the sound event detection system, thereby obtaining a time distribution prediction O (t) of sound events (vector Pro1 or vector Pro2 shown in fig. 4). For the sound category, the time step with the highest probability is selected as the anchor point. Then, an anchor-centered audio segment (s 1 or s2 above) is selected to train the source separation system.

For example, in embodiments of the present disclosure, training a sound source separation system does not necessarily require a clean sound source, and by properly setting the condition vector, training the sound source separation system based on the audio segments (i.e., the first current training audio segment s1 and the second current training audio segment s2 described above) derived from the AudioSet data set may be achieved. For example, a first current training sound anchor vector Pro1 obtained by processing a first current training audio clip AU1 by the sound source event detection system is used as a condition vector of the first current training audio segment s1 (i.e., the first current training condition vector c1 described above), and a second current training sound anchor vector Pro2 obtained by processing a second current training audio clip AU2 by the sound source event detection system is used as a condition vector of the second current training audio segment s2 (i.e., the first current training condition vector c2 described above). The first current training sound anchor vector Pro1 may represent sound events and their existence probabilities comprised by the first current training audio clip AU1, and the second current training sound anchor vector Pro2 may represent sound events and their existence probabilities comprised by the second current training audio clip AU 2. The first current training sound anchor vector Pro1 and the second current training sound anchor vector Pro2 may better reflect sound events in the first current training audio segment s1 and the second current training audio segment s2 than labels of the first current training audio clip AU1 or the second current training audio clip AU 2. In training, the following regression may be learned for the sound source separation system:

In the above equations (1) to (3), j is 1 or 2. Equation (1) represents learning from the current training mixed audio a first current training target sound source sj conditioned on a first current training condition vector cj. Equation (2) represents learning identity mapping, i.e., sound separation systemThe output conditioned on itself should be learned, whereby the distortion of the split signal can be reduced. Equation (3) represents zero mapping, that is, if the system uses a second current training condition vector c that is different from the first current training condition vector cj _-j On condition, an all zero vector 0 (i.e. no sound) should be output.

For example, a training audio clip of arbitrary length is provided, and first, the audio tagging system is used to predict whether a sound event is present in the training audio clip. Then, a list of sound categories corresponding to the training audio clips is obtained. For each sound k in the candidate list, a training condition vector ck= {0, …,0,1,0, …,0}, where k-th element of ck is 1 and the other elements are 0, is set. Based on this training condition vector ck, even if the audio clip used for training during training is a clip including a plurality of sound events (i.e., an unclean audio clip), clean target sound sources can be separated.

For example, in some embodiments, experiments were performed on the above described sound source separation system based on AudioSet datasets. The AudioSet dataset is a large-scale audio dataset with 527 sound categories. The training set in the AudioSet data set includes 2,063,839 audio clips and a balanced subset of 22,050 audio clips. The evaluation set in the AudioSet dataset contained 20,371 audio clips. Most audio clips have a duration (i.e., time length) of 10 seconds. All audio clips are converted to mono at a sampling rate of 32 kHz. The sound event detection system is trained according to a complete training set. A log mel-spectrogram of the audio clip is extracted using an STFT window of 1024 samples and a hop count (hop size, i.e., resolution of 0.01 s) of 320 frames. The number of mel frequency bins is set to less than 64. The convolutional neural network for the sound event detection system includes 13 layers, and a full-connection layer of time distribution is applied to the last convolutional layer to obtain a time distribution probability of sound events. The convolutional neural network further comprises 5 pooling layers, each of which has a kernel size of 2x2 and a resolution of 0.32 seconds for the acoustic event detection system. Adam optimizer (learning rate α of the Adam optimizer is 0.001, decay index β1 is 0.9, decay index β2 is 0.999) is used to train the sound event detection system.

For an audio clip containing a certain sound category, obtaining an anchor point by acquiring the maximum prediction probability of a sound event corresponding to the sound category from a sound event detection system; an audio clip is then determined based on the anchor point (the audio clip includes 5 adjacent time frames such that the audio clip has a time length of 1.6 seconds).

The sound source separation system may employ a U-net network. The inputs to the U-net network are the current training mixed audio (the current training mixed audio is mixed from the first current training audio segment s1 and the second current training audio segment s2 and represents a spectrogram, which can be obtained by applying STFT on a sound waveform with a window size of 1024 and a hop count of 256) and the first current training condition vector. The first current training condition vector is transformed by the full connection layer and added as an offset to each convolution layer of the U-net network. The first current training condition vector determines the sound sources to be separated. For example, adam optimizers (Adam optimizers with a learning rate α of 0.001, an attenuation index β1 of 0.9, and an attenuation index β2 of 0.999) are used to train sound source separation systems.

Some embodiments of the present disclosure further provide a sound source separation device. Fig. 6 is a schematic block diagram of a sound source separation device according to at least one embodiment of the present disclosure.

As shown in fig. 6, the sound source separation device 60 includes a memory 610 and a processor 620. Memory 610 is used to non-transitory store computer-readable instructions. The processor 620 is configured to execute computer readable instructions that when executed by the processor 620 perform the sound source separation method provided by any of the embodiments of the present disclosure.

For example, the memory 610 and the processor 620 may communicate with each other directly or indirectly. For example, the components of memory 610 and processor 620 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the internet, a telecommunications network, an internet of things (Internet of Things) based on the internet and/or telecommunications network, any combination of the above, and/or the like. The wired network may use twisted pair, coaxial cable or optical fiber transmission, and the wireless network may use 3G/4G/5G mobile communication network, bluetooth, zigbee or WiFi, for example. The present disclosure is not limited herein with respect to the type and functionality of the network.

For example, the processor 620 may control other components in the sound source separation device 60 to perform desired functions. The processor 620 may be a Central Processing Unit (CPU), tensor Processor (TPU), or the like having data processing and/or program execution capabilities. The Central Processing Unit (CPU) can be an X86 or ARM architecture, etc. The GPU may be integrated directly onto the motherboard alone or built into the north bridge chip of the motherboard.

For example, memory 610 may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer instructions may be stored on memory 610 and executed by processor 620 to perform various functions. Various applications and various data may also be stored in memory 610, such as training data sets, training reference tags, and various data used and/or generated by the applications, among others.

For example, in some embodiments, the sound source separation device 60 may also include a sound collection device, which may be, for example, a microphone or the like. The sound collection device is used for collecting original mixed audio.

For example, the detailed description of the processing procedure of the sound source separation method may refer to the related description in the above-described embodiment of the sound source separation method, and the repetition is not repeated.

It should be noted that the sound source separation device provided by the embodiments of the present disclosure is exemplary and not limited, and the sound source separation device may further include other conventional components or structures according to practical application requirements, for example, to implement the necessary functions of the sound source separation device, those skilled in the art may set other conventional components or structures according to specific application scenarios, and the embodiments of the present disclosure are not limited thereto.

Technical effects of the sound source separation device provided in the embodiments of the present disclosure may refer to corresponding descriptions about the sound source separation method in the above embodiments, and are not described herein.

Some embodiments of the present disclosure further provide a model training apparatus. Fig. 7 is a schematic block diagram of a model training apparatus provided in at least one embodiment of the present disclosure.

As shown in fig. 7, model training apparatus 70 includes a memory 710 and a processor 720. Memory 710 is used to non-transitory store computer readable instructions. Processor 720 is configured to execute computer readable instructions that when executed by processor 720 perform the model training method provided by any of the embodiments of the present disclosure.

For example, processor 730 may control other components in model training device 70 to perform the desired functions. Processor 730 may be a Central Processing Unit (CPU), tensor Processor (TPU), or the like having data processing and/or program execution capabilities. The Central Processing Unit (CPU) can be an X86 or ARM architecture, etc. The GPU may be integrated directly onto the motherboard alone or built into the north bridge chip of the motherboard.

For example, memory 720 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer readable instructions may be stored on the memory 720 that may be executed by the processor 730 to perform the various functions of the model training device 70.

It should be noted that, the detailed description of the process of performing the model training by the model training device 70 may refer to the related description in the embodiment of the model training method, and the repetition is omitted.

Some embodiments of the present disclosure also provide a storage medium. Fig. 8 is a schematic block diagram of a storage medium provided in at least one embodiment of the present disclosure. For example, as shown in FIG. 8, one or more computer-readable instructions 801 may be stored non-transitory on storage medium 800. For example, a portion of the computer readable instructions 801, when executed by a computer, may perform one or more steps in a sound source separation method according to the above; another portion of the computer readable instructions 801, when executed by a computer, may perform one or more steps in accordance with the model training method described above.

For example, the storage medium 800 may be applied to the above-described sound source separation device 60 and/or model training device 70, and for example, it may be the memory 610 in the sound source separation device 60 and/or the memory 720 in the model training device 70.

For example, the description of the storage medium 800 may refer to the description of the memory in the embodiments of the sound source separation device 60 and/or the model training device 70, and the repetition is omitted.

Referring now to fig. 9, a schematic diagram of a configuration of an electronic device 600 suitable for use in implementing embodiments of the present disclosure (e.g., the electronic device may include the sound source separation device described in the above embodiments) is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 9 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 9, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 606 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 606 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 606, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that in the context of this disclosure, a computer-readable medium can be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

According to one or more embodiments of the present disclosure, a sound source separation method includes: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source tag group; and inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group are in one-to-one correspondence with the condition vectors of the condition vector group.

According to one or more embodiments of the present disclosure, the condition vector group includes a plurality of condition vectors, and inputting the condition vector group and the mixed audio to the first neural network for sound source separation processing to obtain the target sound source group includes: determining a plurality of input data sets according to the mixed audio and the plurality of condition vectors, wherein the plurality of input data sets are in one-to-one correspondence with the plurality of condition vectors, and each of the plurality of input data sets comprises the mixed audio and one of the plurality of condition vectors; and respectively carrying out sound source separation processing on the plurality of input data sets by using the first neural network to obtain a target sound source set, wherein the target sound source set comprises a plurality of target sound sources which are in one-to-one correspondence with the plurality of condition vectors, the plurality of input data sets are in one-to-one correspondence with the plurality of target sound sources, and each target sound source in the plurality of target sound sources is in correspondence with the condition vector in the input data set corresponding to each target sound source.

According to one or more embodiments of the present disclosure, the plurality of condition vectors are different.

According to one or more embodiments of the present disclosure, obtaining mixed audio includes: acquiring original mixed audio; and performing spectrum transformation processing on the original mixed audio to obtain mixed audio.

In accordance with one or more embodiments of the present disclosure, performing spectral transformation processing on the original mixed audio to obtain mixed audio includes: and performing short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

In accordance with one or more embodiments of the present disclosure, performing spectral transformation processing on the original mixed audio to obtain mixed audio includes: performing short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio; the intermediate mixed audio is subjected to logarithmic mel-frequency spectrum processing to obtain mixed audio.

In accordance with one or more embodiments of the present disclosure, determining a sound source tag group corresponding to mixed audio includes: performing sound event detection on the mixed audio by using a second neural network to determine a sound event group included in the mixed audio; a set of sound source tags is determined from the set of sound events.

According to one or more embodiments of the present disclosure, the condition vector group includes at least one condition vector, the sound source tag group includes at least one sound source tag, the at least one condition vector corresponds to the at least one sound source tag one by one, each of the at least one condition vector includes N type probability values, a type probability value corresponding to a sound source type corresponding to the sound source tag corresponding to each of the N type probability values is a target type probability value, the target type probability value is 1, and remaining type probability values except for the target type probability value among the N type probability values are all 0, wherein N is a positive integer.

According to one or more embodiments of the present disclosure, the first neural network is a U-shaped neural network.

According to one or more embodiments of the present disclosure, a model training method of a neural network includes: the method comprises the steps that a training sample set is obtained, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio fragments and a plurality of first training condition vectors, the training mixed audio comprises a plurality of training audio fragments, and the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio fragments; training the first neural network to be trained by using the training sample set to obtain the first neural network, wherein the first neural network to be trained comprises a loss function, and training the first neural network to be trained by using the training sample set to obtain the first neural network comprises: acquiring a current training data set from a training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio fragments, and the current training mixed audio comprises a plurality of current training audio fragments; determining a plurality of first current training condition vectors corresponding to the plurality of current training audio fragments one by one, wherein the current training data set further comprises a plurality of first current training condition vectors, and inputting the current training mixed audio and the plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing so as to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of a first neural network to be trained according to a plurality of first current training target sound sources and a plurality of current training audio clips; correcting parameters of the first neural network to be trained according to the first loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

According to one or more embodiments of the present disclosure, training the first neural network to be trained using the training sample set to obtain the first neural network further comprises: inputting a plurality of current training audio fragments and a plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the plurality of second current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments; calculating a second loss value of the loss function of the first neural network to be trained according to the plurality of second current training target sound sources and the plurality of current training audio clips; correcting parameters of the first neural network to be trained according to the second loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

According to one or more embodiments of the present disclosure, each training data set further includes a plurality of second training condition vectors corresponding to the plurality of training audio segments one-to-one, the current training data set further includes a plurality of second current training condition vectors, the plurality of current training audio segments and the plurality of second current training condition vectors corresponding one-to-one, the second current training condition vector corresponding to each current training audio segment and the first current training condition vector corresponding to each current training audio segment are different, and training the first neural network to be trained using the training sample set to obtain the first neural network further includes: inputting a plurality of current training audio fragments and a plurality of second current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of third current training target sound sources, wherein the plurality of third current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments; calculating a third loss value of the loss function of the first neural network to be trained according to a plurality of third current training target sound sources and all-zero vectors; correcting parameters of the first neural network to be trained according to the third loss value, obtaining the first neural network after training when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

In accordance with one or more embodiments of the present disclosure, obtaining current training mixed audio includes: respectively acquiring a plurality of current training audio clips; performing joint processing on the plurality of current training audio fragments to obtain a first middle current training mixed audio; and performing spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

According to one or more embodiments of the present disclosure, separately obtaining a plurality of current training audio segments includes: processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors; and dividing the plurality of current training audio clips based on the plurality of current training target sound anchors to obtain a plurality of current training audio clips corresponding to the plurality of current training target sound anchors one by one, wherein each current training audio clip comprises a current training target sound anchor corresponding to each current training audio clip.

In accordance with one or more embodiments of the present disclosure, processing the plurality of current training audio clips with the second neural network to obtain a plurality of current training target sound anchors, respectively, includes: for each current training audio clip of the plurality of current training audio clips, processing each current training audio clip with a second neural network to obtain a current training sound anchor vector corresponding to each current training audio clip; determining at least one current training audio clip of the plurality of current training audio clips corresponding to each current training audio clip; and selecting at least one current training target sound anchor point corresponding to each current training audio clip from the current training sound anchor point vectors according to the at least one current training audio clip, thereby obtaining a plurality of current training target sound anchor points.

According to one or more embodiments of the present disclosure, the plurality of current training audio clips are different from each other, the plurality of current training audio clips are in one-to-one correspondence with the plurality of current training audio clips, and are also in one-to-one correspondence with the plurality of current training target sound anchors.

In accordance with one or more embodiments of the present disclosure, determining a plurality of first current training condition vectors includes: and processing the plurality of current training audio clips by using a second neural network respectively to obtain a plurality of current training sound anchor vectors corresponding to the plurality of current training audio clips one by one, wherein the plurality of current training sound anchor vectors are used as a plurality of first current training condition vectors.

In accordance with one or more embodiments of the present disclosure, performing spectral transformation processing on a first intermediate current training mixed audio to obtain a current training mixed audio includes: and performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

In accordance with one or more embodiments of the present disclosure, performing spectral transformation processing on a first intermediate current training mixed audio to obtain a current training mixed audio includes: performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio; and carrying out logarithmic Mel spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

According to one or more embodiments of the present disclosure, the time lengths of the plurality of current training audio segments are all the same.

According to one or more embodiments of the present disclosure, a sound source separation apparatus includes: a memory for non-transitory storage of computer readable instructions; and a processor for executing computer readable instructions which when executed by the processor perform the sound source separation method according to any of the embodiments described above.

According to one or more embodiments of the present disclosure, a model training apparatus includes: a memory for non-transitory storage of computer readable instructions; and a processor for executing computer readable instructions that when executed by the processor perform a model training method according to any of the embodiments described above.

According to one or more embodiments of the present disclosure, a storage medium non-transitory stores computer readable instructions, which when executed by a computer, may perform the sound source separation method according to any of the embodiments described above.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

For the purposes of this disclosure, the following points are also noted:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) In the drawings for describing embodiments of the present invention, thicknesses and dimensions of layers or structures are exaggerated for clarity. It will be understood that when an element such as a layer, film, region or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.

(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely a specific embodiment of the disclosure, but the scope of the disclosure is not limited thereto and should be determined by the scope of the claims.

Claims

1. A sound source separation method comprising:

acquiring mixed audio;

determining a sound source tag group corresponding to the mixed audio, wherein the sound source tag group comprises at least one sound source tag;

determining a condition vector group according to the sound source tag group, wherein the condition vector group comprises at least one condition vector, each condition vector in the at least one condition vector comprises N types of probability values for separating N sound source types, the N sound source types are respectively in one-to-one correspondence with the N sound source tags, and the N types of probability values respectively represent the probabilities of the sound source types corresponding to the N sound source tags;

Inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group are in one-to-one correspondence with the condition vectors of the condition vector group,

wherein the set of condition vectors comprises a plurality of condition vectors,

inputting the condition vector group and the mixed audio to a first neural network for sound source separation processing to obtain the target sound source group comprises:

determining a plurality of input data sets according to the mixed audio and the plurality of condition vectors, wherein the plurality of input data sets are in one-to-one correspondence with the plurality of condition vectors, and each of the plurality of input data sets comprises one condition vector of the mixed audio and the plurality of condition vectors;

and respectively carrying out sound source separation processing on the plurality of input data sets by using the first neural network to obtain the target sound source set, wherein the target sound source set comprises a plurality of target sound sources which are in one-to-one correspondence with the plurality of condition vectors, the plurality of input data sets are in one-to-one correspondence with the plurality of target sound sources, and each target sound source in the plurality of target sound sources is in correspondence with the condition vector in the input data set corresponding to each target sound source.

2. The sound source separation method according to claim 1, wherein the plurality of condition vectors are different from each other.

3. The sound source separation method according to claim 1, wherein acquiring the mixed audio includes:

acquiring original mixed audio;

and performing spectrum transformation processing on the original mixed audio to obtain the mixed audio.

4. The sound source separation method according to claim 3, wherein performing spectral transformation processing on the original mixed audio to obtain the mixed audio comprises:

and carrying out short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

5. The sound source separation method according to claim 3, wherein performing spectral transformation processing on the original mixed audio to obtain the mixed audio comprises:

performing short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio;

and carrying out logarithmic Mel spectrum processing on the intermediate mixed audio to obtain the mixed audio.

6. The sound source separation method of claim 1, wherein determining the sound source tag group corresponding to the mixed audio comprises:

performing sound event detection on the mixed audio by using a second neural network to determine a sound event group included in the mixed audio;

And determining the sound source tag group according to the sound event group.

7. The sound source separation method according to claim 1, wherein the at least one condition vector corresponds one-to-one to the at least one sound source tag,

the type probability value corresponding to the sound source type corresponding to the sound source label corresponding to each condition vector in the N type probability values is a target type probability value, the target type probability value is 1, and the rest type probability values except the target type probability value in the N type probability values are all 0, wherein N is a positive integer.

8. The sound source separation method according to any one of claims 1 to 7, wherein the first neural network is a U-shaped neural network.

9. A model training method of a neural network, comprising:

the method comprises the steps that a training sample set is obtained, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio fragments and a plurality of first training condition vectors, the training mixed audio comprises the plurality of training audio fragments, the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio fragments, each first training condition vector comprises N types of probability values for separating N sound source types, the N sound source types are respectively in one-to-one correspondence with the N training audio fragments, and the N types of probability values respectively represent the probabilities of the N sound source types;

Training a first neural network to be trained by using the training sample set to obtain a first neural network, wherein the first neural network to be trained comprises a loss function,

wherein training the first neural network to be trained by using the training sample set to obtain the first neural network includes:

acquiring a current training data set from the training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio fragments, and the current training mixed audio comprises the plurality of current training audio fragments;

determining a plurality of first current training condition vectors in one-to-one correspondence with the plurality of current training audio segments, wherein the current training data set further comprises the plurality of first current training condition vectors,

inputting the current training mixed audio and the plurality of first current training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of first current training target sound sources, wherein the plurality of first current training target sound sources are in one-to-one correspondence with the plurality of first current training condition vectors;

Calculating a first loss value of a loss function of the first neural network to be trained according to the plurality of first current training target sound sources and the plurality of current training audio clips;

correcting parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets a preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition;

the method for obtaining the current training target sound source comprises the steps of:

and performing sound source separation processing on the current training mixed audio and a first current training condition vector corresponding to the current training mixed audio in the plurality of first current training condition vectors by using the first neural network to be trained to obtain a first current training target sound source corresponding to a first current training condition vector corresponding to the current training mixed audio in the plurality of first current training target sound sources.

10. The model training method of claim 9, wherein training the first neural network to be trained using the training sample set to obtain a first neural network further comprises:

inputting the plurality of current training audio fragments and the plurality of first current training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the plurality of second current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments;

calculating a second loss value of the loss function of the first neural network to be trained according to the plurality of second current training target sound sources and the plurality of current training audio clips;

and correcting the parameters of the first neural network to be trained according to the second loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

11. The model training method according to claim 9 or 10, wherein each training data set further includes a plurality of second training condition vectors corresponding one-to-one to the plurality of training audio pieces,

The current training data set further includes a plurality of second current training condition vectors, the plurality of current training audio segments and the plurality of second current training condition vectors are in one-to-one correspondence, the second current training condition vector corresponding to each current training audio segment and the first current training condition vector corresponding to each current training audio segment are different,

training the first neural network to be trained by using the training sample set to obtain a first neural network, and further comprising:

inputting the plurality of current training audio fragments and the plurality of second current training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of third current training target sound sources, wherein the plurality of third current training target sound sources are in one-to-one correspondence with the plurality of current training audio fragments;

calculating a third loss value of the loss function of the first neural network to be trained according to the plurality of third current training target sound sources and all-zero vectors;

and correcting the parameters of the first neural network to be trained according to the third loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

12. The model training method of claim 9, wherein obtaining the current training mixed audio comprises:

respectively acquiring the plurality of current training audio clips;

performing joint processing on the plurality of current training audio fragments to obtain a first intermediate current training mixed audio;

and performing spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

13. The model training method of claim 12, wherein separately obtaining the plurality of current training audio segments comprises:

processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors;

and based on the plurality of current training target sound anchors, dividing the plurality of current training audio clips to obtain the plurality of current training audio clips corresponding to the plurality of current training target sound anchors one by one, wherein each current training audio clip comprises a current training target sound anchor corresponding to each current training audio clip.

14. The model training method of claim 13, wherein processing the plurality of current training audio clips with the second neural network to obtain the plurality of current training target sound anchors, respectively, comprises:

For each current training audio clip of the plurality of current training audio clips, processing each current training audio clip by using the second neural network to obtain a current training sound anchor point vector corresponding to each current training audio clip;

determining at least one current training audio clip of the plurality of current training audio clips corresponding to the each current training audio clip;

and selecting at least one current training target sound anchor point corresponding to each current training audio clip from the current training sound anchor point vectors according to the at least one current training audio clip, thereby obtaining a plurality of current training target sound anchor points.

15. The model training method of claim 13, wherein the plurality of current training audio clips are different from each other,

the plurality of current training audio clips are in one-to-one correspondence with the plurality of current training audio clips and are also in one-to-one correspondence with the plurality of current training target sound anchors.

16. The model training method of claim 15, wherein determining the plurality of first current training condition vectors comprises:

And respectively processing the plurality of current training audio clips by using the second neural network to obtain a plurality of current training sound anchor vectors which are in one-to-one correspondence with the plurality of current training audio clips, wherein the plurality of current training sound anchor vectors are used as the plurality of first current training condition vectors.

17. The model training method of claim 12, wherein performing spectral transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio comprises:

and performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

18. The model training method of claim 12, wherein performing spectral transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio comprises:

performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio;

and carrying out logarithmic Mel spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

19. The model training method of claim 12, wherein the plurality of current training audio segments are all the same in length of time.

20. A sound source separation device comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions, which when executed by the processor perform the sound source separation method according to any one of claims 1-8.

21. A model training apparatus comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions, which when executed by the processor perform the model training method according to any of claims 9-19.

22. A storage medium non-transitory storing computer readable instructions which, when executed by a computer, can perform the sound source separation method according to any one of claims 1-8.