CN115116448B - Voice extraction method, neural network model training method, device and storage medium - Google Patents
Voice extraction method, neural network model training method, device and storage medium Download PDFInfo
- Publication number
- CN115116448B CN115116448B CN202211037918.4A CN202211037918A CN115116448B CN 115116448 B CN115116448 B CN 115116448B CN 202211037918 A CN202211037918 A CN 202211037918A CN 115116448 B CN115116448 B CN 115116448B
- Authority
- CN
- China
- Prior art keywords
- voice
- target speaker
- speaker
- neural network
- aliasing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003062 neural network model Methods 0.000 title claims abstract description 87
- 238000012549 training Methods 0.000 title claims abstract description 71
- 238000000605 extraction Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000004927 fusion Effects 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 18
- 238000012512 characterization method Methods 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000007500 overflow downdraw method Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 230000001755 vocal effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 2
- 239000012634 fragment Substances 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000000926 separation method Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000167854 Bourreria succulenta Species 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 235000019693 cherries Nutrition 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a voice extraction method, a neural network model training method, a device and a storage medium, wherein the method comprises the following steps: acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; inputting the aliasing voice data of multiple speakers to be extracted into a voice coding network, and acquiring the time sequence representation of aliasing voice; inputting the voiceprint registration voice data of the target speaker into a speaker coding network to obtain the voiceprint characteristics of the target speaker; inputting the time sequence table of aliasing voice and the voiceprint characteristic of a target speaker into a speaker extraction network simultaneously, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers; and representing the input voice decoding network by the extracted voice time sequence of the target speaker, and restoring a time domain voice signal of the target speaker. The method can accurately and effectively extract the voice of the target speaker from the aliasing voices of the multiple speakers.
Description
Technical Field
The invention relates to the technical field of voice separation, in particular to a voice extraction method, a neural network model training method, a device and a storage medium.
Background
The cocktail party problem was originally a well-known problem in 1953, addressed by Cherry, a british cognitive scientist, in studying the attention-selection mechanism, which attempted to explore the logical basis behind the process of human understanding the speech of a target speaker under interference from other speakers or noise, thereby modeling an intelligent machine that was able to filter out the signals of the target speaker. Colloquially described, the cocktail party problem concerns one auditory selection capability of humans in complex auditory environments. In this case, the person can easily focus on a certain sound stimulus of interest and ignore other background sounds, whereas the computational auditory model is heavily influenced by noise. How to design an auditory model capable of flexibly adapting to the cocktail party environment is an important problem in the field of computational hearing, and has very important research significance and application value on a series of important tasks such as speech recognition, speaker recognition, speech separation and the like.
With the vigorous development of artificial intelligence, speech separation represented by the cocktail party problem has made a tremendous progress in the popularization of deep learning. However, in most practical scenarios, the current speech separation technology is limited by the number of speakers, noise interference, and generalization of models, and the performance is not satisfactory. The target speaker voice extraction technology is used for directionally extracting the voice of the specified target speaker by acquiring additional voiceprint characteristic clues and under the guidance of the additional voiceprint characteristic clues, is not limited by the number of speakers, has strong generalization of models and robustness to noise environments, and is suitable for application scenes of families, meetings and the like which can acquire the registered voice of the target speaker.
The early technology for extracting the voice of the target speaker uses a speaker self-adapting method, converts the amplitude spectrum characteristics of the voiceprint registered voice of the target speaker into weight parameters of a self-adapting layer through an auxiliary network, and obtains the output of the self-adapting layer by weighting the output of each sublayer of the self-adapting layer, so that a voice model can self-adapt to the speaker. For example, CN 112331181A provides a method for extracting a target speaker voice based on a multi-speaker condition, which is based on obtaining adaptive parameters to dynamically adjust output, thereby extracting the voice of the target speaker.
The target speaker voice extraction technology based on deep learning is the main trend at present. Most of the schemes adopt a method of performing feature processing on a frequency domain and then reconstructing a time domain voice signal, for example, CN 113990344A provides a method, a device and a medium for separating multi-user voice based on voiceprint features, which uses short-time fourier transform to extract voice spectrum features.
In the process of extracting the voice of the target speaker, modal fusion between the voiceprint feature vector of the target speaker and the voice representation thereof is a more critical problem. Because the feature forms of the two modes are inconsistent, the commonly adopted fusion method is to expand the voiceprint feature vector to the form with the same voice representation through specific transformation and then perform feature fusion by utilizing operations based on simple operation, such as splicing and the like. For example, CN 105489226A provides a method for separating speaker-specific speech based on a two-path self-attention mechanism, which uses a splicing method to perform fusion of speaker coding features and speech features.
The current method for extracting the voice of the target speaker has the following problems:
1) However, the frequency domain methods have the potential problem of unstable frequency spectrum phase estimation, and the quality of the extracted target speaker voice is affected accordingly.
2) And the mainstream fusion method of the voice print feature vector and the voice characterization is a method based on simple operation, such as splicing, the correlation between the two modes is not fully excavated, and the specific information of each mode is lost to a certain extent in the fusion process.
Disclosure of Invention
The invention provides a voice extraction method, a neural network model training method, a device and a storage medium, which are used for solving the problems of poor effect of a frequency domain target speaker-based voice extraction method and the problems of the related technologies of insufficient fusion of a voiceprint feature vector and voice characterization in the prior art.
The technical scheme adopted by the invention is as follows:
according to a first aspect of the present disclosure, there is provided a speech extraction method, including:
acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;
inputting the aliasing voice data of a plurality of speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice;
inputting the voiceprint registration voice data of the target speaker into a speaker coding network in a trained preset neural network model to obtain the voiceprint characteristics of the target speaker;
simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
and inputting the extracted representation of the target speaker voice time sequence into a voice decoding network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.
Further, the method for constructing the voice coding network comprises the following steps: and extracting the time sequence representation by adopting a one-dimensional convolution coder or a self-supervision pre-training model.
Further, the method for constructing the speaker coding network comprises the following steps:
acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;
modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;
and extracting the vocal print characteristic vector of the target speaker from the time series representation after modeling processing by adopting a pooling layer based on a self-attention mechanism.
Further, the method for constructing the speaker extraction network comprises the following steps:
performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time sequence representation input by adopting a gating convolution fusion method;
modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling;
in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to be used as a stage, and the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output after the time dependency modeling processing of the previous stage;
and converting the voice time series representation output of the final stage into a mask, and multiplying the mask and the time series representation of the aliasing voice point by point to extract the time series representation of the voice of the target speaker.
Furthermore, in the last stage, a step of not executing feature fusion is selected, and only the process of modeling the time dependency relationship is carried out.
Further, the method for gated convolution fusion includes:
performing zero-offset one-dimensional convolution operation on the time sequence characterization input and the convolution kernel of the information branch to obtain an output signal of the information branch;
the time sequence representation is input into a convolution kernel of the gating branch to be convolved, a bias term obtained by linear layer conversion of a target speaker voice print characteristic vector is added, and then normalization and activation function processing are carried out to obtain an output signal of the gating branch;
and multiplying the output signals of the gating branch and the information branch point by point, and then connecting the multiplied output signals with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.
Further, the sequence modeling of the time-series characterization is processed by a time convolution network, a dual-path recurrent neural network, or transformations.
Further, the voice decoding network is realized by a one-dimensional deconvolution layer or a fully-connected linear layer.
According to a second aspect of the disclosure, there is provided a neural network model training method, including:
acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;
inputting multi-speaker aliasing voice training sample data into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;
inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to obtain voiceprint characteristics of the target speaker;
simultaneously inputting the time sequence table of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model, and restoring a time domain voice signal of the target speaker;
calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.
According to a third aspect of the present disclosure, there is provided a speech extraction device including:
the acquisition module is used for acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the voice of a target speaker;
the voice coding network module is used for inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model and acquiring the time sequence representation of the aliasing voice;
the speaker coding network module is used for inputting the voiceprint registration voice data of the target speaker into the speaker coding network in the trained preset neural network model to acquire the voiceprint characteristics of the target speaker;
the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characteristics belonging to the target speaker in the multi-speaker aliasing voice data;
and the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a trained preset neural network model and restoring a time domain voice signal of the target speaker.
According to a fourth aspect of the present disclosure, there is provided a neural network model training apparatus, including:
the acquisition module is used for acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;
the voice coding network module is used for inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model and acquiring a time sequence representation of aliasing voices;
the speaker encoder network module is used for inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to acquire voiceprint characteristics of the target speaker;
the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model and restoring a time domain voice signal of the target speaker;
and the loss function calculation module is used for calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.
According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and a processor executes the computer program to implement the voice extraction method according to the first aspect.
The invention has the beneficial effects that:
1) And the voice features are extracted and coded in the time domain, so that the potential influence caused by the problems of unstable frequency spectrum phase estimation and the like in the frequency domain method is avoided.
2) And the target speaker voice print characteristic and the voice characteristic are fused by adopting a gating convolution fusion technology, the characteristics of two modes are fully fused through global condition modeling and a gating mechanism, and the specific information of each mode is effectively reserved, so that the quality of the extracted target speaker voice is improved.
3) The invention fully utilizes the voiceprint characteristic clue of the target speaker through an innovative characteristic fusion mode, and can accurately and effectively extract the voice of the target speaker from the aliasing voice of the multiple speakers.
Drawings
FIG. 1 is a flow chart illustrating the steps of a speech extraction method disclosed in the present invention;
FIG. 2 is a block diagram of a speech extraction method according to the present disclosure;
FIG. 3 is a block diagram of a speaker extraction network according to the present disclosure;
FIG. 4 is a flowchart illustrating the steps of a gated convolution fusion method according to the present disclosure;
FIG. 5 is a block diagram of a gated convolution fusion method according to the present disclosure;
FIG. 6 is a flow chart illustrating the steps of a neural network model training method disclosed in the present invention;
FIG. 7 is a block diagram of a speech extracting apparatus according to the present disclosure;
fig. 8 is a block diagram of a neural network model training apparatus according to the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.
Example 1:
as shown in fig. 1 and fig. 2, the speech extraction method provided in this embodiment includes the following steps:
s1.1, acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the target speaker voice.
Specifically, taking a sampling rate of 16kHz as an example, a voice segment to be extracted and a voiceprint registration voice segment of a specified target speaker with arbitrary length are collected. The voiceprint enrollment voice of the target speaker means: the clean voice of the targeted speaker is used for voiceprint enrollment.
S1.2, inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice.
In particular, methods including at least one-dimensional convolutional encoders or an unsupervised pre-training model can be employed to extract the time series characterization. The one-dimensional convolution encoder can be realized by a one-dimensional convolution network layer (1-D CNN) and a linear convolution layer (ReLU), wherein the size of a convolution kernel is L, the step length is L/2, the number of input channels is 1, and the number of output channels is D. The method for self-supervision pre-training can adopt an open-source pre-training neural network model of Wav2vec2, hubert and the like in standard configuration to extract time series representation.
S1.3, inputting voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model, and obtaining voiceprint characteristics of the target speaker.
S1.3.1, using the speech coding network to obtain a time series representation of the voiceprint registration speech data of the target speaker.
Specifically, the voice coding network in S1.2 can be directly adopted to extract the time series representation of the voiceprint enrollment voice data of the target speaker.
S1.3.2, modeling the time dependence of the time series representation by adopting a convolution or cyclic neural network;
specifically, the time dependence of time series characterization can be modeled by stacking multiple layers of convolutional networks with residual Connection (CNN) or bidirectional long-term memory networks (BiLSTM). On one hand, a convolutional network with n being more than or equal to 5 layers can be adopted for modeling, wherein the number of input and output channels of the network in the first layer is (D, O), the number of input and output channels in the network in the middle layer is (O, O), the number of the last 3 layers is (O, P), the number of the last 2 layers is (P, P), and the number of the last layer is (P, H). In addition, the convolution networks except the first layer and the last layer adopt a residual error connection mode, and the operation of layer normalization is added before the first layer. On the other hand, n-layer BilSTM network with input dimension D and hidden dimension H can be adopted for modeling, and then processing is carried out through a ReLU activation function and a full connection layer with input dimension H.
S1.3.3, extracting the voiceprint feature vector of the target speaker from the time series representation after modeling processing by using a pooling layer based on a self-attention mechanism.
Specifically, the pooling layer based on the self-attention mechanism consists of a feed-forward network and a pooling network. The feedforward network is composed of two fully-connected layers, wherein the input and output channels are (H, H) and (H, 1) respectively. The pooling network firstly calculates attention coefficients in a masking (mask) mode, then performs pooling operation in a weighted average mode after probability weights on all time nodes are obtained by utilizing a softmax function, and finally obtains voiceprint feature vectors of the target speaker after processing through a full connection layer and a tanh activation function.
As shown in fig. 3, S1.4, the time sequence table of the aliasing voice and the voiceprint feature of the target speaker are simultaneously input into the speaker extraction network in the trained preset neural network model, and the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data is extracted.
S1.4.1, and performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time series representation input by adopting a gated convolution fusion method.
S1.4.2, modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling processing.
Specifically, the time dependence of the time series characterization can be modeled at least using a Time Convolutional Network (TCN), a Dual-path recurrent neural network (Dual-path RNN), or Transformers. For example, a time convolutional network typically consists of stacked 8-layer time-domain convolutional network layers. Performing characteristic dimension transformation on each time domain convolution network layer through convolution of 1X 1, performing one-dimensional convolution along the time dimension by adopting convolution kernels with the size K =3 and the step length S =1, and setting the expansion coefficient of the X layer to be 2 (X-1); before each convolution operation, a layer normalization and a parameterized linear rectification activation function are adopted for processing, finally, the characteristic dimension is restored through 1 x 1 convolution, a mask of the time sequence representation is output, and the mask and the output mask are multiplied point by point to obtain the time sequence representation after modeling processing.
S1.4.3, repeating the above two steps (S1.4.1 and S1.4.2) in multiple stages, except that the first stage input is the time series representation of the aliased speech, then the speech time series representation required for feature fusion in each stage is input and output as the time series representation of the speech after modeling processing in the previous stage. Namely: in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;
specifically, M =4 phases of feature fusion and time dependency modeling are performed. Because the fused features need to be fully expressed to obtain accurate time series representation output, the step of not executing feature fusion can be selected in the last stage, and only the time dependency modeling is carried out, so that the expression capability of the modeling processing in the last two stages is enhanced, and the performance of the system is improved.
S1.4.4, converting the speech time series representation output of the final stage into a mask, multiplying the mask by the time series representation of aliasing speech point by point, and extracting the time series representation of the speech of the target speaker.
Specifically, the final stage speech time series characterization output can be converted into an estimate of the mask by a 1 × 1 convolution and the ReLU activation function.
And S1.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.
In particular, the speech decoding network may be implemented by a one-dimensional deconvolution layer or a fully-connected linear layer. The one-dimensional deconvolution layer usually adopts deconvolution operation with an input channel D and an output channel L.
As shown in fig. 4 and fig. 5, the embodiment elaborates the gated convolution fusion method, which includes the following steps:
and S2.1, performing zero-offset one-dimensional convolution operation on the time sequence representation input and the convolution kernel of the information branch to obtain an output signal of the information branch.
Specifically, the size of a convolution kernel adopted by the one-dimensional convolution operation is 3, the filling length is 1, and an output signal of the information branch is obtained after the convolution operation is completed.
S2.2, convolving the time sequence representation input with a convolution kernel of the gating branch, adding a bias term obtained by linear layer conversion of a target speaker vocal print characteristic vector, and obtaining an output signal of the gating branch through normalization and processing of an activation function;
specifically, the gating branch adopts one-dimensional convolution operation with a bias term, wherein the configuration of the convolution operation is the same as that of the voice branch, the bias term is generated by mapping a target speaker voiceprint feature vector based on a fully-connected network linear layer, and then layer normalization (layer norm) and a sigmoid activation function are adopted for processing to obtain an output signal of the gating branch.
And S2.3, multiplying the output signals of the gating branch and the information branch point by point, and then connecting the time sequence representation with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.
Specifically, the feature fusion of the two modes is performed in a manner of controlling the transmission of the voice information stream in the information branch by the output signal of the gating branch. On one hand, the voice information flow in the information branch reserves the complete content information of the voice of the target speaker; on the other hand, the output signal of the gate control branch fuses clues of voiceprint characteristics into the control signal, and the extraction process of the target speaker information in the information branch is interfered in a gate control mode instead of a direct means based on simple operation, such as splicing. And finally, the convergence of the fusion module in the deep neural network is enhanced by adopting a residual connection mode.
Example 2
As shown in fig. 6, the neural network model training method provided in this embodiment includes the following steps:
and S3.1, obtaining aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of the target speaker. The multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice.
Specifically, resampling operation is carried out on all voice training sample data at a sampling rate of Fs =16k, each target speaker voice training sample data and non-target speaker voice training sample data which are used as real labels are firstly divided into voice fragments with the duration of 4s, then one 4s voice fragment is randomly selected from one or more non-target speakers for each 4s target speaker voice fragment to be matched, the voice fragments are mixed through amplitude transformation and superposition according to one randomly-distributed signal-to-noise ratio within a range of-5 dB to generate 4s multi-speaker aliasing voice fragments, finally the corresponding aliasing voice fragments of the multi-speakers and the corresponding voice fragments of the target speakers are respectively used as input and real labels, and a training set, a verification set and a test set are divided according to a common proportion. It should be noted that, in this embodiment, both the voice of the target speaker used for obtaining the aliasing voice and the voiceprint registration voice of the target speaker are from the voice training sample data of the target speaker serving as the real tag, but the voiceprint registration voice of the target speaker in the training needs to select a different voice fragment from the voice training sample data of the target speaker serving as the real tag, which is used in the aliasing voice.
S3.2, inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;
s3.3, inputting voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model to obtain the voiceprint characteristics of the target speaker;
s3.4, simultaneously inputting the time sequence table of aliasing voice and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
s3.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a preset neural network model, and restoring a time domain voice signal of the target speaker;
and S3.6, calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, and ending the training process after the loss function is completely converged. And determining the trained preset neural network model as the trained preset neural network model.
Specifically, the default neural network model may be trained using SI-SDR (scale-innovative signal-to-disturbance ratio) as a loss function. The Adam Optimizer is selected by the training strategy, the initial learning rate is set to be 1e-3, the maximum training iteration number is 100, when the loss of the verification set is not reduced for 3 consecutive epochs (lower than the minimum loss obtained before), the learning rate adjustment is halved, and when 10 consecutive epochs are reduced, the training is ended in advance. And storing the model parameters after the training is finished.
Example 3
Referring to fig. 7, the present embodiment provides a speech extraction apparatus 100, including:
an acquisition module 110, configured to acquire aliased voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;
the voice coding network module 120 is configured to input the aliased voice data of the multiple speakers to be extracted into a voice coding network in the trained preset neural network model, and obtain a time sequence representation of the aliased voice;
a speaker coding network module 130, configured to input voiceprint registration voice data of the target speaker into a speaker coding network in the trained preset neural network model, and obtain a voiceprint feature of the target speaker;
the speaker extraction network module 140 is configured to simultaneously input the time sequence table of the aliasing voices and the voiceprint characteristics of the target speaker into the speaker extraction network in the trained preset neural network model, and extract the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data;
the speech decoding network module 150 is configured to input the extracted representation of the time sequence of the target speaker speech into a speech decoding network in a trained preset neural network model, and restore a time-domain speech signal of the target speaker.
Example 4
Referring to fig. 8, the present embodiment provides a neural network model training apparatus 200, including:
the acquisition module 210 is configured to acquire aliased voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;
the voice coding network module 220 is configured to input aliased voice training sample data of multiple speakers to a voice coding network in a preset neural network model, and acquire a time sequence representation of the aliased voice;
a speaker encoder network module 230, configured to input voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model, and obtain a voiceprint feature of the target speaker;
the speaker extraction network module 240 is configured to input the time sequence table of the aliased voices and the voiceprint features of the target speaker into the speaker extraction network in the preset neural network model at the same time, and extract a voice time sequence representation belonging to the target speaker from the aliased voice data of multiple speakers;
the voice decoding network module 250 is configured to input the extracted time series representation of the target speaker to a voice decoding network in a preset neural network model, and restore a time-domain voice signal of the target speaker;
and the loss function calculation module 260 is configured to calculate a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, update and train parameters of the preset neural network model based on gradient back propagation of the loss function, end the training process after the loss function is completely converged, and determine the trained preset neural network model as the trained preset neural network model.
Example 5
The present embodiment provides a computer-readable storage medium, in which a computer program is stored, and a processor executes the computer program to implement the speech extraction method according to embodiment 1.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (11)
1. A method of speech extraction, comprising:
acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the voice of a target speaker;
inputting the aliasing voice data of a plurality of speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice;
inputting the voiceprint registration voice data of the target speaker into a speaker coding network in a trained preset neural network model to obtain the voiceprint characteristics of the target speaker;
simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
inputting the extracted representation of the target speaker voice time sequence into a voice decoding network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker;
the method for constructing the speaker coding network comprises the following steps:
acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;
modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;
and extracting the vocal print characteristic vector of the target speaker from the time series representation after modeling processing by adopting a pooling layer based on a self-attention mechanism.
2. The speech extraction method of claim 1, wherein the method of constructing the speech coding network comprises: and extracting the time sequence representation by adopting a one-dimensional convolution coder or a self-supervision pre-training model.
3. The speech extraction method of claim 1, wherein the method of constructing the speaker extraction network comprises:
performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time sequence characterization input by adopting a gated convolution fusion method;
modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling;
in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;
and converting the voice time series representation output of the final stage into a mask, and multiplying the mask and the time series representation of the aliasing voice point by point to extract the time series representation of the voice of the target speaker.
4. The speech extraction method according to claim 3, wherein the step of not performing feature fusion is selected in a final stage, and only the process of modeling the time dependency is performed.
5. The method of speech extraction of claim 3 wherein said gated convolution fusion method comprises:
performing zero-offset one-dimensional convolution operation on the time sequence characterization input and the convolution kernel of the information branch to obtain an output signal of the information branch;
convolving the time series representation input with the convolution kernel of the gating branch, adding a bias term obtained by linear layer conversion of the target speaker vocal print characteristic vector, and obtaining the output signal of the gating branch through normalization and processing of an activation function;
and multiplying the output signals of the gating branch and the information branch point by point, and then connecting the multiplied output signals with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.
6. The method of speech extraction according to claim 3, wherein the sequence modeling of the time series characterization is processed by a time convolution network, a two-path recurrent neural network, or transformations.
7. The speech extraction method of claim 1, wherein the speech decoding network is implemented by a one-dimensional deconvolution layer or a fully-connected linear layer.
8. A neural network model training method is characterized by comprising the following steps:
acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;
inputting multi-speaker aliasing voice training sample data into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;
inputting voiceprint registration voice training sample data of a target speaker into a speaker coding network in a preset neural network model to obtain voiceprint characteristics of the target speaker;
simultaneously inputting the time sequence table of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model, and restoring a time domain voice signal of the target speaker;
calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as a trained preset neural network model;
the method for constructing the speaker coding network comprises the following steps:
acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;
modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;
and extracting the voiceprint characteristic vector of the target speaker from the time series representation after the modeling processing by adopting a pooling layer based on a self-attention mechanism.
9. A speech extraction device, comprising:
the acquisition module is used for acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;
the voice coding network module is used for inputting the aliased voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model and acquiring the time sequence representation of the aliased voice;
the speaker coding network module is used for inputting the voiceprint registration voice data of the target speaker into the speaker coding network in the trained preset neural network model to acquire the voiceprint characteristics of the target speaker;
the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characterization belonging to the target speaker in the aliasing voice data of multiple speakers;
the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a trained preset neural network model and restoring a time domain voice signal of the target speaker;
the method for constructing the speaker coding network comprises the following steps:
acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;
modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;
and extracting the vocal print characteristic vector of the target speaker from the time series representation after modeling processing by adopting a pooling layer based on a self-attention mechanism.
10. A neural network model training device, comprising:
the acquisition module is used for acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;
the voice coding network module is used for inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model and acquiring a time sequence representation of aliasing voices;
the speaker coding network module is used for inputting the voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model to acquire the voiceprint characteristics of the target speaker;
the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;
the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model and restoring a time domain voice signal of the target speaker;
the loss function calculation module is used for calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model;
the method for constructing the speaker coding network comprises the following steps:
acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;
modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;
and extracting the voiceprint characteristic vector of the target speaker from the time series representation after the modeling processing by adopting a pooling layer based on a self-attention mechanism.
11. A computer-readable storage medium, having a computer program stored thereon, wherein a processor executes the computer program to implement the speech extraction method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211037918.4A CN115116448B (en) | 2022-08-29 | 2022-08-29 | Voice extraction method, neural network model training method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211037918.4A CN115116448B (en) | 2022-08-29 | 2022-08-29 | Voice extraction method, neural network model training method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115116448A CN115116448A (en) | 2022-09-27 |
CN115116448B true CN115116448B (en) | 2022-11-15 |
Family
ID=83336384
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211037918.4A Active CN115116448B (en) | 2022-08-29 | 2022-08-29 | Voice extraction method, neural network model training method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115116448B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117711420A (en) * | 2023-07-17 | 2024-03-15 | 荣耀终端有限公司 | Target voice extraction method, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11242499A (en) * | 1997-08-29 | 1999-09-07 | Toshiba Corp | Voice encoding and decoding method and component separating method for voice signal |
CN110287320A (en) * | 2019-06-25 | 2019-09-27 | 北京工业大学 | A kind of deep learning of combination attention mechanism is classified sentiment analysis model more |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111653288A (en) * | 2020-06-18 | 2020-09-11 | 南京大学 | Target person voice enhancement method based on conditional variation self-encoder |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
CN113571074A (en) * | 2021-08-09 | 2021-10-29 | 四川启睿克科技有限公司 | Voice enhancement method and device based on multi-band structure time domain audio separation network |
CN114329036A (en) * | 2022-03-16 | 2022-04-12 | 中山大学 | Cross-modal characteristic fusion system based on attention mechanism |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10811000B2 (en) * | 2018-04-13 | 2020-10-20 | Mitsubishi Electric Research Laboratories, Inc. | Methods and systems for recognizing simultaneous speech by multiple speakers |
US20210272573A1 (en) * | 2020-02-29 | 2021-09-02 | Robert Bosch Gmbh | System for end-to-end speech separation using squeeze and excitation dilated convolutional neural networks |
CN114333896A (en) * | 2020-09-25 | 2022-04-12 | 华为技术有限公司 | Voice separation method, electronic device, chip and computer readable storage medium |
CN114495973A (en) * | 2022-01-25 | 2022-05-13 | 中山大学 | Special person voice separation method based on double-path self-attention mechanism |
-
2022
- 2022-08-29 CN CN202211037918.4A patent/CN115116448B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11242499A (en) * | 1997-08-29 | 1999-09-07 | Toshiba Corp | Voice encoding and decoding method and component separating method for voice signal |
CN110287320A (en) * | 2019-06-25 | 2019-09-27 | 北京工业大学 | A kind of deep learning of combination attention mechanism is classified sentiment analysis model more |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111653288A (en) * | 2020-06-18 | 2020-09-11 | 南京大学 | Target person voice enhancement method based on conditional variation self-encoder |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
CN113571074A (en) * | 2021-08-09 | 2021-10-29 | 四川启睿克科技有限公司 | Voice enhancement method and device based on multi-band structure time domain audio separation network |
CN114329036A (en) * | 2022-03-16 | 2022-04-12 | 中山大学 | Cross-modal characteristic fusion system based on attention mechanism |
Non-Patent Citations (2)
Title |
---|
SpEx: Multi-Scale Time Domain Speaker;Chenglin Xu;《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》;20200414;第1370-1384页 * |
基于深度神经网络的时域语音分离算法;丁辉;《中国优秀硕士学位论文全文数据库》;20220315(第3期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115116448A (en) | 2022-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
CN111243620B (en) | Voice separation model training method and device, storage medium and computer equipment | |
Ravanelli et al. | Multi-task self-supervised learning for robust speech recognition | |
WO2021143327A1 (en) | Voice recognition method, device, and computer-readable storage medium | |
CN107680611B (en) | Single-channel sound separation method based on convolutional neural network | |
Feng et al. | Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition | |
KR101807948B1 (en) | Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same | |
CN110767244B (en) | Speech enhancement method | |
CN112071329A (en) | Multi-person voice separation method and device, electronic equipment and storage medium | |
KR20160032536A (en) | Signal process algorithm integrated deep neural network based speech recognition apparatus and optimization learning method thereof | |
KR101807961B1 (en) | Method and apparatus for processing speech signal based on lstm and dnn | |
JP2008152262A (en) | Method and apparatus for transforming speech feature vector | |
CN110189761B (en) | Single-channel speech dereverberation method based on greedy depth dictionary learning | |
CN111899757A (en) | Single-channel voice separation method and system for target speaker extraction | |
CN112037809A (en) | Residual echo suppression method based on multi-feature flow structure deep neural network | |
CN115116448B (en) | Voice extraction method, neural network model training method, device and storage medium | |
Kothapally et al. | Skipconvgan: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking | |
Lin et al. | Speech enhancement using forked generative adversarial networks with spectral subtraction | |
CN117174105A (en) | Speech noise reduction and dereverberation method based on improved deep convolutional network | |
CN113823273A (en) | Audio signal processing method, audio signal processing device, electronic equipment and storage medium | |
Nakagome et al. | Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation. | |
CN112037813B (en) | Voice extraction method for high-power target signal | |
CN113870893A (en) | Multi-channel double-speaker separation method and system | |
CN113241092A (en) | Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network | |
Tamura et al. | Improvements to the noise reduction neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |