CN115116448B

CN115116448B - Voice extraction method, neural network model training method, device and storage medium

Info

Publication number: CN115116448B
Application number: CN202211037918.4A
Authority: CN
Inventors: 刘文璟; 谢川; 谭斌; 展华益
Original assignee: Sichuan Qiruike Technology Co Ltd
Current assignee: Sichuan Qiruike Technology Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-15
Anticipated expiration: 2042-08-29
Also published as: CN115116448A

Abstract

The invention discloses a voice extraction method, a neural network model training method, a device and a storage medium, wherein the method comprises the following steps: acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; inputting the aliasing voice data of multiple speakers to be extracted into a voice coding network, and acquiring the time sequence representation of aliasing voice; inputting the voiceprint registration voice data of the target speaker into a speaker coding network to obtain the voiceprint characteristics of the target speaker; inputting the time sequence table of aliasing voice and the voiceprint characteristic of a target speaker into a speaker extraction network simultaneously, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers; and representing the input voice decoding network by the extracted voice time sequence of the target speaker, and restoring a time domain voice signal of the target speaker. The method can accurately and effectively extract the voice of the target speaker from the aliasing voices of the multiple speakers.

Description

Voice extraction method, neural network model training method, device and storage medium

Technical Field

The invention relates to the technical field of voice separation, in particular to a voice extraction method, a neural network model training method, a device and a storage medium.

Background

The cocktail party problem was originally a well-known problem in 1953, addressed by Cherry, a british cognitive scientist, in studying the attention-selection mechanism, which attempted to explore the logical basis behind the process of human understanding the speech of a target speaker under interference from other speakers or noise, thereby modeling an intelligent machine that was able to filter out the signals of the target speaker. Colloquially described, the cocktail party problem concerns one auditory selection capability of humans in complex auditory environments. In this case, the person can easily focus on a certain sound stimulus of interest and ignore other background sounds, whereas the computational auditory model is heavily influenced by noise. How to design an auditory model capable of flexibly adapting to the cocktail party environment is an important problem in the field of computational hearing, and has very important research significance and application value on a series of important tasks such as speech recognition, speaker recognition, speech separation and the like.

With the vigorous development of artificial intelligence, speech separation represented by the cocktail party problem has made a tremendous progress in the popularization of deep learning. However, in most practical scenarios, the current speech separation technology is limited by the number of speakers, noise interference, and generalization of models, and the performance is not satisfactory. The target speaker voice extraction technology is used for directionally extracting the voice of the specified target speaker by acquiring additional voiceprint characteristic clues and under the guidance of the additional voiceprint characteristic clues, is not limited by the number of speakers, has strong generalization of models and robustness to noise environments, and is suitable for application scenes of families, meetings and the like which can acquire the registered voice of the target speaker.

The early technology for extracting the voice of the target speaker uses a speaker self-adapting method, converts the amplitude spectrum characteristics of the voiceprint registered voice of the target speaker into weight parameters of a self-adapting layer through an auxiliary network, and obtains the output of the self-adapting layer by weighting the output of each sublayer of the self-adapting layer, so that a voice model can self-adapt to the speaker. For example, CN 112331181A provides a method for extracting a target speaker voice based on a multi-speaker condition, which is based on obtaining adaptive parameters to dynamically adjust output, thereby extracting the voice of the target speaker.

The target speaker voice extraction technology based on deep learning is the main trend at present. Most of the schemes adopt a method of performing feature processing on a frequency domain and then reconstructing a time domain voice signal, for example, CN 113990344A provides a method, a device and a medium for separating multi-user voice based on voiceprint features, which uses short-time fourier transform to extract voice spectrum features.

In the process of extracting the voice of the target speaker, modal fusion between the voiceprint feature vector of the target speaker and the voice representation thereof is a more critical problem. Because the feature forms of the two modes are inconsistent, the commonly adopted fusion method is to expand the voiceprint feature vector to the form with the same voice representation through specific transformation and then perform feature fusion by utilizing operations based on simple operation, such as splicing and the like. For example, CN 105489226A provides a method for separating speaker-specific speech based on a two-path self-attention mechanism, which uses a splicing method to perform fusion of speaker coding features and speech features.

The current method for extracting the voice of the target speaker has the following problems:

1) However, the frequency domain methods have the potential problem of unstable frequency spectrum phase estimation, and the quality of the extracted target speaker voice is affected accordingly.

2) And the mainstream fusion method of the voice print feature vector and the voice characterization is a method based on simple operation, such as splicing, the correlation between the two modes is not fully excavated, and the specific information of each mode is lost to a certain extent in the fusion process.

Disclosure of Invention

The invention provides a voice extraction method, a neural network model training method, a device and a storage medium, which are used for solving the problems of poor effect of a frequency domain target speaker-based voice extraction method and the problems of the related technologies of insufficient fusion of a voiceprint feature vector and voice characterization in the prior art.

The technical scheme adopted by the invention is as follows:

according to a first aspect of the present disclosure, there is provided a speech extraction method, including:

acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

inputting the aliasing voice data of a plurality of speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice;

inputting the voiceprint registration voice data of the target speaker into a speaker coding network in a trained preset neural network model to obtain the voiceprint characteristics of the target speaker;

simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

and inputting the extracted representation of the target speaker voice time sequence into a voice decoding network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.

Further, the method for constructing the voice coding network comprises the following steps: and extracting the time sequence representation by adopting a one-dimensional convolution coder or a self-supervision pre-training model.

Further, the method for constructing the speaker coding network comprises the following steps:

acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;

modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;

and extracting the vocal print characteristic vector of the target speaker from the time series representation after modeling processing by adopting a pooling layer based on a self-attention mechanism.

Further, the method for constructing the speaker extraction network comprises the following steps:

performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time sequence representation input by adopting a gating convolution fusion method;

modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling;

in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to be used as a stage, and the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output after the time dependency modeling processing of the previous stage;

and converting the voice time series representation output of the final stage into a mask, and multiplying the mask and the time series representation of the aliasing voice point by point to extract the time series representation of the voice of the target speaker.

Furthermore, in the last stage, a step of not executing feature fusion is selected, and only the process of modeling the time dependency relationship is carried out.

Further, the method for gated convolution fusion includes:

performing zero-offset one-dimensional convolution operation on the time sequence characterization input and the convolution kernel of the information branch to obtain an output signal of the information branch;

the time sequence representation is input into a convolution kernel of the gating branch to be convolved, a bias term obtained by linear layer conversion of a target speaker voice print characteristic vector is added, and then normalization and activation function processing are carried out to obtain an output signal of the gating branch;

and multiplying the output signals of the gating branch and the information branch point by point, and then connecting the multiplied output signals with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.

Further, the sequence modeling of the time-series characterization is processed by a time convolution network, a dual-path recurrent neural network, or transformations.

Further, the voice decoding network is realized by a one-dimensional deconvolution layer or a fully-connected linear layer.

According to a second aspect of the disclosure, there is provided a neural network model training method, including:

acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

inputting multi-speaker aliasing voice training sample data into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;

inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to obtain voiceprint characteristics of the target speaker;

simultaneously inputting the time sequence table of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model, and restoring a time domain voice signal of the target speaker;

calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.

According to a third aspect of the present disclosure, there is provided a speech extraction device including:

the acquisition module is used for acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the voice of a target speaker;

the voice coding network module is used for inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model and acquiring the time sequence representation of the aliasing voice;

the speaker coding network module is used for inputting the voiceprint registration voice data of the target speaker into the speaker coding network in the trained preset neural network model to acquire the voiceprint characteristics of the target speaker;

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characteristics belonging to the target speaker in the multi-speaker aliasing voice data;

and the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a trained preset neural network model and restoring a time domain voice signal of the target speaker.

According to a fourth aspect of the present disclosure, there is provided a neural network model training apparatus, including:

the acquisition module is used for acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

the voice coding network module is used for inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model and acquiring a time sequence representation of aliasing voices;

the speaker encoder network module is used for inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to acquire voiceprint characteristics of the target speaker;

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model and restoring a time domain voice signal of the target speaker;

and the loss function calculation module is used for calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and a processor executes the computer program to implement the voice extraction method according to the first aspect.

The invention has the beneficial effects that:

1) And the voice features are extracted and coded in the time domain, so that the potential influence caused by the problems of unstable frequency spectrum phase estimation and the like in the frequency domain method is avoided.

2) And the target speaker voice print characteristic and the voice characteristic are fused by adopting a gating convolution fusion technology, the characteristics of two modes are fully fused through global condition modeling and a gating mechanism, and the specific information of each mode is effectively reserved, so that the quality of the extracted target speaker voice is improved.

3) The invention fully utilizes the voiceprint characteristic clue of the target speaker through an innovative characteristic fusion mode, and can accurately and effectively extract the voice of the target speaker from the aliasing voice of the multiple speakers.

Drawings

FIG. 1 is a flow chart illustrating the steps of a speech extraction method disclosed in the present invention;

FIG. 2 is a block diagram of a speech extraction method according to the present disclosure;

FIG. 3 is a block diagram of a speaker extraction network according to the present disclosure;

FIG. 4 is a flowchart illustrating the steps of a gated convolution fusion method according to the present disclosure;

FIG. 5 is a block diagram of a gated convolution fusion method according to the present disclosure;

FIG. 6 is a flow chart illustrating the steps of a neural network model training method disclosed in the present invention;

FIG. 7 is a block diagram of a speech extracting apparatus according to the present disclosure;

fig. 8 is a block diagram of a neural network model training apparatus according to the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.

Example 1:

as shown in fig. 1 and fig. 2, the speech extraction method provided in this embodiment includes the following steps:

s1.1, acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the target speaker voice.

Specifically, taking a sampling rate of 16kHz as an example, a voice segment to be extracted and a voiceprint registration voice segment of a specified target speaker with arbitrary length are collected. The voiceprint enrollment voice of the target speaker means: the clean voice of the targeted speaker is used for voiceprint enrollment.

S1.2, inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice.

In particular, methods including at least one-dimensional convolutional encoders or an unsupervised pre-training model can be employed to extract the time series characterization. The one-dimensional convolution encoder can be realized by a one-dimensional convolution network layer (1-D CNN) and a linear convolution layer (ReLU), wherein the size of a convolution kernel is L, the step length is L/2, the number of input channels is 1, and the number of output channels is D. The method for self-supervision pre-training can adopt an open-source pre-training neural network model of Wav2vec2, hubert and the like in standard configuration to extract time series representation.

S1.3, inputting voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model, and obtaining voiceprint characteristics of the target speaker.

S1.3.1, using the speech coding network to obtain a time series representation of the voiceprint registration speech data of the target speaker.

Specifically, the voice coding network in S1.2 can be directly adopted to extract the time series representation of the voiceprint enrollment voice data of the target speaker.

S1.3.2, modeling the time dependence of the time series representation by adopting a convolution or cyclic neural network;

specifically, the time dependence of time series characterization can be modeled by stacking multiple layers of convolutional networks with residual Connection (CNN) or bidirectional long-term memory networks (BiLSTM). On one hand, a convolutional network with n being more than or equal to 5 layers can be adopted for modeling, wherein the number of input and output channels of the network in the first layer is (D, O), the number of input and output channels in the network in the middle layer is (O, O), the number of the last 3 layers is (O, P), the number of the last 2 layers is (P, P), and the number of the last layer is (P, H). In addition, the convolution networks except the first layer and the last layer adopt a residual error connection mode, and the operation of layer normalization is added before the first layer. On the other hand, n-layer BilSTM network with input dimension D and hidden dimension H can be adopted for modeling, and then processing is carried out through a ReLU activation function and a full connection layer with input dimension H.

S1.3.3, extracting the voiceprint feature vector of the target speaker from the time series representation after modeling processing by using a pooling layer based on a self-attention mechanism.

Specifically, the pooling layer based on the self-attention mechanism consists of a feed-forward network and a pooling network. The feedforward network is composed of two fully-connected layers, wherein the input and output channels are (H, H) and (H, 1) respectively. The pooling network firstly calculates attention coefficients in a masking (mask) mode, then performs pooling operation in a weighted average mode after probability weights on all time nodes are obtained by utilizing a softmax function, and finally obtains voiceprint feature vectors of the target speaker after processing through a full connection layer and a tanh activation function.

As shown in fig. 3, S1.4, the time sequence table of the aliasing voice and the voiceprint feature of the target speaker are simultaneously input into the speaker extraction network in the trained preset neural network model, and the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data is extracted.

S1.4.1, and performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time series representation input by adopting a gated convolution fusion method.

S1.4.2, modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling processing.

Specifically, the time dependence of the time series characterization can be modeled at least using a Time Convolutional Network (TCN), a Dual-path recurrent neural network (Dual-path RNN), or Transformers. For example, a time convolutional network typically consists of stacked 8-layer time-domain convolutional network layers. Performing characteristic dimension transformation on each time domain convolution network layer through convolution of 1X 1, performing one-dimensional convolution along the time dimension by adopting convolution kernels with the size K =3 and the step length S =1, and setting the expansion coefficient of the X layer to be 2 (X-1); before each convolution operation, a layer normalization and a parameterized linear rectification activation function are adopted for processing, finally, the characteristic dimension is restored through 1 x 1 convolution, a mask of the time sequence representation is output, and the mask and the output mask are multiplied point by point to obtain the time sequence representation after modeling processing.

S1.4.3, repeating the above two steps (S1.4.1 and S1.4.2) in multiple stages, except that the first stage input is the time series representation of the aliased speech, then the speech time series representation required for feature fusion in each stage is input and output as the time series representation of the speech after modeling processing in the previous stage. Namely: in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;

specifically, M =4 phases of feature fusion and time dependency modeling are performed. Because the fused features need to be fully expressed to obtain accurate time series representation output, the step of not executing feature fusion can be selected in the last stage, and only the time dependency modeling is carried out, so that the expression capability of the modeling processing in the last two stages is enhanced, and the performance of the system is improved.

S1.4.4, converting the speech time series representation output of the final stage into a mask, multiplying the mask by the time series representation of aliasing speech point by point, and extracting the time series representation of the speech of the target speaker.

Specifically, the final stage speech time series characterization output can be converted into an estimate of the mask by a 1 × 1 convolution and the ReLU activation function.

And S1.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.

In particular, the speech decoding network may be implemented by a one-dimensional deconvolution layer or a fully-connected linear layer. The one-dimensional deconvolution layer usually adopts deconvolution operation with an input channel D and an output channel L.

As shown in fig. 4 and fig. 5, the embodiment elaborates the gated convolution fusion method, which includes the following steps:

and S2.1, performing zero-offset one-dimensional convolution operation on the time sequence representation input and the convolution kernel of the information branch to obtain an output signal of the information branch.

Specifically, the size of a convolution kernel adopted by the one-dimensional convolution operation is 3, the filling length is 1, and an output signal of the information branch is obtained after the convolution operation is completed.

S2.2, convolving the time sequence representation input with a convolution kernel of the gating branch, adding a bias term obtained by linear layer conversion of a target speaker vocal print characteristic vector, and obtaining an output signal of the gating branch through normalization and processing of an activation function;

specifically, the gating branch adopts one-dimensional convolution operation with a bias term, wherein the configuration of the convolution operation is the same as that of the voice branch, the bias term is generated by mapping a target speaker voiceprint feature vector based on a fully-connected network linear layer, and then layer normalization (layer norm) and a sigmoid activation function are adopted for processing to obtain an output signal of the gating branch.

And S2.3, multiplying the output signals of the gating branch and the information branch point by point, and then connecting the time sequence representation with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.

Specifically, the feature fusion of the two modes is performed in a manner of controlling the transmission of the voice information stream in the information branch by the output signal of the gating branch. On one hand, the voice information flow in the information branch reserves the complete content information of the voice of the target speaker; on the other hand, the output signal of the gate control branch fuses clues of voiceprint characteristics into the control signal, and the extraction process of the target speaker information in the information branch is interfered in a gate control mode instead of a direct means based on simple operation, such as splicing. And finally, the convergence of the fusion module in the deep neural network is enhanced by adopting a residual connection mode.

Example 2

As shown in fig. 6, the neural network model training method provided in this embodiment includes the following steps:

and S3.1, obtaining aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of the target speaker. The multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice.

Specifically, resampling operation is carried out on all voice training sample data at a sampling rate of Fs =16k, each target speaker voice training sample data and non-target speaker voice training sample data which are used as real labels are firstly divided into voice fragments with the duration of 4s, then one 4s voice fragment is randomly selected from one or more non-target speakers for each 4s target speaker voice fragment to be matched, the voice fragments are mixed through amplitude transformation and superposition according to one randomly-distributed signal-to-noise ratio within a range of-5 dB to generate 4s multi-speaker aliasing voice fragments, finally the corresponding aliasing voice fragments of the multi-speakers and the corresponding voice fragments of the target speakers are respectively used as input and real labels, and a training set, a verification set and a test set are divided according to a common proportion. It should be noted that, in this embodiment, both the voice of the target speaker used for obtaining the aliasing voice and the voiceprint registration voice of the target speaker are from the voice training sample data of the target speaker serving as the real tag, but the voiceprint registration voice of the target speaker in the training needs to select a different voice fragment from the voice training sample data of the target speaker serving as the real tag, which is used in the aliasing voice.

S3.2, inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;

s3.3, inputting voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model to obtain the voiceprint characteristics of the target speaker;

s3.4, simultaneously inputting the time sequence table of aliasing voice and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

s3.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a preset neural network model, and restoring a time domain voice signal of the target speaker;

and S3.6, calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, and ending the training process after the loss function is completely converged. And determining the trained preset neural network model as the trained preset neural network model.

Specifically, the default neural network model may be trained using SI-SDR (scale-innovative signal-to-disturbance ratio) as a loss function. The Adam Optimizer is selected by the training strategy, the initial learning rate is set to be 1e-3, the maximum training iteration number is 100, when the loss of the verification set is not reduced for 3 consecutive epochs (lower than the minimum loss obtained before), the learning rate adjustment is halved, and when 10 consecutive epochs are reduced, the training is ended in advance. And storing the model parameters after the training is finished.

Example 3

Referring to fig. 7, the present embodiment provides a speech extraction apparatus 100, including:

an acquisition module 110, configured to acquire aliased voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

the voice coding network module 120 is configured to input the aliased voice data of the multiple speakers to be extracted into a voice coding network in the trained preset neural network model, and obtain a time sequence representation of the aliased voice;

a speaker coding network module 130, configured to input voiceprint registration voice data of the target speaker into a speaker coding network in the trained preset neural network model, and obtain a voiceprint feature of the target speaker;

the speaker extraction network module 140 is configured to simultaneously input the time sequence table of the aliasing voices and the voiceprint characteristics of the target speaker into the speaker extraction network in the trained preset neural network model, and extract the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data;

the speech decoding network module 150 is configured to input the extracted representation of the time sequence of the target speaker speech into a speech decoding network in a trained preset neural network model, and restore a time-domain speech signal of the target speaker.

Example 4

Referring to fig. 8, the present embodiment provides a neural network model training apparatus 200, including:

the acquisition module 210 is configured to acquire aliased voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

the voice coding network module 220 is configured to input aliased voice training sample data of multiple speakers to a voice coding network in a preset neural network model, and acquire a time sequence representation of the aliased voice;

a speaker encoder network module 230, configured to input voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model, and obtain a voiceprint feature of the target speaker;

the speaker extraction network module 240 is configured to input the time sequence table of the aliased voices and the voiceprint features of the target speaker into the speaker extraction network in the preset neural network model at the same time, and extract a voice time sequence representation belonging to the target speaker from the aliased voice data of multiple speakers;

the voice decoding network module 250 is configured to input the extracted time series representation of the target speaker to a voice decoding network in a preset neural network model, and restore a time-domain voice signal of the target speaker;

and the loss function calculation module 260 is configured to calculate a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, update and train parameters of the preset neural network model based on gradient back propagation of the loss function, end the training process after the loss function is completely converged, and determine the trained preset neural network model as the trained preset neural network model.

Example 5

The present embodiment provides a computer-readable storage medium, in which a computer program is stored, and a processor executes the computer program to implement the speech extraction method according to embodiment 1.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech extraction, comprising:

acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the voice of a target speaker;

inputting the extracted representation of the target speaker voice time sequence into a voice decoding network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker;

the method for constructing the speaker coding network comprises the following steps:

2. The speech extraction method of claim 1, wherein the method of constructing the speech coding network comprises: and extracting the time sequence representation by adopting a one-dimensional convolution coder or a self-supervision pre-training model.

3. The speech extraction method of claim 1, wherein the method of constructing the speaker extraction network comprises:

performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time sequence characterization input by adopting a gated convolution fusion method;

in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;

4. The speech extraction method according to claim 3, wherein the step of not performing feature fusion is selected in a final stage, and only the process of modeling the time dependency is performed.

5. The method of speech extraction of claim 3 wherein said gated convolution fusion method comprises:

convolving the time series representation input with the convolution kernel of the gating branch, adding a bias term obtained by linear layer conversion of the target speaker vocal print characteristic vector, and obtaining the output signal of the gating branch through normalization and processing of an activation function;

6. The method of speech extraction according to claim 3, wherein the sequence modeling of the time series characterization is processed by a time convolution network, a two-path recurrent neural network, or transformations.

7. The speech extraction method of claim 1, wherein the speech decoding network is implemented by a one-dimensional deconvolution layer or a fully-connected linear layer.

8. A neural network model training method is characterized by comprising the following steps:

inputting voiceprint registration voice training sample data of a target speaker into a speaker coding network in a preset neural network model to obtain voiceprint characteristics of the target speaker;

calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as a trained preset neural network model;

and extracting the voiceprint characteristic vector of the target speaker from the time series representation after the modeling processing by adopting a pooling layer based on a self-attention mechanism.

9. A speech extraction device, comprising:

the acquisition module is used for acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

the voice coding network module is used for inputting the aliased voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model and acquiring the time sequence representation of the aliased voice;

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characterization belonging to the target speaker in the aliasing voice data of multiple speakers;

the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a trained preset neural network model and restoring a time domain voice signal of the target speaker;

10. A neural network model training device, comprising:

the speaker coding network module is used for inputting the voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model to acquire the voiceprint characteristics of the target speaker;

the loss function calculation module is used for calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model;

11. A computer-readable storage medium, having a computer program stored thereon, wherein a processor executes the computer program to implement the speech extraction method according to any one of claims 1-7.