CN112634875B

CN112634875B - Voice separation method, voice separation device, electronic device and storage medium

Info

Publication number: CN112634875B
Application number: CN202110237579.3A
Authority: CN
Inventors: 史王雷; 王秋明
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-06-08
Anticipated expiration: 2041-03-04
Also published as: CN112634875A

Abstract

The application provides a voice separation method, a voice separation device, an electronic device and a storage medium, wherein the voice separation method comprises the following steps: acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode; inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model; inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction tag sequence corresponding to the embedded characteristic sequence through the voice clustering model; and carrying out single speaker voice reduction according to the predicted tag sequence to generate separated voice. According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the problem that the voice separation effect is not ideal is solved, the voice fragments belonging to a single speaker can be separated from the short-time voice audio file in which multiple persons speak alternately, and the number of speakers can be accurately estimated by contacting with the context information.

Description

Voice separation method, voice separation device, electronic device and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech separation method, a speech separation apparatus, an electronic device, and a storage medium.

Background

The cocktail party problem is a classic problem in the field of computer speech processing, and the cocktail party problem refers to that when a single speaker speaks, the speech recognition technology can always accurately recognize the content spoken by the speaker, but when a scene contains a plurality of speakers, the accuracy of speech recognition can be greatly reduced.

Generally, the audio to be separated contains a large amount of short-term voices, and the short-term voices contain less information and are not high in distinctiveness, so that the difficulty of voice separation is high, and the separation effect is not ideal because the number of speakers cannot be accurately estimated by contacting context information.

Disclosure of Invention

In view of the problems that the existing voice recognition technology is difficult to separate short-time voice and cannot accurately estimate the number of speakers by contacting with the context information, the application provides a voice separation method, a voice separation device, an electronic device and a storage medium.

According to a first aspect of the present application, there is provided a speech separation method comprising: acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode; inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model; inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model; and carrying out voice reduction of a single speaker according to the prediction tag sequence to generate separated voice.

Alternatively, the speech clustering model may be trained by: acquiring a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples belonging to a plurality of speakers respectively; acquiring a training embedded feature sample sequence from each group of original audio samples in the multiple groups of original audio samples; training the voice clustering model by using a plurality of training embedded feature sample sequences of the plurality of groups of original audio samples, wherein the training embedded feature sample sequences are obtained by the following method: extracting a spectrogram characteristic sample of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode; inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker; and randomly splicing the training embedded feature samples of the multiple speakers to obtain a training embedded feature sample sequence comprising the multiple training embedded feature samples.

Optionally, training the speech clustering model by using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples may include: labeling the identity label of the speaker for each training embedded characteristic sample sequence in the training embedded characteristic sample sequences to determine a test label sample sequence; based on each training embedded characteristic sample sequence in the training embedded characteristic sample sequences, acquiring a training prediction label sequence by using a voice clustering model; and training the voice clustering model according to prior probability based on the training prediction label sequence and the test label sample sequence, wherein the prior probability refers to the probability that the next predicted training prediction label is changed according to the predicted training prediction label.

Optionally, the prior probability includes a speaker tag assigned sequence probability, which can be determined by: and determining the probability of the speaker label distribution sequence according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels.

Optionally, determining the speaker tag assignment sequence probability according to the number of times of speaker alteration occurrence in the predicted training prediction tags and the total number of the predicted training prediction tags, includes: determining statistical parameters of the speaker label distribution sequence probability according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels; determining the speaker tag assignment sequence probability according to the statistical parameter, wherein the statistical parameter can be expressed as:

wherein the content of the first and second substances,

representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|，Y _m={y _{m 1,},…, y _{m i,}, y _{m i,+1}, …, y _{m N,}}，Y _mis shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY _mI represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y _{m i,}is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。

optionally, the training embedded feature sample sequences may include L training embedded feature sample sequences, where training the speech clustering model with the training embedded feature sample sequences of the original audio samples may include: embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of training embedded feature sample sequences and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<L; training the speech clustering model based on the loss function.

Alternatively, the speech segmentation model may be trained by: acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding mode; acquiring training prediction embedding characteristics according to a characteristic aggregation formula based on the spectrogram characteristic sample; training a speech segmentation model based on the test embedded feature samples and the training predicted embedded features, wherein the feature aggregation formula is expressed as:

wherein the content of the first and second substances,V _{j k,}is the amount of the characteristic polymerization,X _{i j,}a second step of representing the features of the spectrogram feature samples obtained through convolution kernel operationiSecond of partial descriptorsjThe value of the characteristic is used as the characteristic value,j=1, …, J is the total number of eigenvalues contained in the local descriptor,c _{k j,}the first in the effective speech clustering center of the features obtained by convolution kernel operation of the speech spectrogram feature samplekThe first of the cluster centerjThe value of the characteristic is used as the characteristic value,Krepresents the number of valid speech cluster centers,k=1,…,K，Grepresents the number of noise point cluster centers,

，

，

，

，

t represents a vector transpose operation,Nthe total number of local descriptors of the features obtained by the convolution kernel operation on the spectrogram feature sample is represented,i=1,…,N，

representing clustersThe weight coefficient of the center is set to be,

the first of the effective speech clustering center and the noise point clustering center representing the features of the speech spectrogram feature sample obtained by convolution kernel operation

The first of the cluster centerjAnd (4) the characteristic value.

According to a second aspect of the present application, there is provided a speech separation apparatus comprising: the extraction unit is used for acquiring original audio and extracting a spectrogram characteristic sequence from the original audio in a time window sliding mode; the segmentation unit is used for inputting the spectrogram feature sequence into a pre-trained voice segmentation model and acquiring an embedded feature sequence through the voice segmentation model; the clustering unit is used for inputting the embedded characteristic sequence into a pre-trained voice clustering model and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model; and the separation unit is used for carrying out single speaker voice reduction according to the prediction tag sequence to generate separated voice.

According to a third aspect of the present application, there is provided an electronic device comprising: a processor; a memory storing a computer program which, when executed by the processor, implements the speech separation method according to the first aspect.

According to a fourth aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech separation method according to the first aspect.

According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the voice fragments belonging to a single speaker can be separated from the short-time voice audio file in which multiple persons speak alternately, and the number of speakers can be accurately estimated by contacting with the context information.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 shows a schematic flow diagram of a speech separation method according to an embodiment of the application;

FIG. 2 shows a flow diagram of each step in a speech separation method according to an embodiment of the application;

FIG. 3 is a diagram illustrating the determination of a loss function in the step of training a speech clustering model in a speech separation method according to an embodiment of the present application;

fig. 4 shows a schematic block diagram of a speech separation apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

One aspect of the present application relates to a method of speech separation. The voice separation method can generate separated voice by firstly dividing and then clustering spectrogram characteristics extracted from the voice by a preset time window and window shift, thereby separating voice segments belonging to a single speaker from a short-time voice audio file in which a plurality of persons speak alternately.

It is noted that prior art speech recognition techniques were unable to accurately separate short-term speech prior to filing the present application. However, the speech separation method provided by the present application can take the continuity of time into account by applying a predetermined time window and window shift, so that even for short-term speech, the speaker can be well distinguished by combining the context information, and the number of speakers and the speaker change time point can be accurately estimated.

Fig. 1 shows a flow chart of a speech separation method according to an embodiment of the present application, and fig. 2 shows a flow chart of each step in the speech separation method according to an embodiment of the present application.

As shown in fig. 1 and 2, a speech separation method according to an embodiment of the present application may include the steps of:

and step S10, acquiring the original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode.

In this step, the original audio may be an audio file to be voice-separated, and the original audio may include a voice scene in which a plurality of speakers alternately speak. The speech separation method of embodiments of the present application can identify speech segments for each of a plurality of speakers from raw audio and can generate speech segments for a single speaker.

The spectrogram feature may be a view feature that performs spectral analysis on the original audio.

As an example, before the method partitions the original audio, a spectrogram feature sequence may be extracted from the original audio with a predetermined time window and a predetermined time window shift.

Specifically, a time window for speech extraction may be set in advance, and then a plurality of spectrogram features may be extracted from the original audio in a sliding window manner from a start time to an end time of the original audio with a predetermined time window shift, thereby forming a spectrogram feature sequence.

The length of the time window may be arbitrarily selected, for example, it may have a fixed window length of 1.2 seconds. As an example, the time window shift may be half the length of the time window, e.g., when the length of the time window is 1.2 seconds, the time window shift may be 0.6 seconds.

In the application, spectrogram features are extracted from an original audio in a sliding window mode, time continuity can be considered, overlapped parts in adjacent windows can be used for segmentation of the two windows, and relevance between spectrogram features of each segment is established, so that even for short-time speech, a speaker can be well distinguished by combining context information, and the time point of speaker change is accurately estimated.

Optionally, the raw audio may be preprocessed before the spectrogram features are extracted. As an example, as shown in fig. 2, silence detection may be performed on the acquired original audio, removing a silence segment, obtaining a plurality of separated effective speech segments, then the plurality of effective speech segments may be spliced to form a preprocessed original audio, and then a spectrogram feature may be extracted based on the preprocessed original audio, so that the speed of speech separation may be increased by removing the silence segment, and the interference of the silence segment on the speech separation result may be removed. Therefore, an effective audio formed by splicing a plurality of effective voice segments can be obtained by removing the mute segments in the original audio, and then a plurality of spectrogram features can be extracted from the effective audio in a time window sliding window mode, so that a spectrogram feature sequence can be obtained.

And step S20, inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model.

In this step, the pre-trained speech segmentation model may segment the spectrogram feature sequence and output an embedded feature sequence including embedded (embedding) features. Here, the spectrogram feature may be a speech primary feature, and the embedded feature may be a speech high-level feature, which may be a speech feature obtained based on the spectrogram feature more conspicuous than a discrimination degree between features of the spectrogram feature.

Here, the embedded feature may represent a single speech segment, and as shown in fig. 2, the embedded feature sequence includes embedded features 1 through n, where n represents the total number of embedded features in the embedded feature sequence.

As an example, the speech segmentation model of the embodiment of the present application may be modeled based on a Convolutional Neural Network (CNN), however, in the existing Convolutional neural network, when the number of Convolutional layers is deepened, some effective features learned in shallow convolution may be lost, and in the aggregation process from the frame level to the speech level, the distinctiveness of the embedding features at the speech level is insufficient, so that the speech segmentation effect is not good.

In this regard, in one example, the speech segmentation model of the embodiment of the present application is based on the ResNet network in the convolutional neural network, and introduces the ghost vlad algorithm in the feature aggregation stage, resulting in the utterance-level embedded features based on the attention mechanism.

As an example, the speech segmentation model may be trained by:

and step S21, acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding window mode.

In this step, an existing original audio sample and a test embedded feature sample corresponding to the original audio sample may be obtained for training the speech segmentation model. The test embedded feature samples may represent actual embedded feature samples corresponding to the original audio samples, which may be obtained in any manner, e.g., by other pre-trained speech segmentation models.

Specifically, spectrogram feature samples may be extracted from the original audio sample in a predetermined time window, which may have a fixed window length, for example, 2.5 seconds, i.e., a single voice randomly intercepted from the original audio sample has a length of 2.5 seconds, and accordingly, the dimension of the extracted spectrogram feature samples may be (257, 250, 1). However, the present application is not limited thereto, and spectrogram features may be extracted from an original audio sample in a time window of any length. Further, as an example, the batch size of the training (batch size) may be set to 96, i.e. the training may be performed on a batch of input amounts of 96 raw audio samples.

And step S22, obtaining a prediction embedded feature sample according to a feature aggregation formula based on the spectrogram feature sample.

As an example, in the improved convolutional neural network structure of the embodiments of the present application, the convolutional layer may be divided into a plurality of residual block portions, and each residual block portion may include one or more layers of convolutions. As an example, the residual block portion may be 9, and each residual block portion may include two layers of convolution, that is, the convolutional layer includes 18 layers of convolution in total.

In the aggregation stage of the network structure, one or more GhostVLAD algorithms can be introduced, and feature aggregation quantities obtained through the one or more GhostVLAD algorithms can be fused into final speech-level embedded features through feature vector hopping connection.

As an example, the aggregation stage may introduce the ghost vlad algorithm twice, for example, the output vector of the layer 9 convolution may be subjected to the first ghost vlad feature aggregation, and the output vector of the layer 18 convolution may be subjected to the second ghost vlad feature aggregation, and then the features aggregated by the two ghost vlad algorithms may be fused into the final speech-level embedded feature through feature vector hopping connection.

Here, the ghost vlad algorithm may be an attention mechanism model established based on a statistical method, which may perform residual accumulation of different feature vectors of different speech segment samples using a soft distribution manner, and may assign different weights to features through a softmax function according to a distance from each feature of different speech segment samples to a corresponding cluster center, thereby introducing an attention mechanism in a feature aggregation process.

In addition, the ghost vlad algorithm can aggregate features and reduce noise of the features, specifically, different features in the same speech segment can obtain residual errors and vectors according to the size of the effective clustering center corresponding to the distance between the features, and similarly, codebooks are provided for the distribution of noise points in spectrogram features for clustering, however, in the calculation stage, weights are not given to the noise points, so that the noise features cannot participate in the generation of the residual errors and the vectors, and noise reduction can be realized.

As an example, the feature aggregation formula of the ghost vlad algorithm may be expressed as:

wherein the content of the first and second substances,V _{j k,}is the amount of the characteristic polymerization,X _{i j,}the first one representing the features of spectrogramiSecond of partial descriptorsjThe value of the characteristic is used as the characteristic value,j=1, …, J is the total number of eigenvalues contained in the local descriptor, where the total number of eigenvalues contained in the local descriptor may be determined from the size of the convolution kernel,c _{k j,}the first in the effective speech clustering center of features obtained by convolution kernel operation representing spectrogram feature sampleskThe first of the cluster centerjThe value of the characteristic is used as the characteristic value,Krepresents the number of valid speech cluster centers,k=1,…,K，Grepresents the number of noise point cluster centers,

，

，

，

，

t represents a vector transpose operation,Nrepresenting the total number of local descriptors of features obtained by convolution kernel operation on spectrogram feature samples,i=1,…,N，

a weight coefficient representing the center of the cluster,

The first of the cluster centerjAnd (4) the characteristic value.

As an example, in training a speech segmentation model, predictive embedded features may be output from a fully-connected layer to compose a sequence of predictive embedded features.

In the application, an attention mechanism is introduced into feature aggregation, so that the comprehensive characterization capability of the speech-level embedded features output by a trained speech segmentation model is stronger, and the distinctiveness is better.

And step S23, training a voice segmentation model based on the test embedded characteristic sample and the prediction embedded characteristic sample.

In this step, the speech segmentation model may be trained using an AM-softmax loss function based on the test embedded feature samples and the predicted embedded feature samples, as an example.

Specifically, the AM-softmax loss function can improve the model optimization space by increasing the inter-class distance, so that the extracted feature vectors of different speakers are distinguished and amplified, the distinguishing capability of the model is improved, and a better classification effect can be obtained.

As an example, the AM-softmax loss function can be expressed as:

wherein the content of the first and second substances,L _amrepresenting the AM-softmax loss,Nindicates the total number of speakers as a whole,i=1,…,N，y _iis shown asiA prediction vector (e.g. prediction embedded feature samples),

is shown asiThe angle of each prediction vector to its corresponding true vector (e.g. test embedded feature sample),

the representation comprisesy _iThe included angles between all the prediction vectors and the corresponding true vectors,jrepresenting the left and right prediction vectors,

，mcosine values representing the angle between the reduced prediction vector and the true vector (i.e., introducingmThe value is such that the cosine of the angle between the prediction vector and the real vector is decreased, i.e., the angle between the prediction vector and the real vector is increased),

indicating predictedy _iThe numerator of the loss function of the vector,sis a scaling factor.

In the application, the AM-softmax loss function is used as the loss function of the speech segmentation model training, so that the distance between the target vector and the real vector can be increased, and the back propagation efficiency of the model cannot be influenced.

In addition, when an attention mechanism is introduced in the feature polymerization process, parameters of the model can be optimized by adopting an AM-softmax loss function as a loss function of the model.

The speech segmentation model trained by the embodiment of the application can generate the speech-level embedded features based on an attention mechanism so as to be used for speech clustering. Compared with a Gaussian mixture model or other frame-level features, the speech-level embedding feature can make full use of time domain and frequency domain information and contain more speech information, and meanwhile, compared with the traditional speech-level embedding feature, the speech-level embedding feature has the advantages that the introduction of an attention mechanism enables the speech-level embedding feature to extract speech features with stronger representation capability, so that the whole speech separation model has good performance on short-time speech segments.

Returning to fig. 1, step S30 is to input the embedded feature sequence into a pre-trained speech clustering model, and obtain a predicted tag sequence corresponding to the embedded feature sequence through the speech clustering model.

In this step, as shown in fig. 2, the embedded features output from the speech segmentation model may be used as input quantities of a pre-trained speech clustering model, and the pre-trained speech clustering model may generate corresponding prediction labels based on the embedded features, thereby forming a prediction label sequence. Here, the predictive tag may represent an identity of a speaker for each embedded feature, and a plurality of predictive tags corresponding to a plurality of embedded features in the sequence of embedded features form a sequence of predictive tags. In an embodiment of the present application, the identity of a speaker can be represented by a positive integer, such as speaker 1, speaker 2, etc., and thus, the tag value can be a positive integer and the tag sequence can be a set of positive integers, such as {1,2,1,3,3,4 }.

As an example, the speech clustering model may be trained by: step S31, obtaining a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples respectively belonging to a plurality of speakers; step S32, obtaining training embedded characteristic sample sequences from each group of original audio samples in the groups of original audio samples; and step S33, training the voice clustering model by using a plurality of training embedded characteristic sample sequences of a plurality of groups of original audio samples.

In step S31, each set of original audio samples may include a plurality of single-speaker original audio samples respectively belonging to a plurality of speakers, that is, each original audio sample corresponds to one speaker, and the plurality of original audio samples respectively belong to the plurality of speakers.

In step S32, the training embedded feature sample sequence may be obtained by:

step S321, extracting spectrogram feature samples of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode.

As an example, the original audio sample may be a single speaker audio file with speaker identity labels, and spectrogram feature samples may be extracted in a sliding window manner from the original audio sample.

Specifically, a time window for speech extraction may be preset, and then a plurality of spectrogram feature samples may be extracted from the original audio sample from a start time to an end time of the original audio sample in a sliding window manner with a predetermined time window shift, thereby forming a spectrogram feature sample sequence.

Here, two or more of the time windows of the plurality of original audio samples for each set of original audio samples may have different lengths. For example, the plurality of original audio samples may include 10000 original audio samples, the 10000 original audio samples may be divided into 5 batches (batch) of original audio samples, 5 time windows with different lengths may be respectively preset for the 5 batches of original audio samples, and accordingly, 5 different time window shifts may be respectively preset. As an example, the time window shift may be half the length of the corresponding time window.

As an example, the length of the time window may be determined from the original audio sample, in particular from the speech scene of the original audio sample, which may allow setting different time windows for different speech scenes of the original audio sample. For example, the length of the time window may be positively correlated with the length of a single speaker's continuous utterance in the original audio sample, e.g., for short duration speech scenes such as a telephone recording, the speaker's frequency of change is faster, and thus a shorter time window and time window shift may be set, as compared to a longer time window and time window shift, which may be set because the speaker may not change for a relatively long duration speech scene such as a conference utterance.

Here, the length of the time window and the time window shift may be indefinite, and in particular, the length of the time window may be preset to be in the range of 0.5s to 1.6s in consideration of the presence of a large number of frequently alternating short-time voices (voice duration <1 s) in the voice separating task. As an example, the time window shift may be half the length of the respective indefinite long time window, e.g. the time window shift may be in the range of 0.25s to 0.8s when the length of the time window is in the range of 0.5s to 1.6 s.

In the training of the voice clustering model, the voice scenes of a plurality of original audio samples are distinguished, the original audio samples of different scenes are extracted through the indefinite long time window and the indefinite long time window, and the trained voice clustering model can be suitable for voice separation of long-time and long-time voice scenes and also suitable for voice separation of short-time and long-time voice scenes. Optionally, the raw audio samples may be preprocessed before the spectrogram feature samples are extracted. As an example, silence detection may be performed on the acquired original audio, a silence segment may be removed, a plurality of separated valid speech segments may be obtained, then the plurality of valid speech segments may be spliced to form a preprocessed original audio sample, and then a spectrogram feature sample may be extracted based on the preprocessed original audio sample.

Step S322, inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker.

In this step, for example, the spectrogram feature sample of each speaker may be input into the above-described pre-trained speech segmentation model, so as to obtain embedded features as high-level features of speech based on the spectrogram features of each speaker, so that the degree of distinction between the speech features is higher.

And step S323, randomly splicing the training embedded characteristic samples of the multiple speakers to obtain a training embedded characteristic sample sequence comprising the multiple training embedded characteristic samples.

In this step, training embedded feature samples of multiple speakers can be randomly spliced to obtain a spliced training embedded feature sample sequence, so that the training embedded feature samples in the training embedded feature sample sequence belong to the multiple speakers, and the occurrence sequence of different speakers is random.

In step S33, the training the speech clustering model using the training embedded feature sample sequences of the original audio samples may include:

and step S331, carrying out speaker identity label labeling on each training embedded characteristic sample sequence in the plurality of training embedded characteristic sample sequences, and determining a test label sample sequence.

Specifically, for the training embedded feature sample sequence including a plurality of training embedded feature samples acquired in step S323, the corresponding training embedded feature samples may be combined with the test tag samples, thereby determining the test tag sample sequence. Here, the test tag sample may be known at the time the original audio sample was taken, e.g., the number of speakers and the speaking time of the speakers are known.

For example, given that speakers included in a set of original audio samples are two speakers, i.e., speaker 1 and speaker 2, the training embedded feature sample sequence may be { embedded feature sample of speaker 1, embedded feature sample of speaker 2, embedded feature sample of speaker 1, embedded feature sample of speaker 2}, and accordingly, the test tag sample sequence may be represented as {1,2,2,1,2 }.

And S332, acquiring a training prediction label sequence by utilizing a voice clustering model based on each training embedded characteristic sample sequence in the plurality of training embedded characteristic sample sequences.

In this step, each training embedded feature sample sequence in the plurality of training embedded feature sample sequences may be input into the speech clustering model, and a corresponding training prediction tag sequence may be output by using the speech clustering model. Here, a plurality of training embedded feature sample sequences may be input into the speech clustering model in batches.

And S333, training the voice clustering model according to the prior probability based on the training prediction label sequence and the test label sample sequence, wherein the prior probability refers to the probability that the next predicted training prediction label determined according to the predicted training prediction label changes.

In this step, since a plurality of training embedded feature sample sequences are input to the speech clustering model in batches during the training process, the training prediction label value obtained in the previous training batch can be compared with the test label sample value to obtain the prior probability, and then the prior probability and the next training embedded feature sample sequence are used together as the input quantity of the model calculation (for example, the prior probability can be used as the hyperparameter of the operation node of the speech clustering model) to participate in the next calculation to optimize the model.

As an example, in step S333, the training predicted label sequence and the test label sample sequence may be input into a clustering network model for training, wherein the clustering network model may generate the training predicted labels based on prior probabilities.

In the training process of the voice clustering model, the prediction precision of the subsequent model can be improved by continuously optimizing the prior probability. As an example, the maximum likelihood estimation method may be used to estimate the parameters of the prior probability, so as to determine the prior probability.

In one example, the prior probabilities can include speaker transition signal sequence probabilities

In this case, the first and second substrates,S _itransit signal sequence probability obeying to speakerP ₂Represents a sequence of speaker transit signals for a speech segment. When adjacent speaker tags embedded in the sequence of feature samples change,S _itaking 1; when the adjacent speaker tag embedded in the sequence of feature samples has not changed,S _itake 0.

Parameter estimation by speaker transition signal sequence probability

Determining speaker transition signal sequence probabilityP ₂。

In particular, parameter estimates

Can be expressed as:

wherein the content of the first and second substances,Nrepresenting the total number of training embedded feature sample sequences,n=1,…,N，I _nis shown asnThe total number of training embedded feature samples contained in the sequence of training embedded feature samples (i.e., the sequence length),iis shown asiEach of the training sets is embedded in a feature sample,y _{n i,}is shown asnTraining the first of the embedded feature sample sequenceiTraining prediction labels embedded in the feature samples. Formula 1(y _{n i,}=y _{n i,-1}) "means wheny _{n i,}=y _{n i,-1}If so, outputting a result of 1; when in use

When the output is 0.

In another example, the prior probabilities can include speaker tag assigned sequence probabilities, which can be determined by: and determining the probability of the speaker label distribution sequence according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels.

Here, the speaker tag assigns a sequence probabilityP ₁Indicating that the speaker jumps to the existing utterance when the speaker changesProbability of talker or new talker.

According to the characteristics of probability distribution, when the speaker changes, if the next input speech segment belongs to the known speaker, the real label of the speech segment is associated with the known speakerN _{k i,-1}In direct proportion, the amount of the solvent, here,N _{k i,-1}indicating that the feature sample is embedded in the 1 st trainingi-1 specific speaker contained in training embedded feature samplekThe number of speech blocks, a speech block can be understood as a sequence of consecutive speaker tag segments in the segment of speech that belong to a particular speaker. For example, the sequencey _[6]= (1,1,2,2,3,1), then the segment of speech contains four speech blocks (1,1), (2,2), (3), (1), where,N _1,5=2，N _2,5=1，N _3,5=1。

thus, the speaker tag assignment sequence probability can be expressed as:

wherein the content of the first and second substances,y _ia value representing the current training prediction label,S _iindicating whether a change has occurred in the speaker tag,S _i=1 indicates that the speaker tag is changed.

When the speaker changes, if the next input voice segment belongs to the unknown speaker, the real label and statistical parameter of the segment voice

In direct proportion, the speaker tag assigned sequence probability can be expressed as:

wherein the content of the first and second substances,K _i-1indicating that the feature sample is embedded in the 1 st trainingi-1 total number of speakers included in training embedded feature samples, wherein，S _iIndicating whether a change has occurred in the speaker tag,S _i=1 indicates that the speaker tag is changed,S _i=0 indicates that the speaker tag has not changed,y _iis shown asiThe value of each of the training prediction labels,y _i-1is shown asnTraining the first of the embedded feature sample sequenceiThe training embedded feature samples are trained to predict label values.

As an example, statistical parameters

The number of speaker changes occurring in the predicted training prediction labels and the total number of predicted training prediction labels may be based on. E.g. statistical parameters

Can be expressed as:

wherein the content of the first and second substances,

representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of a plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|，Y _m={y _{m 1,},…, y _{m i,}, y _{m i,+1}, …, y _{m N,}}，Y _mis shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY _mI represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y _{m i,}is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1. Here, formula "

"means when

If so, outputting a result of 1; when in usey _{m i,}=y _{m i,+1}When the output is 0.

In the above expression, the denominator may represent the total number of predicted training prediction labels, and the numerator may represent the number of times of occurrence of speaker alteration in the predicted training prediction labels.

In the process of training the speech clustering model, the parameters are counted

And continuously optimizing so as to realize the training of the model.

In another example, the prior probabilities can include speaker-embedded feature generation sequence probabilities

Parameter estimates of sequence probabilities can be generated by speaker-embedded features

And

to determine speaker-embedded feature generation sequence probabilities.

In particular, parameter estimates

And

can be expressed as:

wherein the content of the first and second substances,Iindicates the case where a plurality of embedded feature sample sequences are divided into a plurality of training batchesIThe training batches are then run in a batch,Nis shown asIThe total number of embedded feature sample sequences in the training batch as training inputs,P _Iis shown asIThe step value of the gradient ascent optimization strategy of the model training phase of the training batch,B _Iis indicated from the firstIA subset of embedded feature samples formed by a plurality of embedded feature samples selected from the embedded feature samples included in the sequence of embedded feature samples as training input in the training batch,brepresenting the total number of embedded feature samples included in the subset of embedded feature samples,

is shown asnThe embedded feature samples belong to a subset of feature samplesB _I，

Means that under the condition of known prior information, the label is obtained asx _nThe probability of (a) of (b) being,

means that under the condition of known prior information, the label is obtained asx _nThe probability of (c). In particular, the present invention relates to a method for producing,

and

may be obtained by the above-mentioned

And

the product of (a) and (b) is obtained.

In yet another example, the prior probabilities can include the speaker-embedded feature generation sequence probabilities mentioned aboveP ₀Speaker tag assigned sequence probabilityP ₁And speaker transition signal sequence probabilityP ₂Three, in this case, the prior probability can be expressed as:

in this example, for the input voice segment data, the value probability of the speaker transition signal can be calculated, the value probability of the previous voice segment label and the current voice segment label under the current transition signal value condition is calculated, and the accuracy of the embedded feature observed value obtained under the previous state weight parameter and the current voice segment label condition is calculated, so as to determine the prior probability and realize the training of the voice clustering model. Parameter(s)

、

And

is the hyper-parameter of the operation node of the voice clustering model, and the training of the voice clustering model in the process is actually the parameter

、

And

at least one of the iterative optimization processes.

FIG. 3 is a diagram illustrating the determination of the loss function in the step of training the speech clustering model in the speech separation method according to the embodiment of the present application.

As an example, the plurality of training embedded feature sample sequences may include L training embedded feature sample sequences.

In this case, the speech clustering model may be trained based on the loss function.

Specifically, as shown in FIG. 3, the loss function may be determined by: embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of the training embedded feature sample sequence and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<And L. The determined loss function can be substituted back into the speech clustering model to ensure the accuracy and stability of the model.

The speech clustering model of the embodiment of the present application may be, for example, a uissrnn model. The loss function may be, for example, an MSE loss function, and in particular, the first in a temporal signature sequence may be employediIs (i.e. the firstiPeriod) of the time sequence, and the mean vector of the sequence from the sequence of the spectrogram feature samples to the final Lth (i.e., Lth period) of the sequence, and the sequence from the 1 st (i.e., 1 st period) of the spectrogram feature samples in the time feature sequence to the final first of the sequencei-1 (i.e. the firsti-1 time interval) of mean vector of the sequence of spectrogram feature samples to find the MSE loss.

Compared with the first training embedded feature sample sequence to the L-1 st training embedded feature sample sequence based on the L training embedded feature sample sequencesTraining mean vector and second of embedded feature sample sequenceLThe loss function determined according to the method of the present application can ensure the accuracy and stability of the model, and here, the calculation of determining the loss function according to the mean vector can be implemented according to any means known in the art, and is not described herein again.

Returning to fig. 1, step S40 is to perform single speaker speech restoration according to the predicted tag sequence to generate separated speech.

In this step, as shown in fig. 2, the speech segments belonging to each speaker may be extracted according to the predicted tag sequence obtained in step S30, and the speech segments belonging to the same speaker are spliced, so as to restore the individual speech of each speaker in the original audio, and obtain the final separated speech.

Alternatively, when the original audio is pre-processed in step S10, the predicted tag sequence may be post-processed accordingly in this step S40. For example, when the original audio is unmuted in step S10, the obtained predicted tag sequences may be time-aligned with reference to the time of the valid speech segment generated in step S10, and the correspondence between the predicted tag sequences and the real time may be restored.

Further, the post-processing performed on the predicted tag sequence may further include tag smoothing processing, for example, it may be determined whether an abnormal value exists in a plurality of consecutive predicted tags in the predicted tag sequence, and the abnormal value may be smoothed and the final separated speech may be generated based on the smoothed predicted tag sequence.

For example, when speakers include speaker 1 and speaker 2, when the prediction tag sequence is {2, 2,2, 2,1, 1,1, 1,1, 1}, it can be seen that the prediction tag of speaker 2 appears among the prediction tags of a plurality of consecutive speakers 1, and at this time, the prediction tag of speaker 2 can be considered as an abnormal value, and can be smoothed, i.e., modified into the prediction tag of speaker 1, and the smoothed prediction tag sequence is {2, 2,2, 2,1, 1,1, 1,1, 1,1, 1}, and then the final isolated speech can be generated based on the smoothed prediction tag sequence.

The voice separation method of the embodiment of the application can be used for voice separation of multiple speakers, and can be applied to various voice interaction products, such as: under the scenes of meeting, news and the like, the method can be used for estimating the number of participants and separating the speech records of important speakers; under the scene of telephone conversation, the method can also be used for carrying out voice separation on the recording of the telephone conversation, and can be used for voice print warehousing and comparison; and the voice recognition performance and the like under a multi-person speaking environment can be improved through voice separation.

According to the voice separation method, the ResNet convolutional neural network is adopted in the voice segmentation stage, the embedded features are extracted in a sliding window mode, the number of convolutional layers is deep, and an attention mechanism is introduced in the feature aggregation stage.

The voice separation method provided by the embodiment of the application adopts the supervised clustering algorithm in the voice clustering stage, can effectively utilize prior information, dynamically adjusts the model parameters according to the clustering result, meets the requirement of a large-data-volume environment on the model clustering effect, and is higher in clustering accuracy. In addition, the speech separation method of the embodiment of the application optimizes the loss function in the speech clustering stage, adopts a statistical method to replace network training iteration, and improves the parameter calculation mode of the prior probability of the generated prediction label.

The method in the present application may be implemented by a voice separated apparatus in an electronic device, or may be implemented entirely by a computer program, for example, the method may be executed by a voice separated application installed in the electronic device, or by a functional program implemented in an operating system of the electronic device. By way of example, the electronic device may be a personal computer, a server, a tablet computer, a smart phone, or other electronic device with an artificial intelligence operation function.

Another aspect of the present application relates to a voice separation apparatus. Fig. 4 shows a schematic block diagram of a speech separation apparatus according to an exemplary embodiment of the present application.

As shown in fig. 4, the voice separating apparatus according to the exemplary embodiment of the present application includes an extracting unit 100, a dividing unit 200, a clustering unit 300, and a separating unit 400.

The extraction unit 100 acquires an original audio, and extracts a spectrogram feature sequence from the original audio in a time window sliding manner.

The segmentation unit 200 inputs the spectrogram feature sequence into a pre-trained speech segmentation model, and obtains an embedded feature sequence through the speech segmentation model.

The clustering unit 300 inputs the embedded feature sequence into a pre-trained speech clustering model, and obtains a prediction tag sequence corresponding to the embedded feature sequence through the speech clustering model.

The separation unit 400 performs time alignment according to the predicted tag sequence to generate separated speech.

The extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400 may perform corresponding steps in the method according to the speech separating method in the method embodiment shown in fig. 1 to fig. 3, for example, the steps may be implemented by machine readable instructions executable by the extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400, and specific implementation manners of the extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400 may refer to the method embodiment described above, and are not described herein again.

An embodiment of the present application further provides an electronic device, which includes a processor and a memory. The memory stores a computer program. When the computer program is executed by a processor, the electronic device may perform corresponding steps in the method according to the voice separation method in the method embodiments shown in fig. 1 to fig. 3, for example, by machine-readable instructions executable by the electronic device, and specific implementation manners of the electronic device may refer to the above-described method embodiments, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium storing a computer program, which, when executed (for example, by a processor), can perform the steps of the voice separation method in the method embodiments shown in fig. 1 to fig. 3.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment scheme of the application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the separated voice can be generated by firstly dividing and then clustering spectrogram features extracted from the voice in a time window sliding window mode, so that voice fragments belonging to a single speaker can be separated from a short-time voice audio file in which multiple persons speak alternately.

In addition, according to the speech separation method, the speech separation device, the electronic equipment and the storage medium, the speech spectrogram feature sequence can be extracted from the original audio in a time window sliding mode, the continuity of the extracted speech spectrogram features is ensured, and the number of speakers can be accurately estimated in connection with the context information.

In addition, according to the voice separation method, the voice separation device, the electronic equipment and the storage medium, in the training of the voice clustering model, the full-supervised clustering algorithm is adopted, the prior information in the separation process can be fully utilized, the model parameters can be dynamically adjusted, the problem that the performance of unsupervised clustering is reduced when the voice data volume is large and the number of speakers is large is effectively solved, and the requirement of a large data environment on the model separation effect is better met.

In addition, according to the speech separation method, the speech separation device, the electronic equipment and the storage medium, in the training of the speech clustering model, a feature aggregation formula is introduced to introduce an attention mechanism into the model, so that short-time speech features with stronger representation can be extracted.

In addition, according to the speech separation method, the speech separation device, the electronic device and the storage medium of the application, the plurality of training embedded feature sample sequences are divided into two groups, and the loss function of the speech clustering model is determined based on the mean vectors of the two groups of training embedded feature sample sequences, so that the loss function can be optimized, and the accuracy and the stability of the speech clustering model are ensured.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech separation method, characterized in that the speech separation method comprises:

acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode;

inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model;

inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model;

performing single speaker voice restoration according to the prediction tag sequence to generate separated voice,

wherein the speech clustering model is trained by:

acquiring a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples belonging to a plurality of speakers respectively;

acquiring a training embedded feature sample sequence from each group of original audio samples in the multiple groups of original audio samples;

training the voice clustering model by utilizing a plurality of training embedded characteristic sample sequences of the plurality of groups of original audio samples according to prior probability, wherein the prior probability refers to the probability that the next predicted training prediction label determined according to the predicted training prediction label changes, and the prior probability comprises the speaker label distribution sequence probability,

wherein the speaker tag assigned sequence probability is determined by:

determining statistical parameters of the speaker label distribution sequence probability according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels;

determining the speaker tag distribution sequence probability according to the statistical parameters,

wherein the statistical parameter is represented as:

wherein the content of the first and second substances,

representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|，Y _m={y _m,1, …, y _{m i,}, y _{m i,+1}, …, y _m,N}，Y _mis shown withmAn exerciseEmbedding a training prediction tag sequence corresponding to the feature sample sequenceY _mI represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y _{m i,}is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。

2. the speech separation method of claim 1 wherein the training embedded feature sample sequence is obtained by:

extracting a spectrogram characteristic sample of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode;

inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker;

and randomly splicing the training embedded feature samples of the multiple speakers to obtain a training embedded feature sample sequence comprising the multiple training embedded feature samples.

3. The method of claim 2, wherein the training the speech clustering model according to the prior probability using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples comprises:

labeling the identity label of the speaker for each training embedded characteristic sample sequence in the training embedded characteristic sample sequences to determine a test label sample sequence;

based on each training embedded characteristic sample sequence in the training embedded characteristic sample sequences, acquiring a training prediction label sequence by using a voice clustering model;

and training the voice clustering model according to the prior probability based on the training prediction label sequence and the test label sample sequence.

4. The speech separation method of claim 2 wherein the plurality of training embedded feature sample sequences comprises L training embedded feature sample sequences,

wherein training the speech clustering model using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples comprises:

embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of training embedded feature sample sequences and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<L；

Training the speech clustering model based on the loss function.

5. The speech separation method of claim 1, wherein the speech segmentation model is trained by:

acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding mode;

acquiring training prediction embedding characteristics according to a characteristic aggregation formula based on the spectrogram characteristic sample;

training a speech segmentation model based on the test embedded feature samples and the training predicted embedded features,

wherein the feature aggregation formula is represented as:

，

，

，

，

a weight coefficient representing the center of the cluster,

Individual clusterAt the center ofjAnd (4) the characteristic value.

6. A speech separation apparatus, characterized in that the speech separation apparatus comprises:

the extraction unit is used for acquiring original audio and extracting a spectrogram characteristic sequence from the original audio in a time window sliding mode;

the segmentation unit is used for inputting the spectrogram feature sequence into a pre-trained voice segmentation model and acquiring an embedded feature sequence through the voice segmentation model;

the clustering unit is used for inputting the embedded characteristic sequence into a pre-trained voice clustering model and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model;

a separation unit for performing single speaker voice restoration according to the prediction tag sequence to generate separated voice,

wherein the clustering unit trains the speech clustering model by:

wherein the speaker tag assigned sequence probability is determined by:

wherein the statistical parameter is represented as:

wherein the content of the first and second substances,

representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|，Y _m={y _m,1, …, y _{m i,}, y _{m i,+1}, …, y _m,N}，Y _mis shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY _mI represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y _{m i,}is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。

7. an electronic device, characterized in that the electronic device comprises:

a processor;

memory storing a computer program which, when executed by a processor, implements a speech separation method according to any one of claims 1 to 5.

8. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech separation method according to any of claims 1 to 5.