CN112634875B - Voice separation method, voice separation device, electronic device and storage medium - Google Patents

Voice separation method, voice separation device, electronic device and storage medium Download PDF

Info

Publication number
CN112634875B
CN112634875B CN202110237579.3A CN202110237579A CN112634875B CN 112634875 B CN112634875 B CN 112634875B CN 202110237579 A CN202110237579 A CN 202110237579A CN 112634875 B CN112634875 B CN 112634875B
Authority
CN
China
Prior art keywords
training
sequence
embedded
voice
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110237579.3A
Other languages
Chinese (zh)
Other versions
CN112634875A (en
Inventor
史王雷
王秋明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuanjian Information Technology Co Ltd
Original Assignee
Beijing Yuanjian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuanjian Information Technology Co Ltd filed Critical Beijing Yuanjian Information Technology Co Ltd
Priority to CN202110237579.3A priority Critical patent/CN112634875B/en
Publication of CN112634875A publication Critical patent/CN112634875A/en
Application granted granted Critical
Publication of CN112634875B publication Critical patent/CN112634875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a voice separation method, a voice separation device, an electronic device and a storage medium, wherein the voice separation method comprises the following steps: acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode; inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model; inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction tag sequence corresponding to the embedded characteristic sequence through the voice clustering model; and carrying out single speaker voice reduction according to the predicted tag sequence to generate separated voice. According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the problem that the voice separation effect is not ideal is solved, the voice fragments belonging to a single speaker can be separated from the short-time voice audio file in which multiple persons speak alternately, and the number of speakers can be accurately estimated by contacting with the context information.

Description

Voice separation method, voice separation device, electronic device and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech separation method, a speech separation apparatus, an electronic device, and a storage medium.
Background
The cocktail party problem is a classic problem in the field of computer speech processing, and the cocktail party problem refers to that when a single speaker speaks, the speech recognition technology can always accurately recognize the content spoken by the speaker, but when a scene contains a plurality of speakers, the accuracy of speech recognition can be greatly reduced.
Generally, the audio to be separated contains a large amount of short-term voices, and the short-term voices contain less information and are not high in distinctiveness, so that the difficulty of voice separation is high, and the separation effect is not ideal because the number of speakers cannot be accurately estimated by contacting context information.
Disclosure of Invention
In view of the problems that the existing voice recognition technology is difficult to separate short-time voice and cannot accurately estimate the number of speakers by contacting with the context information, the application provides a voice separation method, a voice separation device, an electronic device and a storage medium.
According to a first aspect of the present application, there is provided a speech separation method comprising: acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode; inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model; inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model; and carrying out voice reduction of a single speaker according to the prediction tag sequence to generate separated voice.
Alternatively, the speech clustering model may be trained by: acquiring a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples belonging to a plurality of speakers respectively; acquiring a training embedded feature sample sequence from each group of original audio samples in the multiple groups of original audio samples; training the voice clustering model by using a plurality of training embedded feature sample sequences of the plurality of groups of original audio samples, wherein the training embedded feature sample sequences are obtained by the following method: extracting a spectrogram characteristic sample of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode; inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker; and randomly splicing the training embedded feature samples of the multiple speakers to obtain a training embedded feature sample sequence comprising the multiple training embedded feature samples.
Optionally, training the speech clustering model by using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples may include: labeling the identity label of the speaker for each training embedded characteristic sample sequence in the training embedded characteristic sample sequences to determine a test label sample sequence; based on each training embedded characteristic sample sequence in the training embedded characteristic sample sequences, acquiring a training prediction label sequence by using a voice clustering model; and training the voice clustering model according to prior probability based on the training prediction label sequence and the test label sample sequence, wherein the prior probability refers to the probability that the next predicted training prediction label is changed according to the predicted training prediction label.
Optionally, the prior probability includes a speaker tag assigned sequence probability, which can be determined by: and determining the probability of the speaker label distribution sequence according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels.
Optionally, determining the speaker tag assignment sequence probability according to the number of times of speaker alteration occurrence in the predicted training prediction tags and the total number of the predicted training prediction tags, includes: determining statistical parameters of the speaker label distribution sequence probability according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels; determining the speaker tag assignment sequence probability according to the statistical parameter, wherein the statistical parameter can be expressed as:
Figure 284433DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 78077DEST_PATH_IMAGE002
representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|,Y m ={y m 1,,…, y m i,, y m i,+1, …, y m N,},Y m is shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY m I represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y m i,is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。
optionally, the training embedded feature sample sequences may include L training embedded feature sample sequences, where training the speech clustering model with the training embedded feature sample sequences of the original audio samples may include: embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of training embedded feature sample sequences and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<L; training the speech clustering model based on the loss function.
Alternatively, the speech segmentation model may be trained by: acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding mode; acquiring training prediction embedding characteristics according to a characteristic aggregation formula based on the spectrogram characteristic sample; training a speech segmentation model based on the test embedded feature samples and the training predicted embedded features, wherein the feature aggregation formula is expressed as:
Figure 713458DEST_PATH_IMAGE003
wherein the content of the first and second substances,V j k,is the amount of the characteristic polymerization,X i j,a second step of representing the features of the spectrogram feature samples obtained through convolution kernel operationiSecond of partial descriptorsjThe value of the characteristic is used as the characteristic value,j=1, …, J is the total number of eigenvalues contained in the local descriptor,c k j,the first in the effective speech clustering center of the features obtained by convolution kernel operation of the speech spectrogram feature samplekThe first of the cluster centerjThe value of the characteristic is used as the characteristic value,Krepresents the number of valid speech cluster centers,k=1,…,KGrepresents the number of noise point cluster centers,
Figure 135212DEST_PATH_IMAGE004
Figure 590989DEST_PATH_IMAGE005
Figure 465404DEST_PATH_IMAGE006
Figure 45421DEST_PATH_IMAGE007
Figure 321681DEST_PATH_IMAGE008
t represents a vector transpose operation,Nthe total number of local descriptors of the features obtained by the convolution kernel operation on the spectrogram feature sample is represented,i=1,…,N
Figure 820796DEST_PATH_IMAGE009
representing clustersThe weight coefficient of the center is set to be,
Figure 572720DEST_PATH_IMAGE010
the first of the effective speech clustering center and the noise point clustering center representing the features of the speech spectrogram feature sample obtained by convolution kernel operation
Figure 549903DEST_PATH_IMAGE011
The first of the cluster centerjAnd (4) the characteristic value.
According to a second aspect of the present application, there is provided a speech separation apparatus comprising: the extraction unit is used for acquiring original audio and extracting a spectrogram characteristic sequence from the original audio in a time window sliding mode; the segmentation unit is used for inputting the spectrogram feature sequence into a pre-trained voice segmentation model and acquiring an embedded feature sequence through the voice segmentation model; the clustering unit is used for inputting the embedded characteristic sequence into a pre-trained voice clustering model and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model; and the separation unit is used for carrying out single speaker voice reduction according to the prediction tag sequence to generate separated voice.
According to a third aspect of the present application, there is provided an electronic device comprising: a processor; a memory storing a computer program which, when executed by the processor, implements the speech separation method according to the first aspect.
According to a fourth aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech separation method according to the first aspect.
According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the voice fragments belonging to a single speaker can be separated from the short-time voice audio file in which multiple persons speak alternately, and the number of speakers can be accurately estimated by contacting with the context information.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 shows a schematic flow diagram of a speech separation method according to an embodiment of the application;
FIG. 2 shows a flow diagram of each step in a speech separation method according to an embodiment of the application;
FIG. 3 is a diagram illustrating the determination of a loss function in the step of training a speech clustering model in a speech separation method according to an embodiment of the present application;
fig. 4 shows a schematic block diagram of a speech separation apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.
It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.
One aspect of the present application relates to a method of speech separation. The voice separation method can generate separated voice by firstly dividing and then clustering spectrogram characteristics extracted from the voice by a preset time window and window shift, thereby separating voice segments belonging to a single speaker from a short-time voice audio file in which a plurality of persons speak alternately.
It is noted that prior art speech recognition techniques were unable to accurately separate short-term speech prior to filing the present application. However, the speech separation method provided by the present application can take the continuity of time into account by applying a predetermined time window and window shift, so that even for short-term speech, the speaker can be well distinguished by combining the context information, and the number of speakers and the speaker change time point can be accurately estimated.
Fig. 1 shows a flow chart of a speech separation method according to an embodiment of the present application, and fig. 2 shows a flow chart of each step in the speech separation method according to an embodiment of the present application.
As shown in fig. 1 and 2, a speech separation method according to an embodiment of the present application may include the steps of:
and step S10, acquiring the original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode.
In this step, the original audio may be an audio file to be voice-separated, and the original audio may include a voice scene in which a plurality of speakers alternately speak. The speech separation method of embodiments of the present application can identify speech segments for each of a plurality of speakers from raw audio and can generate speech segments for a single speaker.
The spectrogram feature may be a view feature that performs spectral analysis on the original audio.
As an example, before the method partitions the original audio, a spectrogram feature sequence may be extracted from the original audio with a predetermined time window and a predetermined time window shift.
Specifically, a time window for speech extraction may be set in advance, and then a plurality of spectrogram features may be extracted from the original audio in a sliding window manner from a start time to an end time of the original audio with a predetermined time window shift, thereby forming a spectrogram feature sequence.
The length of the time window may be arbitrarily selected, for example, it may have a fixed window length of 1.2 seconds. As an example, the time window shift may be half the length of the time window, e.g., when the length of the time window is 1.2 seconds, the time window shift may be 0.6 seconds.
In the application, spectrogram features are extracted from an original audio in a sliding window mode, time continuity can be considered, overlapped parts in adjacent windows can be used for segmentation of the two windows, and relevance between spectrogram features of each segment is established, so that even for short-time speech, a speaker can be well distinguished by combining context information, and the time point of speaker change is accurately estimated.
Optionally, the raw audio may be preprocessed before the spectrogram features are extracted. As an example, as shown in fig. 2, silence detection may be performed on the acquired original audio, removing a silence segment, obtaining a plurality of separated effective speech segments, then the plurality of effective speech segments may be spliced to form a preprocessed original audio, and then a spectrogram feature may be extracted based on the preprocessed original audio, so that the speed of speech separation may be increased by removing the silence segment, and the interference of the silence segment on the speech separation result may be removed. Therefore, an effective audio formed by splicing a plurality of effective voice segments can be obtained by removing the mute segments in the original audio, and then a plurality of spectrogram features can be extracted from the effective audio in a time window sliding window mode, so that a spectrogram feature sequence can be obtained.
And step S20, inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model.
In this step, the pre-trained speech segmentation model may segment the spectrogram feature sequence and output an embedded feature sequence including embedded (embedding) features. Here, the spectrogram feature may be a speech primary feature, and the embedded feature may be a speech high-level feature, which may be a speech feature obtained based on the spectrogram feature more conspicuous than a discrimination degree between features of the spectrogram feature.
Here, the embedded feature may represent a single speech segment, and as shown in fig. 2, the embedded feature sequence includes embedded features 1 through n, where n represents the total number of embedded features in the embedded feature sequence.
As an example, the speech segmentation model of the embodiment of the present application may be modeled based on a Convolutional Neural Network (CNN), however, in the existing Convolutional neural network, when the number of Convolutional layers is deepened, some effective features learned in shallow convolution may be lost, and in the aggregation process from the frame level to the speech level, the distinctiveness of the embedding features at the speech level is insufficient, so that the speech segmentation effect is not good.
In this regard, in one example, the speech segmentation model of the embodiment of the present application is based on the ResNet network in the convolutional neural network, and introduces the ghost vlad algorithm in the feature aggregation stage, resulting in the utterance-level embedded features based on the attention mechanism.
As an example, the speech segmentation model may be trained by:
and step S21, acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding window mode.
In this step, an existing original audio sample and a test embedded feature sample corresponding to the original audio sample may be obtained for training the speech segmentation model. The test embedded feature samples may represent actual embedded feature samples corresponding to the original audio samples, which may be obtained in any manner, e.g., by other pre-trained speech segmentation models.
Specifically, spectrogram feature samples may be extracted from the original audio sample in a predetermined time window, which may have a fixed window length, for example, 2.5 seconds, i.e., a single voice randomly intercepted from the original audio sample has a length of 2.5 seconds, and accordingly, the dimension of the extracted spectrogram feature samples may be (257, 250, 1). However, the present application is not limited thereto, and spectrogram features may be extracted from an original audio sample in a time window of any length. Further, as an example, the batch size of the training (batch size) may be set to 96, i.e. the training may be performed on a batch of input amounts of 96 raw audio samples.
And step S22, obtaining a prediction embedded feature sample according to a feature aggregation formula based on the spectrogram feature sample.
As an example, in the improved convolutional neural network structure of the embodiments of the present application, the convolutional layer may be divided into a plurality of residual block portions, and each residual block portion may include one or more layers of convolutions. As an example, the residual block portion may be 9, and each residual block portion may include two layers of convolution, that is, the convolutional layer includes 18 layers of convolution in total.
In the aggregation stage of the network structure, one or more GhostVLAD algorithms can be introduced, and feature aggregation quantities obtained through the one or more GhostVLAD algorithms can be fused into final speech-level embedded features through feature vector hopping connection.
As an example, the aggregation stage may introduce the ghost vlad algorithm twice, for example, the output vector of the layer 9 convolution may be subjected to the first ghost vlad feature aggregation, and the output vector of the layer 18 convolution may be subjected to the second ghost vlad feature aggregation, and then the features aggregated by the two ghost vlad algorithms may be fused into the final speech-level embedded feature through feature vector hopping connection.
Here, the ghost vlad algorithm may be an attention mechanism model established based on a statistical method, which may perform residual accumulation of different feature vectors of different speech segment samples using a soft distribution manner, and may assign different weights to features through a softmax function according to a distance from each feature of different speech segment samples to a corresponding cluster center, thereby introducing an attention mechanism in a feature aggregation process.
In addition, the ghost vlad algorithm can aggregate features and reduce noise of the features, specifically, different features in the same speech segment can obtain residual errors and vectors according to the size of the effective clustering center corresponding to the distance between the features, and similarly, codebooks are provided for the distribution of noise points in spectrogram features for clustering, however, in the calculation stage, weights are not given to the noise points, so that the noise features cannot participate in the generation of the residual errors and the vectors, and noise reduction can be realized.
As an example, the feature aggregation formula of the ghost vlad algorithm may be expressed as:
Figure 821616DEST_PATH_IMAGE012
wherein the content of the first and second substances,V j k,is the amount of the characteristic polymerization,X i j,the first one representing the features of spectrogramiSecond of partial descriptorsjThe value of the characteristic is used as the characteristic value,j=1, …, J is the total number of eigenvalues contained in the local descriptor, where the total number of eigenvalues contained in the local descriptor may be determined from the size of the convolution kernel,c k j,the first in the effective speech clustering center of features obtained by convolution kernel operation representing spectrogram feature sampleskThe first of the cluster centerjThe value of the characteristic is used as the characteristic value,Krepresents the number of valid speech cluster centers,k=1,…,KGrepresents the number of noise point cluster centers,
Figure 491631DEST_PATH_IMAGE013
Figure 340639DEST_PATH_IMAGE014
Figure 777305DEST_PATH_IMAGE015
Figure 169103DEST_PATH_IMAGE016
Figure 10020DEST_PATH_IMAGE017
t represents a vector transpose operation,Nrepresenting the total number of local descriptors of features obtained by convolution kernel operation on spectrogram feature samples,i=1,…,N
Figure 470958DEST_PATH_IMAGE018
a weight coefficient representing the center of the cluster,
Figure 55523DEST_PATH_IMAGE019
the first of the effective speech clustering center and the noise point clustering center representing the features of the speech spectrogram feature sample obtained by convolution kernel operation
Figure 160882DEST_PATH_IMAGE020
The first of the cluster centerjAnd (4) the characteristic value.
As an example, in training a speech segmentation model, predictive embedded features may be output from a fully-connected layer to compose a sequence of predictive embedded features.
In the application, an attention mechanism is introduced into feature aggregation, so that the comprehensive characterization capability of the speech-level embedded features output by a trained speech segmentation model is stronger, and the distinctiveness is better.
And step S23, training a voice segmentation model based on the test embedded characteristic sample and the prediction embedded characteristic sample.
In this step, the speech segmentation model may be trained using an AM-softmax loss function based on the test embedded feature samples and the predicted embedded feature samples, as an example.
Specifically, the AM-softmax loss function can improve the model optimization space by increasing the inter-class distance, so that the extracted feature vectors of different speakers are distinguished and amplified, the distinguishing capability of the model is improved, and a better classification effect can be obtained.
As an example, the AM-softmax loss function can be expressed as:
Figure 48066DEST_PATH_IMAGE021
wherein the content of the first and second substances,L am representing the AM-softmax loss,Nindicates the total number of speakers as a whole,i=1,…,Ny i is shown asiA prediction vector (e.g. prediction embedded feature samples),
Figure 871666DEST_PATH_IMAGE022
is shown asiThe angle of each prediction vector to its corresponding true vector (e.g. test embedded feature sample),
Figure 375767DEST_PATH_IMAGE023
the representation comprisesy i The included angles between all the prediction vectors and the corresponding true vectors,jrepresenting the left and right prediction vectors,
Figure 601212DEST_PATH_IMAGE024
mcosine values representing the angle between the reduced prediction vector and the true vector (i.e., introducingmThe value is such that the cosine of the angle between the prediction vector and the real vector is decreased, i.e., the angle between the prediction vector and the real vector is increased),
Figure 393718DEST_PATH_IMAGE025
indicating predictedy i The numerator of the loss function of the vector,sis a scaling factor.
In the application, the AM-softmax loss function is used as the loss function of the speech segmentation model training, so that the distance between the target vector and the real vector can be increased, and the back propagation efficiency of the model cannot be influenced.
In addition, when an attention mechanism is introduced in the feature polymerization process, parameters of the model can be optimized by adopting an AM-softmax loss function as a loss function of the model.
The speech segmentation model trained by the embodiment of the application can generate the speech-level embedded features based on an attention mechanism so as to be used for speech clustering. Compared with a Gaussian mixture model or other frame-level features, the speech-level embedding feature can make full use of time domain and frequency domain information and contain more speech information, and meanwhile, compared with the traditional speech-level embedding feature, the speech-level embedding feature has the advantages that the introduction of an attention mechanism enables the speech-level embedding feature to extract speech features with stronger representation capability, so that the whole speech separation model has good performance on short-time speech segments.
Returning to fig. 1, step S30 is to input the embedded feature sequence into a pre-trained speech clustering model, and obtain a predicted tag sequence corresponding to the embedded feature sequence through the speech clustering model.
In this step, as shown in fig. 2, the embedded features output from the speech segmentation model may be used as input quantities of a pre-trained speech clustering model, and the pre-trained speech clustering model may generate corresponding prediction labels based on the embedded features, thereby forming a prediction label sequence. Here, the predictive tag may represent an identity of a speaker for each embedded feature, and a plurality of predictive tags corresponding to a plurality of embedded features in the sequence of embedded features form a sequence of predictive tags. In an embodiment of the present application, the identity of a speaker can be represented by a positive integer, such as speaker 1, speaker 2, etc., and thus, the tag value can be a positive integer and the tag sequence can be a set of positive integers, such as {1,2,1,3,3,4 }.
As an example, the speech clustering model may be trained by: step S31, obtaining a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples respectively belonging to a plurality of speakers; step S32, obtaining training embedded characteristic sample sequences from each group of original audio samples in the groups of original audio samples; and step S33, training the voice clustering model by using a plurality of training embedded characteristic sample sequences of a plurality of groups of original audio samples.
In step S31, each set of original audio samples may include a plurality of single-speaker original audio samples respectively belonging to a plurality of speakers, that is, each original audio sample corresponds to one speaker, and the plurality of original audio samples respectively belong to the plurality of speakers.
In step S32, the training embedded feature sample sequence may be obtained by:
step S321, extracting spectrogram feature samples of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode.
As an example, the original audio sample may be a single speaker audio file with speaker identity labels, and spectrogram feature samples may be extracted in a sliding window manner from the original audio sample.
Specifically, a time window for speech extraction may be preset, and then a plurality of spectrogram feature samples may be extracted from the original audio sample from a start time to an end time of the original audio sample in a sliding window manner with a predetermined time window shift, thereby forming a spectrogram feature sample sequence.
Here, two or more of the time windows of the plurality of original audio samples for each set of original audio samples may have different lengths. For example, the plurality of original audio samples may include 10000 original audio samples, the 10000 original audio samples may be divided into 5 batches (batch) of original audio samples, 5 time windows with different lengths may be respectively preset for the 5 batches of original audio samples, and accordingly, 5 different time window shifts may be respectively preset. As an example, the time window shift may be half the length of the corresponding time window.
As an example, the length of the time window may be determined from the original audio sample, in particular from the speech scene of the original audio sample, which may allow setting different time windows for different speech scenes of the original audio sample. For example, the length of the time window may be positively correlated with the length of a single speaker's continuous utterance in the original audio sample, e.g., for short duration speech scenes such as a telephone recording, the speaker's frequency of change is faster, and thus a shorter time window and time window shift may be set, as compared to a longer time window and time window shift, which may be set because the speaker may not change for a relatively long duration speech scene such as a conference utterance.
Here, the length of the time window and the time window shift may be indefinite, and in particular, the length of the time window may be preset to be in the range of 0.5s to 1.6s in consideration of the presence of a large number of frequently alternating short-time voices (voice duration <1 s) in the voice separating task. As an example, the time window shift may be half the length of the respective indefinite long time window, e.g. the time window shift may be in the range of 0.25s to 0.8s when the length of the time window is in the range of 0.5s to 1.6 s.
In the training of the voice clustering model, the voice scenes of a plurality of original audio samples are distinguished, the original audio samples of different scenes are extracted through the indefinite long time window and the indefinite long time window, and the trained voice clustering model can be suitable for voice separation of long-time and long-time voice scenes and also suitable for voice separation of short-time and long-time voice scenes. Optionally, the raw audio samples may be preprocessed before the spectrogram feature samples are extracted. As an example, silence detection may be performed on the acquired original audio, a silence segment may be removed, a plurality of separated valid speech segments may be obtained, then the plurality of valid speech segments may be spliced to form a preprocessed original audio sample, and then a spectrogram feature sample may be extracted based on the preprocessed original audio sample.
Step S322, inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker.
In this step, for example, the spectrogram feature sample of each speaker may be input into the above-described pre-trained speech segmentation model, so as to obtain embedded features as high-level features of speech based on the spectrogram features of each speaker, so that the degree of distinction between the speech features is higher.
And step S323, randomly splicing the training embedded characteristic samples of the multiple speakers to obtain a training embedded characteristic sample sequence comprising the multiple training embedded characteristic samples.
In this step, training embedded feature samples of multiple speakers can be randomly spliced to obtain a spliced training embedded feature sample sequence, so that the training embedded feature samples in the training embedded feature sample sequence belong to the multiple speakers, and the occurrence sequence of different speakers is random.
In step S33, the training the speech clustering model using the training embedded feature sample sequences of the original audio samples may include:
and step S331, carrying out speaker identity label labeling on each training embedded characteristic sample sequence in the plurality of training embedded characteristic sample sequences, and determining a test label sample sequence.
Specifically, for the training embedded feature sample sequence including a plurality of training embedded feature samples acquired in step S323, the corresponding training embedded feature samples may be combined with the test tag samples, thereby determining the test tag sample sequence. Here, the test tag sample may be known at the time the original audio sample was taken, e.g., the number of speakers and the speaking time of the speakers are known.
For example, given that speakers included in a set of original audio samples are two speakers, i.e., speaker 1 and speaker 2, the training embedded feature sample sequence may be { embedded feature sample of speaker 1, embedded feature sample of speaker 2, embedded feature sample of speaker 1, embedded feature sample of speaker 2}, and accordingly, the test tag sample sequence may be represented as {1,2,2,1,2 }.
And S332, acquiring a training prediction label sequence by utilizing a voice clustering model based on each training embedded characteristic sample sequence in the plurality of training embedded characteristic sample sequences.
In this step, each training embedded feature sample sequence in the plurality of training embedded feature sample sequences may be input into the speech clustering model, and a corresponding training prediction tag sequence may be output by using the speech clustering model. Here, a plurality of training embedded feature sample sequences may be input into the speech clustering model in batches.
And S333, training the voice clustering model according to the prior probability based on the training prediction label sequence and the test label sample sequence, wherein the prior probability refers to the probability that the next predicted training prediction label determined according to the predicted training prediction label changes.
In this step, since a plurality of training embedded feature sample sequences are input to the speech clustering model in batches during the training process, the training prediction label value obtained in the previous training batch can be compared with the test label sample value to obtain the prior probability, and then the prior probability and the next training embedded feature sample sequence are used together as the input quantity of the model calculation (for example, the prior probability can be used as the hyperparameter of the operation node of the speech clustering model) to participate in the next calculation to optimize the model.
As an example, in step S333, the training predicted label sequence and the test label sample sequence may be input into a clustering network model for training, wherein the clustering network model may generate the training predicted labels based on prior probabilities.
In the training process of the voice clustering model, the prediction precision of the subsequent model can be improved by continuously optimizing the prior probability. As an example, the maximum likelihood estimation method may be used to estimate the parameters of the prior probability, so as to determine the prior probability.
In one example, the prior probabilities can include speaker transition signal sequence probabilities
Figure 970193DEST_PATH_IMAGE026
In this case, the first and second substrates,S i transit signal sequence probability obeying to speakerP 2Represents a sequence of speaker transit signals for a speech segment. When adjacent speaker tags embedded in the sequence of feature samples change,S i taking 1; when the adjacent speaker tag embedded in the sequence of feature samples has not changed,S i take 0.
Parameter estimation by speaker transition signal sequence probability
Figure 630982DEST_PATH_IMAGE027
Determining speaker transition signal sequence probabilityP 2
In particular, parameter estimates
Figure 835567DEST_PATH_IMAGE028
Can be expressed as:
Figure 189188DEST_PATH_IMAGE029
wherein the content of the first and second substances,Nrepresenting the total number of training embedded feature sample sequences,n=1,…,NI n is shown asnThe total number of training embedded feature samples contained in the sequence of training embedded feature samples (i.e., the sequence length),iis shown asiEach of the training sets is embedded in a feature sample,y n i,is shown asnTraining the first of the embedded feature sample sequenceiTraining prediction labels embedded in the feature samples. Formula 1(y n i,=y n i,-1) "means wheny n i,=y n i,-1If so, outputting a result of 1; when in use
Figure 862746DEST_PATH_IMAGE030
When the output is 0.
In another example, the prior probabilities can include speaker tag assigned sequence probabilities, which can be determined by: and determining the probability of the speaker label distribution sequence according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels.
Here, the speaker tag assigns a sequence probabilityP 1Indicating that the speaker jumps to the existing utterance when the speaker changesProbability of talker or new talker.
According to the characteristics of probability distribution, when the speaker changes, if the next input speech segment belongs to the known speaker, the real label of the speech segment is associated with the known speakerN k i,-1In direct proportion, the amount of the solvent, here,N k i,-1indicating that the feature sample is embedded in the 1 st trainingi-1 specific speaker contained in training embedded feature samplekThe number of speech blocks, a speech block can be understood as a sequence of consecutive speaker tag segments in the segment of speech that belong to a particular speaker. For example, the sequencey [6]= (1,1,2,2,3,1), then the segment of speech contains four speech blocks (1,1), (2,2), (3), (1), where,N 1,5=2,N 2,5=1,N 3,5=1。
thus, the speaker tag assignment sequence probability can be expressed as:
Figure 592804DEST_PATH_IMAGE031
wherein the content of the first and second substances,y i a value representing the current training prediction label,S i indicating whether a change has occurred in the speaker tag,S i =1 indicates that the speaker tag is changed.
When the speaker changes, if the next input voice segment belongs to the unknown speaker, the real label and statistical parameter of the segment voice
Figure 261683DEST_PATH_IMAGE032
In direct proportion, the speaker tag assigned sequence probability can be expressed as:
Figure 176418DEST_PATH_IMAGE033
wherein the content of the first and second substances,K i-1indicating that the feature sample is embedded in the 1 st trainingi-1 total number of speakers included in training embedded feature samples, wherein,S i Indicating whether a change has occurred in the speaker tag,S i =1 indicates that the speaker tag is changed,S i =0 indicates that the speaker tag has not changed,y i is shown asiThe value of each of the training prediction labels,y i-1is shown asnTraining the first of the embedded feature sample sequenceiThe training embedded feature samples are trained to predict label values.
As an example, statistical parameters
Figure 196327DEST_PATH_IMAGE032
The number of speaker changes occurring in the predicted training prediction labels and the total number of predicted training prediction labels may be based on. E.g. statistical parameters
Figure 339864DEST_PATH_IMAGE032
Can be expressed as:
Figure 394407DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 89831DEST_PATH_IMAGE032
representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of a plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|,Y m ={y m 1,,…, y m i,, y m i,+1, …, y m N,},Y m is shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY m I represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y m i,is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1. Here, formula "
Figure 456090DEST_PATH_IMAGE035
"means when
Figure 527951DEST_PATH_IMAGE036
If so, outputting a result of 1; when in usey m i,=y m i,+1When the output is 0.
In the above expression, the denominator may represent the total number of predicted training prediction labels, and the numerator may represent the number of times of occurrence of speaker alteration in the predicted training prediction labels.
In the process of training the speech clustering model, the parameters are counted
Figure 437001DEST_PATH_IMAGE032
And continuously optimizing so as to realize the training of the model.
In another example, the prior probabilities can include speaker-embedded feature generation sequence probabilities
Figure 178692DEST_PATH_IMAGE037
Parameter estimates of sequence probabilities can be generated by speaker-embedded features
Figure 438772DEST_PATH_IMAGE038
And
Figure 176309DEST_PATH_IMAGE039
to determine speaker-embedded feature generation sequence probabilities.
In particular, parameter estimates
Figure 939866DEST_PATH_IMAGE038
And
Figure 711512DEST_PATH_IMAGE039
can be expressed as:
Figure 334255DEST_PATH_IMAGE040
wherein the content of the first and second substances,Iindicates the case where a plurality of embedded feature sample sequences are divided into a plurality of training batchesIThe training batches are then run in a batch,Nis shown asIThe total number of embedded feature sample sequences in the training batch as training inputs,P I is shown asIThe step value of the gradient ascent optimization strategy of the model training phase of the training batch,B I is indicated from the firstIA subset of embedded feature samples formed by a plurality of embedded feature samples selected from the embedded feature samples included in the sequence of embedded feature samples as training input in the training batch,brepresenting the total number of embedded feature samples included in the subset of embedded feature samples,
Figure 482339DEST_PATH_IMAGE041
is shown asnThe embedded feature samples belong to a subset of feature samplesB I
Figure 490616DEST_PATH_IMAGE042
Means that under the condition of known prior information, the label is obtained asx n The probability of (a) of (b) being,
Figure 964322DEST_PATH_IMAGE043
means that under the condition of known prior information, the label is obtained asx n The probability of (c). In particular, the present invention relates to a method for producing,
Figure 808781DEST_PATH_IMAGE044
and
Figure 494978DEST_PATH_IMAGE045
may be obtained by the above-mentioned
Figure 498706DEST_PATH_IMAGE046
And
Figure 736789DEST_PATH_IMAGE047
the product of (a) and (b) is obtained.
In yet another example, the prior probabilities can include the speaker-embedded feature generation sequence probabilities mentioned aboveP 0Speaker tag assigned sequence probabilityP 1And speaker transition signal sequence probabilityP 2Three, in this case, the prior probability can be expressed as:
Figure 193178DEST_PATH_IMAGE048
in this example, for the input voice segment data, the value probability of the speaker transition signal can be calculated, the value probability of the previous voice segment label and the current voice segment label under the current transition signal value condition is calculated, and the accuracy of the embedded feature observed value obtained under the previous state weight parameter and the current voice segment label condition is calculated, so as to determine the prior probability and realize the training of the voice clustering model. Parameter(s)
Figure 683065DEST_PATH_IMAGE009
Figure 416666DEST_PATH_IMAGE032
Figure 232175DEST_PATH_IMAGE049
And
Figure 34915DEST_PATH_IMAGE050
is the hyper-parameter of the operation node of the voice clustering model, and the training of the voice clustering model in the process is actually the parameter
Figure 328493DEST_PATH_IMAGE009
Figure 775655DEST_PATH_IMAGE032
Figure 637432DEST_PATH_IMAGE049
And
Figure 68413DEST_PATH_IMAGE050
at least one of the iterative optimization processes.
FIG. 3 is a diagram illustrating the determination of the loss function in the step of training the speech clustering model in the speech separation method according to the embodiment of the present application.
As an example, the plurality of training embedded feature sample sequences may include L training embedded feature sample sequences.
In this case, the speech clustering model may be trained based on the loss function.
Specifically, as shown in FIG. 3, the loss function may be determined by: embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of the training embedded feature sample sequence and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<And L. The determined loss function can be substituted back into the speech clustering model to ensure the accuracy and stability of the model.
The speech clustering model of the embodiment of the present application may be, for example, a uissrnn model. The loss function may be, for example, an MSE loss function, and in particular, the first in a temporal signature sequence may be employediIs (i.e. the firstiPeriod) of the time sequence, and the mean vector of the sequence from the sequence of the spectrogram feature samples to the final Lth (i.e., Lth period) of the sequence, and the sequence from the 1 st (i.e., 1 st period) of the spectrogram feature samples in the time feature sequence to the final first of the sequencei-1 (i.e. the firsti-1 time interval) of mean vector of the sequence of spectrogram feature samples to find the MSE loss.
Compared with the first training embedded feature sample sequence to the L-1 st training embedded feature sample sequence based on the L training embedded feature sample sequencesTraining mean vector and second of embedded feature sample sequenceLThe loss function determined according to the method of the present application can ensure the accuracy and stability of the model, and here, the calculation of determining the loss function according to the mean vector can be implemented according to any means known in the art, and is not described herein again.
Returning to fig. 1, step S40 is to perform single speaker speech restoration according to the predicted tag sequence to generate separated speech.
In this step, as shown in fig. 2, the speech segments belonging to each speaker may be extracted according to the predicted tag sequence obtained in step S30, and the speech segments belonging to the same speaker are spliced, so as to restore the individual speech of each speaker in the original audio, and obtain the final separated speech.
Alternatively, when the original audio is pre-processed in step S10, the predicted tag sequence may be post-processed accordingly in this step S40. For example, when the original audio is unmuted in step S10, the obtained predicted tag sequences may be time-aligned with reference to the time of the valid speech segment generated in step S10, and the correspondence between the predicted tag sequences and the real time may be restored.
Further, the post-processing performed on the predicted tag sequence may further include tag smoothing processing, for example, it may be determined whether an abnormal value exists in a plurality of consecutive predicted tags in the predicted tag sequence, and the abnormal value may be smoothed and the final separated speech may be generated based on the smoothed predicted tag sequence.
For example, when speakers include speaker 1 and speaker 2, when the prediction tag sequence is {2, 2,2, 2,1, 1,1, 1,1, 1}, it can be seen that the prediction tag of speaker 2 appears among the prediction tags of a plurality of consecutive speakers 1, and at this time, the prediction tag of speaker 2 can be considered as an abnormal value, and can be smoothed, i.e., modified into the prediction tag of speaker 1, and the smoothed prediction tag sequence is {2, 2,2, 2,1, 1,1, 1,1, 1,1, 1}, and then the final isolated speech can be generated based on the smoothed prediction tag sequence.
The voice separation method of the embodiment of the application can be used for voice separation of multiple speakers, and can be applied to various voice interaction products, such as: under the scenes of meeting, news and the like, the method can be used for estimating the number of participants and separating the speech records of important speakers; under the scene of telephone conversation, the method can also be used for carrying out voice separation on the recording of the telephone conversation, and can be used for voice print warehousing and comparison; and the voice recognition performance and the like under a multi-person speaking environment can be improved through voice separation.
According to the voice separation method, the ResNet convolutional neural network is adopted in the voice segmentation stage, the embedded features are extracted in a sliding window mode, the number of convolutional layers is deep, and an attention mechanism is introduced in the feature aggregation stage.
The voice separation method provided by the embodiment of the application adopts the supervised clustering algorithm in the voice clustering stage, can effectively utilize prior information, dynamically adjusts the model parameters according to the clustering result, meets the requirement of a large-data-volume environment on the model clustering effect, and is higher in clustering accuracy. In addition, the speech separation method of the embodiment of the application optimizes the loss function in the speech clustering stage, adopts a statistical method to replace network training iteration, and improves the parameter calculation mode of the prior probability of the generated prediction label.
The method in the present application may be implemented by a voice separated apparatus in an electronic device, or may be implemented entirely by a computer program, for example, the method may be executed by a voice separated application installed in the electronic device, or by a functional program implemented in an operating system of the electronic device. By way of example, the electronic device may be a personal computer, a server, a tablet computer, a smart phone, or other electronic device with an artificial intelligence operation function.
Another aspect of the present application relates to a voice separation apparatus. Fig. 4 shows a schematic block diagram of a speech separation apparatus according to an exemplary embodiment of the present application.
As shown in fig. 4, the voice separating apparatus according to the exemplary embodiment of the present application includes an extracting unit 100, a dividing unit 200, a clustering unit 300, and a separating unit 400.
The extraction unit 100 acquires an original audio, and extracts a spectrogram feature sequence from the original audio in a time window sliding manner.
The segmentation unit 200 inputs the spectrogram feature sequence into a pre-trained speech segmentation model, and obtains an embedded feature sequence through the speech segmentation model.
The clustering unit 300 inputs the embedded feature sequence into a pre-trained speech clustering model, and obtains a prediction tag sequence corresponding to the embedded feature sequence through the speech clustering model.
The separation unit 400 performs time alignment according to the predicted tag sequence to generate separated speech.
The extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400 may perform corresponding steps in the method according to the speech separating method in the method embodiment shown in fig. 1 to fig. 3, for example, the steps may be implemented by machine readable instructions executable by the extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400, and specific implementation manners of the extracting unit 100, the segmenting unit 200, the clustering unit 300, and the separating unit 400 may refer to the method embodiment described above, and are not described herein again.
An embodiment of the present application further provides an electronic device, which includes a processor and a memory. The memory stores a computer program. When the computer program is executed by a processor, the electronic device may perform corresponding steps in the method according to the voice separation method in the method embodiments shown in fig. 1 to fig. 3, for example, by machine-readable instructions executable by the electronic device, and specific implementation manners of the electronic device may refer to the above-described method embodiments, which are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium storing a computer program, which, when executed (for example, by a processor), can perform the steps of the voice separation method in the method embodiments shown in fig. 1 to fig. 3.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment scheme of the application.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, the separated voice can be generated by firstly dividing and then clustering spectrogram features extracted from the voice in a time window sliding window mode, so that voice fragments belonging to a single speaker can be separated from a short-time voice audio file in which multiple persons speak alternately.
In addition, according to the speech separation method, the speech separation device, the electronic equipment and the storage medium, the speech spectrogram feature sequence can be extracted from the original audio in a time window sliding mode, the continuity of the extracted speech spectrogram features is ensured, and the number of speakers can be accurately estimated in connection with the context information.
In addition, according to the voice separation method, the voice separation device, the electronic equipment and the storage medium, in the training of the voice clustering model, the full-supervised clustering algorithm is adopted, the prior information in the separation process can be fully utilized, the model parameters can be dynamically adjusted, the problem that the performance of unsupervised clustering is reduced when the voice data volume is large and the number of speakers is large is effectively solved, and the requirement of a large data environment on the model separation effect is better met.
In addition, according to the speech separation method, the speech separation device, the electronic equipment and the storage medium, in the training of the speech clustering model, a feature aggregation formula is introduced to introduce an attention mechanism into the model, so that short-time speech features with stronger representation can be extracted.
In addition, according to the speech separation method, the speech separation device, the electronic device and the storage medium of the application, the plurality of training embedded feature sample sequences are divided into two groups, and the loss function of the speech clustering model is determined based on the mean vectors of the two groups of training embedded feature sample sequences, so that the loss function can be optimized, and the accuracy and the stability of the speech clustering model are ensured.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A speech separation method, characterized in that the speech separation method comprises:
acquiring original audio, and extracting a spectrogram feature sequence from the original audio in a time window sliding mode;
inputting the spectrogram feature sequence into a pre-trained voice segmentation model, and acquiring an embedded feature sequence through the voice segmentation model;
inputting the embedded characteristic sequence into a pre-trained voice clustering model, and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model;
performing single speaker voice restoration according to the prediction tag sequence to generate separated voice,
wherein the speech clustering model is trained by:
acquiring a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples belonging to a plurality of speakers respectively;
acquiring a training embedded feature sample sequence from each group of original audio samples in the multiple groups of original audio samples;
training the voice clustering model by utilizing a plurality of training embedded characteristic sample sequences of the plurality of groups of original audio samples according to prior probability, wherein the prior probability refers to the probability that the next predicted training prediction label determined according to the predicted training prediction label changes, and the prior probability comprises the speaker label distribution sequence probability,
wherein the speaker tag assigned sequence probability is determined by:
determining statistical parameters of the speaker label distribution sequence probability according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels;
determining the speaker tag distribution sequence probability according to the statistical parameters,
wherein the statistical parameter is represented as:
Figure 494986DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 488350DEST_PATH_IMAGE002
representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|,Y m ={y m,1, …, y m i,, y m i,+1, …, y m,N },Y m is shown withmAn exerciseEmbedding a training prediction tag sequence corresponding to the feature sample sequenceY m I represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y m i,is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。
2. the speech separation method of claim 1 wherein the training embedded feature sample sequence is obtained by:
extracting a spectrogram characteristic sample of each speaker from each single speaker original audio sample of each group of original audio samples in a time window sliding mode;
inputting the spectrogram characteristic sample of each speaker into a pre-trained voice segmentation model to obtain a training embedded characteristic sample of each speaker;
and randomly splicing the training embedded feature samples of the multiple speakers to obtain a training embedded feature sample sequence comprising the multiple training embedded feature samples.
3. The method of claim 2, wherein the training the speech clustering model according to the prior probability using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples comprises:
labeling the identity label of the speaker for each training embedded characteristic sample sequence in the training embedded characteristic sample sequences to determine a test label sample sequence;
based on each training embedded characteristic sample sequence in the training embedded characteristic sample sequences, acquiring a training prediction label sequence by using a voice clustering model;
and training the voice clustering model according to the prior probability based on the training prediction label sequence and the test label sample sequence.
4. The speech separation method of claim 2 wherein the plurality of training embedded feature sample sequences comprises L training embedded feature sample sequences,
wherein training the speech clustering model using a plurality of training embedded feature sample sequences of the plurality of sets of original audio samples comprises:
embedding the feature sample sequence based on the first training of the L training embedded feature sample sequences to the second trainingi-1 mean vector of training embedded feature sample sequences and the second of the L training embedded feature sample sequencesiDetermining a loss function from the training embedded feature sample sequence to the mean vector of the L-th training embedded feature sample sequence, wherein,iand L is an integer and satisfies 1<i<L;
Training the speech clustering model based on the loss function.
5. The speech separation method of claim 1, wherein the speech segmentation model is trained by:
acquiring an original audio sample and a test embedded feature sample, and extracting a spectrogram feature sample from the original audio sample in a time window sliding mode;
acquiring training prediction embedding characteristics according to a characteristic aggregation formula based on the spectrogram characteristic sample;
training a speech segmentation model based on the test embedded feature samples and the training predicted embedded features,
wherein the feature aggregation formula is represented as:
Figure 304996DEST_PATH_IMAGE003
wherein the content of the first and second substances,V j k,is the amount of the characteristic polymerization,X i j,a second step of representing the features of the spectrogram feature samples obtained through convolution kernel operationiSecond of partial descriptorsjThe value of the characteristic is used as the characteristic value,j=1, …, J is the total number of eigenvalues contained in the local descriptor,c k j,the first in the effective speech clustering center of the features obtained by convolution kernel operation of the speech spectrogram feature samplekThe first of the cluster centerjThe value of the characteristic is used as the characteristic value,Krepresents the number of valid speech cluster centers,k=1,…,KGrepresents the number of noise point cluster centers,
Figure 42008DEST_PATH_IMAGE004
Figure 565393DEST_PATH_IMAGE005
Figure 57555DEST_PATH_IMAGE006
Figure 299180DEST_PATH_IMAGE007
Figure 574304DEST_PATH_IMAGE008
t represents a vector transpose operation,Nthe total number of local descriptors of the features obtained by the convolution kernel operation on the spectrogram feature sample is represented,i=1,…,N
Figure 283021DEST_PATH_IMAGE009
a weight coefficient representing the center of the cluster,
Figure 618188DEST_PATH_IMAGE010
the first of the effective speech clustering center and the noise point clustering center representing the features of the speech spectrogram feature sample obtained by convolution kernel operation
Figure 347109DEST_PATH_IMAGE011
Individual clusterAt the center ofjAnd (4) the characteristic value.
6. A speech separation apparatus, characterized in that the speech separation apparatus comprises:
the extraction unit is used for acquiring original audio and extracting a spectrogram characteristic sequence from the original audio in a time window sliding mode;
the segmentation unit is used for inputting the spectrogram feature sequence into a pre-trained voice segmentation model and acquiring an embedded feature sequence through the voice segmentation model;
the clustering unit is used for inputting the embedded characteristic sequence into a pre-trained voice clustering model and obtaining a prediction label sequence corresponding to the embedded characteristic sequence through the voice clustering model;
a separation unit for performing single speaker voice restoration according to the prediction tag sequence to generate separated voice,
wherein the clustering unit trains the speech clustering model by:
acquiring a plurality of groups of original audio samples, wherein each group of original audio samples comprises a plurality of single speaker original audio samples belonging to a plurality of speakers respectively;
acquiring a training embedded feature sample sequence from each group of original audio samples in the multiple groups of original audio samples;
training the voice clustering model by utilizing a plurality of training embedded characteristic sample sequences of the plurality of groups of original audio samples according to prior probability, wherein the prior probability refers to the probability that the next predicted training prediction label determined according to the predicted training prediction label changes, and the prior probability comprises the speaker label distribution sequence probability,
wherein the speaker tag assigned sequence probability is determined by:
determining statistical parameters of the speaker label distribution sequence probability according to the number of times of speaker change in the predicted training prediction labels and the total number of the predicted training prediction labels;
determining the speaker tag distribution sequence probability according to the statistical parameters,
wherein the statistical parameter is represented as:
Figure 488240DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 986218DEST_PATH_IMAGE002
representing a statistical parameter, <' >DL represents the total number of training embedded feature sample sequences in the plurality of training embedded feature sample sequences,mrepresenting the first of the plurality of training embedded feature sample sequencesmEach training is embedded in a sequence of feature samples,m=1,…,|D|,Y m ={y m,1, …, y m i,, y m i,+1, …, y m,N },Y m is shown withmTraining prediction label sequence corresponding to the training embedded characteristic sample sequenceY m I represents the sum ofmThe total number of training prediction label values in the training prediction label sequence corresponding to the training embedded characteristic sample sequence, wherein,y m i,is shown asmThe first to train predictive tag sequencesiThe training of each embedded feature sample predicts the label value,Nis shown asmThe total number of training predictive labels in the sequence of training predictive labels,i=1,…,N-1。
7. an electronic device, characterized in that the electronic device comprises:
a processor;
memory storing a computer program which, when executed by a processor, implements a speech separation method according to any one of claims 1 to 5.
8. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech separation method according to any of claims 1 to 5.
CN202110237579.3A 2021-03-04 2021-03-04 Voice separation method, voice separation device, electronic device and storage medium Active CN112634875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110237579.3A CN112634875B (en) 2021-03-04 2021-03-04 Voice separation method, voice separation device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110237579.3A CN112634875B (en) 2021-03-04 2021-03-04 Voice separation method, voice separation device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN112634875A CN112634875A (en) 2021-04-09
CN112634875B true CN112634875B (en) 2021-06-08

Family

ID=75295567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110237579.3A Active CN112634875B (en) 2021-03-04 2021-03-04 Voice separation method, voice separation device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN112634875B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113178196B (en) * 2021-04-20 2023-02-07 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113205815B (en) * 2021-04-28 2023-02-28 维沃移动通信有限公司 Voice processing method and electronic equipment
CN113470695B (en) * 2021-06-30 2024-02-09 平安科技(深圳)有限公司 Voice abnormality detection method, device, computer equipment and storage medium
CN115938385A (en) * 2021-08-17 2023-04-07 中移(苏州)软件技术有限公司 Voice separation method and device and storage medium
CN114139063B (en) * 2022-01-30 2022-05-17 北京淇瑀信息科技有限公司 User tag extraction method and device based on embedded vector and electronic equipment
CN117198272B (en) * 2023-11-07 2024-01-30 浙江同花顺智能科技有限公司 Voice processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019199554A1 (en) * 2018-04-11 2019-10-17 Microsoft Technology Licensing, Llc Multi-microphone speech separation
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110619887A (en) * 2019-09-25 2019-12-27 电子科技大学 Multi-speaker voice separation method based on convolutional neural network
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019199554A1 (en) * 2018-04-11 2019-10-17 Microsoft Technology Licensing, Llc Multi-microphone speech separation
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110619887A (en) * 2019-09-25 2019-12-27 电子科技大学 Multi-speaker voice separation method based on convolutional neural network
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering

Also Published As

Publication number Publication date
CN112634875A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112634875B (en) Voice separation method, voice separation device, electronic device and storage medium
CN109473123B (en) Voice activity detection method and device
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN109360572B (en) Call separation method and device, computer equipment and storage medium
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN109410956B (en) Object identification method, device, equipment and storage medium of audio data
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
WO2013028351A2 (en) Measuring content coherence and measuring similarity
Tseng et al. Combining sparse NMF with deep neural network: A new classification-based approach for speech enhancement
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
CN107545898B (en) Processing method and device for distinguishing speaker voice
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
CN110633735B (en) Progressive depth convolution network image identification method and device based on wavelet transformation
CN113593597B (en) Voice noise filtering method, device, electronic equipment and medium
CN113362831A (en) Speaker separation method and related equipment thereof
CN111968650B (en) Voice matching method and device, electronic equipment and storage medium
CN112259114A (en) Voice processing method and device, computer storage medium and electronic equipment
Imoto et al. Acoustic scene analysis from acoustic event sequence with intermittent missing event
CN115565533A (en) Voice recognition method, device, equipment and storage medium
CN113113048B (en) Speech emotion recognition method and device, computer equipment and medium
CN106373576B (en) Speaker confirmation method and system based on VQ and SVM algorithms
CN111477248B (en) Audio noise detection method and device
CN114023336A (en) Model training method, device, equipment and storage medium
CN116230017A (en) Speech evaluation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant