CN116246639A - Self-supervision speaker verification model training method, electronic device and storage medium - Google Patents

Self-supervision speaker verification model training method, electronic device and storage medium Download PDF

Info

Publication number
CN116246639A
CN116246639A CN202310085281.4A CN202310085281A CN116246639A CN 116246639 A CN116246639 A CN 116246639A CN 202310085281 A CN202310085281 A CN 202310085281A CN 116246639 A CN116246639 A CN 116246639A
Authority
CN
China
Prior art keywords
model
training
speaker
student
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310085281.4A
Other languages
Chinese (zh)
Inventor
钱彦旻
韩冰
黄文�
陈正阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202310085281.4A priority Critical patent/CN116246639A/en
Publication of CN116246639A publication Critical patent/CN116246639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a self-supervision speaker verification model training method, electronic equipment and a storage medium, wherein the training method comprises a first-stage training and a second-stage training, and the first-stage training comprises the following steps: randomly extracting a plurality of short segments and a plurality of long segments from the training corpus; inputting the short sections and the long sections into a student model to obtain output distribution of the student model; inputting the long section into a teacher model, and obtaining output distribution of the teacher model; encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model; the teacher model and the student model have the same structure, the updating methods are different, the parameters are also different, the student model is updated through a gradient descent method, and the teacher model is updated through an index moving average method of the student model parameters.

Description

Self-supervision speaker verification model training method, electronic device and storage medium
Technical Field
The invention belongs to the technical field of self-supervision speaker verification model training, and particularly relates to a self-supervision speaker verification model training method, electronic equipment and a storage medium.
Background
In the related art, based on the method of contrast learning, it will be assumed that the fragments intercepted from different sentences are negative samples for different speakers, and the fragments intercepted in the same sentence are positive samples for the same speaker. And then the distance between different speakers is increased by contrast learning, and the distance between the same speakers is reduced.
The inventors have found in the course of implementing the present application that different sentences may also come from the same speaker, so that some erroneous negative sample pairs may be introduced, resulting in performance degradation.
Disclosure of Invention
The embodiment of the invention provides a self-supervision speaker verification model training method, electronic equipment and a storage medium, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present invention provides a method for training a self-supervised speaker verification model, including a first stage training and a second stage training, where the first stage training includes:
randomly extracting a plurality of short segments and a plurality of long segments from the training corpus; inputting the short sections and the long sections into a student model to obtain output distribution of the student model; inputting the long section into a teacher model, and obtaining output distribution of the teacher model; and encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model; the teacher model and the student model have the same structure, the updating methods are different, the parameters are also different, the student model is updated through a gradient descent method, and the teacher model is updated through an index moving average method of the student model parameters.
In a second aspect, there is provided an electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the self-supervising speaker verification model training method of any one of the embodiments of the present invention.
In a third aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the self-supervised speaker verification model training method of any of the embodiments of the present invention.
In this embodiment, by using a self-supervising training framework without negative pairs, the dependence on labeled training corpus can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for training a self-supervising speaker verification model according to an embodiment of the present invention;
FIG. 2 is a diagram of a self-monitoring speaker verification model training system framework according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating the distinction between a conventional DINO and a cluster-aware training DINO provided by an embodiment of the present invention;
FIG. 4 is a performance comparison of CA-DINO and other self-supervising speaker verification methods according to an embodiment of the invention;
FIG. 5 is a graph showing a comparison of performance of cluster awareness training for different cluster numbers according to one embodiment of the present invention;
FIG. 6 is a graph showing EER (%) comparison of a pre-trained self-supervised model fine tuning using varying amounts of signature data for Voxceeb 1, according to an embodiment of the present invention;
fig. 7 is a comparison of EER (%) and minDCF (p=0.01) of a fine-tuning self-supervision model using CN-celeb 1 according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Please refer to fig. 1, which illustrates a self-supervised speaker verification model training method of the present application, including a first stage training and a second stage training, wherein the first stage training includes the following steps, the teacher model and the student model have the same structure, the updating methods are different, and the parameters are also different, the student model is updated by a gradient descent method, and the teacher model is updated by an exponential moving average method of the student model parameters.
As shown in fig. 1, in step 101, a plurality of short segments and a plurality of long segments are randomly extracted from a training corpus;
in step 102, inputting the plurality of short segments and the plurality of long segments into a student model, and obtaining output distribution of the student model;
in step 103, inputting the long segment into a teacher model, and obtaining output distribution of the teacher model;
encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model in step 104;
in this embodiment, by using a self-supervising training framework without negative pairs, the dependence on labeled training corpus can be reduced.
In some alternative embodiments, the second stage of training is entered when the speaker verification model is capable of extracting an authenticated speaker representation, the second stage of training comprising: clustering the extracted speaker embedments, wherein the corpus in the same cluster belongs to the same speaker; positive sample pairs are extracted from the same cluster and used as new input for training the subsequent speaker verification model to carry out loop iteration on the speaker verification model, wherein in the iteration process, progressive clustering is adopted to gradually reduce the number of clusters along with convergence of the speaker verification model so as to reduce the inter-class distance.
In a further alternative embodiment, the progressive clustering includes linear descent progressive clustering and exponential descent progressive clustering.
In some alternative embodiments, cosine-based consistency loss is added to the cross entropy loss to maximize cosine similarity between the extracted embeddings in the same speaker.
In some alternative embodiments, prior to said entering each of said plurality of short segments and said plurality of long segments into the student model, said method further comprises: the short and long segments are subjected to different types of data enhancement by adding noise or reverberation to obtain stable performance.
In some alternative embodiments, the obtaining the output profile of the student model comprises:
normalizing speaker embedding output by the student model using a softmax function to obtain an output distribution of the student model, wherein a temperature parameter is introduced to control sharpness of the output distribution of the student model.
In a further alternative embodiment, the obtaining the output profile of the teacher model includes: normalizing speaker-embedded output of the teacher model using a softmax function to obtain an output distribution of the teacher model, wherein an average of each batch of output distributions of the student model is used for centralization of the output distribution of the teacher model.
It should be noted that the above method steps are not limited to the order of execution of the steps, and in fact, some steps may be executed simultaneously or in reverse order of the steps, which is not limited by the present application.
The following description is given to better understand the aspects of the present application by describing some of the problems encountered by the inventor in carrying out the present invention and one specific embodiment of the finally-determined aspects.
The inventors have found that the above-mentioned drawbacks are mainly caused by the following reasons: different sentences may also come from the same speaker, so some erroneous negative sample pairs may be introduced, resulting in performance degradation.
To solve the above-mentioned drawbacks existing in the related art, a person skilled in the art will generally use a method of contrast learning. The method for contrast learning is simple and easy to realize.
In the embodiment of the application, the DINO frame is introduced, no negative sample pair is needed, the influence of an error negative sample is avoided, a clustering perception strategy is provided based on the negative sample pair, and positive sample pairs are intercepted from the same class of sentences obtained by clustering, so that the overlapping of fragments is reduced, and the performance is further improved.
Referring to fig. 2, a block diagram of a self-supervised speaker verification model training system is shown, according to an embodiment of the present application.
The DINO framework of this embodiment of the present application assumes that the same speaker is in the same sentence, and then samples 4 short segments and 2 long segments from the same sentence. These segments are then passed through the student and teacher models, respectively, and then the cross entropy loss is used to pull up the outputs of the student and teacher models. Wherein the student model is updated using gradient descent and the teacher model is updated using EMA.
The sampled fragments will have a high overlap rate due to the statement duration limitations. The clustering perception strategy is to gather sentences of the same class through clustering and then sample fragments from the same class instead of a single sentence. The progressive clustering is to continuously reduce the number of clusters, so that the accuracy of the clusters is higher and higher.
The DINO self-supervision learning method based on cluster perception provided by the embodiment of the application has great promotion on the Voxceeb data set of the speaker compared with the previous method based on contrast learning.
DINO (self-distillation with no labels) no-tag self-distillation
EMA (exponential moving average) exponential sliding average
The following verifies the beneficial effects of the embodiments of the present application over the prior art through specific experiments and experimental data.
Since self-supervised learning does not rely on tagged data, it has recently become a potential method in speaker verification tasks. The DINO-based self-supervision framework performs training without using negative pairs, and is excellent in speaker verification tasks. However, due to the constraint of statement length, there are many overlaps in the numerous positive fragments that are intercepted, which may mislead the model to focus on irrelevant information. To solve this problem, we propose a Cluster Aware (CA) training strategy that lets the model intercept positive segments from several words in the same cluster instead of a single word. Furthermore, in the clustering stage, we have studied strategies of fixed number clustering and progressive clustering. By these strategies, our CA-DINO has achieved the most advanced results on the Vox-O test set. Finally, we explored the effect of fine tuning CA-DINO with small amounts of tagged data. On the Voxceleb1 dataset, only 10% of the tagged data is used for fine tuning, which is superior to the supervised system for all data training.
1. Introduction to the invention
Speaker verification (SV, speaker verification) is a task of verifying a person's identity based on his voice characteristics. In recent years, a deep learning-based method has been vigorously developed, and excellent performance has been achieved in a speaker verification task. To achieve better performance and robustness, researchers have designed various model architectures, training strategies, pooling methods for speaker verification tasks. However, these deep learning-based methods are typically based on fully supervised training, which requires a large amount of well-labeled data. However, collecting well-tagged data on a large scale is difficult and expensive, while untagged data is relatively easy to collect in large quantities.
In this case, in order to make full use of these unlabeled data, reducing reliance on the labeled data, many researchers turn their attention to self-supervised learning, obtain supervisory signals from the data itself, and design a pre-task to assist in model learning representation. First, with the help of Text-to-Speech (TTS) tasks, a generation method is proposed that separates speaker representations based on phoneme information. Although speaker notes are not used here, their performance is not ideal. Subsequently, some researchers explored fine tuning directly to the speaker verification task, inspired by the success of the frame-level pre-training model in the Wav2Vec series, hubert, etc., but this would bring huge parameters compared to the traditional model. Next, by observing the data structure, a researcher proposes an assumption that speech segments truncated from the same corpus belong to the same speaker, and speech segments truncated from different corpora belong to different speakers. Based on this assumption, many efforts employ a contrast learning approach to obtain a discernable speaker representation by maximizing and minimizing positive and negative pairs. Then, to solve the negative facing problem caused by inaccurate assumptions, a non-contrasting framework DINO (unlabeled distillation, distillation with no labels) was introduced into speaker verification and brought about a tremendous performance improvement. For conventional DINO, the two distributions of positive segments are minimized by cross entropy, where the positive segments are formed by sampling several segments in a corpus. Since each corpus has a short duration, there is a lot of overlap between the segments, which can mislead the model to focus on irrelevant information (content, channel noise, etc.) and ignore speaker information in the audio.
To address this problem, we propose several new strategies for self-supervised learning in the speaker verification task. First, we use the traditional DINO framework as the initial model for the first pre-training stage. Next, we propose a Cluster Aware (CA) training strategy for DINO to extract positive segments from the same cluster generated by the Clustering algorithm. This strategy can minimize channel and background effects and increase the diversity of data. Furthermore, we explore a progressive cluster-aware strategy in the clustering stage that plays a role in adapting to network convergence and preventing false label pollution. By these strategies, our progressive CA-DINO achieves the most advanced performance on the voxceeb evaluation set. In addition, we performed fine tuning experiments to verify the CA-DINO that we propose with only a small amount of labeling data. Compared with another self-supervision model such as SimCLR and fully supervised model, it performs better with only 10% of the labeling data.
2.2. Method of
2.1. DINO-based self-supervising speaker verification
In this section we will describe the DINO, the whole framework being shown in fig. 1. The DINO follows an architecture similar to knowledge distillation, including not only a student encoder, but also a teacher encoder. Both encoders are trained in parallel. The output of the teacher encoder is used as a target profile to optimize the student encoder.
Fig. 2 shows the framework of a tag distillation method (DINO) for self-supervised speaker characterization learning. Wherein, the Chinese and English comparison is as follows: short phrases; long sentence; an Encoder; a project head; stop gradient truncation; EMA exponential moving average; centering Softmax: centralizing softmax.
Wherein, different views of each corpus are constructed with multi-intercept strategies. More precisely, from a given corpus we randomly extract 4 short segments
Figure BDA0004068744610000071
And 2 long segments->
Figure BDA0004068744610000072
These areThe segments should overlap as little as possible. We follow the assumption that: the segments cut from the same corpus belong to the same speaker. We then perform different types of data enhancement on these segments by adding noise or reverberation to achieve stable performance. After enhancement, all segments pass through the student model, while only long segments pass through the teacher model.
The teacher model and the student model have the same structure, but the parameters are different due to different updating methods. The student model is updated by gradient descent and the teacher model is updated by exponential moving average (EMA, exponential Moving Average) of the student model parameters. The update rule of EMA is θ t ←λθ t +(1-λ)θ s . Where lambda is adjusted from 0.996 to 1 by the cosine scheduler during training. Speaker embedding is extracted by an encoder and then fed into a projection head comprising a 3-layer perceptron with hidden dimensions of 2048, then l2 normalization, and a weight normalized full-link layer with K dimensions.
We encourage a short to long correspondence by minimizing the cross entropy loss H (-) between the two distributions, the specific formula 1 is as follows:
Figure BDA0004068744610000073
wherein the momentum teacher network f θt And student network f θs Respectively with P t And P s And (3) representing. Furthermore, P may be calculated by normalizing the output using a softmax function:
Figure BDA0004068744610000074
wherein E is s > 0 is a temperature parameter that controls the sharpness of the output profile. Also, for a temperature E t P > 0 t There is also a formula established. In addition, the average calculated over the batch is used in the teacher model output distributionHeart. Both sharpening and centralization are applied during training to avoid trivial solutions.
Furthermore, we add a cosine-based consistency penalty to ensure that speaker embedding is encoded into cosine space, which is better suited for later scoring and clustering. It maximizes cosine similarity between the embeddings extracted from the same speaker. Finally, the total loss is summarized by the coefficient α:
Figure BDA0004068744610000081
where e represents the speaker embedding extracted from the encoder.
2.2. Progressive cluster perception training strategy
For conventional DINO, the positive sample pairs are formed from segments sampled in the same corpus. As described above, in the optimization process of the DINO, the cross entropy between the two distributions of positive segments is minimized to encourage short-to-long correspondence. Although we try to overlap as little of these segments from the same utterance as possible at the time of sampling, in practice, these segments will often overlap to a large extent due to the limited duration of the utterance. Under the influence of these overlapping portions, the model may be more concerned with the content of the overlapping portions, the channel and other irrelevant information, while ignoring speaker information in the audio. Although we will later add different types of data enhancements to it, the data still lacks diversity, which may lead to model optimization in the wrong direction.
Fig. 3 shows the distinction between conventional DINO and cluster perception training DINO. Fig. 3 (a) conventional DINO: the long and short segments are sampled from the same corpus to form the positive segments. Cluster awareness training DINO in fig. 3 (b): by a simple clustering algorithm, we consider that the same speaker in the same cluster has the same identity, and the fragments are cut out from the corresponding cluster.
To reduce the overlap of speech segments and increase the diversity of data, we propose a Cluster Aware (CA) DINO training strategy, while keeping the original assumption as much as possible, hereinafter named CA-DINO. Model training is divided into two phases. In the first stage of training, we optimize the model according to the traditional pattern of DINO training. Then, when the model is able to extract the discriminative speaker representation, the training process will go to the next stage. Next, the extracted speaker embedments are clustered here using a clustering algorithm such as K-means. The corpuses in the same cluster are assumed to belong to the same speaker, and the clustered corpuses can be used for generating crops with fewer overlapped parts and stronger diversity. As shown in fig. 3, unlike the conventional DINO strategy, the current positive sample pair is sampled from several different corpora in the same cluster, rather than from a single corpus. These positive sample pairs come from the same speaker, but have different content and channel information, which greatly enhances the diversity of the data, reducing overlap so that the model can focus more on speaker information than on unrelated information. These positive sample pairs will be used as new inputs for subsequent model training. The clustering process will take place after several training epochs, taking into account the resource consumption of extracting speaker embedments.
Furthermore, we have introduced a progressive clustering (PC, progressive Clustering) method in the clustering step. In the early stage of characterization learning, setting a small number of clusters in the clustering process may cause inconsistent sample types in some clusters, thereby causing pollution of pseudo tags and further preventing the increase of the characterization capability of the model. With the convergence of the network, we can gradually reduce the number of clusters to reduce the intra-class distance, making the feature space more compact and class consistency. Specifically, we have adopted two strategies to reduce the number of clusters: the linear decrease and the logarithmic decrease are referred to as PC-linearity and PC-Log, respectively, hereinafter. To represent the strategy of PC-Log, we note that the initial cluster number is Ni and the final fixed cluster number is N f The clustering number of the t th epoch (epoch) is N t The formula is as follows:
Figure BDA0004068744610000091
where T represents the total training epoch. N is as shown in formula (4) t The drop is fast during early durations and slow during later durations. When t=0, N t Equal to N i The method comprises the steps of carrying out a first treatment on the surface of the When t > t0, we fix N t Is N f
3. Experimental setup
3.1. Cluster-aware DINO
3.1.1.DINO
For DINO, we have employed ECAPATDNN with 512 channels as an audio encoder to learn an discriminative speaker representation, which is a backbone based on Time delayed neural network (TDNN, time-delay Neural Network), emphasizing channel attention, propagation and aggregation, taking into account the available resources. It employs a channel and context dependent attention mechanism, multi-layer feature aggregation (MFA, multi-layer Feature Aggregation), and extrusion-Excitation (SE) and residual blocks.
The Voxceleb 2 development set is used to train the network without using any speaker tags. The training set included 1,091,251 corpora of 5,994 speakers collected from YouTube. For each utterance, two long (3 seconds) and four short (2 seconds) segments are randomly clipped and considered as positive sample pairs. The extracted speech segments are enhanced with noise from the MUSAN and RIR datasets, respectively. They are then encoded by an encoder as 192-dimensional speaker embeddings. K in the DINO projection head is set to 65,536. The temperatures of the teacher model e t and the student model e s are 0.04 and 0.1, respectively. Furthermore, we set the cosine loss weight α to 1.0 to balance the two losses. The whole training process will last 150 epochs. Model parameters were updated using a random gradient Descent (SGD, stochasticGradient device) algorithm, weight decay 5e5. The learning rate increases linearly from 0 to 0.2 over the first 20 durations and then decays to 1e-5 with the cosine scheduler. In addition, the momentum also follows a cosine schedule, from 0.996 to 1.0.
3.1.2. Cluster awareness training
In the cluster-aware training strategy, we train the model as described by DINO for the first 90 durations. Thereafter, a k-means based clustering algorithm is applied over the entire training set every 5 durations, which is supported by the fasss library.
Our model was evaluated in 3 trials: vox-O is the original test set of Voxceleb1, containing 37720 trials of 40 speakers. Vox-E (using the entire dataset) contained 581,480 trials from 1251 speakers. Vox-H is a difficult list of assessments, including 552,536 pairs of trials drawn from 1190 speakers of Voxceleb1, all from the same nationality and gender.
3.2. Fine tuning pre-trained self-supervision model
Trimming experiments were performed on Voxceleb1 in the domain and CN-celeb 1 outside the domain to demonstrate the robustness of our model. The data set of Voxceleb1 consists of 148,642 corpora of 1,211 speakers. While CN-celeb 1 contains 53,288 from 800 speakers (we concatenate phrases from the same genre and the same speaker, making it longer because there are many phrases less than 2 s).
In the fine tuning stage, the pre-trained model is randomly initialized, pre-trained using the SimCLR framework, or pre-trained through the CA-DINO framework. We used a training fragment of 2 s. The model is optimized here using the addition angle difference (AAM) penalty. While we set the margin and scale of AAM to 0.2 and 32.0, respectively. The fine tuning process will last 100 epochs with an exponential decline in learning rate from the first 0.01 to the last 1e-5.
4. Results
4.1. Comparison with other self-supervision models
Figure 4 shows a performance comparison of the proposed CA-DINO with other self-supervising speaker verification methods. SSL refers to self-supervised learning. EER (%) and minDCF (p=0.01) were evaluated on the Vox-O test set. The SSL methods are self-monitoring methods.
Figure 5 shows a performance comparison of cluster awareness training for different cluster numbers. EER (%) was evaluated on Vox-O, vox-E and clustering was superior to other systems with fixed numbers of clusters, indicating that progressive clustering can improve performance to some extent. Where #cluster represents the number of clusters.
Wherein fig. 4 reports the speaker verification performance of our proposed method and other previous self-supervised models. All methods were trained on Voxceleb 2 without any speaker tags and evaluated on Vox-O test sets. From the table we can find that the non-negative pair of DINOs outperform all the previous conventional methods and the comparison-based methods, which demonstrates that the negative pair is indeed a bottleneck for performance improvement. Furthermore, we also provide a dilo ablation study at the bottom of fig. 4. We can see that dimo without an exponentially moving average line (EMA) achieves a poor result, which reveals EMA is essential to prevent model collapse. After the progressive Cluster Awareness (CA) strategy is applied in training the DINO, the performance is further improved. On the Vox-O test set, it is relatively 23.74% higher than the best system, which is a significant performance leap.
We also provided experiments to explore the impact of different numbers of clusters and different degradation strategies on performance. And the results are reported in fig. 5. Compared to the baseline system (1080 k), we can find that our proposed cluster-aware strategy is not sensitive to the number of clusters, as it can bring a significant and stable improvement for all fixed numbers of clusters. Meanwhile, the smallest DCF of CA-DINO with Progressive Clustering (PC) is given with p=0.05.
4.2. Trimming CA-DiNO with small amounts of tag data
To better illustrate the superior performance of CA-DINO, we performed a self-supervised learning experiment of the pre-training-fine tuning framework. We fine-tune the self-supervision model with different amounts of label data in the downstream speaker verification task. We split the data set of Voxceleb1 into 10%/20%/50%/100% in different proportions from "number of speakers" or "number of utterances per speaker", respectively.
Fig. 6 shows EER (%) comparisons of fine-tuning of pre-trained self-supervised models using different amounts of label data for Voxceleb 1. The results were evaluated on Vox-O, which is a test set of Voxceleb 1. Wherein Initial represents an initialization parameter.
As shown in fig. 6, the frame-level pre-training model Wav2Vec only achieves undesirable results, which is reasonable because it is designed for speech recognition rather than for speaker tasks. We can then observe a significant improvement in the self-supervised model SimCLR and the CADINO we propose over the model trained from scratch. This shows that under low resource conditions it is important to have a better initialized pre-trained model. Furthermore, the proposed CA-DINO performed significantly better than SimCLR, and still performed well with only a small amount of labeling data. The reduction of the marking data does not cause a significant performance degradation. More hopefully, fine tuning with pre-trained CA-DINO with only 10% of the labeling corpus achieved even better results than a full supervision system trained on all labeling data, i.e. 2.510% versus 2.755%.
Furthermore, as can be seen from fig. 6, when the sampling ratio is the same, the training performance of using a small number of corpora for each speaker is better than that of using all corpora of a small number of speakers. This discovery also provides us with a new idea to collect data in the case of limited resources. Collecting data from different speakers as much as possible, rather than collecting corpus from each speaker as much as possible, is significant in saving a large number of manual annotations.
Fig. 7 shows EER (%) and minDCF (p=0.01) comparisons of the fine-tuning self-supervision model with CN-celeb 1. The results were evaluated on an evaluation set of CN-celeb 1.
Finally, we also provide the results of the CA-DINO fine tuning on the outside domain CN-celeb 1 in fig. 7. From these results, we found that even with fine tuning in different areas, we propose CA-DINO with better performance, which suggests robustness and versatility of CA-DINO.
5. Conclusion(s)
In the present embodiment, we propose a Cluster Awareness (CA) strategy to reduce the overlap problem when training a legacy DINO. By this strategy, the model can utilize positive sample pairs sampled from several different corpora in the same cluster, rather than from a single corpus, which can increase the diversity of the data and achieve the most advanced performance. Furthermore, in the clustering stage, we have studied strategies of fixed number clustering or progressive clustering. Finally, we have discussed the effect of fine tuning different self-supervising speaker verification models with a small amount of marker data. With only 10% of the annotation data, we propose CA-DINO that exceeds the full supervision system trained on all the annotation data.
In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer executable instructions that are capable of executing the self-supervised speaker verification model training method in any of the above method embodiments, for a self-supervised speaker verification model training system, where the training method includes a first stage of training and a second stage of training;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
randomly extracting a plurality of short segments and a plurality of long segments from the training corpus;
inputting the short sections and the long sections into a student model to obtain output distribution of the student model;
inputting the long section into a teacher model, and obtaining output distribution of the teacher model;
encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model;
the teacher model and the student model have the same structure, the updating methods are different, the parameters are also different, the student model is updated through a gradient descent method, and the teacher model is updated through an index moving average method of the student model parameters.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the self-supervising speaker verification model training system, and the like. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes a memory remotely located with respect to the processor, the remote memory being connectable to the self-supervising speaker verification model training system via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described self-supervised speaker verification model training methods.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, where the device includes: one or more processors 810, and a memory 820, one processor 810 being illustrated in fig. 8. The device of the self-supervising speaker verification model training method and system may further comprise: an input device 830 and an output device 840. Processor 810, memory 820, input device 830, and output device 840 may be connected by a bus or other means, for example in fig. 8. Memory 820 is the non-volatile computer-readable storage medium described above. Processor 810 executes various functional applications and data processing of the server by running non-volatile software programs, instructions and modules stored in memory 820, i.e., implements the method and system for self-supervising speaker verification model training of the method embodiments described above. The input device 830 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication compensation device. The output device 840 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present invention.
As an implementation manner, the electronic device is applied to a self-supervision speaker verification model training system, the training method includes a first stage training and a second stage training, and the system includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
randomly extracting a plurality of short segments and a plurality of long segments from the training corpus;
inputting the short sections and the long sections into a student model to obtain output distribution of the student model;
inputting the long section into a teacher model, and obtaining output distribution of the teacher model;
encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model;
the teacher model and the student model have the same structure, the updating methods are different, the parameters are also different, the student model is updated through a gradient descent method, and the teacher model is updated through an index moving average method of the student model parameters.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones, multimedia phones, functional phones, low-end phones, etc.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players, palm game players, electronic books, and smart toys and portable car navigation devices.
(4) And (3) a server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A self-supervising speaker verification model training method, comprising a first stage of training and a second stage of training, wherein the first stage of training comprises:
randomly extracting a plurality of short segments and a plurality of long segments from the training corpus;
inputting the short sections and the long sections into a student model to obtain output distribution of the student model;
inputting the long section into a teacher model, and obtaining output distribution of the teacher model;
encouraging short-to-long correspondence by minimizing cross entropy loss between the output distribution of the student model and the output distribution of the teacher model;
the teacher model and the student model have the same structure, the updating methods are different, the parameters are also different, the student model is updated through a gradient descent method, and the teacher model is updated through an index moving average method of the student model parameters.
2. The method of claim 1, wherein the second stage of training is entered when the speaker verification model is capable of extracting an authenticated speaker representation, the second stage of training comprising:
clustering the extracted speaker embedments, wherein the corpus in the same cluster belongs to the same speaker;
positive sample pairs are extracted from the same cluster and used as new input for training the subsequent speaker verification model to carry out loop iteration on the speaker verification model, wherein in the iteration process, progressive clustering is adopted to gradually reduce the number of clusters along with convergence of the speaker verification model so as to reduce the inter-class distance.
3. The method of claim 2, wherein the progressive clustering comprises linear descent progressive clustering and exponential descent progressive clustering.
4. The method of claim 1, wherein cosine-based consistency loss is added to the cross entropy loss to maximize cosine similarity between embeddings extracted from the same speaker.
5. The method of claim 1, wherein prior to said entering each of the plurality of short segments and the plurality of long segments into a student model, the method further comprises:
the short and long segments are subjected to different types of data enhancement by adding noise or reverberation to obtain stable performance.
6. The method of claim 1, wherein the obtaining the output profile of the student model comprises:
normalizing speaker embedding output by the student model using a softmax function to obtain an output distribution of the student model, wherein a temperature parameter is introduced to control sharpness of the output distribution of the student model.
7. The method of claim 6, wherein the obtaining the output profile of the teacher model comprises:
normalizing speaker-embedded output of the teacher model using a softmax function to obtain an output distribution of the teacher model, wherein an average of each batch of output distributions of the student model is used for centralization of the output distribution of the teacher model.
8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 7.
9. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 7.
CN202310085281.4A 2023-02-06 2023-02-06 Self-supervision speaker verification model training method, electronic device and storage medium Pending CN116246639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310085281.4A CN116246639A (en) 2023-02-06 2023-02-06 Self-supervision speaker verification model training method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310085281.4A CN116246639A (en) 2023-02-06 2023-02-06 Self-supervision speaker verification model training method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN116246639A true CN116246639A (en) 2023-06-09

Family

ID=86629038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310085281.4A Pending CN116246639A (en) 2023-02-06 2023-02-06 Self-supervision speaker verification model training method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN116246639A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524252A (en) * 2023-11-13 2024-02-06 北方工业大学 Light-weight acoustic scene perception method based on drunken model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524252A (en) * 2023-11-13 2024-02-06 北方工业大学 Light-weight acoustic scene perception method based on drunken model
CN117524252B (en) * 2023-11-13 2024-04-05 北方工业大学 Light-weight acoustic scene perception method based on drunken model

Similar Documents

Publication Publication Date Title
EP3935544B1 (en) System and method for detecting adversarial attacks
CN110556100B (en) Training method and system of end-to-end speech recognition model
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN110211565B (en) Dialect identification method and device and computer readable storage medium
WO2021051544A1 (en) Voice recognition method and device
US11222627B1 (en) Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
CN110516253B (en) Chinese spoken language semantic understanding method and system
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
CN108510985A (en) System and method for reducing the principle sexual deviation in production speech model
CN111833845B (en) Multilingual speech recognition model training method, device, equipment and storage medium
CN110222841A (en) Neural network training method and device based on spacing loss function
CN103345923A (en) Sparse representation based short-voice speaker recognition method
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
WO2023093295A1 (en) Artificial intelligence-based audio processing method and apparatus, electronic device, computer program product, and computer-readable storage medium
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
Rasipuram et al. Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model
CN113223506A (en) Speech recognition model training method and speech recognition method
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN116246639A (en) Self-supervision speaker verification model training method, electronic device and storage medium
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
Hassan et al. Improvement in automatic speech recognition of south asian accent using transfer learning of deepspeech2
Radha et al. Speech and speaker recognition using raw waveform modeling for adult and children’s speech: A comprehensive review
Qin et al. Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination