CN113823274B

CN113823274B - Voice keyword sample screening method based on detection error weighted editing distance

Info

Publication number: CN113823274B
Application number: CN202110938700.5A
Authority: CN
Inventors: 贺前华; 严海康; 兰小添; 郑若伟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2023-10-27
Anticipated expiration: 2041-08-16
Also published as: CN113823274A

Abstract

The invention discloses a voice keyword sample screening method based on a detection error weighting edit distance, which utilizes output information in a voice keyword recognition model training process to revise the edit distance of a decoding sequence and a tag sequence by weighting detection errors of sample keywords, so that important samples can be paid greater attention, and unqualified voice keyword samples can be screened out. The invention greatly reduces the workload of manually auditing all samples and improves the screening efficiency. An effective scheme is provided for cleaning the corpus, a high-quality voice data set is constructed, the difficulty in constructing the low-resource small language corpus is reduced, a voice keyword sample with higher quality is provided for the deep neural network, and the research and development of the low-resource language related voice technology are promoted.

Description

Voice keyword sample screening method based on detection error weighted editing distance

Technical Field

The invention relates to the technical field of data processing, in particular to a voice keyword sample screening method based on detection error weighting editing distance.

Background

In recent years, with the rapid development of deep learning, the performance of tasks such as voice recognition and voice keyword detection is also greatly improved. Neural network-based methods are currently the mainstay of research, such as recurrent neural networks (Recurrent Neural Network, RNN), convolutional neural networks (Convolutional Neural Network, CNN), transformers, etc. However, deep neural networks have high demands on both the size and quality of the data set. The deep neural network shows remarkable performance under the conditions of sufficient training data quantity and high sample quality. Large-scale speech data sets are generally recorded by organizing recording staff or collected from a network, and due to the influence of some objective factors, a corpus often contains inaccurate text labels, i.e. the speech semantic content of a sample and the text have certain access. But it is not known which samples are marked inaccurately. When the corpus is large in scale, manual auditing labeling correction is difficult to be performed on all samples, so that huge time and labor cost are required, and meanwhile, the efficiency is low. Such training data may cause the model to learn a wrong mapping, thereby affecting the performance of the model. Therefore, it is necessary to clean the corpus and screen out voice keyword samples with unqualified labels, so as to construct a high-quality voice data set to be applied to different tasks.

Voice keyword detection is an effective intelligent voice processing technique. In the current voice keyword detection method based on the neural network, a large number of voice keyword samples are required to be used for iterative training. The model learns the commonalities among samples in the training process, and forms different mapping relations. The model can output a lot of useful information in the training process, and one heuristic knowledge is to use the intermediate information of the keyword model training process to help screen out unqualified labeling samples. The main basis of the heuristic knowledge is that most samples are reliable, and under the assumption, the model obtained by training is also basically reliable, namely, the samples marked with errors occupy a small number, and the qualified samples occupy a large number, so that the correct mapping relationship learned by the model is stronger. In different iteration runs, predictions of the model will tend to be consistent for a good sample; for failed samples, the prediction of the model may appear to float significantly. And (3) eliminating abnormal samples by fusing the output information of the model in the whole training period. How to use the output information of the model during training becomes critical to screening.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a voice keyword sample screening method based on the detection error weighting edit distance, which utilizes the output information in the training process of a voice keyword recognition model to weight the detection error of a sample keyword so as to revise the edit distance of a decoding sequence and a tag sequence, so that important samples can be paid greater attention to, and unqualified voice keyword samples can be screened out.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

a voice keyword sample screening method based on detection error weighted editing distance comprises the following steps:

s1, using an original voice data set as a training sample setWherein X is _n For speech keyword samples, Y _n For the corresponding recorded text, N is the total number of training samples, and then the training sample set +.>Recording text Y in (3) _n Is transcribed into a sequence with a tuning knot->The tuning joints of all keywords are respectively represented by numbers from 0 to K-2, other tuning syllables are respectively represented by numbers from K-1, K-1 is the number of tuning syllables of all keywords, and a tag sequence>Thereby obtaining a training sample set Z _t ；

S2, obtaining a training sample set Z through the acquired training sample set _t Iteratively training a voice keyword recognition model until the model converges, recognizing all voice keyword samples after each iteration, and recording each voice keyword sample X _n Is a decoding sequence of (a)

S3, respectively decoding each sequenceCorresponding tag sequence->Comparing to calculate the edit distance, the number of missed keyword detection and the number of false alarms, and checking the missed keyword detectionThe number and the false alarm number are weighted respectively to revise the editing distance, and the revised editing distance is obtained;

s4, fusing all revised editing distances obtained in the training process of the voice keyword recognition model of each voice keyword sample to obtain an error value of each training sample;

and S5, screening the voice keyword samples according to the error value of each training sample until the proportion of the qualified samples meets the requirement.

Further, in step S3, the number of missed detection and the number of false alarms are weighted respectively to revise the edit distance, and the revision formula adopted is:

D＝D _e +n _FR ·D _FR +n _FA ·D _FA (1)

in the formula (1), D _e To decode a sequenceCorresponding tag sequence->Edit distance, n _FR For the number of missed keyword detection, D _FR To the cost of the missed detection of the key words, n _FA For the number of false alarms of keywords, D _FA The keyword false alarm cost is used; d (D) _FR And D _FA Is an empirical constant and satisfies D _FR ≥0,D _FA ≥0。

Further, when the step S4 fuses all revised edit distances obtained in the model training process for each voice keyword sample, the edit distances from the second iteration to the mth iteration are specifically fused, and the fusion formula is as follows:

in the formula (2), the amino acid sequence of the compound,representing training samplesM represents the error value in the training sample set Z _t Iteration times of training of keyword recognition model, D _i Representing the revised edit distance calculated for the i-th iteration.

Further, the edit distance is the sequence to be decodedModulated to tag sequence->The number of insert, delete and replace operations required.

Further, the step S5 of screening the voice keyword samples according to the error value of each training sample specifically includes:

firstly, sequencing all training samples according to the error value from large to small, setting a threshold value, and if the error value of the training samples is smaller than or equal to the threshold value, putting the training samples into a qualified sample set; if the error value of the training sample is greater than the threshold value, the training sample is put into a candidate sample set; then, manually checking the training samples in the candidate sample set, and if the checking is a qualified sample, moving the training samples into the qualified sample set; finally, taking the obtained qualified sample set as a new training sample set Z _t Repeating the steps S2-S5 until the proportion of the qualified samples meets the requirement.

Further, the set threshold value is obtained by selecting the following processes:

1) Dividing the ordered error value sequence into different continuous intervals according to the numerical range of the error value of each training sample;

2) Starting from the interval with the largest numerical value, randomly extracting k training samples in each interval in sequence for manual auditing, if the recording text of the training samples is consistent with the voice semantic content, the training samples are regarded as qualified samples, otherwise, the training samples are regarded as unqualified samples, and the audited training samples are recorded;

3) Counting the occupation ratio of unqualified samples in k training samples extracted in the current interval, stopping manual auditing if the occupation ratio of the unqualified samples is smaller than a coefficient alpha, and taking the maximum error value of the interval as a set threshold value; where k and α are empirical parameters.

Further, the proportion of the disqualified samples in all the continuous intervals is smaller than the coefficient alpha until the proportion of the qualified samples meets the requirement.

In the technical scheme, the voice keyword sample refers to all samples used for training a voice keyword detector, and comprises a positive sample containing keywords and a negative sample not containing keywords. Positive and negative samples are indispensable in the voice keyword detector training process, positive and negative samples are also indispensable in the voice keyword detector evaluation process, and negative samples are generally more than positive samples. The literature is commonly referred to as a speech keyword sample.

Compared with the prior art, the principle of the technical scheme is as follows:

according to the technical scheme, output information in the training process of the voice keyword recognition model is utilized, and detection errors of sample keywords are weighted, so that the editing distance between a decoding sequence and a tag sequence is revised, important samples can be focused more greatly, and unqualified voice keyword samples are screened out.

Compared with the prior art, the technical scheme has the following advantages:

according to the technical scheme, the workload of manually auditing all samples is greatly reduced, and the screening efficiency is improved. An effective scheme is provided for cleaning the corpus, a high-quality voice data set is constructed, the difficulty in constructing the low-resource small language corpus is reduced, a voice keyword sample with higher quality is provided for the deep neural network, and the research and development of the low-resource language related voice technology are promoted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the services required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the figures in the following description are only some embodiments of the present invention, and that other figures can be obtained according to these figures without inventive effort to a person skilled in the art.

Fig. 1 is a schematic flow chart of a speech keyword sample screening method based on detecting error weighted editing distance in an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

As shown in fig. 1, the method for screening a speech keyword sample based on a detection error weighted edit distance according to the present embodiment includes the following steps:

s1, using an original voice data set as a training sample setWherein X is _n For speech keyword samples, Y _n For the corresponding recorded text, N is the total number of training samples, and then the training sample set +.>Recording text Y in (3) _n Is transcribed into a sequence with a tuning knot->The tuning joints of all keywords are respectively represented by numbers from 0 to K-2, the other tuning syllables are represented by numbers from K-1, K-1 is the number of tuning syllables of all keywords, and a tag sequence of a voice keyword sample is constructed>Thereby obtaining a training sample set Z _t ；

In this example, the recorded 392.67-hour Jiangxi Gangzhou speech sound data without manual review is used as a training sample set, which contains 761 different speakers in total; in addition, the voice data of the guest voice which has been subjected to the manual examination for 15.51 hours is used as a verification sample set, wherein the verification sample set is used as a judgment criterion for model convergence,and there is no intersection with the training sample set; the number of keywords to be identified is predefined to be 50. Firstly, all keywords are represented by tuning joints, and after the tuning joints of the keywords are sequenced, the keywords are respectively represented by numbers 0 to K-2, so that a mapping relation table is obtained. For each voice keyword sample, converting the text sequence into a sequence with a tuning joint, traversing the sequence with the tuning joint of the sample according to the mapping relation table, if the current tuning joint is in the mapping relation table, using the corresponding number to represent, otherwise using the number K-1 to represent, and obtaining the label sequence of each voice keyword sample

S2, obtaining a training sample set Z through the acquired training sample set _t Iteratively training a keyword recognition model until the model converges, recognizing all voice keyword samples after each iteration, and recording each voice keyword sample X _n Is a decoding sequence of (a)

In this embodiment, first, extracting 80-dimensional logarithmic mel spectrum features from each voice keyword sample in the training sample set and the verification sample set according to frame length of 25ms and frame shift of 10ms, and obtaining voice feature expression of the sample. The structure of the keyword recognition model is a convolutional cyclic neural network, and specifically comprises a 3-layer convolutional network, a 2-layer bidirectional gating unit and a 2-layer full-connection layer, the convolutional kernel size is 3 multiplied by 3, a loss function uses a joint sense time sequence classifier, an optimizer uses Adam, and the initial learning rate is 0.001. In the model training process, one iteration takes part in one training for all training samples, and the judgment criterion of model convergence is that the keyword recognition performance of the verification sample set is not improved. After each round of iterative training, all training samples are identified once by using a current model, a beam searching algorithm is adopted in a decoding method, the beam width is set to be 20, and a decoding sequence of the voice keyword samples is obtained and recorded.

S3, respectively decoding each sequenceAnd tag sequence->Comparing to calculate the editing distance, the number of missed detection of the keywords and the number of false alarms, and respectively weighting the number of missed detection and the number of false alarms to revise the editing distance to obtain revised editing distance;

in this embodiment, the keyword recognition performance of the verification sample set is optimal at round 16. For each training sample, there are 16 decoding sequences, corresponding to the output results of the 1 st to 16 th models; comparing each decoded sequence with the tag sequence to calculate an edit distance d _e . In the keyword detection task, two types of detection errors are more concerned, namely missed detection of keywords and false alarms. Although the edit distance measures the deviation of the code sequence from the label sequence, the error cost of false alarm and omission is consistent with the error cost of non-key words, so that the edit distance is necessary to be revised, so that the quality evaluation index of the sample more highlights the sample with error in key word detection. Therefore, according to the sequence of the keywords, the number of false alarms and missed detection of the statistical samples is weighted respectively, and the editing distance D is calculated _e The revision is carried out, and the specific implementation formula is as follows:

D＝D _e +n _FR ·D _FR +n _FA ·D _FA (1)

in the formula (1), D _e To decode a sequenceCorresponding tag sequence->Edit distance, n _FR The number of missed keyword detection of the voice keyword sample is D _FR To the cost of the missed detection of the key words, n _FA The number of keyword false alarms of the voice keyword sample is D _FA The keyword false alarm cost is used; d (D) _FR And D _FA Is an empirical constant and satisfies D _FR ≥0,D _FA And is more than or equal to 0. In the present embodiment, D _FR ＝3,D _FA ＝3。

S4, fusing all revised editing distances obtained in the keyword recognition model training process of each voice keyword sample to obtain an error value of each training sample;

specifically, the edit distance from the second iteration to the mth iteration after revising is fused, and a fusion formula is as follows:

in the formula (2), the amino acid sequence of the compound,representing the error value of the training samples, m represents the error value in the training sample set Z _t Iteration number D of training of lower keyword recognition model _i Representing the revised edit distance calculated for the i-th iteration. The keyword recognition model starts training in a random initialization mode, and larger errors can be generated in model output of the 1 st round, so that the editing distance from the second round of iteration to the m-th round of iteration after revision is fused. In this embodiment, m is 16, and an error value +_is calculated for each training sample>The value range is within the range of 0, + -infinity).

S5, screening the voice keyword samples according to the error value of each training sample to obtain qualified voice keyword samples, wherein the specific process is as follows:

the error values of each training sample are sequenced from big to small, a threshold value is set, the training sample set is divided into two subsets, if the error values are larger than the threshold value, the samples are put into candidate sample sets, otherwise, the samples are put into qualified sample sets. The selection mode of the threshold value is specifically as follows:

dividing the numerical range of the error value of the sample into different continuous sections;

starting from the interval with the largest numerical value, randomly extracting k training samples in each interval in sequence for manual auditing, if the recording text of the training samples is consistent with the voice semantic content, the samples are regarded as qualified samples, otherwise, the samples are regarded as unqualified samples, and the audited training samples are recorded;

counting the occupation ratio of unqualified samples in k training samples extracted in the current interval, stopping manual auditing if the occupation ratio of the unqualified samples is smaller than a coefficient alpha, and taking the maximum error value of the interval as a set threshold value; where k and α are empirical parameters.

In this embodiment, the divided sections are [0,2 ], [2,4 ], [4,6 ], [6,8 ], [8, 10), [10, 12), [12, 14), [14, 16), [16, ++ infinity), the empirical parameter k=100, α=0.1.

The samples in the candidate sample set are the selected unqualified samples, and in order to avoid deleting the correct samples by mistake and reusing the samples, manual auditing can be performed, and the samples are put into the qualified sample set again after being audited to be qualified or revised to be qualified.

Taking qualified sample set as new training sample set Z _t Repeating the steps S2-S5 until the ratio of the disqualified samples in all the continuous intervals is smaller than the coefficient alpha, and finally obtaining a qualified sample set which is the washed corpus.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, so variations in shape and principles of the present invention should be covered.

Claims

1. The voice keyword sample screening method based on the detection error weighted editing distance is characterized by comprising the following steps of:

s1, using an original voice data set as a training sample setWherein X is _n For speech keyword samples, Y _n N is the total number of training samples for corresponding recording text, and then the training samples are collected/>Recording text Y in (3) _n Is transcribed into a sequence with a tuning knot->The tuning joints of all keywords are respectively represented by numbers from 0 to K-2, other tuning syllables are respectively represented by numbers from K-1, K-1 is the number of tuning syllables of all keywords, and a tag sequence>Thereby obtaining a training sample set Z _t ；

S3, respectively decoding each sequenceCorresponding tag sequence->Comparing to calculate the editing distance, the number of missed detection of the keywords and the number of false alarms, and respectively weighting the number of missed detection and the number of false alarms to revise the editing distance to obtain revised editing distance;

s4, fusing revised editing distances obtained in the training process of the voice keyword recognition model of each voice keyword sample to obtain an error value of each training sample;

2. The method for screening a voice keyword sample based on the detection error weighted edit distance according to claim 1, wherein in step S3, the number of missed detection and the number of false alarms are weighted respectively to revise the edit distance, and a revised formula is adopted as follows:

D＝D _e +n _FR ·D _FR +n _FA ·D _FA (1)

3. The method for screening voice keyword samples based on error weighted editing distance detection according to claim 1, wherein when the revised editing distance obtained in the model training process of each voice keyword sample is fused in step S4, the revised editing distance from the second iteration to the mth iteration is specifically fused, and the fusion formula is as follows:

in the formula (2), the amino acid sequence of the compound,representing the error value of the training samples, m represents the error value in the training sample set Z _t Training of keyword recognition modelNumber of iterations of training, D _i Representing the revised edit distance calculated for the i-th iteration.

4. A method for screening speech keyword samples based on detection error weighted edit distance according to any of claims 1-3, wherein the edit distance is a sequence to be decodedModulated to tag sequence->The number of insert, delete and replace operations required.

5. The method for screening voice keyword samples based on the detection error weighted editing distance according to claim 1, wherein the step S5 is to screen the voice keyword samples according to the error value of each training sample, and specifically comprises:

6. The method for screening a voice keyword sample based on detecting an error weighted edit distance according to claim 5, wherein the set threshold is selected by:

7. The method for screening voice keyword samples based on error-detection weighted edit distance according to claim 6, wherein the ratio of non-conforming samples in all consecutive intervals is smaller than a coefficient α until the ratio of conforming samples satisfies the requirement.