CN113257236A

CN113257236A - Model score optimization method based on core frame screening

Info

Publication number: CN113257236A
Application number: CN202110514259.8A
Authority: CN
Inventors: 杨莹春; 魏含玉; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-04-30
Filing date: 2021-04-30
Publication date: 2021-08-13
Anticipated expiration: 2041-04-30
Also published as: CN113257236B

Abstract

The invention discloses a model score optimization method based on core frame screening, which comprises the following specific steps: s1, training by using training data to obtain model parameters; s2, calculating importance weight of each frame of voice in the voice; s3, selecting core frames of each voice according to the importance weight sequence; s4, training by using core frame data to obtain model parameters; s5, selecting a core frame of the test voice by calculating importance weight; and S6, scoring the core frame of the tested voice to obtain the score of the voice and making a decision. The score optimization method can select high-quality core frames in the voice as scoring basis to improve the detection performance, and is suitable for voice classification scenes such as voice recognition, speaker recognition, fake voice recognition and the like.

Description

Model score optimization method based on core frame screening

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a model score optimization method based on core frame screening.

Background

The voiceprint authentication system has the advantages of low acquisition cost, easiness in acquisition, convenience in remote authentication and the like as a biological authentication mode, and is widely applied to the fields of access control systems, financial transactions, judicial appraisal and the like. With the rapid development of the voice synthesis technology, on one hand, more convenient service and better user experience are brought to people, such as real sound intelligent customer service, real sound intelligent navigation, audio reading, intelligent voice calling and the like; on the other hand, the method also brings great challenges to the security of the voiceprint authentication system, for example, the voiceprint authentication system is attacked by the synthesized voice to obviously reduce the performance of the voiceprint authentication system, so that the method has important significance in research on the detection of the synthesized voice.

The purpose of the synthesized speech detection is to detect the synthesized speech from the real speech. In a mainstream GMM detection system, when a test stage is carried out, firstly, a voice feature sequence of test voice is extracted, then, the score of each frame is calculated through a trained GMM model, then, the average value of the scores of each frame is obtained to be used as the score of the test voice, and decision judgment is carried out; in fact, when the human ears distinguish whether the speech is true or false, the information of each frame is not paid attention to averagely, and more attention is paid to special frames, such as pause continuity, polyphone pronunciation accuracy, sentence break way naturalness and the like, so that the GMM mean value scoring method is unreasonable in the detection of the synthesized speech, and the optimization of the scoring method becomes a topic worth paying attention.

Disclosure of Invention

In order to solve the problem of score optimization, the invention provides a model score optimization method based on core frame screening, and the method can select high-quality core frames in voice as scoring basis so as to improve the detection performance.

A model score optimization method based on core frame screening comprises the following steps:

s1, training an original model by using training voice;

s2, calculating importance weight of each frame in training voice by using an original model;

s3, selecting core frames of each training voice according to the importance weight sequence;

s4, training a core model by using a core frame of training voice;

s5, calculating importance weight of each frame in the test voice by using the original model;

s6, selecting core frames of each test voice according to the importance weight sequence;

and S7, inputting the core frame of the test voice into the core model to calculate a matching score, wherein the score is the optimized model score.

Further, the specific implementation manner of step S1 is: for N-type voice recognition tasks, dividing all training voices into N sets according to the categories of the training voices, sequentially extracting features of the training voices in each set, and then respectively training to obtain original models of the voices, namely N original models, for calculating likelihood scores later, wherein N is a natural number larger than 1, namely a set voice category number.

Further, the specific implementation manner of step S2 is: for any training voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.

Further, the specific implementation manner of step S3 is: and for the importance weight obtained in the step S2, sorting the frames in the training speech from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the training speech.

Further, the specific implementation manner of step S4 is: and for the core frame obtained in the step S3, after extracting features thereof, training the core frame according to the category to obtain core models of various types of speech, which are used for calculating an optimized model score later.

Further, the specific implementation manner of step S5 is: for any test voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.

Further, the specific implementation manner of step S6 is: and for the importance weight obtained in the step S5, sorting the frames in the test voice from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the test voice.

Furthermore, in the process of the method of the present invention, steps S1-S4 are training stages, and steps S5-S7 are testing stages.

By the score optimization method provided by the invention, the average value of the scores of all the voice frames is not directly calculated, but the average value of the scores of the core frames is calculated to be used as the final score of the voice, so that more scores are biased to the voice frames with higher importance, and the classification performance of the model can be improved.

Drawings

FIG. 1 is a schematic flow chart of the training phase of the model score optimization method of the present invention.

FIG. 2 is a schematic flow chart of the testing phase of the model score optimization method of the present invention.

Detailed Description

The invention is suitable for voice classification scenes such as voice recognition, speaker recognition, forged voice recognition and the like. For further understanding of the present invention, the following detailed description of the present invention is provided only with respect to specific embodiments of applications for selecting core training speech model score optimization in synthesized speech detection, but it is to be understood that these descriptions are merely provided for further explanation of the features and advantages of the present invention, and are not intended to limit the present invention as claimed.

The experimental data in the present embodiment adopts a logical access database (ASVspoof 2019-LA) for automatically recognizing a spoofing attack and defense countermeasure challenge match by a speaker in 2019, an ASVspoof 2015 (ASVspoof 2015) for automatically recognizing a spoofing attack and defense countermeasure challenge match by a speaker, and a real scene synthesized speech detection data set (RS-SSD).

The ASVspoof challenge is initiated by a co-organization of several world-leading research institutes, university of Edinburgh, England, France, EURECOM, Japan NEC, university of east Finland, etc. The real speech of ASVspoof 2019 comes from 107 speakers, 61 being female and 46 being male, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and comprises 2580 sentences of real voices and 22800 sentences of false voices; the development set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and the development set comprises 2548 true voices and 22296 false voices; the evaluation set contains 67 speakers, 37 women and 30 men, and contains 7355 true voices and 63882 false voices, and the size of the evaluation set is about 4 GB.

The real speech of ASVspoof 2015 is from 106 speakers, 61 as females and 45 as males, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 25 speakers, 15 human females and 10 human males and comprises 3750 sentences of the real voices and 12625 sentences of the false voices; the development set comprises 35 speakers, 20 people are female, 15 people are male, and the development set comprises 2497 true voices and 49875 false voices; the evaluation set contained 46 speakers, 26 women, 20 men, and about 20 million test voices, and the evaluation set size was about 20 GB.

A Real-scene synthesized Speech Detection dataset (Real-scene synthesized Speech Detection Database), referred to as RS-SSD dataset for short, wherein the synthesized Speech includes synthesized Speech from ***, Tencent, hundred degrees and synthesized Speech from the Artificial Intelligence (AI) anchor of Xinhua society, the duration is 4.12 hours in total, and Real Speech of the same duration includes Real Speech from network media video, Real Speech from news video of Xinhua society, Real Speech from the chinese emotion Corpus (MASC) released from CCNT laboratory of Zhejiang university, part of which is from two databases of Mandarin Speech open source Speech Database AILL 1 provided by Hill shell; the voice contents of various categories are various and include voice contents of various scenes such as news reports, smart homes, unmanned driving, industrial production and the like.

As shown in fig. 1 and fig. 2, the method for optimizing model score based on core frame screening of the present invention includes the following steps:

s1, training by using training data to obtain model parameters;

s2, calculating importance weight of each frame of voice in the sentence;

s3, selecting core frames of each sentence according to the importance weight sequence;

s4, training by using core frame data to obtain model parameters;

s5, selecting a core frame of the test statement by calculating importance weight;

and S6, scoring the core frame of the test statement to obtain the score of the statement for decision making.

The specific implementation method of the step S1 is as follows: firstly, define the training corpus of the synthesized speech to detect the real speech as

The corpus of false speech is

The corpus of the test speech is

The ratio of the core frame of a sentence of speech to the sentence is denoted as α.

Extracting the features of the corpus, such as 32-dimensional LFCC plus first-order delta features and second-order delta features, and training with real speech and synthetic speech to obtain GMM model parameter GMM1_genuineAnd GMM1_spoofThe two GMM models are the models used to later calculate the likelihood score of the speech frame.

The training of the GMM is a supervised optimization process, typically using maximum likelihood criteria. The whole process is divided into two parts of parameter initialization and parameter optimization, wherein the parameter initialization usually uses an LBG algorithm, and the parameter optimization usually uses an EM algorithm; since the GMM training and the speech feature obtaining method are commonly applied to the existing synthesized speech detection system, they are not described in more detail here. For the choice of GMM model order K, typically a power of 2 such as 64, 128, 512, 1024, etc., it was found experimentally that the 512 order GMM synthesized speech detection system performs better for the 96-dimensional LFCC features used.

The specific implementation method of the step S2 is as follows: for each training speech, the frames of the real speech are computed at GMM1_genuineThe log likelihood score and the synthesized speech are at GMM1_spoofLog likelihood score of

Normalizing log-likelihood scores for frames of a speech

The normalized score is used as the importance weight of each frame.

The specific implementation method of the step S3 is as follows: for the importance weight obtained in step S2, the importance of each frame in a piece of speech is sorted according to speech classification, and for each piece of speech, α × T with a top ratio of α is selected_iThe frame is used as the core frame χ _ core of the speech_i。

The specific implementation method of the step S4 is as follows: for each speech core frame obtained in step S3, the real speech core frame and the synthesized speech core frame are respectively trained to obtain GMM model parameters GMM2_genuineAnd GMM2_spoofThe GMM training procedure is the same as that in step S1 described above.

The specific implementation method of the step S5 is as follows: for a test speech, extracting its acoustic features and calculating the GMM1 for each frame_genuineLog likelihood score of

And in GMM1_spoofLog likelihood score of

Then obtaining a log-likelihood score ratio

Normalizing the likelihood scores of the frames to obtain the importance weight of the frames, and selecting alpha-T with the proportion of alpha at the top of the ranking_iThe frame is used as the core frame q _ core of the speech_i。

The specific implementation method of the step S6 is as follows: core frame q _ core for the test speech obtained in step S5_iComputing each frame at GMM2_genuineScoring and averaging

And in GMM2_spoofAnd taking the mean of the likelihood scores

Then obtaining a log-likelihood score ratio

Then comparing the voice decision category with a model threshold value threshold to obtain the voice decision category; if llk_i>threshold is judged as true speech if llk_i<And the threshold is judged as the synthesized voice.

All voices in the evaluation set are tested in the following manner, experiments are based on a GMM system, and the results of an original mean value scoring method and error rate EER (error rate of experiments) are compared and are shown in Table 1:

TABLE 1

As can be seen from Table 1, the method can improve the system identification performance to a certain extent, and compared with the original mean value scoring method, the EER is increased by 0.32% in the ASVspoof 2015 Eval set, is decreased by 1.34% in the ASVspoof 2019 Eval set, and is decreased by 2.34% in the RS-SSD set, so that the overall performance is improved.

The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A model score optimization method based on core frame screening comprises the following steps:

s1, training an original model by using training voice;

s4, training a core model by using a core frame of training voice;

2. The model score optimization method of claim 1, wherein: the specific implementation manner of step S1 is as follows: for N-type voice recognition tasks, dividing all training voices into N sets according to the categories of the training voices, sequentially extracting features of the training voices in each set, and then respectively training to obtain original models of the voices, namely N original models, for calculating likelihood scores later, wherein N is a natural number larger than 1, namely a set voice category number.

3. The model score optimization method of claim 1, wherein: the specific implementation manner of step S2 is as follows: for any training voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.

4. The model score optimization method of claim 1, wherein: the specific implementation manner of step S3 is as follows: and for the importance weight obtained in the step S2, sorting the frames in the training speech from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the training speech.

5. The model score optimization method of claim 1, wherein: the specific implementation manner of step S4 is as follows: and for the core frame obtained in the step S3, after extracting features thereof, training the core frame according to the category to obtain core models of various types of speech, which are used for calculating an optimized model score later.

6. The model score optimization method of claim 1, wherein: the specific implementation manner of step S5 is as follows: for any test voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.

7. The model score optimization method of claim 1, wherein: the specific implementation manner of step S6 is as follows: and for the importance weight obtained in the step S5, sorting the frames in the test voice from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the test voice.

8. The model score optimization method of claim 1, wherein: in the process of the method, steps S1-S4 are training stages, and steps S5-S7 are testing stages.