CN113257236A - Model score optimization method based on core frame screening - Google Patents

Model score optimization method based on core frame screening Download PDF

Info

Publication number
CN113257236A
CN113257236A CN202110514259.8A CN202110514259A CN113257236A CN 113257236 A CN113257236 A CN 113257236A CN 202110514259 A CN202110514259 A CN 202110514259A CN 113257236 A CN113257236 A CN 113257236A
Authority
CN
China
Prior art keywords
voice
frame
training
core
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110514259.8A
Other languages
Chinese (zh)
Other versions
CN113257236B (en
Inventor
杨莹春
魏含玉
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Publication of CN113257236A publication Critical patent/CN113257236A/en
Application granted granted Critical
Publication of CN113257236B publication Critical patent/CN113257236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a model score optimization method based on core frame screening, which comprises the following specific steps: s1, training by using training data to obtain model parameters; s2, calculating importance weight of each frame of voice in the voice; s3, selecting core frames of each voice according to the importance weight sequence; s4, training by using core frame data to obtain model parameters; s5, selecting a core frame of the test voice by calculating importance weight; and S6, scoring the core frame of the tested voice to obtain the score of the voice and making a decision. The score optimization method can select high-quality core frames in the voice as scoring basis to improve the detection performance, and is suitable for voice classification scenes such as voice recognition, speaker recognition, fake voice recognition and the like.

Description

Model score optimization method based on core frame screening
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a model score optimization method based on core frame screening.
Background
The voiceprint authentication system has the advantages of low acquisition cost, easiness in acquisition, convenience in remote authentication and the like as a biological authentication mode, and is widely applied to the fields of access control systems, financial transactions, judicial appraisal and the like. With the rapid development of the voice synthesis technology, on one hand, more convenient service and better user experience are brought to people, such as real sound intelligent customer service, real sound intelligent navigation, audio reading, intelligent voice calling and the like; on the other hand, the method also brings great challenges to the security of the voiceprint authentication system, for example, the voiceprint authentication system is attacked by the synthesized voice to obviously reduce the performance of the voiceprint authentication system, so that the method has important significance in research on the detection of the synthesized voice.
The purpose of the synthesized speech detection is to detect the synthesized speech from the real speech. In a mainstream GMM detection system, when a test stage is carried out, firstly, a voice feature sequence of test voice is extracted, then, the score of each frame is calculated through a trained GMM model, then, the average value of the scores of each frame is obtained to be used as the score of the test voice, and decision judgment is carried out; in fact, when the human ears distinguish whether the speech is true or false, the information of each frame is not paid attention to averagely, and more attention is paid to special frames, such as pause continuity, polyphone pronunciation accuracy, sentence break way naturalness and the like, so that the GMM mean value scoring method is unreasonable in the detection of the synthesized speech, and the optimization of the scoring method becomes a topic worth paying attention.
Disclosure of Invention
In order to solve the problem of score optimization, the invention provides a model score optimization method based on core frame screening, and the method can select high-quality core frames in voice as scoring basis so as to improve the detection performance.
A model score optimization method based on core frame screening comprises the following steps:
s1, training an original model by using training voice;
s2, calculating importance weight of each frame in training voice by using an original model;
s3, selecting core frames of each training voice according to the importance weight sequence;
s4, training a core model by using a core frame of training voice;
s5, calculating importance weight of each frame in the test voice by using the original model;
s6, selecting core frames of each test voice according to the importance weight sequence;
and S7, inputting the core frame of the test voice into the core model to calculate a matching score, wherein the score is the optimized model score.
Further, the specific implementation manner of step S1 is: for N-type voice recognition tasks, dividing all training voices into N sets according to the categories of the training voices, sequentially extracting features of the training voices in each set, and then respectively training to obtain original models of the voices, namely N original models, for calculating likelihood scores later, wherein N is a natural number larger than 1, namely a set voice category number.
Further, the specific implementation manner of step S2 is: for any training voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.
Further, the specific implementation manner of step S3 is: and for the importance weight obtained in the step S2, sorting the frames in the training speech from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the training speech.
Further, the specific implementation manner of step S4 is: and for the core frame obtained in the step S3, after extracting features thereof, training the core frame according to the category to obtain core models of various types of speech, which are used for calculating an optimized model score later.
Further, the specific implementation manner of step S5 is: for any test voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.
Further, the specific implementation manner of step S6 is: and for the importance weight obtained in the step S5, sorting the frames in the test voice from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the test voice.
Furthermore, in the process of the method of the present invention, steps S1-S4 are training stages, and steps S5-S7 are testing stages.
By the score optimization method provided by the invention, the average value of the scores of all the voice frames is not directly calculated, but the average value of the scores of the core frames is calculated to be used as the final score of the voice, so that more scores are biased to the voice frames with higher importance, and the classification performance of the model can be improved.
Drawings
FIG. 1 is a schematic flow chart of the training phase of the model score optimization method of the present invention.
FIG. 2 is a schematic flow chart of the testing phase of the model score optimization method of the present invention.
Detailed Description
The invention is suitable for voice classification scenes such as voice recognition, speaker recognition, forged voice recognition and the like. For further understanding of the present invention, the following detailed description of the present invention is provided only with respect to specific embodiments of applications for selecting core training speech model score optimization in synthesized speech detection, but it is to be understood that these descriptions are merely provided for further explanation of the features and advantages of the present invention, and are not intended to limit the present invention as claimed.
The experimental data in the present embodiment adopts a logical access database (ASVspoof 2019-LA) for automatically recognizing a spoofing attack and defense countermeasure challenge match by a speaker in 2019, an ASVspoof 2015 (ASVspoof 2015) for automatically recognizing a spoofing attack and defense countermeasure challenge match by a speaker, and a real scene synthesized speech detection data set (RS-SSD).
The ASVspoof challenge is initiated by a co-organization of several world-leading research institutes, university of Edinburgh, England, France, EURECOM, Japan NEC, university of east Finland, etc. The real speech of ASVspoof 2019 comes from 107 speakers, 61 being female and 46 being male, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and comprises 2580 sentences of real voices and 22800 sentences of false voices; the development set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and the development set comprises 2548 true voices and 22296 false voices; the evaluation set contains 67 speakers, 37 women and 30 men, and contains 7355 true voices and 63882 false voices, and the size of the evaluation set is about 4 GB.
The real speech of ASVspoof 2015 is from 106 speakers, 61 as females and 45 as males, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 25 speakers, 15 human females and 10 human males and comprises 3750 sentences of the real voices and 12625 sentences of the false voices; the development set comprises 35 speakers, 20 people are female, 15 people are male, and the development set comprises 2497 true voices and 49875 false voices; the evaluation set contained 46 speakers, 26 women, 20 men, and about 20 million test voices, and the evaluation set size was about 20 GB.
A Real-scene synthesized Speech Detection dataset (Real-scene synthesized Speech Detection Database), referred to as RS-SSD dataset for short, wherein the synthesized Speech includes synthesized Speech from ***, Tencent, hundred degrees and synthesized Speech from the Artificial Intelligence (AI) anchor of Xinhua society, the duration is 4.12 hours in total, and Real Speech of the same duration includes Real Speech from network media video, Real Speech from news video of Xinhua society, Real Speech from the chinese emotion Corpus (MASC) released from CCNT laboratory of Zhejiang university, part of which is from two databases of Mandarin Speech open source Speech Database AILL 1 provided by Hill shell; the voice contents of various categories are various and include voice contents of various scenes such as news reports, smart homes, unmanned driving, industrial production and the like.
As shown in fig. 1 and fig. 2, the method for optimizing model score based on core frame screening of the present invention includes the following steps:
s1, training by using training data to obtain model parameters;
s2, calculating importance weight of each frame of voice in the sentence;
s3, selecting core frames of each sentence according to the importance weight sequence;
s4, training by using core frame data to obtain model parameters;
s5, selecting a core frame of the test statement by calculating importance weight;
and S6, scoring the core frame of the test statement to obtain the score of the statement for decision making.
The specific implementation method of the step S1 is as follows: firstly, define the training corpus of the synthesized speech to detect the real speech as
Figure BDA0003049570680000041
The corpus of false speech is
Figure BDA0003049570680000042
Figure BDA0003049570680000043
The corpus of the test speech is
Figure BDA0003049570680000044
The ratio of the core frame of a sentence of speech to the sentence is denoted as α.
Extracting the features of the corpus, such as 32-dimensional LFCC plus first-order delta features and second-order delta features, and training with real speech and synthetic speech to obtain GMM model parameter GMM1genuineAnd GMM1spoofThe two GMM models are the models used to later calculate the likelihood score of the speech frame.
The training of the GMM is a supervised optimization process, typically using maximum likelihood criteria. The whole process is divided into two parts of parameter initialization and parameter optimization, wherein the parameter initialization usually uses an LBG algorithm, and the parameter optimization usually uses an EM algorithm; since the GMM training and the speech feature obtaining method are commonly applied to the existing synthesized speech detection system, they are not described in more detail here. For the choice of GMM model order K, typically a power of 2 such as 64, 128, 512, 1024, etc., it was found experimentally that the 512 order GMM synthesized speech detection system performs better for the 96-dimensional LFCC features used.
The specific implementation method of the step S2 is as follows: for each training speech, the frames of the real speech are computed at GMM1genuineThe log likelihood score and the synthesized speech are at GMM1spoofLog likelihood score of
Figure BDA0003049570680000051
Figure BDA0003049570680000052
Normalizing log-likelihood scores for frames of a speech
Figure BDA0003049570680000053
The normalized score is used as the importance weight of each frame.
The specific implementation method of the step S3 is as follows: for the importance weight obtained in step S2, the importance of each frame in a piece of speech is sorted according to speech classification, and for each piece of speech, α × T with a top ratio of α is selectediThe frame is used as the core frame χ _ core of the speechi
The specific implementation method of the step S4 is as follows: for each speech core frame obtained in step S3, the real speech core frame and the synthesized speech core frame are respectively trained to obtain GMM model parameters GMM2genuineAnd GMM2spoofThe GMM training procedure is the same as that in step S1 described above.
The specific implementation method of the step S5 is as follows: for a test speech, extracting its acoustic features and calculating the GMM1 for each framegenuineLog likelihood score of
Figure BDA0003049570680000055
And in GMM1spoofLog likelihood score of
Figure BDA0003049570680000056
Then obtaining a log-likelihood score ratio
Figure BDA0003049570680000057
Normalizing the likelihood scores of the frames to obtain the importance weight of the frames, and selecting alpha-T with the proportion of alpha at the top of the rankingiThe frame is used as the core frame q _ core of the speechi
The specific implementation method of the step S6 is as follows: core frame q _ core for the test speech obtained in step S5iComputing each frame at GMM2genuineScoring and averaging
Figure BDA0003049570680000058
And in GMM2spoofAnd taking the mean of the likelihood scores
Figure BDA0003049570680000059
Then obtaining a log-likelihood score ratio
Figure BDA00030495706800000510
Figure BDA0003049570680000061
Then comparing the voice decision category with a model threshold value threshold to obtain the voice decision category; if llki>threshold is judged as true speech if llki<And the threshold is judged as the synthesized voice.
All voices in the evaluation set are tested in the following manner, experiments are based on a GMM system, and the results of an original mean value scoring method and error rate EER (error rate of experiments) are compared and are shown in Table 1:
TABLE 1
Figure BDA0003049570680000062
As can be seen from Table 1, the method can improve the system identification performance to a certain extent, and compared with the original mean value scoring method, the EER is increased by 0.32% in the ASVspoof 2015 Eval set, is decreased by 1.34% in the ASVspoof 2019 Eval set, and is decreased by 2.34% in the RS-SSD set, so that the overall performance is improved.
The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims (8)

1. A model score optimization method based on core frame screening comprises the following steps:
s1, training an original model by using training voice;
s2, calculating importance weight of each frame in training voice by using an original model;
s3, selecting core frames of each training voice according to the importance weight sequence;
s4, training a core model by using a core frame of training voice;
s5, calculating importance weight of each frame in the test voice by using the original model;
s6, selecting core frames of each test voice according to the importance weight sequence;
and S7, inputting the core frame of the test voice into the core model to calculate a matching score, wherein the score is the optimized model score.
2. The model score optimization method of claim 1, wherein: the specific implementation manner of step S1 is as follows: for N-type voice recognition tasks, dividing all training voices into N sets according to the categories of the training voices, sequentially extracting features of the training voices in each set, and then respectively training to obtain original models of the voices, namely N original models, for calculating likelihood scores later, wherein N is a natural number larger than 1, namely a set voice category number.
3. The model score optimization method of claim 1, wherein: the specific implementation manner of step S2 is as follows: for any training voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.
4. The model score optimization method of claim 1, wherein: the specific implementation manner of step S3 is as follows: and for the importance weight obtained in the step S2, sorting the frames in the training speech from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the training speech.
5. The model score optimization method of claim 1, wherein: the specific implementation manner of step S4 is as follows: and for the core frame obtained in the step S3, after extracting features thereof, training the core frame according to the category to obtain core models of various types of speech, which are used for calculating an optimized model score later.
6. The model score optimization method of claim 1, wherein: the specific implementation manner of step S5 is as follows: for any test voice, the likelihood score of each frame in the corresponding type original model is calculated respectively, and then the likelihood score of each frame is normalized to be used as the importance weight of each frame.
7. The model score optimization method of claim 1, wherein: the specific implementation manner of step S6 is as follows: and for the importance weight obtained in the step S5, sorting the frames in the test voice from large to small according to the importance weight, and selecting a frame with a certain proportion ranked at the top as a core frame of the test voice.
8. The model score optimization method of claim 1, wherein: in the process of the method, steps S1-S4 are training stages, and steps S5-S7 are testing stages.
CN202110514259.8A 2020-04-30 2021-04-30 Model score optimization method based on core frame screening Active CN113257236B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2020103613811 2020-04-30
CN202010361381 2020-04-30

Publications (2)

Publication Number Publication Date
CN113257236A true CN113257236A (en) 2021-08-13
CN113257236B CN113257236B (en) 2022-03-29

Family

ID=77222896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110514259.8A Active CN113257236B (en) 2020-04-30 2021-04-30 Model score optimization method based on core frame screening

Country Status (1)

Country Link
CN (1) CN113257236B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808567A (en) * 2006-01-26 2006-07-26 覃文华 Voice-print authentication device and method of authenticating people presence
US20070043768A1 (en) * 2005-08-19 2007-02-22 Samsung Electronics Co., Ltd. Apparatus, medium, and method clustering audio files
US20080162139A1 (en) * 2006-11-30 2008-07-03 Samsung Electronics Co., Ltd. Apparatus and method for outputting voice
CN104240706A (en) * 2014-09-12 2014-12-24 浙江大学 Speaker recognition method based on GMM Token matching similarity correction scores
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110085236A (en) * 2019-05-06 2019-08-02 中国人民解放军陆军工程大学 A kind of method for distinguishing speek person based on the weighting of adaptive voice frame

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070043768A1 (en) * 2005-08-19 2007-02-22 Samsung Electronics Co., Ltd. Apparatus, medium, and method clustering audio files
CN1808567A (en) * 2006-01-26 2006-07-26 覃文华 Voice-print authentication device and method of authenticating people presence
US20080162139A1 (en) * 2006-11-30 2008-07-03 Samsung Electronics Co., Ltd. Apparatus and method for outputting voice
CN104240706A (en) * 2014-09-12 2014-12-24 浙江大学 Speaker recognition method based on GMM Token matching similarity correction scores
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110085236A (en) * 2019-05-06 2019-08-02 中国人民解放军陆军工程大学 A kind of method for distinguishing speek person based on the weighting of adaptive voice frame

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LOU WENHUA: "Text-Independent Speaker Recognition Using GMM Non-Linear Transformation", 《电子器件》 *
卫星: "基于长短期记忆的车辆行为动态识别网络", 《计算机应用》 *
张舸等: "基于异构计算的语音识别解码加速方法", 《网络新媒体技术》 *
桑立锋等: "基于GMM的语音帧得分上的重优化", 《广西师范大学学报(自然科学版)》 *

Also Published As

Publication number Publication date
CN113257236B (en) 2022-03-29

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
Dennis Sound event recognition in unstructured environments using spectrogram image processing
CN103531198B (en) A kind of speech emotion feature normalization method based on pseudo-speaker clustering
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN104882144A (en) Animal voice identification method based on double sound spectrogram characteristics
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
Ghai et al. Emotion recognition on speech signals using machine learning
CN112329438B (en) Automatic lie detection method and system based on domain countermeasure training
Yücesoy et al. A new approach with score-level fusion for the classification of a speaker age and gender
Joshi et al. A Study of speech emotion recognition methods
CN103578481A (en) Method for recognizing cross-linguistic voice emotion
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
Wang et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
Mishra et al. Gender differentiated convolutional neural networks for speech emotion recognition
Prachi et al. Deep learning based speaker recognition system with cnn and lstm techniques
Konangi et al. Emotion recognition through speech: A review
CN113257236B (en) Model score optimization method based on core frame screening
Duduka et al. A neural network approach to accent classification
Saputri et al. Identifying Indonesian local languages on spontaneous speech data
Sharma et al. Speech Emotion Recognition System using SVD algorithm with HMM Model
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant