CN109712609A

CN109712609A - A method of it solving keyword and identifies imbalanced training sets

Info

Publication number: CN109712609A
Application number: CN201910014005.2A
Authority: CN
Inventors: 贺前华; 汪星; 严海康
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-01-08
Filing date: 2019-01-08
Publication date: 2019-05-03
Anticipated expiration: 2039-01-08
Also published as: CN109712609B

Abstract

The invention discloses a kind of methods of solution keyword identification imbalanced training sets, including 1) changing speech pitch and keeping voice semanteme constant, the voice containing keyword is converted using Voice Conversion Techniques, obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker；2) adaptive weighted processing is done to the loss function in neural network model: when using weight cross entropy, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, the weighting coefficient W of kth wheel is automatically adjusted according to the difference of the two_k；3) different keywords adaptive frame number: are used with different detection frame number L according to length keywords when using DNN as training pattern_i；The present invention can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train, while training speed can be accelerated to a certain extent, training for promotion effect.

Description

A method of it solving keyword and identifies imbalanced training sets

Technical field

The present invention relates to continuous speech keywords to identify field, and in particular to a kind of solution keyword identification imbalanced training sets Method.

Background technique

In most important technological progress in recent years, speech recognition technology will undoubtedly be arranged in first place, it is answered actual Also there is very big market in.Keyword identification is also commonly referred to as keyword recognition or word confirmation, has become voice in recent years A key areas in quite being paid attention in Study of recognition.Popular saying exactly is recognized from the continuous voice of speaker Sensitive or given keyword

Since the vocabulary of each language is particularly big, want that speech recognition system is almost with designing all vocabularies of covering It is impossible, and identifies that a small amount of keyword is critically important to some application requests from voice signal, and There are very big application prospect and market.Unlike but the former be recognized and converted into from continuous voice flow it is a series of continuous Text, and these texts are exactly content described in speaker.And keyword identification is more flexible, both accurate, calculation amount is again small, And there is biggish elasticity.Keyword identification do not require entire voice flow all to identify, speaker can at will talk, As long as keyword identification detects certain keywords and does not have to consider other words, and generally unrelated with speaker.When speaker is not It works in the environment of cooperation or noise, Keyword Spotting System can also obtain good effect, and continuous speech recognition is then to saying The attitude of words people has certain requirement, therefore Keyword Spotting System has critically important application prospect in terms of monitoring.

Information security is increasingly paid attention in countries in the world, especially the monitoring to internet and communication equipment, the U.S. for The needs of anti-terrorism represent the highest level in the world in voice monitoring.It is given from being identified in the talk of two people or more people Keyword, these words are some sensitive vocabulary, may repeatedly be occurred in dialogue.And due to the randomness and record of speaker's time There are many sound data, and the monitoring of tradition manually is obviously undesirable, install machines to replace manual labor just necessary.

Summary of the invention

In order to overcome the problems, such as in voice keyword neural network based identification in the prior art there are imbalanced training sets, The present invention provides a kind of method of solution keyword identification imbalanced training sets.

The present invention adopts the following technical scheme:

A method of it solving keyword and identifies imbalanced training sets, comprising:

1) change speech pitch and keep voice semanteme constant, the voice containing keyword is carried out using Voice Conversion Techniques Conversion obtains the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker；

2) adaptive weighted processing is done to the loss function in neural network model: uses weight cross entropy When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two The weighting coefficient W of kth wheel_k；

3) adaptive frame number: different keywords are used not according to length keywords when using DNN as training pattern Same detection frame number L_i；

Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant, Increase the diversity of voice, such as gender, age etc. by changing the other information in voice；Gender, age etc. believe in voice Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequencies_mVoice conversion is done, normal voice Average base frequency range is (136,332), and the average base frequency range of normal voice is divided into N sections, it is average to calculate present video Segmentation where fundamental frequency digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that N section averagely fundamental frequencies are not mutually Identical voiceTo change the non-semantic information of information such as gender, age in the voice, increase the diversity of voice. SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary the section of audio file or real-time audio stream (Tempo), tone (Pitch), playback rate (Playback Rates) are clapped, also supports the stabilization beat rate (BPM of estimation track rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, time-stretching combines It realizes.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), is mainly this in .wav file Kind format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again at ST Reason.Using the library soundtouch to the voice H of different fundamental frequencies_mThe stabilization beat rate of estimation track need to be used when doing voice conversion, And beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, to generate containing not according to the stabilization beat rate of track The synonymous voice of same gender, age information.

Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, and every layer has k a hidden Layer and n concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one Softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set Have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive.The size of network also depends on output label Quantity.The sub-word unit of entire keyword or keyword can be used in our label.

Assuming that p_ijIt is i-th in neural network^thLabel and jth^thFrame x_jPosterior probability, wherein the value of i be 0,1,, n- Value between 1, n are total number of tags, and 0 is the label of non-key word word.Pass through the training data { x of label_j, i_j}_jUpper maximization is handed over Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.

It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.? Here, I carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.It is all Layer all updates in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding Bad local optimum learns more preferable, more excellent character representation.It is exactly such in this experiment.

To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy) When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key word_cValue, every training in rotation later The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk₁、A₂, according to the difference C=A of the two₁-A₂Automatic adjustment Weighting coefficient W during the training of kth wheel_kValue, α is adjustment factor.

W_k=W_c*(1+C*α)

Further, in 3), the simple but effective method of proposition: DNN generates the label posteriority based on frame, by DNN Posteriority is combined into keyword confidence.If confidence level is more than some predefined threshold value, judgement could be made that.First assume One keyword describes confidence calculations, behind can easily modify it with and meanwhile detect multiple keywords.Make using DNN When for training pattern, length keywords are different, and the voice duration of this keyword is also different, can be according to i-th of keyword in training The minimum duration of voice is concentrated to determine the minimum detection frame number L of keyword_i。

WhereinIndicate the minimum duration of i-th of keyword voice in training set, S indicates mfcc as voice spy Frame when sign moves.

Beneficial effects of the present invention:

The present invention compares conventional depth neural network method, there is data with good adaptation effect and higher utilization rate, Can be effectively relieved because data nonbalance or it is very little caused by training effect difference or the problems such as can not train.It simultaneously can be to a certain degree Upper quickening training speed, training for promotion effect.

Detailed description of the invention

Fig. 1 is the flow chart that loss function of the present invention is weighted processing.

Specific embodiment

Below with reference to examples and drawings, the present invention is described in further detail, but embodiments of the present invention are not It is limited to this.

Embodiment

As shown in Figure 1, the present invention is using AiShell-1 corpus as training set, with the real speech that school student is recorded Test set, wherein keyword are as follows: China, company, reporter, investment, four keywords carry out keyword spottings.The present invention provides The method for solving keyword identification imbalanced training sets, specific as follows:

Further, in 1), in the speech enhan-cement to the voice containing keyword, keep the semantic information of the voice constant, Increase the diversity of voice, such as gender, age etc. by changing the other information in voice；Gender, age etc. believe in voice Breath is mainly related with speech pitch, using the library soundtouch to the voice H of different fundamental frequencies_mVoice conversion is done, normal voice Average base frequency range is (136,332), and the base frequency range of normal voice is divided into N sections, calculates fundamental frequency where present video Segmentation digitizes the speech into different fundamental frequencies using voice conversion and is segmented, it is hereby achieved that the N sections of mutually different voices of fundamental frequencyTo change the non-semantic information of information such as gender, age in the voice, increases the diversity of voice, be averaged in this experiment It is divided into 10 sections.

SoundTouch is one with the audio processing library of the C++ open source write, thus it is possible to vary audio file or real-time sound Beat (Tempo), the tone (Pitch), playback rate (Playback Rates) of frequency stream, also support the stabilization beat of estimation track Rate (BPM rate).3 effects of ST are independent mutually, can also be used together.These effects are converted by sample rate, the time draws It stretches and is implemented in combination with.The object of ST processing is PCM (Pulse Code Modulation, pulse code modulation), master in wav file If this format, therefore the example of ST is all processing wav audio.The formats such as mp3 have passed through compression, need to be converted to after PCM again It is handled with ST.Using the library soundtouch to the voice H of different fundamental frequencies_mEstimation track need to be used when doing voice conversion stablizes section Beat rate, and beat and the tone of audio are adjusted with information such as the fundamental frequencies that changes voice, thus raw according to the stabilization beat rate of track At the synonymous voice containing different sexes, age information.

10 sections of mutually different voices of fundamental frequency are generated in the present invention in normal voice fundamental frequency section, use SoundTouch's C++ source code simultaneously rewrites it, can Mass production difference fundamental frequency section voice.

Further, in 2), deep neural network model is the full Connection Neural Network of feedforward of standard, network have 3 it is hidden Layer and every layer of 128 concealed nodes are hidden, each calculates the nonlinear function of the weighted sum of preceding layer output.The last layer has one A softmax, it exports each posterior estimated value of output label.For hidden layer, using ReLU function, ReLU is in development set On have preferably as a result, reducing calculating simultaneously.Use ReLU as activation primitive, the size of network also depends on output label Quantity.The sub-word unit of entire keyword or keyword can be used in our label.

Assuming that p_ijIt is i-th in neural network^thLabel and jth^thFrame x_jPosterior probability, wherein the value of i be 0,1 ... n-1 Between value, n be total number of tags, 0 be non-key word word label.Pass through the training data { x of label_j, i_j}_jUpper maximization is handed over Fork entropy training standard carrys out the weight and deviation θ of estimating depth neural network.

It is executed using the software frame TensorFlow of the distributed computing of multiple CPU in support deep neural network excellent Change.Training for promotion speed is maximized using the exponential damping of the loss function of asynchronous stochastic gradient descent and learning rate.Transfer Study refers to the situation that the relevant parameter of (part) network parameter existing network is initialized, rather than trained from the beginning.This Invention carries out speech recognition using deep neural network, and the hidden layer of network is initialized using suitable topology.All layers All updated in training.Shift learning has a potential advantage, and hidden layer can be by using a greater amount of data and avoiding grain The local optimum of cake is more preferable to learn, more excellent character representation.It is exactly such in this experiment.In the net of the present embodiment In network, we use weight cross entropy using the aNN network for having 3 hidden layers and every layer of 128 concealed nodes, loss function (weight cross entropy) is to be applicable in positive and negative imbalanced training sets problem.Label uses entire keyword, such as: market, Using the speech frame of the entire word in market as a label.

To adaptive weighted processing in neural network model: using weight cross entropy (weight cross entropy) When, initial weighting coefficients W is initialized according to the quantitative comparison of keyword in corpus and non-key word_cValue, every training in rotation later The accuracy rate A of keyword corpus and non-key word material is calculated separately in white silk₁、A₂, according to the difference C=A of the two₁-A₂Automatic adjustment Weighting coefficient W during the training of kth wheel_kValue, α is adjustment factor, takes 0.5 here in this experiment.

W_k=W_c*(1+C*α)

It is test that the present invention, which tests the real speech using AiShell-1 corpus as training set, with the recording of this school student, Collect, wherein keyword are as follows: China, company, reporter, investment.Four keywords are not using DNN neural network model of the invention In average recall rate be 78.6%, using after a kind of keyword identification imbalanced training sets problem-solving approach of the present invention in identical number Average recall rate according to concentration is 81.92%, improves 3.32 percentage points.

Specific experiment the result is as follows:

0: China 1: company 2: reporter 3: investment

	0	1	2	3	Average false detection rate
						Normal DNN false detection rate (%)	7.56	6.92	8.33	8.78	7.89
The method of the present invention false detection rate (%)	5.31	4.97	6.12	5.16	5.39
							0	1	2	3	Average omission factor
Normal DNN omission factor (%)	21.83	22.32	19.64	21.81	21.40
						The method of the present invention omission factor (%)	18.84	17.12	16.62	19.75	18.08

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by the embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of method for solving keyword identification imbalanced training sets, which is characterized in that including

Change the speech pitch containing keyword and keep voice semanteme constant, using Voice Conversion Techniques to the language containing keyword Sound is converted, and the different sexes of same semantic sample, multiple speech samples of all ages and classes speaker are obtained；

Adaptive weighted processing is done to the loss function in neural network model according to multiple speech samples: using weight cross entropy When, the accuracy rate of keyword corpus and non-key word material is calculated separately in every wheel training, is automatically adjusted according to the difference of the two The weighting coefficient W of kth wheel_k；

Adaptive frame number: different keywords are used with different inspections according to length keywords when using DNN as training pattern Frame number L out_i。

2. the method according to claim 1, wherein the change of audio basic frequency is using the library soundtouch to not With the voice H of fundamental frequency_mVoice conversion is done, the average base frequency range of normal voice is (136,332), by the average base of normal voice Frequency range is divided into N sections, calculates present video and be averaged segmentation where fundamental frequency, is converted using voice and digitize the speech into different bases Frequency division section, to obtain the N sections of average mutually different voices of fundamental frequency

3. the method according to claim 1, wherein adaptive weighted being done to neural network model loss function Processing: when using weight cross entropy, initial weighting coefficients are initialized according to the quantitative comparison of keyword in corpus and non-key word W_cValue, calculate separately the recognition accuracy A of keyword corpus and non-key word material in subsequent every wheel training₁、A₂, according to The difference C=A of the two₁-A₂Automatically adjust the weighting coefficient W of kth wheel in training process_kValue, specific formula is as follows, wherein α be adjust Save coefficient

W_k=W_c*(1+C*α)。

4. the method according to claim 1, wherein length keywords are not when using DNN as training pattern Together, the voice duration of this keyword is also different, can be determined according to the minimum duration of i-th of keyword voice in training set crucial The minimum detection frame number L of word_i；

WhereinThe minimum duration of i-th of keyword voice in training set is indicated, when S indicates mfcc as phonetic feature Frame move.