CN110246487A

CN110246487A - Optimization method and system for single pass speech recognition modeling

Info

Publication number: CN110246487A
Application number: CN201910511791.7A
Authority: CN
Inventors: 钱彦旻; 张王优; 常煊恺
Original assignee: Shanghai Jiaotong University; AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2019-09-17
Anticipated expiration: 2039-06-13
Also published as: CN110246487B

Abstract

The embodiment of the present invention provides a kind of optimization method for single pass speech recognition modeling.This method comprises: receiving the single voice for respectively having true tag vector, the phonetic feature extracted from each single voice is input to target tutor model, obtains the corresponding soft label vector of target of each single voice by more people's mixing voices；More people's mixing voices are input to end-to-end student model, determine output arrangement；Everyone output label vector, determines knowledge distillation loss and direct losses in the more people's mixing voices arranged according to determining output；When the joint error determined according to knowledge distillation loss and direct losses is not converged, end-to-end student model is optimized according to joint error.The embodiment of the present invention also provides a kind of optimization system for single pass speech recognition modeling.The embodiment of the present invention can be easier to learn parameter good out, while model is more simplified, and the student model that better parameter trains it has better performance.

Description

Optimization method and system for single pass speech recognition modeling

Technical field

The present invention relates to field of speech recognition more particularly to a kind of optimization methods for single pass speech recognition modeling And system.

Background technique

With the development of intelligent sound, more and more equipment all have the function of speech recognition, but due to consideration that not With the usage scenario of equipment, in device fabrication, some equipment only carries single microphone, and some equipment is equipped with multiple Mikes Wind, that is, so-called single channel, multichannel.Due to there was only single microphone, this kind of equipment is similar to dinner party class in processing Type, have multiple people and meanwhile speech mix voice dialogue when, the performance of identification is poor.It would generally use thus: base In the knowledge distillating method of the more speaker's speech recognitions of single channel of two-way shot and long term memory network-Recognition with Recurrent Neural Network, or end It is trained to the end more speaker's speech recognition systems of single channel.

In realizing process of the present invention, at least there are the following problems in the related technology for inventor's discovery:

The knowledge of the more speaker's speech recognitions of single channel based on two-way shot and long term memory network-Recognition with Recurrent Neural Network is distilled Method: the model of use belongs to conventional method, increasingly complex compared to end to end model, and training process is cumbersome；And end-to-end list The more speaker's speech recognition systems in channel: the voice signal spoken due to existing simultaneously more people, the model are only capable of utilizing creolized language The information of sound lacks the voice messaging of single speaker in training, it is more difficult to which the effect trained is known with words human speech sound is concentrated the talk of Other system is larger compared to performance gap.

Summary of the invention

Complex in order at least solve conventional model in the prior art, training process is cumbersome, and training effect is bad, performance Poor problem.

In a first aspect, the embodiment of the present invention provides a kind of optimization method for single pass speech recognition modeling, comprising:

Receive the single voice for respectively having true tag vector, and more people's creolized languages by each single speech synthesis The phonetic feature extracted from each single voice is input to target tutor model respectively, obtains each single voice by sound The corresponding soft label vector of target；

More people's mixing voices are input to end-to-end student model, export in more people's mixing voices everyone Output label vector, by permutation invariance method (PIT) by everyone output label vector in more people's mixing voices It is matched with the true tag vector of each single voice, determines in more people's mixing voices everyone output label The output of vector arranges；

According to everyone output label vector in more people's mixing voices by determining output arrangement after pairing, determine with The knowledge distillation loss of the soft label vector of each target, and the direct damage with each single voice true tag vector It loses；

When the joint error determined according to the knowledge distillation loss and direct losses is not converged, missed according to the joint Difference carries out backpropagation to the end-to-end student model, to update the end-to-end student model, until the joint error Convergence is used for single pass speech recognition student model after determining optimization.

Second aspect, the embodiment of the present invention provide a kind of optimization system for single pass speech recognition modeling, comprising:

The soft label of target determines program module, for receiving the single voice for respectively having true tag vector, and by institute More people's mixing voices of each single speech synthesis are stated, the phonetic feature extracted from each single voice is input to target respectively Tutor model obtains the corresponding soft label vector of target of each single voice；

Output, which arranges, determines program module, for more people's mixing voices to be input to end-to-end student model, exports Everyone output label vector in more people's mixing voices is mixed more people by permutation invariance method (PIT) Everyone output label vector and the true tag vector of each single voice match in voice, determine more people The output arrangement of everyone output label vector in mixing voice；

Lose determine program module, for according to by pairing after determine output arrangement more people's mixing voices in everyone Output label vector, the determining knowledge distillation loss with each soft label vector of target, and with each single voice The direct losses of true tag vector；

Optimize program module, for when not converged according to the determining joint error of the knowledge distillation loss and direct losses When, backpropagation is carried out to the end-to-end student model according to the joint error, to update the end-to-end student model, Until the joint error restrains, single pass speech recognition student model is used for after determining optimization.

The third aspect provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment of the present invention The optimization method for single pass speech recognition modeling the step of.

Fourth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, and feature exists In the optimization side for single pass speech recognition modeling of realization any embodiment of the present invention when the program is executed by processor The step of method.

The beneficial effect of the embodiment of the present invention is: speaking the defeated of trained tutor model on corpus using at one Label is trained as target out, and incorporates the voice messaging of single speaker in training, this soft label is capable of providing more Information, enable student model to be easier to learn parameter good out, while model is more simplified, better parameter makes its instruction Experienced student model has better performance.The strategy for additionally using course learning, by by training data according to speaking The signal-to-noise ratio (SNR) of people is ranked up, and is made better use of the information in data, is further improved model performance.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of process for optimization method for single pass speech recognition modeling that one embodiment of the invention provides Figure；

Fig. 2 be one embodiment of the invention provide a kind of optimization method for single pass speech recognition modeling based on The more speaker's speech recognition modeling configuration diagrams of end-to-end single channel of knowledge distillation；

Fig. 3 is that a kind of end for optimization method for single pass speech recognition modeling that one embodiment of the invention provides is arrived More speakers are held to combine the comparison list datagram of coder-decoder system of the CTC/ based on attention；

Fig. 4 is a kind of difference for optimization method for single pass speech recognition modeling that one embodiment of the invention provides Performance (average CER&WER) the table data figure of course learning strategy in 2 people's mixed type WSJ0 corpus test data sets；

Fig. 5 is a kind of structure for optimization system for single pass speech recognition modeling that one embodiment of the invention provides Schematic diagram.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

A kind of optimization method for single pass speech recognition modeling provided as shown in Figure 1 for one embodiment of the invention Flow chart, include the following steps:

S11: the single voice for respectively having true tag vector is received, and is mixed by more people of each single speech synthesis Close voice, target tutor model will be input to from the phonetic feature that each single voice extracts respectively, obtain it is described it is each one The soft label vector of the corresponding target of voice；

S12: being input to end-to-end student model for more people's mixing voices, exports each in more people's mixing voices The output label vector of people, by permutation invariance method (PIT) by everyone output label in more people's mixing voices The true tag vector of vector and each single voice is matched, and determines in more people's mixing voices everyone output The output of label vector arranges；

S13: according to everyone output label vector in more people's mixing voices by determining output arrangement after pairing, really The fixed knowledge distillation loss with each soft label vector of target, and it is direct with each single voice true tag vector Loss；

S14: when the joint error determined according to the knowledge distillation loss and direct losses is not converged, according to described It closes error and backpropagation is carried out to the end-to-end student model, to update the end-to-end student model, until the joint Error convergence is used for single pass speech recognition student model after determining optimization.

In the present embodiment, existing method is often used without tutor model, usually only uses student model in training Output label vector and true tag vector carry out error calculation；And tutor model is introduced in this method.Tutor model is logical It is often used in knowledge distillation, in use, usually that ability is the powerful, knowledge migration of tutor model that does well is to more Compact student model, although ability cannot be with tutor model phase when with monitor mode directly trained student model Match, but distilled by knowledge, the predictive ability of student model can be closer to the predictive ability of tutor model.

For step S11, in order to optimize the recognition effect of student's speech recognition modeling, it is first determined the target to be learnt religion Teacher's model, wherein the target tutor model can be preparatory trained tutor model.In training, it is also necessary to certain instruction Practice data, comprising: more people's creolized languages of some single voices with true tag vector and these single speech syntheses Sound.Wherein, label can be understood as the corresponding text of voice, but map by dictionary, facilitate computer disposal.Respectively will Single voice is input in target tutor model, in turn, obtains the corresponding soft label vector of target, wherein soft label vector packet The supplemental information of the audio hiding containing overlapping, and the understanding to single speaker model.

For step S12, the more people's mixing voices determined in step S11 are input to the end-to-end student's mould for needing to learn Type exports in more people's mixing voices everyone output label vector, is mixed more people by permutation invariance method (PIT) It closes everyone output label vector and the true tag vector of each single voice in voice to match, wherein displacement Invariance method (PIT) is a kind of algorithm for being used to solve multiple prediction labels (output label) and the pairing of multiple true tags, In this example, model can export label corresponding to multiple speaker's voices when handling mixing voice respectively, but in training When be required to calculate the error of each output label with corresponding true tag, such as 2 speaker's voices of output are corresponding Label, and be not aware that 2 output labels of model it is actually corresponding be which speaker (such as two labels predicted Vector is P1 and P2 respectively, and true tag is Y1 and Y2, it is not known that should P1-Y1, P2-Y2 or P1-Y2, P2-Y1), therefore It helps to match using permutation invariance method.In turn, everyone output label vector in more people's mixing voices is determined Output arrangement.

It is each in more people's mixing voices by the output arrangement determined after matching in step s 12 for step S13 The output label vector of people, it is determining true with the knowledge distillation loss of the soft label vector of each target and each single voice respectively The direct losses of label vector not only consider in the prior art by using true tag vector institute in optimization process The direct losses of generation also further consider the knowledge distillation loss that the tutor model of knowledge distillation can generate, multidimensional The loss that can be generated with respect to various aspects of degree.

For step S14, the joint error that determining knowledge distillation loss and direct losses determine in step s 13 is not received It holds back, after calculating joint error, is fed back error back propagation by back-propagation algorithm (general algorithm in machine learning) Out in each layer network of front, for updating network parameter, the process of this undated parameter is exactly to train, and then determine optimization Afterwards be used for single pass speech recognition student model.

It can be seen that by the embodiment and make using in one output of trained tutor model on corpus of speaking For target training label, this soft label is capable of providing more information, enable student model be easier to learn it is good out Parameter, while model is more simplified, the student model that better parameter trains it has better performance.

As an implementation, in the present embodiment, described that more people's mixing voices are input to end-to-end student Model includes:

It is special by voice of the neural network being trained in the end-to-end student model to more people's mixing voices Sign carries out Projection Character, marks off in more people's mixing voices everyone phonetic feature；

Based on everyone phonetic feature, determined by the encoder in the end-to-end student model corresponding Connect timing classification (CTC) score；

It is by the decoder in the end-to-end student model, the minimum score of connection timing classification (CTC) is corresponding Feature permutation, be converted to corresponding output label vector；Wherein, the label vector is mapped by dictionary, available right The text sequence answered.

In the present embodiment, it in the optimization training stage, based on everyone phonetic feature, is arrived by the end Encoder in the student model of end determines that feature permutation corresponding with the tutor model combines, and then determines each feature row Column combine corresponding connection timing classification (CTC) score set, then determine each feature permutation by permutation invariance method The smallest feature permutation of total score in combination.By the decoder in end-to-end student model, the connection timing is classified (CTC) the corresponding feature permutation of minimum total score, is converted to corresponding output label vector.

And in the recognition stage, tutor model is not needed, does not need to carry out permutation invariance training yet, directly according to decoding The result sequence that device determines arranges, and determines corresponding decoding result by calculated score.

It can be seen that by the embodiment and the corresponding feature row of minimum score determined by permutation invariance training Column, the error in identification can be preferably minimized, and improve recognition effect.

As an implementation, in the present embodiment, mark off in more people's mixing voices everyone described After phonetic feature, the method also includes:

By the attention power module in the end-to-end student model, to everyone voice in more people's mixing voices Corresponding attention score is determined in the further feature extraction of feature so that more people's mixing voices with described one is defeated Outgoing label vector time is to trimly.

In the present embodiment, the calculating of attention score be permutation invariance after training, first, in accordance with displacement not The output arrangement that denaturation training obtains, accordingly resets everyone intermediate representation of encoder output, is then calculated Attention score between the intermediate representation of corresponding each speaker and the intermediate representation of corresponding tutor model.

It can be seen that by the embodiment and pay attention to power module to solve the text of the output in end-to-end speech identifying system The problem of this is misaligned in time with input audio, is further extracted phonetic feature, to improve speech recognition modeling Recognition effect.

As an implementation, in the present embodiment, it is described determine corresponding attention score after, the side Method further include:

By by the corresponding connection timing of each feature permutation classify (CTC) score with the attention score according to pre- If recognition mode weighting, determine the Combination score of each feature permutation；

By the feature permutation of score minimum in Combination score, corresponding output label vector is converted to.

In the present embodiment, different recognition modes has different training directions, and corresponding to different directions Different weighting specific gravity.Different weighting accountings is determined according to the pre-set identification profiles determined, in turn, by described CTC score and attention score is weighted in weighting accounting.For example, the corresponding CTC score of a certain feature permutation is 2.34, attention score is 3.22, when weighting accounting respectively accounts for half, determines that this feature arrangement is scored at 2.78.It is each determining After the corresponding final score of feature permutation, it is the smallest to choose score, that is, the smallest feature permutation of error, is converted into pair The output label vector answered.

The various need that meet optimization speech recognition modeling are can be seen that by the embodiment by adjusting weighting accounting It asks, further improves the recognition effect of speech recognition modeling.

As an implementation, in the present embodiment, described by being trained in the end-to-end student model Neural network Projection Character is carried out to the phonetic features of more people's mixing voices after, the method also includes:

According to the end-to-end newly added speaker adaptation module of student model, obtain in more people's mixing voices Everyone relevant information, after everyone phonetic feature described in determination, everyone context described in additional determination becomes Amount.

In the present embodiment, to increase sequence before the corresponding encoder of each speaker in end-to-end student model general Want network, input be mixing voice encoder output, output arrives the identical dimension with input by projective transformation, then Multiply in original input, as new feature, be then connected in everyone corresponding encoder below, specifically as shown in Fig. 2, example Such as, the volume in the more speaker's speech recognition modeling configuration diagrams of end-to-end single channel of knowledge based shown in Fig. 2 distillation Code device 1 and encoder 2, are separately added into a mutually isostructural sequence summary network.

By the embodiment can be seen that due to training process with it is previously completely the same, and the module being newly added is being instructed Information relevant to everyone is arrived in study during practicing, so as to export the context variable comprising everyone information, in turn More data improve the recognition effect of speech recognition modeling.

As an implementation, in the present embodiment, the single voice of true tag vector is respectively had in the reception Later, the method also includes:

By the voice and ambient noise of people in each single voice, the signal-to-noise ratio of each single voice is determined；

More people's mixing voice data are ranked up according to the size of the signal-to-noise ratio, so that in speech recognition mould Reach incremental in the optimization of type.

In the present embodiment, signal-to-noise ratio is the logarithm of the power ratio of the voice and ambient noise of people in voice signal Value, size represent the relative intensity of voice, and it is relatively weaker to be worth bigger expression noise, thus are easier to identify voice therein.

It can be seen that the noise ordering by training data according to speaker by the embodiment, be in order in training In the process simulate people learning process gradually increase difficulty that is, since simple sample, achieve the effect that it is incremental, from And realize better training effect.

As an implementation, the joint error packet determined according to the knowledge distillation loss and direct losses It includes:

According to knowledge distillation loss described in preset pattern pair and direct losses weighted sum, determine that joint is missed Difference.

In order to meet different identification demands, in the training process, can be arranged according to different use environments different Training mode.And then by different weighting ratios, training meets the speech recognition modeling of different demands.

It can be seen that by the embodiment by the way that different training modes is arranged, during training, according to difference Weighting ratio knowledge distillation loss and direct losses are carried out with the determination of joint error, meet the environment-identification of different demands. And then improve the recognition effect of speech recognition modeling.

Further specific implementation is carried out to above-mentioned steps, the end-to-end speech identification model that this method uses is joint base In the coder-decoder of CTC/ attention.The advantages of model, is that it uses CTC as secondary task and is based on note to enhance The alignment ability of the coder-decoder for power of anticipating.Later, by introducing separation phase in the encoder, have modified the model with Adapt to more speaker's scenes.Input voice mixing is clearly divided into multiple vector sequences, each vector sequence in the encoder first Column represent a source of people of speaking.These sequences are admitted to decoder with design conditions probability.

O indicates the input voice mixing of S speaker.Encoder is made of three phases:

Encoder-Mix (Encoder_Mix), encoder-SD (Encoder_SD) and encoder-Rec (Encoder_Rec)。

Encoder-Mix: O is encoded to intermediate representation H by hybrid coder, secondly, indicating H by S independent speakers Different (SD) coder processes；

There is encoder-SD S to export H^s(s=1 ..., S), each corresponds to the expression of a speaker.Last Stage, for each stream s (s=1 ..., S)；

Encoder-Rec is by characteristic sequence H^sBe converted to advanced expression G^S。

Encoder can be write according to the following steps:

H=EncOder_Mix(O)

H^s=Encoder_SD ^s(H), s=1 ..., S

G^s=Encoder_Rec(H^s), s=1 ..., S

CTC objective function connects after encoder, and there are two fold for advantage.The first is training sequence to sequence mould The encoder of type is as nonproductive task.Second is in the case where more speakers, and CTC objective function is for executing such as following public affairs It is trained shown in formula without displacement, also referred to as PIT (permutation invariant training, replace constant training).

Wherein Y^SIt is from expression G^SThe output sequence variable of calculating, π (s) are s-th yuan in the displacement π of { 1 ..., S } Element, R are the reference labels of S speaker.Later, the displacement with minimum CTC lossFor the decoder based on attention In reference marker, to reduce calculating cost.

Decoder network based on attention is for decoding each stream G^SAnd generate corresponding output label sequence Y^S.For Each pair of expression and reference label indexDecoding process is described as following equation:

WhereinIndicate the vector of context,It is the hidden state of decoder,It is ginseng Examine the nth elements in sequence label.During the training period, the reference label in RIt is used as the history that teacher forces, and It is not the y in above-mentioned formula_n-1.Patt equation defines the target labels sequence of the prediction of the coder-decoder based on attention Y={ y₁..., y_NProbability, wherein the y of n-th of time step_nProbability depends on previous sequences y₁: n-1.Final loss function Is defined as:

Wherein λ is interpolation factor, 0≤λ≤1.

The modification of decoder based on attention, referred to as speaker pay attention to parallel.Its motivation is to compensate the separation of encoder Ability improves the separating property of model.The idea is to pay attention to module using the individual for not cocurrent flow, is come by selectivity characrerisitic Filter noise information:

Compared with the hard goal used in the cross entropy criterion, claim that soft object can provide additional useful information, thus Obtain better performance.In more voice voice recognition tasks, this method can be used also to improve the solution based on attention The accuracy of code device network.In order to obtain soft label vector, the voice of parallel single speaker is by only including a speaker Voice training model.Soft label vector includes by the supplemental information of overlapping audio hiding and with more preferable modeling ability The insight of single speaker model.

Model framework is as shown in Figure 2.Mixing voice and corresponding individual voice are expressed as O and O^S(s=1 ..., S).Therefore, end-to-end tutor model is by source voice O^SAs the input for calculating teacher's log of each step in target sequence.And And corresponding output is expressed asIt is considered as the target distribution of student model.Therefore, teachers and students learn Loss function can be expressed as follows:

It calculates For the cross entropy between student model and the prediction of tutor model,It is determining most by CTC loss

Wherein, θ^TParameter in corresponding tutor model；θ corresponds to the learning parameter in student model；Q () and P () Respectively represent the distribution of each spokesman in Faculty and Students' model.

In method, the decoder based on attention is had modifiedLoss function.New model is based on cross entropy (CE) weighted sum of primary loss and the item based on KL- divergence loss, it may be assumed that

Wherein η is weight coefficient.

In pervious method, more speaker's ASR systems receive training end to end, ignore the similitude and difference of data It is anisotropic.However, claiming that the sequence of data has an impact to training process, referred to as course learning strategy in some researchs.Therefore, Wish to find one mode from data herein, so that training process is more stable and improves performance.One observation is that Signal-to-noise ratio (SNR) between overlapping voice has a significant impact to separating property.In the language with small SNR, different theorys are come from The voice of words person is distorted by similar energy.On the contrary, big SNR means voice in the uneven item with a leading voice It is distorted under part.

In the method, the SNR of concern overlapping voice is horizontal, is defined as between the source voice from two spokesmans Energy ratio.Also other factors can be used, but method is identical.When generating mixing voice, randomly choose energy ratio with Simulate physical condition.When SNR is larger, high-energy voice is apparent, but the lower phonetic representation of energy is bad.On the contrary, working as SNR When smaller, each language in mixing voice can be identified with similar performance, therefore model can learn from each speaker Knowledge.Training data is rearranged, specifically, when training starts, according to the ascending order repetitive exercise of the voice SNR of speaker The small-sized group concentrated, later, training are restored to the training set of random sequence.

In order to verify the effect of this method, the tool issued using MERL1 is manually generated to be based on Wall Street Journal (WSJ0) The double speaker's mixed signals of the monophonic of speech corpus.Training, exploitation and assessment data respectively from WSJ0 SI-84, The duration of Dev93 and Eval92, each data set are as follows: training 88.2 hours is developed 1.1 hours, is assessed 0.9 hour.

Input feature vector is 80 dimension log-Mel filter bank coefficients, and each frame has tonality feature, with their delta and The connection of delta coefficient.All features are extracted using Kaldi kit, and are normalized to zero-mean and unit variance.

In the method, different neural network models depth having the same and similar size, therefore their property It can be comparable.The CNN that encoder is driven by two VGG (Visual Geometry Group, visual geometric group) (Convolutional Neural Network, product neural network) module and three BLSTMP with projection (bidirectional long-short term memory recurrent neural networks with Projection, two-way shot and long term remember recurrent neural network) composition, and only there is decoder network one to have 300 lists The unidirectional shot and long term of member remembers (LSTM) layer.All-network is all based on ESPnet framework establishment.AdaDelta optimizer, ρ =0.95 and ∈=1e-8 is for training.During the training period, factor lambda is arranged to 0.2.

For teachers and students' training, the training of end-to-end tutor model is carried out to the original clean speech training dataset of WSJ0 first Instruction.In the method, WER (Word Error Rate, the Word Error Rate) difference of the tutor model of WSJ0 Dev93 and Eval92 For 8.0% and 2.1%.Then mixing voice data and corresponding personal voice data are input to teachers and students' module by us simultaneously In.When weight coefficient η is set as 0.5, optimum performance is obtained.

In decoding stage, joint CTC/ is noticed to the score of the word level RNN language model (RNNLM) of score and pre-training It combines, which has 1 layer of LSTM and 1000 cell, and transcription of the training in relation to WSJ0SI-84, with shallow-layer fusion Mode.Beam angle is set as 30, and the weight that the interpolation factor λ used during decoding is 0.3, RNNLM is 1.0.

Teachers and students' training is tested with course learning, and the baseline for having firstly evaluated mixing voice test data set in WSJ0 is end-to-end The performance that method and this method propose.As a result end-to-end more speakers shown in Fig. 3 combine coding of the CTC/ based on attention It is provided in the comparison list of device-decoder system.First method is the connection based on CTC/ concern for more speaker's voices Encoder-decoder network is closed, wherein concern-decoder module is shared between the expression of each speaker.Second method Single attention is expanded into speaker and pays attention to module parallel.Both methods is considered as baseline system.

Then gradually using teachers and students' study and course learning.By teachers and students' training, it can be observed that two baseline systems Performance is all improved in exploitation and assessment data set.Even bigger performance may be implemented in the parallel attention method of speaker It is promoted, it is opposite to the average WER of dev and eval data set respectively to reduce 7% and 6%.This demonstrate that the speaker side of concern parallel Method has the stronger ability for eliminating current individual speaker irrelevant information, and can use the concern output point of tutor model The knowledge of cloth preferably learns.Next, course learning strategy is applied to teacher learning frame, to further increase performance. Just as seen in Figure 3, combination teachers and students' training of proposition, spokesman pays attention to parallel and the end-to-end method of course learning is aobvious Improve the performance of bilingual mixing voice identification, the phase of WER and CER (Character Error Rate, word error rate) It is more than 15% to improvement rate.

For the influence that research course learning strategy shows model, this method explores different strategies.Use teachers and students Training and spokesman pay close attention to parallel is tested end to end model, uses two different strategies: according to the ascending order of SNR Training data is ranked up with the descending of SNR.Experimental result such as Fig. 4 difference course learning strategy is in 2 people's mixed type WSJ0 languages Expect shown in performance (average CER&WER) list on library test data set.

When descending sort of the training data according to SNR, the performance of model is poorer than the model with reverse order training, even It is more worse than the model with randomly ordered data training, this demonstrate that the supposition of this method.When SNR very little, two speakers Between capacity volume variance be delicate, model learning separating capacity.Later, the data with bigger SNR improve accuracy.

The distillation of sequence-level knowledge and course learning technology are applied to the coder-decoder based on joint CTC/ attention More speaker's end-to-end speech identifying systems of frame.Tutor model is identified using single language end-to-end speech to calculate soft label Vector calculates final loss function as target distribution.In order to make full use of training data, further according to the ascending order weight of SNR New arrangement data.

A kind of optimization system for single pass speech recognition modeling of one embodiment of the invention offer is provided Structural schematic diagram, the system can be performed described in above-mentioned any embodiment be used for single pass speech recognition modeling optimization side Method, and configure in the terminal.

A kind of optimization system for single pass speech recognition modeling provided in this embodiment includes: that the soft label of target is true Determine program module 11, output, which arranges, determines program module 12, loses and determines program module 13 and optimization program module 14.

Wherein, the soft label of target determines that program module 11 is used to receive the single voice for respectively having true tag vector, with And more people's mixing voices by each single speech synthesis, the phonetic feature extracted from each single voice is inputted respectively To target tutor model, the corresponding soft label vector of target of each single voice is obtained；Output, which arranges, determines program module 12 For more people's mixing voices to be input to end-to-end student model, everyone output is exported in more people's mixing voices Label vector, by permutation invariance method (PIT) by everyone output label vector and institute in more people's mixing voices The true tag vector for stating each single voice is matched, and determines in more people's mixing voices everyone output label vector Output arrangement；It loses and determines that program module 13 is used for according to every in more people's mixing voices by determining output arrangement after pairing Personal output label vector, the determining knowledge distillation loss with each soft label vector of target, and with it is described it is each one The direct losses of voice true tag vector；Optimize program module 14 to be used for when according to the knowledge distillation loss and direct losses When determining joint error is not converged, backpropagation is carried out to the end-to-end student model according to the joint error, with more The new end-to-end student model is used for single pass speech recognition after determining optimization until the joint error restrains Raw model.

Further, the output, which arranges, determines that program module is used for:

Further, the output, which arranges, determines that program module is also used to:

Further, the soft label of the target determines that program module is used for:

The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter Calculation machine executable instruction, the computer executable instructions can be performed in above-mentioned any means embodiment and are used for single pass voice The optimization method of identification model；

As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions setting are as follows:

As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held The optimization method for single pass speech recognition modeling in the above-mentioned any means embodiment of row.

Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey It sequence area can application program required for storage program area, at least one function；Storage data area can be stored according to test software Device use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at random Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non- Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional The remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processor Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention any The step of optimization method for single pass speech recognition modeling of embodiment.

The client of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone, multimedia handset, functional mobile phone and low-end mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as tablet computer.

(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio, Video player, handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) other electronic devices with speech identifying function.

Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listed Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more In the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the element Or there is also other identical elements in equipment.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of optimization method for single pass speech recognition modeling, comprising:

The single voice for respectively having true tag vector is received, and by more people's mixing voices of each single speech synthesis, The phonetic feature extracted from each single voice is input to target tutor model respectively, it is corresponding to obtain each single voice The soft label vector of target；

More people's mixing voices are input to end-to-end student model, export in more people's mixing voices everyone output Label vector, by permutation invariance method (PIT) by everyone output label vector and institute in more people's mixing voices The true tag vector for stating each single voice is matched, and determines in more people's mixing voices everyone output label vector Output arrangement；

According to everyone output label vector in more people's mixing voices by determining output arrangement after pairing, it is determining with it is described The knowledge distillation loss of each soft label vector of target, and the direct losses with each single voice true tag vector；

When the joint error determined according to the knowledge distillation loss and direct losses is not converged, according to the joint error pair The end-to-end student model carries out backpropagation, to update the end-to-end student model, until the joint error restrains, Single pass speech recognition student model is used for after determining optimization.

2. described that more people's mixing voices are input to end-to-end student model according to the method described in claim 1, wherein Include:

By the neural network being trained in the end-to-end student model to the phonetic features of more people's mixing voices into Row Projection Character marks off in more people's mixing voices everyone phonetic feature；

Based on everyone phonetic feature, corresponding connection is determined by the encoder in the end-to-end student model Timing classification (CTC) score；

By the decoder in the end-to-end student model, by the corresponding spy of the minimum score of connection timing classification (CTC) Sign arrangement, is converted to corresponding output label vector；Wherein, the label vector is mapped by dictionary, available corresponding Text sequence.

3. according to the method described in claim 2, wherein, in the voice for marking off in more people's mixing voices everyone After feature, the method also includes:

By the attention power module in the end-to-end student model, to everyone phonetic feature in more people's mixing voices Corresponding attention score is determined in further feature extraction, so that more people's mixing voices and the single output are marked Sign vector time alignment.

4. according to the method described in claim 3, wherein, it is described determine corresponding attention score after, the method Further include:

By by the corresponding connection timing of each feature permutation classify (CTC) score with the attention score according to preset Recognition mode weighting, determines the Combination score of each feature permutation；

5. according to the method described in claim 2, wherein, in the mind being trained to by the end-to-end student model After carrying out Projection Character through phonetic feature of the network to more people's mixing voices, the method also includes:

According to the end-to-end newly added speaker adaptation module of student model, obtain each in more people's mixing voices The relevant information of people, after everyone phonetic feature described in determination, everyone context variable described in additional determination.

6. according to the method described in claim 1, wherein, the reception respectively the single voice with true tag vector it Afterwards, the method also includes:

More people's mixing voice data are ranked up according to the size of the signal-to-noise ratio, so that in speech recognition modeling Reach in optimization incremental.

7. according to the method described in claim 1, wherein, the connection determined according to the knowledge distillation loss and direct losses Closing error includes:

According to knowledge distillation loss described in preset pattern pair and direct losses weighted sum, joint error is determined.

8. a kind of optimization system for single pass speech recognition modeling, comprising:

The soft label of target determines program module, for receiving the single voice for respectively having true tag vector, and by described each The phonetic feature extracted from each single voice is input to target teacher respectively by more people's mixing voices of single speech synthesis Model obtains the corresponding soft label vector of target of each single voice；

Output, which arranges, determines program module, for more people's mixing voices to be input to end-to-end student model, described in output Everyone output label vector in more people's mixing voices, by permutation invariance method (PIT) by more people's mixing voices It is interior everyone output label vector and the true tag vector of each single voice match, determine the more people mixing The output arrangement of everyone output label vector in voice；

It loses and determines program module, for according to everyone defeated in more people's mixing voices by determining output arrangement after pairing Outgoing label vector, the determining knowledge distillation loss with each soft label vector of target, and it is true with each single voice The direct losses of label vector；

Optimize program module, for when the joint error determined according to the knowledge distillation loss and direct losses is not converged, Backpropagation is carried out to the end-to-end student model according to the joint error, to update the end-to-end student model, directly It is restrained to the joint error, is used for single pass speech recognition student model after determining optimization.

9. a kind of electronic equipment comprising: at least one processor, and deposited with what at least one described processor communication was connect Reservoir, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described at least One processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-7 the method Suddenly.

10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor The step of any one of claim 1-7 the method.