CN106782560A

CN106782560A - Determine the method and device of target identification text

Info

Publication number: CN106782560A
Application number: CN201710127503.9A
Authority: CN
Inventors: 陈仲帅; 马宏
Original assignee: Hisense Group Co Ltd
Current assignee: Hisense Group Co Ltd
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2017-05-31
Anticipated expiration: 2037-03-06
Also published as: CN106782560B

Abstract

The application provides a kind of method and device for determining target identification text, and the method includes：Determine determination identification text and the identification text to be determined in the corresponding at least two candidates identification text of speech data to be identified, wherein, it is determined that identification text is identical part at least two candidates identification text, identification text to be determined is that at least two candidates recognize the part differed in text；Similarity between the text of the correspondence position for calculating identification text to be determined and targeted contrast text, targeted contrast text is text consistent with the sentence pattern structure of candidate's identification text in pre-set text storehouse, and targeted contrast text is including determining identification text；And then the corresponding identification text to be determined of the maximum in similarity is recognized into text with the candidate for determining identification text composition, it is configured to target identification text；Realize and further screen target identification text from candidate's identification text, improve the accuracy of target identification text.

Description

Determine the method and device of target identification text

Technical field

The application is related to speech recognition technology, more particularly to a kind of method and device for determining target identification text.

Background technology

With the development of voice control technology, increasing smart machine possesses speech identifying function at present, example, Possess intelligent television, intelligent refrigerator, intelligent air condition of voice control function etc. and possess the smart mobile phone intelligence of speech voice input function Energy computer etc..

Current speech recognition is mainly comprising voice pretreatment, acoustic model decoding, pronunciation dictionary parsing, language model solution The processes such as code, wherein, voice pretreatment is that the voice signal that will be received simply is processed, and obtains the tag file of voice Deng；The input of acoustic model decoding is the tag file of voice, and acquisition probability highest phoneme file is decoded by acoustic model； And then, by inquiring about pronunciation dictionary, phoneme information is switched into possible spelling words intellectual, then the context pass for passing through language model Connection information, acquisition probability spelling words intellectual information higher is used as candidate's recognition result from spelling words intellectual.Due in language model Language material source it is relatively broad, candidate's recognition result cannot ensure the accuracy of recognition result, it is therefore desirable to by certain methods from In screen out accurate recognition result.

But, it is unsuitable in the prior art to select method.

Application content

The application provides a kind of method and device for determining target identification text, for the candidate in speech data to be identified Accurate recognition result is selected out in recognition result.

A kind of the application first aspect determines target identification text method in providing identification text from least two candidates, Including：

Determine determination identification text in speech data to be identified corresponding at least two candidate identification text and to be determined Identification text, wherein, it is described to determine that identification text is identical part in candidate's identification text described at least two, it is described to treat really Surely identification text is that candidate described at least two recognizes the part differed in text；

Similarity between the text of the correspondence position for calculating the identification text to be determined and targeted contrast text, its In, the targeted contrast text is the consistent text of sentence pattern structure with candidate identification text in pre-set text storehouse, and institute Stating targeted contrast text includes the determination identification text；

The corresponding identification text to be determined of maximum in the similarity is determined that identification text is constituted with described The candidate identification text, be configured to target identification text.

The application second aspect determines the device of target identification text in providing a kind of identification text from candidate, including：

First determining module, for determining the determination in the corresponding at least two candidates identification text of speech data to be identified Identification text and identification text to be determined, wherein, it is described to determine that identification text is phase in candidate's identification text described at least two Same part, the identification text to be determined is that candidate described at least two recognizes the part differed in text；

Computing module, between the text of the correspondence position for calculating the identification text to be determined and targeted contrast text Similarity, wherein, the targeted contrast text be pre-set text storehouse in the candidate identification text sentence pattern structure it is consistent Text, and the targeted contrast text include it is described determine identification text；

Second determining module, for by the maximum in the similarity it is corresponding it is described it is to be determined identification text with it is described It is determined that the candidate identification text of identification text composition, is configured to target identification text.

The application's has the beneficial effect that：

The application is provided in the method for the identification text that sets the goal really, it is first determined speech data to be identified is corresponding at least Determination identification text and identification text to be determined in two candidate's identification texts, then for identification text to be determined, calculate Similarity between the text of the correspondence position of identification text to be determined and targeted contrast text, by the maximum pair in similarity The identification text to be determined answered is defined as the corresponding correct result of speech data to be identified, so by the identification text to be determined with It is determined that identification text composition candidate identification text, be configured to target identification text, realize get multiple probability approach Candidate identification text when, according to the targeted contrast text consistent with its sentence pattern structure, further according to identification text to be determined With the similarity between the text of correspondence position in targeted contrast text, determine immediate with the speech data of user input Identification text to be determined, and then by the identification text to be determined and determine that identification text constitutes target identification text together, feed back To user, i.e., by referring to targeted contrast text, the different piece in candidate's identification text close to multiple probability is further Selection, improves the accuracy for recognizing speech data to be identified, improves the user experience of speech recognition.

Brief description of the drawings

Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be to that will make needed for embodiment description Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.

The side that target identification text is determined from least two candidates identification text that Fig. 1 is provided for the embodiment of the application one Method schematic flow sheet；

Fig. 2 determines target identification text for what another embodiment of the application was provided from least two candidates identification text Method flow schematic diagram；

The dress that target identification text is determined from least two candidates identification text that Fig. 3 is provided for the embodiment of the application one Put structural representation；

Fig. 4 determines target identification text for what another embodiment of the application was provided from least two candidates identification text Apparatus structure schematic diagram.

Specific embodiment

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with accompanying drawing the present invention is made into One step ground is described in detail, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole implementation Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made All other embodiment, belongs to the scope of protection of the invention.

Before carrying out that explanation is explained in detail to the embodiment of the present invention, first the applied environment to the embodiment of the present invention gives Introduce.It is provided in an embodiment of the present invention for showing that the display methods of phonetic entry control instruction is applied to terminal, example, should Terminal can be the intelligent television with Android operation system or IOS, smart mobile phone, panel computer etc., the end End can also be the computer with Window operating systems or Ios operating systems, PDA (Personal Digital Assistant, personal digital assistant) etc., the embodiment of the present invention is not specifically limited to this.

Provided herein one method that target identification text is determined from least two candidates identification text, knows in voice The final speech recognition text of selection Huo Qu not be further analyzed in multiple recognition results on the basis of multiple recognition results, To improve the accuracy of speech recognition.

The side that target identification text is determined from least two candidates identification text that Fig. 1 is provided for the embodiment of the application one Method schematic flow sheet, as shown in figure 1, the method includes：

S101, determine determination identification text in speech data to be identified corresponding at least two candidate identification text and treat It is determined that identification text.

During implementing, after user input speech data to be identified, due to pronunciation close to or accuracy of identification etc. Reason, may recognize multiple speech recognition texts.

Such as user has said one " I wants to listen the song of Gao Shengmei ", is likely to be obtained " I wants the song for listening Gaosheng beautiful ", " I Want to listen the song of glad U.S. ", multiple speech recognition texts such as " I wants to listen the song of Gao Shengmei ".

First determine that candidate recognizes text from this multiple speech recognition text, further select accurate recognition result.

Candidate recognizes that text is constituted by determining identification text and identification text to be determined.Wherein it is determined that identification text be to Identical part in few two candidates identification text, identification text to be determined is to be differed during at least two candidates recognize text Part.For example in " I wants to listen the song of glad U.S. " and " I wants to listen the song of Gao Shengmei ", " I wants to listen ", " song " are to determine Identification text, " Gao Shengmei " and " glad beautiful " is identification text to be determined.

It is to need without identical part i.e. it is considered that identical part is accurate result in multiple candidate's identification texts The identification text to be determined to be further determined that, namely identification text to be determined also needs to further be identified, to obtain More accurately result.

Similarity between S102, the text of the correspondence position for calculating identification text to be determined and targeted contrast text.

Wherein, targeted contrast text is text consistent with the sentence pattern structure of candidate's identification text in pre-set text storehouse, and Targeted contrast text includes that above-mentioned determination recognizes text.

In pre-set text storehouse can including largely prestore sentence, etc. word combination, can by the meaning of a word, part of speech (noun, Verb) etc., the matching targeted contrast text consistent with candidate's identification text sentence pattern in pre-set text storehouse.For example " I wants to listen height The song of Xing Mei " may match targeted contrast text " I wants to listen the song of Zhou Jielun " etc..For example " me please be give one glass of coffee again Coffee " may match targeted contrast text " please give me one glass of milk ".

For example, targeted contrast text includes that above-mentioned determination identification text, i.e. " I wants to listen the song of Zhou Jielun " are included It is determined that identification text " I wants to listen ", " song ".

S103, by the maximum in similarity it is corresponding it is to be determined identification text with determine identification text composition candidate know Other text, is configured to target identification text.

Alternatively, between the text of the correspondence position for calculating determination identification text to be determined and targeted contrast text respectively Similarity.The similarity of " Gaosheng is beautiful " between " Zhou Jielun ", the phase between " Gao Shengmei " and " Zhou Jielun " are for example determined respectively Like degree etc..

If " Gao Shengmei " is maximum with the similarity of " Zhou Jielun ", then " I wants to listen the song of Gao Shengmei " is configured into target Identification text.

Wherein, above-mentioned similarity can refer to semantic similarity, or affiliated type similarity, part of speech similarity etc., This is not restricted.

In the present embodiment, it is first determined the determination in the corresponding at least two candidates identification text of speech data to be identified is known Other text and identification text to be determined, then for identification text to be determined, calculate identification text to be determined with targeted contrast text Similarity between the text of this correspondence position, the corresponding identification text to be determined of the maximum in similarity is defined as treating The corresponding correct result of identification speech data, and then the identification text to be determined and determination are recognized candidate's identification of text composition Text, is configured to target identification text, realizes when getting the close candidate of multiple probability and recognizing text, according to its sentence The consistent targeted contrast text of type structure, further according to the text of correspondence position in identification text to be determined and targeted contrast text Similarity between this, determines identification text to be determined immediate with the speech data of user input, and then this is treated really Surely identification text and determination identification text constitute target identification text together, feed back to user, i.e., by referring to targeted contrast text This, the different piece in candidate's identification text close to multiple probability is further selected, and improves identification voice number to be identified According to accuracy, improve the user experience of speech recognition.

The method flow that target identification text is determined from candidate's identification text that Fig. 2 is provided for another embodiment of the application Schematic diagram.As shown in Fig. 2 on the basis of Fig. 1, also including before S101：

S201, the corresponding multiple speech recognition texts of acquisition speech data to be identified.

After one section of voice of user input, terminal can obtain multiple results, typically according to default speech recognition decoder Ground, default speech recognition decoder can include that one or more are used for the model of speech recognition, enter to speech data to be identified Row identification.Because some pronunciations are fuzzy or unisonance itself, the close vocabulary that pronounces are more in voice messaging, multiple may be recognized Speech recognition text.

Specifically：After getting speech data to be identified, can first by speech data to be identified carry out front end signal treatment, End-point detection treatment etc. some pretreatment after, phonetic feature is extracted frame by frame, the feature that will have been extracted delivers to default speech recognition solution Code device, default speech recognition decoder can include：The related decoding mould such as acoustic model, language model and pronunciation dictionary Type, in a decoder with reference to acoustic model, language model and pronunciation dictionary, obtains multiple speech recognition texts.

Wherein, acoustic model mainly describes the likelihood probability of feature under pronunciation model, and acoustic model can use hidden Ma Er Section husband model (HMM).Continuous probability of occurrence between the main descriptor of language model, language model use can with n-gram models, For Chinese, we term it Chinese language model (CLM, Chinese Language Model), wherein can be comprising big The language material of amount, these language materials can be substantial amounts of sentence, vocabulary etc., can be according to the statistical probability of co-occurrence between front and rear word come about The result of beam text search.Pronunciation dictionary is mainly the conversion completed between word and sound.During specific conversion, acoustic model decoding is The tag file of voice signal is searched in acoustic model, optimal phoneme recognition result is produced, wherein phoneme can be with identifier word It is female.By inquiring about pronunciation dictionary, phoneme recognition result is changed into word.Finally, the target of language model decoding is from inquiry Most possible spelling words intellectual result is chosen in the spelling words intellectual that pronunciation dictionary is obtained, as speech recognition text.

It should be noted that can join to the operation that speech data to be identified identification obtains its corresponding speech recognition text Correlation technique is examined, the embodiment of the present invention is no longer repeated this one by one.

Example, can successively realize recognizing speech data to be identified that obtaining its corresponding voice knows by following formula The operation of other text.

W₁=argmaxP (W | X) (1)

Wherein, in above-mentioned formula (1), W represents any word sequence stored in database, and the word sequence includes word Or word, the database can be the corpus for doing speech recognition；X represents the speech data of user input, W₁Represent from depositing The word sequence that can be matched with speech data to be identified obtained in storage word sequence, and P (W | X) represent the speech data to be identified The probability of word can be become.In above-mentioned formula (2), W₂Represent between the speech data to be identified and the word sequence With degree, and P (X | W) probability that the word sequence can pronounce is represented, P (W) represents that the word sequence is word or the probability of word, P (X) represent that speech data to be identified is the probability of audio-frequency information.

It should be noted that in above-mentioned identification process, P (W) can be determined by language model, by acoustic model Determine P (X | W), so as to complete the speech recognition to the speech data to be identified, obtain the corresponding voice of speech data to be identified Identification text.It is following that language model and acoustic model simply will be introduced respectively.

Language model

Language model generally utilizes chain rule, word sequence for the probability of word or word disassembles into wherein each word or word Probability product, that is to say, W is disassembled into w₁、w₂、w₃、....w_n-1、w_n, and determine P (W) by following formula (3).

P (W)=P (w₁)P(w₂|w₁)P(w₃|w₁,w₂)...P(w_n|w₁,w₂,...,w_n-1) (3)

Wherein, in above-mentioned formula (3), each single item in P (W) is all that all word sequences are all before known to representing Current character sequence is the probability of word or word under conditions of word or word.

Due to when P (W) is determined by above-mentioned formula (3), if condition is oversize, it is determined that the efficiency of P (W) will be compared with It is low, so as to influence follow-up speech recognition.Therefore, the efficiency of P (W) is determined to improve, it will usually by language model N-gram language models determine P (W).When P (W) is determined by n-gram language models, the probability of n-th word only depends on position (n-1)th word before the word, now can determine P (W) by following formula (4).

P (W)=P (w₁)P(w₂|w₁)P(w₃|w₂)...P(w_n|w_n-1) (4)

Acoustic model

Due to it is determined that also need to determine the pronunciation of each word during each word, and determining the pronunciation of each word then needs to pass through Dictionary is realized.Wherein, dictionary is the model arranged side by side with acoustic model and language module, and the dictionary can be converted into single word Phone string.Acoustic model can determine word in the speech data of user input by dictionary, and which sound this sends out successively, and leads to The DP algorithm for crossing such as Viterbi (Viterbi) algorithm finds the separation of each phoneme, so that it is determined that each phoneme Beginning and ending time, and then determine the matching degree of speech data and the phone string of user input, that is to say, determine P (X | W).

Under normal circumstances, the characteristic vector of each phoneme can be estimated by the grader of such as gauss hybrid models Distribution, and in speech recognition period, determine the characteristic vector x of each frame in the speech data of user input_tBy corresponding phoneme s_iProduce Raw probability P (x_t|s_i), the probability multiplication of each frame, just obtain P (X | W).

Wherein, grader can be obtained with precondition, and concrete operations are：By frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) substantial amounts of characteristic vector, and each characteristic vector correspondence are extracted from training data Phoneme, so as to train the grader from feature to phoneme.

It should be noted that in actual applications, P (X | W) not only can be through the above way determined, can also include it His mode, such as, P (s are directly given by neutral net_i|x_t), can be converted into P (x with Bayesian formula_t|s_i), then be multiplied P (X | W) is obtained, certainly, is merely illustrative of herein, do not represented the embodiment of the present invention and be confined to this.

Most probable value and the second greatest in S202, the corresponding probable value of the multiple speech recognition texts of determination.

Each speech recognition text can be calculated using preset algorithm according to the spelling words intellectual of each speech recognition text Identification probability.

It is alternatively possible to using formulaCalculate each speech recognition text This probable value P_rec, whereinIt is the decoding rate of acoustic model,It is the decoding rate of pronunciation dictionary,It is language Speech solution to model code check.The tag file of speech data to be identified is represented,It is the spelling words intellectual for identifying,It is phoneme sequence Row.

It can be seen that, substitute into spelling words intellectual, the aligned phoneme sequence of each speech recognition text, and speech data to be identified spy Solicit articles part, each speech recognition text can be obtained corresponding And then obtain each language The corresponding probable value of sound identification text.

Assuming that a total of N number of speech recognition text, the probable value of each speech recognition text is designated as P_n, wherein, n=1, 2 ... ..., N.Most probable value P can also further be selected_maxWith the second greatest P_2max。

S203, determine difference between most probable value and the second greatest whether more than default probability threshold value.

It is possible to further obtain the difference between most probable value and the second greatest, if difference is more than or equal to Default probability threshold value, illustrates that the corresponding speech recognition text accuracy rate of most probable value is inherently higher, can directly determine The corresponding speech recognition text of most probable value is target identification text.

When implementing, the probable value P of maximum can be successively calculated_maxWith other probable values P_nDifference, alternatively, adopt Use formulaAbsolute value average is calculated as acoustics probability value difference E_P, E_PReflection speech recognition text Distribution situation, has weighed best speech recognition text and remaining direct gap of speech recognition text.E_PDuring more than predetermined threshold value, Can directly by maximum probable value P_maxCorresponding speech recognition text is defined as target identification text, and without further entering Row semantic analysis.

Further, when the difference between most probable value and the second greatest is less than default probability threshold value, from many Determine that at least two candidates recognize text in individual speech recognition text.

Alternatively, determine that at least two candidates recognize text, Ke Yishi from multiple speech recognition texts：Obtain multiple languages Probable value is less than the first speech recognition text of default probability threshold value with the difference of most probable value in sound identification text, by this First speech recognition text and the corresponding speech recognition text of most probable value are defined as at least two candidates identification text.

Will most probable value be compared with other probable values, difference be less than default probability threshold value when, will just be compared Compared with the corresponding speech recognition text of probable value as candidate recognize text.If difference is more than or equal to default probability threshold Value, illustrates that probability of the corresponding speech recognition text of compared probable value as target identification text is very low, not further Analysis.

It is alternatively possible to the probable value of multiple speech recognition texts is ranked up, default of select probability value highest Number speech recognition text recognizes text as candidate.Can also from high in the end, according to the probability of two neighboring speech recognition text Value difference value selects candidate to recognize text successively, for example, the probable value of maximum is more than predetermined threshold value with the difference of the second high probability values, So just do not continue to compare directly using probable value highest speech recognition text as target identification text；Otherwise, by probability Value highest speech recognition text and the speech recognition text high of probable value second all first recognize text as candidate, true successively The difference of fixed second high probability values and next probable value, and determine that candidate recognizes text, the like, it is more than to certain difference During predetermined threshold value, just no longer compare.Certainly, it is not limited in such ways, flexibly can as needed determines candidate's identification text This, it would however also be possible to employ formula or algorithm are obtained.

If only determining a candidate speech identification text, this candidate speech identification text can be directly configured to Target voice recognizes text.If multiple candidate speech recognize text, then the knot best suited with actual conditions is further determined that Fruit recognizes text as target voice.

Alternatively, the similarity between the text of the correspondence position for calculating identification text to be determined and targeted contrast text, Can include：Using default term vector model, the text of identification text to be determined and the correspondence position of targeted contrast text is determined Between semantic similarity.

Wherein, presetting term vector model is used for by the semantic similarity between term vector range marker vocabulary.

Default term vector model can be trained by term vector and obtained, can be specifically word content is changed into it is limited low The real number vector of dimension, dimension ties up relatively common with 50 peacekeepings 100.The distance of vector can be weighed with most traditional Euclidean distance Amount, it is also possible to weighed with cosine angle, this is not restricted.The distance of vector reflects the distance of phrase semantic, i.e., between word Semantic similarity can with vector distance represent.Term vector training can be carried out using the training tool of some term vectors, The training corpus of the basic word that can be covered comprehensively in Chinese is obtained first, and is accordingly pre-processed；Then term vector is called Training tool be trained, generate vector representation form, such as in language material each word have one it is corresponding 50 dimension to Amount represents that this is not restricted.Vector distance is bigger, the semantic distance between word farther out, conversely, semantic distance is nearer.

Specifically, the identification text to be determined of candidate's identification text and the text of the correspondence position of targeted contrast text, go out Now in same sentence pattern, and position is the same, then be same class things possibility it is very big, then according further to word Vector distance determines similarity.

The explanation by taking table 1 as an example：

Table 1

It can be seen that, " Gao Shengmei " is closest with the term vector of " Zhou Jielun ", then match somebody with somebody " I wants to listen the song of Gao Shengmei " Target identification text is set to, and target identification text output is shown to user, if the voice messaging of control instruction class, can Related instruction is performed with according to target identification text, is not repeated one by one herein.

Alternatively, using default term vector model, the correspondence position of identification text to be determined and targeted contrast text is determined Text between semantic similarity, Ke Yiwei：When identification text to be determined includes at least two vocabulary, using default word Vector model, determines in identification text to be determined in each vocabulary and targeted contrast text between the vocabulary of correspondence position respectively Semantic similarity.

Vocabulary i.e. respectively to diverse location is compared, such as " it is beneficial that breakfast eats fruit to identification text more to be determined It is healthy " it is semantic similar between the text of the correspondence position of targeted contrast text " coarse food grain body health benefits are eaten in dinner " Degree, can respectively determine the semantic phase between the semantic similarity between " breakfast " and " dinner ", and " coarse food grain " and " fruit " Like degree.

The dress that target identification text is determined from least two candidates identification text that Fig. 3 is provided for the embodiment of the application one Structural representation is put, as shown in figure 3, the device includes：First determining module 301, computing module 302 and second determine mould Block 303, wherein：

First determining module 301, for determining speech data to be identified corresponding at least two candidate identification text in It is determined that identification text and identification text to be determined.

Wherein, it is described to determine that identification text is identical part in candidate's identification text described at least two, it is described to treat really Surely identification text is that candidate described at least two recognizes the part differed in text.

Computing module 302, the text for calculating the identification text to be determined and the correspondence position of targeted contrast text Between similarity.

Wherein, the targeted contrast text be pre-set text storehouse in the candidate identification text sentence pattern structure it is consistent Text, and the targeted contrast text includes the determination identification text.

Second determining module 303, for by the maximum in the similarity it is corresponding it is described it is to be determined identification text with The candidate identification text for determining identification text composition, is configured to target identification text.

In the present embodiment, the first determining module 301 first determines that corresponding at least two candidate of speech data to be identified knows Determination identification text and identification text to be determined in other text, then computing module 302 is for identification text to be determined, calculating Similarity between the text of the correspondence position of identification text to be determined and targeted contrast text, by the maximum pair in similarity The identification text to be determined answered is defined as the corresponding correct result of speech data to be identified, and then the second determining module 302 should Identification text to be determined and the candidate's identification text for determining identification text composition, are configured to target identification text, realize and are obtaining Get multiple probability it is close candidate identification text when, according to the targeted contrast text consistent with its sentence pattern structure, further root According to the similarity between the text of correspondence position in identification text to be determined and targeted contrast text, determine and user input The immediate identification text to be determined of speech data, and then by the identification text to be determined and determine that identification text constitutes mesh together Other text is identified, user is fed back to, i.e., by referring to targeted contrast text, in candidate's identification text close to multiple probability Different piece is further selected, and improves the accuracy for recognizing speech data to be identified, improves the Consumer's Experience of speech recognition Property.

Fig. 4 determines target identification text for what another embodiment of the application was provided from least two candidates identification text Apparatus structure schematic diagram, as shown in figure 4, on the basis of Fig. 3, the device also includes：3rd determining module 401, wherein：

3rd determining module 401, for determining speech data to be identified corresponding at least two in the first determining module 301 Before determination identification text and identification text to be determined in candidate's identification text, determine that the speech data to be identified is corresponding Most probable value and the second greatest in multiple speech recognition texts.

In the present embodiment, the first determining module 301, the difference between the most probable value and second greatest When value is less than default probability threshold value, determine that at least two candidates recognize text from the multiple speech recognition text.

Alternatively, the first determining module 301, specifically for obtain in the multiple speech recognition text probable value with it is described First speech recognition text of the difference of most probable value less than default probability threshold value；By the first speech recognition text and The corresponding speech recognition text of the most probable value is defined as at least two candidates identification text.

Further, computing module 302, specifically for using default term vector model, determining the identification text to be determined Originally the semantic similarity and in the targeted contrast text between the text of correspondence position.Wherein, the default term vector model For by the semantic similarity between term vector range marker vocabulary.

Alternatively, computing module 302, specifically for when the identification text to be determined includes at least two vocabulary, adopting With the default term vector model, each vocabulary is corresponding with targeted contrast text during the identification text to be determined is determined respectively Semantic similarity between the vocabulary of position.

It should be noted that：The device that above-described embodiment provides the identification text that sets the goal really is known from least two candidates When determining target identification text in other text, only carried out with the division of above-mentioned each functional module for example, in practical application, can To be completed by different functional module as needed and by above-mentioned functions distribution, will device internal structure be divided into it is different Functional module, to complete all or part of function described above.In addition, above-described embodiment provides the identification text that sets the goal really This device belongs to same design with the embodiment of the method for determining target identification text, and it implements process and refers to method implementation Example, repeats no more here.

In several embodiments provided herein, it should be understood that disclosed apparatus and method, can be by it Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed Coupling each other or direct-coupling or communication connection can be the INDIRECT COUPLINGs or logical of device or unit by some interfaces Letter connection, can be electrical, mechanical or other forms.

In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of SFU software functional unit to realize.

The above-mentioned integrated unit realized in the form of SFU software functional unit, can store and be deposited in an embodied on computer readable In storage media.Above-mentioned SFU software functional unit storage is in a storage medium, including some instructions are used to so that a computer Equipment (can be personal computer, server, or network equipment etc.) or processor (English：Processor this Shen) is performed Please each embodiment methods described part steps.And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (English：Read-Only Memory, referred to as：ROM), random access memory (English：Random Access Memory, letter Claim：RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.

Finally it should be noted that：Various embodiments above is only used to illustrate the technical scheme of the application, rather than its limitations；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the application, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from each embodiment technology of the application The scope of scheme.

Claims

1. a kind of method that target identification text is determined in identification text from least two candidates, it is characterised in that including：

Determine determination identification text and the identification to be determined in the corresponding at least two candidates identification text of speech data to be identified Text, wherein, it is described to determine that identification text is identical part, the knowledge to be determined in candidate's identification text described at least two Other text is that candidate described at least two recognizes the part differed in text；

Similarity between the text of the correspondence position for calculating the identification text to be determined and targeted contrast text, wherein, institute Targeted contrast text is stated to recognize the consistent text of the sentence pattern structure of text, and the target in pre-set text storehouse with the candidate Contrast text includes the determination identification text；

The corresponding identification text to be determined of maximum in the similarity is determined into the institute that identification text is constituted with described Candidate's identification text is stated, target identification text is configured to.

2. method according to claim 1, it is characterised in that the determination speech data to be identified corresponding at least two Before determination identification text and identification text to be determined in candidate's identification text, methods described also includes：

Determine the most probable value and the second greatest in the corresponding multiple speech recognition texts of the speech data to be identified；

When the difference between the most probable value and second greatest is less than default probability threshold value, from described many Determine that at least two candidates recognize text in individual speech recognition text.

3. method according to claim 1 and 2, it is characterised in that described to determine from the multiple speech recognition text At least two candidates recognize text, including：

Probable value is less than default probability threshold value with the difference of the most probable value in obtaining the multiple speech recognition text The first speech recognition text；

The first speech recognition text and the corresponding speech recognition text of the most probable value are defined as described at least two Individual candidate recognizes text.

4. method according to claim 1, it is characterised in that calculating identification text and the targeted contrast to be determined Similarity between the text of the correspondence position of text, specially：

Using default term vector model, the text of the identification text to be determined and the correspondence position of the targeted contrast text is determined Semantic similarity between this, wherein, the default term vector model is used for by the semanteme between term vector range marker vocabulary Similarity.

5. method according to claim 4, it is characterised in that described using default term vector model, it is determined that described treat really Surely the semantic similarity between the text of correspondence position in text and the targeted contrast text is recognized, specially：

When the identification text to be determined includes at least two vocabulary, using the default term vector model, determine respectively described Semantic similarity in identification text to be determined in each vocabulary and targeted contrast text between the vocabulary of correspondence position.

6. the device of target identification text is determined in a kind of identification text from least two candidates, it is characterised in that including：

First determining module, for determining the determination identification in the corresponding at least two candidates identification text of speech data to be identified Text and identification text to be determined, wherein, it is described to determine that identification text is identical in candidate's identification text described at least two Part, the identification text to be determined is that candidate described at least two recognizes the part differed in text；

Computing module, for calculating the phase between the identification text of the text with the correspondence position of targeted contrast text to be determined Like degree, wherein, the targeted contrast text is text consistent with the sentence pattern structure of candidate identification text in pre-set text storehouse This, and the targeted contrast text includes the determination identification text；

Second determining module, for by the maximum in the similarity it is corresponding it is described it is to be determined identification text and the determination The candidate identification text of identification text composition, is configured to target identification text.

7. device according to claim 6, it is characterised in that described device also includes：3rd determining module；

3rd determining module, for determining corresponding at least two time of speech data to be identified in first determining module Before determination identification text and identification text to be determined in choosing identification text, determine that the speech data to be identified is corresponding more Most probable value and the second greatest in individual speech recognition text；

First determining module, is less than specifically for the difference between the most probable value and second greatest During default probability threshold value, determine that at least two candidates recognize text from the multiple speech recognition text.

8. the device according to claim 6 or 7, it is characterised in that first determining module, it is described specifically for obtaining Probable value is known with the difference of the most probable value less than the first voice of default probability threshold value in multiple speech recognition texts Other text；By the first speech recognition text and the corresponding speech recognition text of the most probable value be defined as it is described at least Two candidates recognize text.

9. device according to claim 6, it is characterised in that the computing module, specifically for using default term vector Model, determines semantic similar between the identification text to be determined and the text of correspondence position in the targeted contrast text Degree, wherein, the default term vector model is used for by the semantic similarity between term vector range marker vocabulary.

10. device according to claim 9, it is characterised in that the computing module, specifically in the knowledge to be determined Other text include at least two vocabulary when, using the default term vector model, determine respectively it is described it is to be determined identification text in Semantic similarity in each vocabulary and targeted contrast text between the vocabulary of correspondence position.