CN110021295A

CN110021295A - Learn the transcription error of voice recognition tasks

Info

Publication number: CN110021295A
Application number: CN201910000917.4A
Authority: CN
Inventors: A·阿龙; 郭尚青; J·伦克纳; M·慕克尔吉
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2018-01-07
Filing date: 2019-01-02
Publication date: 2019-07-16
Anticipated expiration: 2039-01-02
Also published as: CN110021295B

Abstract

Method, equipment and the computer program product of the mistake transcription generated for identification by speech recognition system are provided.Identification is transcribed by the mistake that speech recognition system generates.Language member known to providing one group is for speech recognition system use.Each language member is made of corresponding multiple words.The language received is matched with the first language member known to the group in language member.First language member is closest to matched language member, and has more than first a words.Matching operation by word matcheds in received language less than more than first, and compared with the first word in the first time slot in the first language member, the received language of institute is changed with the first ad hoc fashion.The language received is sent to wrong transcription analysis device assembly, the component increments received language be mistake transcription evidence evidence.Once the incremental evidence for mistake transcription is more than threshold value, then the following language comprising mistake transcription received of the processing as the first word is identified.

Description

Learn the transcription error of voice recognition tasks

Technical field

The present disclosure relates generally to machine learning.More specifically, this disclosure relates to teaching machine learning system to detect voice The transcription error of identification mission.

Background technique

Speech recognition is a kind of computer technology, and user is allowed to execute various interactive computer tasks, as passing through The alternative solution that conventional input devices (such as mouse and keyboard) are communicated.Some tasks include for computer transmission order with Execute selected function or by phonetic transcription at being used for computer applied algorithm (such as electrical form or text processing application journey Sequence) written transcription.Unfortunately, speech recognition process is not without mistake, and important problem is to correct transcription mistake Mistake or " mistake transcription ".When the speech recognition component of computer mistakenly transcribes acoustic signal in the language said, occur Mistake transcription.In automatic speech recognition task, when the word of selection is mistakenly transcribed, possibly order can not be correctly executed Or possibly it can not be properly transcribed into voice.Mistake transcription may be as caused by one or more factors.For example, it may be possible to be Because user is non-local speaker, due to the carelessness speech of user, or because of the back on the channel of speech recognition system Scape noise.

A type of mistake transcription is replacement mistake, and wherein speech recognition system replaces sending with incorrect word Word.Another type of mistake is inserting error, wherein system identification " rubbish " language, for example, breathing, ambient noise, " uh ", or a word is construed to two words etc..Another type of transcription error is deletion error, wherein pronunciation is single A pronunciation word in word does not occur in transcription.In some cases, it can be possible to delete, because of speech recognition system root Refuse according to its dictionary using the phoneme of identification as the word being not present.Alternatively, deletion be due to two words not It is correct to merge.For example, user may say " nine trees ", and the language is identified as " 90 " by system.

Conventional method for solving mistake transcription includes that manual check is transcribed to search mistake and by such as keyboard Input unit corrects them, or by transcribing the candidate mistake of system identification and entering be intended to correct them and user Dialogue correct them.For example, system can be inquired via loudspeaker user " you say ' chicken '? " if user says " no ", Then candidate mistake transcription can be recorded as mistake by system.It is wrong transcription can also to be reduced by improving the speech model of specific user Quantity accidentally.When system receives greater number of speech samples from specific user, or by allowing user from known transcription Middle reading, or the system is continued to use by user, the default acoustic model of speech recognition system can better adapt to use Family.

Require further improvement area of computer aided speech recognition.

Summary of the invention

According to the disclosure, a kind of method, equipment and the calculating of the mistake transcription generated for identification by speech recognition system Machine program product.Language member known to providing one group is for speech recognition system use.Each language member is by corresponding more A word composition.The language received matches with the first language member known to the group in language member.First language member It is closest to matched language member, and there are more than first a words.Less than the in the received language of matching operation matching institute Word more than one, and compared with the first word in the first time slot in the first language member, the received language of institute is with the The variation of one ad hoc fashion.The language received is sent to wrong transcription analysis device assembly, the received language of component increments institute It is the evidence of the evidence of mistake transcription.Once the incremental evidence for mistake transcription is more than threshold value, then what future received includes The language of mistake transcription is considered as identified by the first word.

Front has outlined some more relevant features of published subject.These features should be construed as merely It is illustrative.By that using disclosed theme or by modifying the present invention that will be described, can obtain perhaps in different ways Mostly other beneficial outcomes.

Detailed description of the invention

For a more complete understanding of the present invention and its advantage, referring now to following description with reference to the accompanying drawings, in the accompanying drawings:

Fig. 1 depicts the exemplary block diagram of distributed data processing environment, wherein showing for illustrative embodiments may be implemented Example property aspect；

Fig. 2 is the exemplary block diagram of data processing system, wherein the illustrative aspect of illustrative embodiments may be implemented；

Fig. 3 shows the architecture diagram of the component in speech recognition system according to an embodiment of the present invention；

Fig. 4 shows the flow chart of operation speech recognition system according to an embodiment of the present invention；

Fig. 5 is the flow chart according to an embodiment of the present invention based on user response addition class members；

Fig. 6 is the flow chart for the evidence that addition class members according to an embodiment of the present invention should be added to multiple classes；

Fig. 7 be it is according to an embodiment of the present invention for detect new transcription error be replacement, delete or inserting error Figure；

Fig. 8 is the flow chart shown for being incremented by evidence, which is the word of the mistake transcription from a class members It is the evidence for the legal substitution that the word of mistake transcription is the same word in another mutually similar class members；

Fig. 9 is the flow chart shown for being incremented by evidence, which is that the word of the mistake transcription from a class is wrong The word of record is misrouted as the evidence of the legal substitution of the same word in inhomogeneous class members；And

Figure 10 is the flow chart that one embodiment of the present of invention of additional evidence is obtained using speech recognition system.

Specific embodiment

On high-level, the preferred embodiment of the present invention provides a kind of system, method and calculating for machine learning Machine program product, for suitably identifying and handling wrong transcription from speech recognition system.The present invention using one group one or Multiple known language, language generation system when being identified by speech recognition system respond.In a preferred embodiment, language quilt It is arranged in one group or one " class "；When a language in the language for identifying class, class system response action is executed.When identification When language is the member of class, they are referred to as " class members ".Each language is usually made of multiple words, and the quantity of word can be with Changed according to specific language.When some but not all words in transcription matching member's language, for example, in class members In given time slot in identify word Y replace word X, the transcription be considered as in member's language be directed to word X word Y Some evidences of mistake transcription.Once evidence is more than threshold value, the language comprising mistake transcription of future identification will be regarded as original Beginning word is identified.Determine that the word Y of identification is medium in the language of identification using machine learning algorithm in some embodiments It is same as the confidence level of word X.

By wrong transcription analysis device in an embodiment of the present invention using various rules come it is determined that with identical mistake The additional example for misrouting record increases how many confidence level.In some embodiments, mistake transcription analysis device is calculated using machine learning Method.Since the evidence amount that specific transcription provides depends on many factors, as described below.For example, system is seen more at this Or the language of the expection X of Y is transcribed into other member's language, the evidence of the mistake transcription of X to Y is more.In addition, in particular utterance In matched word quantity it is bigger, for example, have the long language of single suspicious mistake transcription word, then wrong transcription analysis device It is assumed that the evidence of mistake transcription is more.Since the evidence of particular error transcription becomes almost determining, categorizing system is capable of handling tool The identification language of wrong transcription, as original word is identified.The embodiment of the present invention realizes a kind of side of this purpose Formula is the new language of addition, and the original word of existing class members is replaced in wherein one or more mistake transcriptions, as one or more The class members of a language class.The another way that other embodiments of the invention use is that mistake transcription is identified as original word Effective substitution so that no matter when identify, system is all substituted as original word is identified and is carried out.

In the following description, it is determined whether the process that new mistake is transcribed or new language uses for system should be added It is described generally as being incremented by evidence.It will be understood by those skilled in the art that incremental evidence can confidence in some embodiments Degree uses in calculating, for example, a part as machine learning system.Therefore, when incremental evidence is more than threshold value, threshold value can be with It is the confidence threshold value that the evidence Threshold of accumulation or the evidence based on accumulation calculate.It is calculated in evidence Threshold or confidence threshold value In, each evidence collected from different mistake transcription examples can have different weight or influence in threshold calculations.? In preferred embodiment, according to received language attribute and language member attribute degree of closeness, for each language at The evidence of member's incremental error transcription.In other embodiments, for the evidence of mistake transcription (for example, identifying word Y rather than list Word X) it is incremented by single location, so that being more than once threshold value for mistake transcription, system just handles appointing comprising mistake transcription What following received language, as original word is identified.

Referring now to the drawings, and with particular reference to Fig. 1-2, the exemplary diagram of data processing circumstance is provided, wherein can To realize the illustrative embodiments of the disclosure.It should be understood that Fig. 1-2 is merely exemplary, it is not intended that statement or imply about Any restrictions of the aspect of published subject or the environment of embodiment may be implemented.The spirit and scope of the present invention are not being departed from In the case of, many modifications can be carried out to discribed environment.

Referring now to the drawings, Fig. 1 depicts the graphical representation of exemplary distributed data processing system, wherein may be implemented The various aspects of illustrative embodiments.Distributed data processing system 100 may include computer network, wherein may be implemented to illustrate The various aspects of property embodiment.Distributed data processing system 100 includes at least one network 102, which is for dividing It is connected between various devices and computers in cloth data processing system 100 and the medium of communication link is provided.Network 102 may include connection, such as wired, wireless communication link or fiber optic cable.

In discribed example, server 104 and server 106 are connected to network together with networked storage units 108 102.In addition, client 110,112 and 114 is also connected to network 102.These clients 110,112 and 114 can be such as intelligence Energy mobile phone, tablet computer, personal computer, network computer etc..In discribed example, server 104 is to client 110,112 and 114 the data for such as guiding file, operating system image and application program are provided.In discribed example, visitor Family end 110,112 and 114 is the client of server 104.Distributed data processing system 100 may include Additional servers, Client and unshowned other devices.One or more server computers can be attached to the Framework computing of network 102 Machine.Host computer can be the IBM System z host for for example running IBM z/OS operating system.Be connected to host may be Host storage unit and work station (not shown).Work station can be directly to the personal meter of the host by bus communication Calculation machine is also possible to be directly connected to the console terminal of host via display port.

In discribed example, distributed data processing system 100 is internet, and wherein network 102 is indicated using biography The global network and gateway group that transport control protocol view/Internet Protocol (TCP/IP) protocol groups communicate with one another.The core of internet It is the backbone of the high-speed data communication lines between main node or master computer, by the quotient of thousands of routing data and messages Industry, government, education and other computer systems composition.Certainly, distributed data processing system 100 is also implemented as including more The different types of network of kind, such as Intranet, local area network (LAN), wide area network (WAN) etc..As described above, shown in Fig. 1 Be intended as example, rather than the framework limitation of the different embodiments to published subject, and therefore, shown in Fig. 1 Particular element is not considered as the limitation about the environment that illustrative embodiments of the invention may be implemented.

Referring now to Figure 2, the block diagram of example data processing system is shown, wherein illustrative embodiments may be implemented Various aspects.Data processing system 200 is the example of computer, all clients 114 as shown in figure 1, wherein realization can be positioned originally The computer usable code or instruction of the process of disclosed illustrative embodiments.

Referring now to Figure 2, the block diagram of data processing system is shown, wherein illustrative embodiments may be implemented.At data Reason system 200 is the example of computer, all servers 104 or client 110 as shown in figure 1, wherein can be illustrative implementation The computer usable program code or instruction of the process are realized in example positioning.In the illustrated examples, data processing system 200 is wrapped Include communication structure 202, provide processor unit 204, memory 206, permanent storage 208, communication unit 210, input/ Export the communication between (I/O) unit 212 and display 214.

Processor unit 204 is used to execute the instruction for the software that can be loaded into memory 206.Depending on particular implementation Mode, processor unit 204 can be one group of one or more processors, or can be multi-processor core.In addition, processor One or more heterogeneous processor systems can be used to realize in unit 204, and wherein primary processor is deposited together with secondary processor It is on one single chip.As another illustrated examples, processor unit 204 can be the processing comprising multiple same types Symmetric multiprocessor (SMP) system of device.

Memory 206 and permanent storage 208 are the examples of storage device.Storage device is can temporarily and/or forever Any hardware of storage information long.In these examples, memory 206 can be such as random access memory or any other Suitable volatile or non-volatile memory devices.Depending on particular implementation, permanent storage 208 can be taken various Form.For example, permanent storage 208 may include one or more components or device.For example, permanent storage 208 can To be hard disk drive, flash memory, rewritable CD, rewritable tape or above-mentioned some combinations.Permanent storage 208 uses Medium be also possible to movably.For example, removable hard disk drive can be used for permanent storage 208.

In these examples, communication unit 210 provides the communication with other data processing systems or device.In these examples In, communication unit 210 is network interface card.Communication unit 210 can be by using one of physics and wireless communication link or two Person provides communication.

I/O unit 212 allows to output and input number with the other devices that may be coupled to data processing system 200 According to.For example, I/O unit 212 can provide connection by keyboard and mouse for user's input.In addition, input/output list Member 212 can send output to printer.In addition, I/O unit can be provided to the connection of microphone, for coming from The audio input of user and loudspeaker, to provide the audio output from computer.Display 214 provides to user and shows information Mechanism.

The instruction of operating system and application program or program is located in permanent storage 208.These instructions can load The execution of device unit 204 for processing into memory 206.The process of different embodiments can use meter by processor unit 204 The instruction that calculation machine is realized executes, which is located in such as memory of memory 206.These instructions can be referred to as program Code, computer usable program code or computer readable program code can be read by the processor in processor unit 204 It takes and executes.In different embodiments, can different physics or visible computer readable medium (such as memory 206 or Permanent storage 208) on realize.

Program code 216 is located on computer-readable medium 218 in functional form, and the computer-readable medium 218 is optional It is removed to selecting property and can be loaded into or be transmitted to the execution of the device unit 204 for processing of data processing system 200.At these In example, program code 216 and computer-readable medium 218 form computer program product 220.In one example, it calculates Machine readable medium 218 can be tangible form, be such as inserted or placed in driver or CD in other devices or Disk, the driver or other devices are a part of permanent storage 208, are used for transmission on storage device, such as make For the hard disk drive of a part of permanent storage 208.In tangible form, computer-readable medium 218 can also be adopted Take the form of permanent storage, such as hard disk drive, thumb actuator or the flash memory for being connected to data processing system 200. The tangible form of computer-readable medium 218 is also referred to as computer recordable storage medium.In some cases, computer can be remembered Recording medium 218 may not be removable.

Alternatively, program code 216 can pass through the communication chain to communication unit 210 from computer-readable medium 218 Road and/or by being transmitted to data processing system 200 to the connection of I/O unit 212.It communication link and/or is connected to Can be in illustrated examples physics or wireless.Computer-readable medium can also take the form of non-tangible media, all Such as communication link comprising program code or wireless transmission.It is not meant to for the different components shown in data processing system 200 Framework limitation is provided to the mode that different embodiments may be implemented.Different illustrative embodimentss can be in a data processing system It realizes, which includes other than for component shown in data processing system 200 or replacing at data The component of component shown in reason system 200.Other components shown in Fig. 2 can the illustrated examples from shown in it is different.As One example, the storage device in data processing system 200 is any hardware device that can store data.Memory 206, forever Long storage device 208 and computer-readable medium 218 are the examples of the storage device of tangible form.

In another example, bus system can be used to implement communication structure 202, and can be by one or more total Line composition, such as system bus or input/output bus.Certainly, the framework that any suitable type can be used in bus system comes It realizes, which provides the data transmission between the different components or device for being attached to bus system.In addition, communication unit can To include one or more devices for sending and receiving data, such as modem or network adapter.In addition, storage Device can be in such as memory 206 or the interface and memory controller hub that are such as found in communication structure 202 Cache.

Computer program code for executing operation of the invention can be with any group of one or more programming languages It closes to write, including such as Java^TM, Smalltalk, C++, C#, Objective-C etc. object-oriented programming language, with And traditional procedural of such as Python or C.Program code can execute on the user's computer completely, part Ground executes on the user's computer, as independent software package, partly on the user's computer and partly remote On journey computer, or execute on a remote computer or server completely.In the latter case, remote computer can pass through The computer of user, including local area network (LAN) or wide area network (WAN) are arrived in any kind of network connection, or may be coupled to Outer computer (for example, passing through Internet use ISP).

Those skilled in the art will appreciate that the hardware in Fig. 1-2 may change because of embodiment.In addition in Fig. 1-2 Shown in except hardware or replace hardware shown in Fig. 1-2, other internal hardwares or peripheral unit can be used, such as dodge It deposits, equivalent nonvolatile memory or CD drive etc..In addition, in the feelings for the spirit and scope for not departing from published subject Under condition, the process of illustrative embodiments can be applied to the processing of the multiprocessor data other than previously mentioned smp system System.

Technology described herein can in all standard client-server paradigms as shown in Figure 1 binding operation, Middle client machine can be carried out with what is executed on one group of one or more machine by the portal based on Web of access to the Internet Communication.End users operation is able to access that portal and with the device for connecting internet of portal interaction (for example, desk-top calculating Machine, notebook computer, the mobile device for supporting internet etc.).In general, each client or server machine are such as Fig. 2 Shown in include hardware and software data processing system, and these entities are communicated with one another by network, such as internet, Intranet, extranet, dedicated network or any other communication media or link.Data processing system generally includes one or more Processor, operating system, one or more application program, and one or more utility programs.

Although people are often unable to each word of correct understanding in dialogue, the mankind are helped using the context of dialogue Piecing together out the word being misunderstood should be what.Speech recognition mechanism does not have the work that the mankind are used to make this situation judgement Tool.However, can be learned by observing identical repetition transcription error (combining sometimes with user behavior) by machine learning It practises and transcribes the confidence level what word must be in relation to mistake.The embodiment of the present invention allows system to be based on individual consumer and environment And class of subscriber and environmental form are learnt.

The environment that the embodiment of the present invention may be implemented is shown in FIG. 3.Speech recognition system 303 receives voice sample Sheet 301, to be converted into the available text of computer or label.A part of speech recognition system 303 is classifier 304, such as IBM Watson natural language classifier or Stanford classifier.The component recognition inquire or the problem of say identical things or The class (alternatively, " language ") 309 asserted.For example, it may be possible to which there are the classes in the direction of an inquiry to nearest toilet.This Kind of request can there are many form, i.e., " which road to toilet? ", " where is toilet ", " which road to lavatory? " etc.. In this case, " which road to toilet? ", " where is toilet " and " which road to lavatory? " referred to as class example 309a...309c.All examples are " intention " all having the same.Classifier using language and attempt it with any known class and Class example matches.Classifier 304 returns to highest confidence level class and confidence level 311.If the confidence level returned is more than system The threshold value of setting, then system responds 313 with specified response associated with given class.Meanwhile mistake transcription analysis device 312 times Highest is gone through to match each class members in the class members of class and find immediate matching class members.If do not exactly matched, Then after accumulating enough evidences, new class members is added in class in the step 310.If it is real to find close class Example, then system trial is inferred to during illustrating in further detail below, for known word (or multiple words), which list Word (or multiple words) may be transcribed by mistake.The word of suspicious mistake transcription deposits the word for being stored in mistake transcription to data In reservoir 325.The data storage 325 stores the word (or multiple words) and correct word that mistake is heard.At this In the alternate embodiment of invention, language is not instead of by histioid, associated system response of each language with their own. Most of describe the language tissue in class is class example or the embodiment of member below.In some implementations of the invention In example, mistake transcription analysis device 312 is incremented by the evidence amount of particular error transcription using machine learning algorithm.As discussed below , depending on the rule that wrong transcription analysis device 312 uses, specific mistake transcription example can be incremented by for different class members Different amounts of evidence.

In an embodiment of the present invention, all components may reside in individual system.In other embodiments, in component Some components can be distributed in not homologous ray.For example, speech samples can be obtained by the FTP client FTP of such as smart phone It takes, speech recognition system 303, classifier 304 and class storage device 309 may reside at server, and system response 313 It can be the voice response played back at client, or the response executed at another system in distributed network.

In the first stage of operation, classifier 304 will identify each class members, and if identifying class members, Generate system response 313 appropriate.In many cases, system response will be the voice generated by system, such as user puts question to Answer.System response 313 can be non-voice response, for example, in the window of user's request or the graphic user interface of webpage Retrieval and visual display.

In an embodiment of the present invention, the feedback responded to system is collected from user.Feedback can take additional speech sample This form, for example, other similar problems, negative response " this is not my meaning ", or connect implicitly through instruction is lacked It is correctly additional response by response.Other user's inputs can indicate to accept or reject response.For example, being closed if user inquires It is encyclopedic theme or the problem of about the webpage shown in system in speech recognition system, and user continues and is System interaction continues to check webpage in a manner of no wonder, then this movement can be construed to receive response by system.When When initial speech specimen discerning cannot be class members by classifier 304, in an embodiment of the present invention, speech recognition device 303 can It is cleared up problems with generating to prompt user to provide additional information and/or speech samples.

In the first stage of operation, classifier 304 will also send the class for mismatching identification to wrong transcription analysis device 312 The message of the identification voice 305 of member.In an embodiment of the present invention, what the trial of mistake transcription analysis device 312 exchange was different can The mistake that data storage 325 obtains can be transcribed from mistake transcription word, and text is resubmitted to classifier 304, example Such as candidate class members.

In the second operational phase, class members is added to existing one group of class for classifier by mistake transcription analysis device 312 304 use.Mistake transcription analysis device 312 stores the generation of candidate class members, turns including (multiple) candidate mistakes in candidate class Record, candidate's class calculate the identification voice most possibly belonged to.With storage same candidate class members and (multiple) same candidate The generation of mistake transcription is more, and candidate class members affiliated and candidate mistake transcription in such is the word in existing class members Alternative form confidence level it is higher.When a threshold is reached, candidate class members is added to the class members 311 in class as identification So that classifier 304 is used to generate the system response 313 to user.In alternative embodiments of the present invention, class storage device 309 share between wrong transcription analysis device 312 and classifier 304.When candidate class members is by wrong 312 conduct of transcription analysis device When new class members is added in class, classifier 304 will simply begin to use it.

As a part of the second operational phase, the embodiment of the present invention further include " it is expected that class members " or " candidate class at Member ", classifier 304 identifies language using it.Mistake transcription analysis device 312 has in candidate class members and candidate mistake transcription There is increased confidence level, and expected class members is placed in class to accelerate evidence to accumulate.Mistake transcription analysis device 312 is counted Intermediate confidence level is calculated, which is more than the first intermediate threshold, but is below mistake transcription and will be added to Second threshold needed for the candidate class members of such identification member.Classifier 304 uses more than the expection or time of intermediate threshold Select class members come generation system response 313, as it is identified class members, or enter and user interactive mode it is right Words, such as " I thinks that you feel like doing X.Does is this correct?, wherein X is such correct system response.If it is affirmative, User response can add evidence, i.e., candidate class members and candidate mistake transcription should be added in class.

As the configuration step of operation above system, the set of the language identified that system can be identified and may be responded Or " classification " is created and stored in class storage device 309.These may be considered that in the sense that being integrated into text classifier It is " class ", such as those classes used in Watson natural language classifier or similar classifier.Such is by a group membership It constitutes, which constitutes the various modes of substantially the same language.For example, " how I enter the class with template problem Bathroom? " can have alternate example: " where is bathroom? ", " which road to bathroom? ", " which road to toilet? " etc..One The manual creation of class is used in a little embodiments, but is discussed further below, in some embodiments, there are one group can extend manual wound The automatic technique for the class built.Moreover, as described herein, mistake transcription analysis device 312 is provided based on the mistake transcription occurred repeatedly New class members.

When the language said is translated into word by speech recognition system, it may mistakenly transcribe one in language or Multiple words.As described above, this is known as mistake transcription or transcription error.In a preferred embodiment of the invention, if to language In N-1 word in N number of word class members in a class members " matching " occurs, i.e. only one word and class members It mismatches, then wrong transcription analysis device is regarded as the evidence of the mistake transcription there are unmatched word.Mistake transcription analysis The rule that device is used to be incremented by evidence is, for given N-1 (quantity of matching word), N (the word quantity in member) Bigger, then the evidence that there is mistake transcription is more.In an embodiment of the present invention, another rule be word in class members with Phonetic similarity between candidate mistake transcription is closer, then the evidence that there is mistake transcription is more.Typical mistake transcription will It is following word, which has similar sound with the expection word at same position in class members.In many cases, it waits The word or expression in class members is selected to sound that otherwise speech recognition system will not similar to the word or expression in class members Generate mistake transcription.

Therefore, in these cases, it is assumed that speech recognition system transcrypted " it is toilet which, which wakes (wake),? " without Be " which road (way) to toilet? ", then wrong transcription analysis device is it is thought that " wake-up " may be the mistake transcription to " road " Some evidences.The example that same mistake transcription occurs is more frequent, and the evidence of collection is more, and at wrong transcription analysis device Confidence level in mistake transcription is bigger.Sometimes, the such confidence of mistake transcription analysis device has been more than that threshold value is that mistake turns Record.In the embodiment of the present invention using language class, candidate class members is added in class, and system executes movement, example Such as give the answer by word of mouth, just look like it actually identifies " which road to toilet? ".

In an embodiment of the present invention, under lower confidence level, system can execute the second movement, such as, it is desirable that Clarification, for example, say " I does not hear that you inquire the direction of toilet? " it in other embodiments of the invention, can be with There are the threshold values of the first lower medium confidence level, wherein candidate class members is added in class as " on probation " member.System System will collect user response when executing operation appropriate for class members, and these responses are fed back to wrong transcription analysis device. Therefore, the user response that instruction receives system response will increase the confidence level of mistake transcription, and indicate the use of refusal system response Family response will reduce mistake transcription for the confidence level of class members.As confidence level increases, because user continues to receive system Response, so confidence level is more than the second higher level, and candidate class members is converted to permanent state conduct from trial status The class members of class.

The embodiment of the present invention identifies that the mistake of the speech text generated from speech recognition system turns using machine learning Record.In an embodiment of the present invention, the language class with similar meaning with user for interacting.Class includes group membership's language, Each member's language U_i is made of N_i word of respective numbers.When transcription matches these some but not all word (examples Such as N_i-1) and when using word Y replacing word X in given time slot (such as j-th of time slot) in member's language, it should Transcription is adopted as some evidences of the mistake transcription for the word Y of word X.

The expection language X that it is Y in this or other known language transfer record that system is seen more, then mistake is transcribed Evidence is more.As described above, a rule is the word number in the particular utterance of the word with single suspicious mistake transcription Amount N_i is bigger, then assumes that the evidence of mistake transcription is more.

The embodiment of the present invention allows to assist mistake transcription confidence level by the knowledge of same or like speaker.It is identical Speaker or similar speaker be more likely to incorrect pronunciations in the same or a similar manner or using word.Can be used one Kind similarity measurement is to detect speaker's L1 language having the same, i.e., the mother tongue speaker of identical first language.It is another Similarity measurement is the identical environment of user sharing, such as workplace or tissue, and will be tended to using identical vocabulary. In an embodiment of the present invention, inhomogeneous member's language is stored for different user or different user class.Implementation of the invention Example adds the evidence of mistake transcription using the rule based on user.

The embodiment of the present invention allows to assist mistake transcription confidence level by the knowledge of same or like environment.Although with There are some overlappings by the user of above-mentioned identical workplace or tissue, but in the category, same subscriber will in different environments Use different words.The word be used in a home environment opposite with working environment often will be different.In addition, certain The mistake of type is transcribed in different types of environment more commonly, for example, the inserting error in noisy environment.Of the invention In embodiment, inhomogeneous member's language is stored for specific environment or environmental form.The embodiment of the present invention, which uses, is based on ring The rule in border come add mistake transcription evidence.

It is a degree of whether other embodiments of the invention allow to have by the word of word and suspicious mistake transcription The knowledge of phonetic similarity come assist mistake transcription confidence level.

It in an embodiment of the present invention, is considered as using phase to the mistake transcription of given word in a class members With the evidence transcribed in other class members of word to the mistake of the word.In these embodiments, it is actually met in class members To before mistake transcription, it is contemplated that ground cumulative evidence.For example, in the first class members, word " thorough (thorough) " one Word may be mistranslated as " row (row) " or vice versa.In an embodiment of the present invention, system will accumulate, preferably less Evidence to share mistake transcription (multiple) word other class members.The embodiment of the present invention can also tire out in other classes The evidence of these words of product.Rule instruction in some embodiments in embodiment is compared with companion class members in other classes Less evidence is accumulated in language.

Other embodiments of the invention allow the candidate from potential matched existing class members using regular expression Various words sequence in class members.Different order of words is allowed in candidate class members, it would mean that in the presence of compared with Low signal strength is the relatively low confidence of real mistake transcription to candidate mistake transcription.

In an embodiment of the present invention, mistake transcription analysis device also considers the environment for confusion in interpretation word, geography connects Recency and Situation Awareness.For example, such as " which road to toilet (restroom)? " expression can easily with expression " which road to dining room (restaurant)? " obscure.If a people is simultaneously emitted by the expression, first environment or environment in driving Type, then the second expression is it is more likely that correctly, i.e., the people is look for restaurant.If issuing the expression in working space, Then the first expression is it is more likely that correctly.

Equally, the perplexing statement of another pair is: " let us moves up " and " we to watch movie ".It can be with base The two sentences are distinguished in such as saying the context of corresponding sentence to whom.Office manager is it is more likely that his/her member Work says a word, and second is more likely said between two friends.

The flow chart of the embodiment of the present invention is shown in Fig. 4.In step 401, the transcription error of minimum number is set. Therefore, in one embodiment, the value of MISTRANSCRIP_MIN_SEEN=minimum# is set to turn candidate mistake Before record or candidate class members " identified " it has to be observed that the mistake of same error transcription transcribe the quantity of example, that is, at it Upper system will take the quantity of movement.In step 403, there are the threshold values of the confidence level of transcription error for setting.Therefore, The value of MISTRANSCRIP_THRESH=such probability/confidence level is higher than the probability/confidence level, it is assumed that the word of identification System recommendations are wrong transcriptions.Two threshold values are set, because the evidence quantity that each candidate mistake transcription example is collected is different.It is wrong The each example for misrouting record may have different contexts and different amounts of between candidate class members and existing class members With with unmatched word.

Other values are set before example is considered candidate mistake transcription, what such as each candidate class members allowed Transcription error quantity, for example, in a preferred embodiment, each words of MAX_FRACTION_MISTRANSCRIBED=are arranged The largest score of the word for the mistake transcription that language allows.If there is too many candidate mistake transcription in single candidate mistake transcription, Then it is less likely to be present the ample evidence that identified language is class members.In alternative embodiments, different threshold values is set.

In step 405, natural language classifier is initialized using one group of class that system can respond.Of the invention In embodiment, one group of synonymous phrase is also initialized in classifier.Synonymous phrase is can replace class members in class equivalent The set of word or expression.In this manner, class members can be extended, without the possible variant of each of class members to be classified as individually Class members.Each class in class set is associated with so-called intention, and each intention is mapped to system and is intended in identification When the response taken.

When language is submitted to system, step 407, in one embodiment, voice is identified, and if classifier is true Determine language to match with class members, step 409, then such appropriate response is returned into user, step 411.In its of the invention It in its embodiment, does not match not instead of accurately, being determined whether using confidence level should returning response.For example, response is by dividing The assessment of class device, T class and associated confidence level CONF_i before being provided for each class.CONF_0 has highest confidence level Class.When CONF_0 is more than threshold value THRESH, classifier system responds it and knows system associated with the intention of associated classes Response, step 411.For example, system response will provide the direction of toilet if be intended that in " toilet direction ".At this In the embodiment of invention, if there is no accurate matching or if confidence level is no more than threshold level, system, which will enter, is ask Mode is asked, wherein for example clearing up problems by inquiry user and analyzing the user spoken utterances made in response to problem, from user Receive more information, step 410.

It is next determined whether there are candidate transcription mistake, step 412.In some embodiments, the step is by classifier It executes and passes to wrong transcription analysis device.In other embodiments, the language of all identifications is delivered to wrong transcription analysis Device, the mistake transcription analysis device will be determined whether generation transcription error.The process for determining mistake transcription is more thoroughly discussed below. If there is transcription error, then transcription error and its position in class members, step 413 are stored.If it does not exist, then system It returns to monitor other user spoken utterances.

Since wrong transcription analysis device receives mistake transcription and mistake the transcription weight in user spoken utterances in new example It is new to occur, therefore wrong transcription analysis device will accumulate more and more evidences that identified word is mistake transcription.Each reality Example may all provide the evidence of different number.If all words other than a word are all matched with class members, In the embodiment of the present invention, this by than in word in language several words and the unmatched example of class members more on evidence.With Evidence accumulation, confidence level will meet mistake transcription threshold value, step 415.Once confidence level is more than threshold value, there is mistake The class members of transcription is stored as the alternative form of class members.In the embodiment of the present invention of the synonymous phrase of use, it can incite somebody to action Mistake transcription is stored as a part of the synonymous phrase of class.Other embodiments use other components store mistake transcription as Effective substitution of (multiple) original word.

Fig. 5 shows the flow chart of one embodiment of the present of invention for adding new class members.In illustrated embodiment In, the intermediate confidence level of candidate class members is used to send the evidence of system response and incremental error transcription to user.Classifier It sends to wrong transcription analysis device (not shown) comprising candidate wrong transcription, candidate class members and the system transcribed to candidate mistake The message of the user response of response.

In step 501, mistake transcription is received from classifier by wrong transcription analysis device.User is received in step 503 Response.In step 505, identification has the position of the mistake transcription of class members.For example, whenever detection of classifier to class members simultaneously And word for word transcription matches with N-k word in N number of word in the class example of most tight fit, and in the transcription of language In also use N number of word, enable non-matching word to by (w_ { i_j }, a_ { i_j }) indicate, wherein there are k indexes { i_j }. In an embodiment of the present invention, (w_ { i_j }, a_ { i_j }) is stored with the hash of potential mistake transcription.For example, a_ { i_j } It is the latent fault transcription of word w_ { i_j }, wherein word w_ { i_j } is appeared in class example.In some embodiments, mistake The position of transcription is from a part of the received packet of classifier, however, in other embodiments, passing through wrong transcription analysis Device executes determination.

Other than word pair, in an embodiment of the present invention, mistake transcription analysis device stores three added values.In step In 507, system stores the number that classifier responds under the hypothesis of mistake transcription, and the answer provided seems to be connect by user By.In step 509, system storage classifier is it is assumed that the number that has responded when mistake transcription, but the response provided seems Refused by user.In step 511, system storage detects the number of mistake transcription, that is, the list in the class example at top There are direct corresponding relationships between substitution word in the w_ { i_j } of word -> a_ { i_j } and transcription, but the system in the class at top Confidence level is no more than THRESH (intermediate threshold), therefore system does not provide response.

In one example, system storage (w_ { i_j }, a_ { i_j }, 5,2,4), it is meant that the hypothesis of 5 mistake transcriptions The hypothesis of the response for causing user to receive, twice mistake transcription causes user to refuse to respond and seem for 4 times to have heard a_ { i_ J } word w_ { i_j } is replaced, but the classifier confidence in the class at top is not above THRESH_1, therefore does not provide and be System response.In the illustrative embodiments, the general entry in mistake transcription hash provides by (w, a, CO, IN, NO)-wherein w =correct word, the word of the potential mi transcription of a=, CO=are correctly counted, the incorrect counting of IN=and NO=without Response count.

The process continues, until the class threshold value at the top of language is more than higher threshold level THRESH_2, step 513, This means that machine learning system has enough confidence levels for transcribing as the mistake of substitution class members's storage, or do not deposit Indicating the remaining word for the candidate's mistake transcription for still having to be replaced to (w_i, a_i).

Note that in the language with 5 tuples (w_i, a_i, CO_i, IN_i, NO_i), there may be as candidate mistake Transcribe several words of a_i.In this case, iteration process, wherein being replaced a_ with the descending of the confidence level in correction I- > w_i is carried out with the descending of CO_i/ (CO_i+IN_i).As long as the word quantity M and total words amount N of mistake transcription are not So that M/N > MAX_FRACTION_MISTRANSCRIBED, which be will continue to.

In different embodiments of the invention, it can store specific to each user, user class, specific environment (such as position Set) or environmental form class members.Tradeoff according to specific user training class members is, for the identification language from single user The less sample of sound, training will be transcribed more accurately for the certain types of mistake that user can be carried out, this might mean that Machine learning will take longer time training rather than train together with multiple users.Had according to the training of user class and more may be used The advantages of identifying speech samples, and to quickly carry out machine learning, but exist that relative users mistake may be transcribed into The risk that the member of user class or the mistake of error handle specific to specific user are transcribed.

According to specific environment or the trained class members of environmental form for obtaining the speech samples for comparing and obtaining from single user more More speech samples are also useful.Compared with quiet environment, environmental form may include noisy environment.Alternatively, Environmental form can be the environment that certain activities occur, such as automobile, family, work or school.System must be by type to ring Border is classified, and environmental form may need user to input, for example, confirmation environmental form.Alternatively, system can be used ground Positioning input and mapping data are managed to classify to environment, client data is, for example, to possess from company's desktop or individual Whether the speech utterance of smart phone, client device move, with the environmental background noise of speech samples.Class members can also be with It is trained according to specific context/location (such as general headquarters, XYZ company or family of Joe).

Fig. 6 shows the flow chart of the embodiment of the present invention, wherein according to user and environmental characteristic training class members.At this It is the class that relative users are trained and storage is different in the embodiment of invention, i.e. class members gathers, and not for different environment storages Same class.In other embodiments, class is stored for the specific user in specific environment.Identical existing class members is present in difference User and environmental classes in, as evidence is accumulated, new candidate class members (being based on existing class members) will be in different classes With different amounts of evidence.Therefore, when being more than corresponding threshold value, new candidate class members will become some classes rather than other Class members in class.The figure is also served to show that the candidate's mistake transcription for being wherein calculated as class members using context and accumulated Evidence amount.

In step 601, new candidate class members's data are received, that is, it is similar with above-mentioned example, belong to the candidate of respective class Class members has been identified as candidate mistake transcription.Step 603-613 is received for determining new candidate member and candidate mistake transcription Belong to which class and determines the data of the context of new candidate member and the candidate mistake transcription of Xin.In step 603, it receives and uses Family information.User information can take various forms.In an embodiment of the present invention, log-on message identity user.As registration A part of process, user have input the personal information such as name, sex, race.In other embodiments of the invention, it uses Family information is biometric data with sorted users for identification.During speech recognition, system can be according to phonetic feature (such as sound quality) makes the assumption that accent meets racial group.Finally, system can be during the training stage into interactive right Words, to inquire the problem about identity, race, work role etc..In the embodiment for each user's training and storage class, User information for determining user identity in step 605.In an embodiment of the present invention, user information is used for wherein User class is determined in the step 607 of family class training and storage class.User class is group, organizational member group or may be similarly used Other user groups of word (leading to similar mistake transcription).

In step 609, environmental information is received.In an embodiment of the present invention, environmental information is geo-localisation information, can Selection of land is enhanced by cartographic information.In other embodiments, environmental information includes the ambient noise with speech capturing, instruction it is quiet or Noisy environment, or the mobile message captured by GPS or accelerometer, indicate the mobile environment of such as vehicle.Of the invention In some embodiments, environmental information for uniquely determining environmental labels in step 611.In other embodiments, environment is believed Breath in step 613 for determining environmental form.In certain unique environment, such as workplace or school, use is specific Term, and therefore identical mistake transcription can occur for different user.In the environment of identical environmental form, for example, noisy Environment will be prone to identical mistake transcription, for example, ambient noise is mistakenly identified as voice.In reality of the invention It applies in example, environmental information can be also used for determining user class, for example, wherein position is associated with user class.

Although being not shown, as being previously mentioned in other embodiments above, problem information also can receive, for determination It is useful for issuing the context of new candidate member.By the way that nearest language to be compared with current language, system can be with Determine that candidate mistake transcription is the probability of real mistake transcription.Other data are received in other embodiments of the invention.

Once system determines which class new candidate class members and candidate mistake transcription belong to, system just calculates each certain kinds Evidence have mostly strong, step 614.For example, if candidate class members and candidate mistake are transcribed by specific user in specific environment It issues, in an embodiment of the present invention, then for the class for the specific user or the specific environment training and storage, evidence will It is bigger than the class members for user class and user and the environment user class being belonging respectively to and environmental form.By can simultaneously root According to user, user class, environment and environmental form training class members, the system of can permit possesses more multisample and is quickly instructed Practice.It also allows system to provide class members by specialized training in specific environment for specific user, this is for detecting mistake Transcription will be most accurately.That is, in an embodiment of the present invention, being instructed for the specific combination of user and environmental characteristic Practice class.The context of candidate class members and candidate mistake transcription, such as position, problem information, are also used for the embodiment of the present invention In, it is determined that the evidence amount accumulated for the class members in each class.

Next, in step 615, it is determined whether have collected enough cards for the mistake transcription of specific user's class According to.If it is, in step 617, by have mistake transcription as the replacement of the original word in class members new class at Member is added in class.If it is not, then being incremented by the cumulative evidence of the mistake transcription in user class in step 619.From step 617 show dotted line, even if instruction still can be incremented by the cumulative evidence of user class when evidence is more than the threshold value of user class.

Next, in step 621, it is determined whether have collected enough cards for the mistake transcription of specific environment class According to.If it is, in step 623, by have mistake transcription as the replacement of the original word in class members new class at Member is added in class.If it is not, then in step 625, being incremented by the cumulative evidence of the mistake transcription in environmental form.From step 623 show dotted line, even if instruction evidence is more than the threshold value of environmental form, can also be incremented by cumulative evidence.

In the figure, for ease of description, the decision for specific user and specific environment is illustrated only.However, replacing For in embodiment, environmental form belonging to each user class and environment for user class belonging to user and specific user/ The class of environment combination makes similar decision.

In an embodiment of the present invention, all classes are loaded for training.However, when classifier for identification class class at When whether member is identified, in the embodiment of the present invention for for example identifying user and environment, classifier will use only specific user And/or the selected class group of specific environment.In distributed environment, wherein client is used to collect voice sample from multiple individual consumers This is simultaneously interacted, and can be used only and most be paid close attention to by machine learning from all users training class in multiple users Class allows training faster for user and environment and preferably distinguishes.

In other embodiments of the invention, once specific environment/user's combined training class is directed to, then mistake transcription Analyzer stops other classes of the load for training.Once for example, user/environment combination reaches desired confidence level, it is different It surely is the sufficiently high confidence level being added to class members in class, then in response to from specific user/environment combination candidate Mistake transcription, other class stoppings are trained to.

In alternative embodiments, one or more of listed step step can not be executed.For example, according only to In the case that user information only stores class members, it will not execute and the step of environmental correclation.Only single user store class at In the case where member, then user class step is not executed.

In fig. 7 it is shown that the process for storing new candidate's mistake transcription.As described above, mistake transcription can be Mistake is replaced, wherein speech recognition system replaces the word issued with incorrect word；Inserting error, such as wherein system is known Not " rubbish " language, for example, breathing, ambient noise " uh "；Or deletion error, wherein a word in the word to pronounce is not It appears in transcription.Each mistake in the mistake of these types may all indicate different mistake transcriptions.In addition, at this In the embodiment of invention, each type of mistake transcription (replacement, is deleted at insertion) will differently be stored.

In step 700, new candidate's mistake transcription is detected.Whether system determines mistake transcription first in step 701 It is replacement mistake.If mistake transcription is replacement mistake, will have with the class members as potential substitution in candidate class members There is the word of identical quantity.If it is not, then system determines whether mistake transcription is deletion error in step 703.If turned Record is deletion error, then lacks one or more words from existing class members in candidate class members.If transcription error is not It is deletion error, then system determines whether it is inserting error in step 705.To simplify the explanation, pure replacement is illustrated only The test of mistake, pure deletion error and pure inserting error.However, in alternative embodiments of the present invention, executing other types and turning Record other tests of mistake.For example, there may be identical types or different types of multiple mistakes to turn in candidate class members Record, for example, two replacement mistakes or a replacement mistake and an inserting error.

Once system, which determines, which type of candidate mistake transcription in candidate class members, the transcription of appropriate type is just used Mismark tracks evidence.In step 707, replacement symbol is for replacing mistake.Fig. 5 is combined to discuss the symbol above. The position of false transitions in class members is identified as { i j }, and non-matching word is to being represented as (w_ { i_j }, a_ { i_j }) word pair, system response and received number, the number that system is responded and is rejected, and detect that mistake turns Record but system do not provide the number of response.In the illustrative embodiments, mistake transcription hash in general entry by (w, a, CO, IN, NO) provide-wherein the correct word of w=, the potential mistake transcription word of a=, CO=correctly count (receiving), IN= Error count (refusal) and NO=are without response count.

In step 709, deletion symbol is used.Due to the situation, there is no word in candidate class members, word will be to will be referred to It is set to (w_ { ij }, 0_ { ij }) to indicate the word for not having corresponding to word w in candidate class members.Mistake in this case Transcription hash is provided by (w, 0, CO, IN, NO).

Similarly, in step 711, if detecting inserting error, insertion symbol is used.Indicate the example symbol of insertion It number is (0_ { ij }, w_ { ij }), associated mistake transcription is provided by (0, w, CO, IN, NO).

Once transcription error evidence is incremented to the evidence of the transcription error accumulation for class members, step 713, then the process knot Beam, step 715.

Compared with the first language member, mistake transcription be about word in received language replacement, delete or replace It changes, compared with the first time slot in the first language member, the received language of institute changes in the first way.As evidence is turned by mistake Record analyzer be incremented by, i.e., institute received language be the first time slot carry out in the first way mistake transcription evidence, it will be more than Threshold value, and the second language member is added to this group of language member for speech recognition system use.With first sounding member's phase Second sounding member of ratio, the second sounding member uses to be identified as at the first time slot in the first sounding member so far The change of " mistake transcription ".It note that when transcription error is insertion or deletion error, it is total in the second language member of generation Timeslot number can be slightly different, and inserting error has more time slots, and deletion error has less time slot, but changes still It is so considered as the first time slot of the first language member.

Fig. 8 is the flow chart shown for being incremented by evidence, and the evidence is that is, the word of the mistake transcription in a class members is The word of mistake transcription is the evidence of the legal substitution of the same word in another mutually similar class members.In single instance Single error transcription is found in long class members, i.e., all other word matched is considered as a preferred embodiment of the present invention In mistake transcription strong evidence.However, in an embodiment of the present invention, the evidence of the mistake transcription in a class members is Some, preferably less evidences, i.e. mistake transcription are also effective substitution in other class members.

The process starts from step 800, wherein identifying candidate mistake transcription for the first class members in class.Next, In step 801, system determines whether the word of mistake transcription is shared by another class members in mutually similar.If it is, doing A series of word for the mistake transcription that decisions are incremented by other class members with conclusion evidence out has mostly strong.For example, in step 803, System determines whether mistake transcription comes from same subscriber.If it is, will more have than the mistake transcription from another user The evidence of power.As another example, if receiving wrong transcription in identical environment, step 805, and in another ring It receives mistake transcription in border to compare, it will be more strong evidence.In addition, if being considered to have identical first language (i.e. L1 language) two users receive wrong transcription, then be regarded as receiving mistake than the individual with different L1 language and turn The more strong evidence of record.As described above, the phonetic similarity between the word in class members and mistake transcription also will be evidence. Moreover, as described above, the quantity of the word correctly identified can be evidence to be added compared with the quantity of candidate mistake transcription Quantity factor, but because this be " second-hand " factor, it can than class members itself mistake transcription evidence it is few.Such as User whether in same subscriber class or environment whether be identical environmental form other decisions and it is other test can wrap Containing in an embodiment of the present invention.

In step 807, it is increased by identified evidence, that is, the word of mistake transcription is the conjunction in another class members Method substitution.If there is other class members, step 809, then the process is repeated, until no other class members are its cumulative evidence. The process terminates in step 811.

Fig. 9 is the flow chart shown for being incremented by evidence, and the evidence is that is, the word of the mistake transcription from a class is wrong Misroute record word be same word in inhomogeneous class members legal substitution evidence.The process is opened in step 901 Begin, wherein determine the word of mistake transcription from class, and system determine whether the class that be regarded as in other classes at The evidence of the mistake transcription of member.In one embodiment of the invention, the mistake transcription from different classes of class members is recognized It is fewer than the evidence from mutually similar mistake transcription.Nevertheless, it is still some evidences, because specific user may Identical word is issued in an identical manner, the class occurred but regardless of wherein word.Therefore, system will execute similar to the above It determines.In step 903, whether system appears in the word for determining mistake transcription in the class members of new class, and new class is and inspection Measure the different class of the class of mistake transcription.In step 905, system determines whether same subscriber has issued wrong transcription.In step In rapid 907, system determines whether to have issued wrong transcription in identical environment.Another decision is that identical mistake transcription is It is no to be issued threshold number.All of these factors taken together is all used for the evidence amount it is determined that incremental, i.e. the word of mistake transcription is Effective substitution in inhomogeneous other class members.It is other test can be used for determining whether to add evidence to class at Member.

Next, in step 903, it is determined whether there is another to be checked class members.If it is, process returns To step 903.Next, in step 905, it is determined whether there is another to be checked class, if it is not, then process knot Beam, step 917.

In fig. 10 it is shown that obtaining the process of one embodiment of the present of invention of additional evidence using speech recognition system Figure.In one embodiment of the invention, when transcription matches N_i-k word in word and known k remaining word has When may be the transcription of mistake, but having never seen some subsets of associated individual event mistake transcription to this, the system via Text To Speech subsystem synthetically creates the audio stream of the only language of the word with incorrect pronunciations, and the stream is fed To speech recognition engine to check whether that the mistake for correcting word is transcribed.It, can cumulative error transcription if corrected Additional evidence, rather than other evidences.The speech recognition using sliding N-gram window is utilized in the embodiment of the invention The characteristic of system.In such speech recognition system, by means of the sliding N-gram window in its hiding Markov model, Engine automatically corrects some words, otherwise will be that the mistake transcribed to word one at a time is transcribed.On the other hand, Yi Xieyu Sound identification engine offer is word for word transcribed, less accurate, but than using sliding N-gram or other correction means once to export The transcription of language phrase is faster.It word for word transcribes and is usually used by system, which must ring immediately when hearing word and/or phrase It answers, and the pause to be instructed for completing language cannot be waited.Therefore, drawn by the way that the word for word speech recognition of rapid system response will be used for It holds up and is matched with sliding N-gram speech recognition engine, can be new class members's cumulative evidence for word for word engine use.

It will word for word transcribe and the transcription of a sounding is compared and provides the examples of many possible mistake transcriptions.For example, Suspicious mistake transcription there are two in a language or more, such as sentence is the shape of AA......XX......YY......BB Formula, and suspicious right version is AA......QQ......RR......BB, and wherein QQ is deemed likely to be the XX for YY With the correction of RR.It is assumed, however, that speech recognition engine never identifies ... .QQ......YY...... or ... XX......RR...... and all language identified before only have single replacement, therefore evidence is indirect. In this case, system generates synthesis voice system (using text-to-speech system) and uses N-gram window or other correction machines System enters speech recognition system to feed AA......QQ......YY......BB and AA......XX......RR....BB, To check whether language is identified.This by be single and dual replacement evidence.If they are so identified, exists and support The additional evidence of dual transcription, is not otherwise supported.

With reference to Figure 10, in step 1001, the language with multiple mistake transcriptions is received.List one group of mistake transcription. In step 1003, next mistake transcription is selected.In step 1005, system generates new synthesis language, and it includes specific Next mistake transcription of class members.In a preferred embodiment, it is used only in new synthesis language and is transcribed from the group mistake Single error transcription.In step 1007, newly-generated synthesis language is sent to speech recognition system, which uses N- Gram window or other correction mechanisms check whether language is identified, that is, are corrected.Judge speech recognition whether by synthesis Language is identified as class members, step 1009.If it is not, then this method continues to check whether to turn there are another mistake Record.If it is, in step 1011 there is mistake transcription for class members in cumulative evidence, and therefore should add new Class members.In step 1013, system is determined in language with the presence or absence of another candidate mistake transcription.If it is, should Process returns to step 1003.If it is not, then the process terminates.

The other class members and there is identical possible mistake transcription in other embodiments that the process expands in such Word other language.In some of these embodiments, rule will be used for than initially with the multiple mistakes of the group The class members of transcription identification accumulates the evidence of less other class members and other language.

In an embodiment of the present invention, system is by identifying new phrase and the then interactive problem of input together with user New class members is added in class by mode, to determine that new phrase belongs to one of existing class.

In an embodiment of the present invention, system manager will define one group of class members for given class.Then, in addition to due to mistake Record is misrouted except the new class members of addition, system will use synonymous phrase or interactive question mode to add new class members It is added in class.

Although it have been described that preferred operating environment and use-case, but technology in this can be used in desired deployment services Any other operation environment in.

As described above, above-mentioned function can be used as independent solution realization, for example, being executed by one or more hardware processors The software-based functions of one or more or it can be used as management service (including as Web service via SOAP/XML Or RESTful interface).Specific hardware and software realization details described herein, which is for illustration purposes only, is not intended to limit institute The range of theme is described.

More generally, each of the computing device in the context of disclosed theme is the number for including hardware and software According to processing system, and these entities pass through network (such as internet, Intranet, extranet, dedicated network or any other Telecommunication media or link) it communicates with one another.Application program in data processing system provides for Web and other known service and agreement The machine support, the including but not limited to support to HTTP, FTP, SMTP, SOAP, XML, WSDL, UDDI and WSFL etc..About The information of SOAP, WSDL, UDDI and WSFL can be obtained from World Wide Web Consortium (W3C), which is responsible for developing and safeguarding these marks It is quasi-；More information about HTTP, FTP, SMTP and XML can be from Internet Engineering Task Force (IETF) It obtains.

Other than environment based on cloud, technology described herein can be realized in various server end frameworks or and its It is implemented in combination with, including simple n-layer framework, Web portal, association system etc..

More generally, theme described herein can be using complete hardware embodiment, complete software embodiment or comprising hard The form of the embodiment of part and software element.In a preferred embodiment, functions of modules is implemented in software comprising but be not limited to solid Part, resident software, microcode etc..In addition, interface and function can take can or computer-readable medium available from computer visit The form for the computer program product asked, the computer is available or computer-readable medium provide program code for computer or Any instruction execution system use or used in combination.For the purpose this specification, computer is available or computer-readable Medium, which can be, can include or store program so that instruction execution system, device use or times in connection What equipment.Medium can be electronics, magnetic, optical, electromagnetic, infrared or semiconductor system (or device).Computer-readable Jie The example of matter includes semiconductor or solid-state memory, tape, removable computer diskette, random access memory (RAM), read-only Memory (ROM), rigid magnetic disks and CD.The present exemplary of CD includes compact disk-read only memory (CD-ROM), compression Disk-read/write (CD-R/W) and DVD.Computer-readable medium is tangible non-transient project.

Computer program product can be the product with program instruction (or program code), one described to realize Or multiple functions.Passing through network after remote data processing system downloading, those instructions or code can store in data In computer readable storage medium in processing system.Alternatively, those instructions or code can store in server data processing In computer readable storage medium in system, and it is suitable for downloading to remote data processing system by network, for remote Computer readable storage medium in journey system.

In the exemplary embodiment, these technologies are realized in dedicated computing platform, preferably by one or more It is realized in the software that reason device executes.Software is stored in one or more data storages associated with one or more processors Or in memory, and software can be implemented as one or more computer programs.Generally speaking, this specialized hardware and software Including above-mentioned function.

In a preferred embodiment, function is implemented as the attachment of existing cloud computing deployment rwan management solution RWAN provided herein Or extension.

Although the foregoing describe the specific operation executed by certain embodiments of the present invention sequences, but it is to be understood that this Class sequence is exemplary, because alternate embodiment can be executed in different order operation, is combined certain operations, is overlapped certain Operation etc..It may include special characteristic, structure or spy to the reference instruction described embodiment of given embodiment in specification Property, but each embodiment may not necessarily include a particular feature, structure, or characteristic.

Although finally, respectively described the given component of system, ordinarily skilled artisan will understand that, it can give Some functions are combined or shared in fixed instruction, agenda, code section etc..

Our invention has been described, is our contents to be stated as follows.

Claims

1. a kind of method of the mistake transcription generated for identification by speech recognition system, comprising:

Language member known to providing one group is for speech recognition system use, and each language member is by corresponding multiple group of words At；

By received language matched with the first language member in language member known to the group, the first language member is The language member of closest match with more than first a words, wherein in received language less than more than described first Word is matched with more than described first a words in the first language member, and in the first language member The first word in first time slot is compared, and the received language of institute is changed with the first ad hoc fashion；

The received language of institute is sent to wrong transcription analysis device assembly；

The evidence that the received language of institute is the evidence that mistake is transcribed is incremented by by the wrong transcription analysis device；And

In response to being more than threshold value for the incremental evidence of the mistake transcription, the processing as identifying first word includes The future received language of the mistake transcription.

2. according to the method described in claim 1, wherein, the received language of institute is replaced using the first word in first language Second word used in the first time slot in member；

Wherein, the incremental evidence of the wrong transcription analysis device is that the received language of institute is to replace institute for first word State the evidence of the mistake transcription of the second word.

3. according to the method described in claim 1, further comprising:

In response to matching the language that second receives with the first language member, described the is sent to wrong transcription analysis device Two language received, wherein the matching to more than described first a word matcheds, and in received language more than second A residue word is candidate mistake transcription；

It is surplus based on replacing be assumed in the first language member more than described second using the correct replacement assumed First continuous group of words of the mistake transcription of remaining word generates the first synthesis via the Text To Speech subsystem of audio stream and talks about Language；

The first synthesis language is sent to the speech recognition engine with above-mentioned correction feature；And

Correction in response to the synthesis language to the first language member, accumulating the first continuous group of words is to be assumed Correctly replace mistake transcription evidence.

4. according to the method described in claim 1, wherein, the mistake transcription analysis device is different from having by the received language of institute The word of quantity and with the single first candidate mistake transcription for leading to the bigger evidence for the first candidate mistake transcription Corresponding language member matching, the first candidate mistake transcription include one or more in corresponding language member described in Incomplete matching The continuous word of one or more of a continuous word.

5. according to the method described in claim 2, wherein, the mistake transcription analysis device uses following rule, the rule base In the received language of institute for matching the first language member, in the second language member for also including first word for Incremental evidence is transcribed for the mistake of second word of first word, wherein in the second language member The mistake transcription and incremental evidence amount be less than and be incremented by for the mistake transcription in the first language member Evidence amount.

6. according to the method described in claim 1, wherein, the mistake transcription analysis device is based on from at described first Gap with the first method mistake transcription the first user multiple language received, in first time slot with institute The mistake for stating first method, which is transcribed, is incremented by evidence.

7. according to the method described in claim 2, further comprising: it is received to be incremented by institute every time by the wrong transcription analysis device When language is matched with the first language member received language be second word for first word Mistake transcription evidence evidence so that with wherein second word be transcribed with replace institute in received language The language that each of first word receives is stated, it is more for the evidence of the mistake transcription accumulation.

8. according to the method described in claim 1, wherein, the mistake transcription analysis device uses following voice-based rule: It is bigger between the first word at the first time slot in the second word and the first language member in received language If the phonetic similarity of degree causes compared with not detecting this phonetic similarity, each received language example of institute is passed Increase bigger evidence amount.

9. according to the method described in claim 1, wherein, the mistake transcription analysis device is based on from at described first With multiple received language of institute of the first environment of the mistake transcription of the first method at gap, at first time slot It is transcribed with the mistake of the first method and is incremented by evidence.

10. a kind of equipment, comprising:

Processor；

Computer storage is saved and is transcribed for identification by the mistake that speech recognition system generates by what the processor executed Computer program instructions, the computer program instructions include:

It can operate with the program code that is used for speech recognition system of language member known to providing one group, each language member It is made of corresponding multiple words；

Can operate with by the first matched program code of language member known to received language and the group in language member, institute State the language member that the first language member is the closest match with more than first a words, wherein less than the received language of institute In more than described first words matched with more than described first a words in the first language member, and with institute The first word stated in the first time slot in the first language member is compared, and the received language of institute is changed with the first ad hoc fashion；

Can operate with to wrong transcription analysis device assembly send received language program code；

It can operate to be incremented by the program that the received language of institute is the evidence for the evidence that mistake is transcribed by the wrong transcription analysis device Code；And

It is more than threshold value in response to the incremental evidence for the mistake transcription, can operates as identifying first word The program code of future received language of the processing comprising mistake transcription.

11. equipment according to claim 10, wherein the received language of institute is replaced using the first word in first words Second word used in the first time slot in language member；

12. equipment according to claim 10, further comprises:

In response to matching the language that second receives with the first language member, can operate to be sent out to wrong transcription analysis device Send the program code of the described second language received, wherein the matching is received to more than first a word matcheds, and institute More than second remaining words in language are candidate mistake transcriptions；

It can operate with based on replacing be assumed in the first language member described the using the correct replacement assumed First continuous group of words of the mistake transcription of more than two residue word generates first via the Text To Speech subsystem of audio stream Synthesize the program code of language；

It can operate to send the program code of the first synthesis language to the speech recognition engine with above-mentioned correction feature；With And

Correction in response to the synthesis language to the first language member, can operate to accumulate the described first continuous group of words It is the program code of the evidence of the mistake transcription correctly replaced assumed.

13. equipment according to claim 11, wherein the mistake transcription analysis device is for for first word Second word mistake transcribe be incremented by evidence, wherein for issue institute received language the first user mistake transcription Evidence be greater than for the equipment other users mistake transcription evidence.

14. equipment according to claim 11, wherein the mistake transcription analysis device is for in first language Second word of first word mistake transcribe be incremented by evidence, wherein for receive received language first The evidence of the mistake transcription of environment is greater than the evidence for carrying out the mistake transcription for other environment that the freely equipment receives language.

15. the computer program product in a kind of non-transitory computer-readable medium for data processing system, the calculating Machine program product saves the calculating of the mistake transcription generated for identification by speech recognition system executed by data processing system Machine program instruction, the computer program instructions include:

Incremental evidence in response to the mistake transcription is more than threshold value, can be operated to handle as identifying first word The program code of future received language comprising the mistake transcription.

16. computer program product according to claim 15, wherein the received language of institute is replaced using the first word Second word used in the first time slot in the first language member；

17. computer program product according to claim 15, further comprises:

In response to matching the second received language of institute with the first language member, can operate to be sent out to wrong transcription analysis device Send described second received language program code, wherein the matching and connects more than described first a word matcheds More than second remaining words in the language of receipts are candidate mistake transcriptions；

18. computer program product according to claim 15, further comprises:

In response to the passing with the mistake transcription of the first method at first time slot of the first language member Increasing evidence is more than the intermediate threshold lower than the first threshold, and the second language member is added to facing for this group of language member When the program code that is used for the speech recognition system of member；

The receiving to respond based on user to the system for the first language member can be operated, by the wrong transcription analysis Device be incremented by institute received language be at first time slot with the first method mistake transcribe evidence evidence Program code.

19. computer program product according to claim 16, wherein the mistake transcription analysis device is for for described Second word of first word mistake transcribe be incremented by evidence, wherein for issue received language the first user The evidence of the mistake transcription of user in class is greater than the evidence of the mistake transcription for the user in other user class.

20. computer program product according to claim 16, wherein the mistake transcription analysis device is for in institute The mistake for stating second word of first word in the first language, which is transcribed, is incremented by evidence, wherein for wherein receiving The evidence of the mistake transcription of environment in the first environment type of received language be greater than the ring in other environmental forms The evidence of the mistake transcription in border.

21. a kind of system of the mistake transcription generated for identification by speech recognition system, including for realizing claim 1-9 Any one of in step component.

22. a kind of method of the mistake transcription generated for identification by speech recognition system, comprising:

The first language member class is provided so that the speech recognition system uses, each language class members by respective numbers word Composition, wherein the first kind is by the first common meaning and first in the case where identifying the class members of the first kind Common system responds to define；

In response to the speech recognition system by received language matched with the first class members of the first kind, to mistake turn It records analyzer and sends the received language of institute, wherein the received language of institute includes that the mistake compared with the first class members is transcribed；

The evidence that the mistake that the received language of institute is first class members is transcribed is incremented by by the wrong transcription analysis device Evidence；

Incremental evidence in response to the mistake transcription for first class members is more than first threshold, is based on described first Second class members is added to the first language member's class by the mistake transcription of class members；And

In response to identification and the matched received language of second institute of second class members, the common system response is executed.

23. according to the method for claim 22, further comprising: providing multiple language class members for the speech recognition System uses, and each language class members is made of the word of respective numbers, wherein each respective class by corresponding meaning jointly and Identify that the corresponding common system in the case where the class members of the respective class responds to define.

24. according to the method for claim 22, wherein the mistake transcription of first class members is mistake transcription Word, the method further includes: it is incremented by less evidence according to for the class members other than first class members Rule is incremented by evidence for all class members in the class of the word comprising the transcription error.

25. according to the method for claim 22, wherein the mistake transcription of first class members is mistake transcription Word, the method further includes: it not is that the class members of first class members is incremented by the rule of less evidence according to being directed to, It is not that all class members comprising those of wrong first class members of word transcribed class members pass for including Increase evidence.

26. according to the method for claim 22, further comprising:

A class more than first is provided, each class includes one group of language member so that the speech recognition system uses, more than described first Each class in a class is directed to corresponding user, wherein each class more than described first in a class is by the described first common meaning It is defined with first common system response in the case where identifying the class members of a class more than described first；

The user of received language according to receiving from it, the class members of each respective class of training more than first a class.

27. according to the method for claim 22, further comprising:

A class more than second is provided, each class includes one group of language member so that the speech recognition system uses, more than described second Each class in a class is directed to respective environment, wherein each class more than described second in a class by the described first common meaning and First common system in the case where identifying the class members of a class more than described second responds to define；

The environment of received language according to receiving from it, the class members of each respective class of training more than second a class.

28. according to the method for claim 26, further comprising:

Each class into more than described first a classes provides one group of identical initial class members；

Based on identical corresponding mistake transcription example, the evidence amount incrementally different for the class members of the class of relative users；And

The third class members of incremental evidence in response to the mistake transcription in to(for) third user is more than the first threshold, While other class members of incremental evidence to(for) other users is no more than the first threshold, third class members is added To the language class members for being directed to the third user.

29. according to the method for claim 27, further comprising:

Each class into more than described second a classes provides one group of identical initial class members；

Based on identical corresponding mistake transcription example, the evidence amount incrementally different for the class members of the class of respective environment；And

Incremental evidence in response to the 4th class members of the mistake transcription to(for) first environment user is more than first threshold Value, while other class members of incremental evidence to(for) other environment is no more than the first threshold, by the 4th class members It is added to language member's class for the first environment.

30. according to the method for claim 22, further comprising:

The multiple classes of third are provided, each class includes one group of language member so that the speech recognition system uses, and the third is more Each class in a class is directed to relative users class, wherein each class in the multiple classes of third is by the common meaning of the third It is defined with the third common system response in the case where identifying the class members in the multiple classes of the third；

According to receiving from it, the user class of received language trains the class members of each respective class of the multiple classes of the third, Wherein, the training is based on identical corresponding mistake transcription example, incrementally different for the class members of the class of relative users class Evidence amount；And

It is more than first threshold in response to the incremental evidence for the mistake transcription in the 5th class members of the first user class Value, other class members for other user class incremental evidence be no more than the first threshold while, by the 5th class at Member is added to the language class members for first user class.

31. a kind of equipment, comprising:

Processor；

Can operate to provide the program code that uses for the speech recognition system of the first language class members, each language class at Member is made of the word of respective numbers, wherein the first kind is by the first common meaning and in the class for identifying the first kind The first common system in the case where member responds to define；

In response to the speech recognition system by received language matched with the first class members of the first kind, can operate with To wrong transcription analysis device send institute received language program code, wherein compared with first class members, it is received Language includes mistake transcription；

It can operate to be incremented by the mistake turn that the received language of institute is first class members by the wrong transcription analysis device The program code of the evidence of the evidence of record；

Incremental evidence in response to the mistake transcription for first class members is more than first threshold, can be operated to be based on Second class members is added to the program code of the first language class members by the mistake transcription of first class members；With And

In response to identification and the matched received language of second institute of second class members, can operate to execute the common system The program code of response.

32. equipment according to claim 31, further comprises:

Can operate to provide multiple language class members program code used for the speech recognition system, each language class at Member be made of the word of respective numbers, wherein each respective class by it is corresponding jointly meaning and the class for identifying the respective class at Corresponding common system in the case where member responds to define.

33. equipment according to claim 31, wherein the mistake transcription of first class members is mistake transcription Word further comprises: it is incremented by the rule of less evidence according to the class members being directed to other than first class members, for All class members in the class of word comprising the transcription error are incremented by evidence.

34. equipment according to claim 31, further comprises: can operate to provide the computer generation of the multiple classes of third Code, each class include one group of language member so that the speech recognition system uses, and each class in the multiple classes of third is directed to phase The user class answered, wherein the inhomogeneity of the multiple classes of third incrementally different evidence based on identical mistake transcription example Amount.

35. equipment according to claim 33, further comprises: can operate to provide the computer generation of more than the 4th a classes Code, each class include one group of language member so that the speech recognition system uses, each class needle in more than the described 4th a class To corresponding environmental form, wherein the inhomogeneity of more than the described 4th a class is incrementally different based on identical mistake transcription example Evidence amount.

36. equipment according to claim 33, further comprises: can operate to be incremented by for candidate mistake transcription The program code of the incremental evidence of all classes, wherein according to user and environmental planning, example is transcribed for particular error, for The incrementally different evidence amount of respective class.

37. the computer program product in a kind of non-transitory computer-readable medium for data processing system, the calculating Machine program product saves the mistake transcription generated for identification by speech recognition system executed by the data processing system Computer program instructions, the computer program instructions include:

Incremental evidence in response to the mistake transcription for first class members is more than first threshold, can be operated to be based on Second class members is added to the program code of the first kind language member by the mistake transcription of first class members；With And

38. the computer program product according to claim 37, further comprises:

39. the computer program product according to claim 37, wherein the mistake of first class members, which is transcribed, is The word of mistake transcription further comprises: being incremented by less evidence according to for the class members other than first class members Rule, in the class of the word comprising the transcription error all class members be incremented by evidence.

40. the computer program product according to claim 37, further comprises:

Can operate with identify the user of received language or the program code of environment；And

It can operate with the program generation of user or the suitably trained class of environmental selection based on the speech recognition system identified Code.

41. the computer program product according to claim 37, wherein the first language class members is used for specific use Family/environment combination.

42. the computer program product according to claim 37, wherein according to following regular cumulative evidence: if described Mistake transcription is received by two users in the first user class, if be then considered as with from the user in first user class It receives the first mistake transcription and receives the second mistake transcription compared to for described first from the user in second user class The stronger evidence of the mistake transcription of user class.

43. a kind of system of the mistake transcription generated for identification by speech recognition system, including for realizing claim 22- The component of step in any one of 30.