CN110148416A

CN110148416A - Audio recognition method, device, equipment and storage medium

Info

Publication number: CN110148416A
Application number: CN201910327337.6A
Authority: CN
Inventors: 骆彬; 彭学政; 涂润; 陈召
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2019-08-20
Anticipated expiration: 2039-04-23
Also published as: CN110148416B

Abstract

This application provides a kind of audio recognition method, device, equipment and storage mediums.The described method includes: obtaining voice data to be identified；Voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n is the integer greater than 1；According to the characteristic information of n speech recognition result, the selection target speech recognition result from n speech recognition result.In technical solution provided by the present application, using the speech recognition result of multiple speech recognition engines as reference, and preferably speech recognition result is chosen from multiple speech recognition results, improve the accuracy of recognition result.

Description

Audio recognition method, device, equipment and storage medium

Technical field

The invention relates to technical field of voice recognition, in particular to a kind of audio recognition method, device, equipment and Storage medium.

Background technique

Speech recognition technology is a kind of technology that the voice of people is converted to text, is widely used in all kinds of artificial intelligence and produces In product, such as Intelligent dialogue robot, intelligent sound box, intelligent translation apparatus etc..

The substantially process of speech recognition includes: the voice data that speech recognition apparatus obtains user's input, and by the voice Data are sent to speech recognition engine, and speech recognition engine identifies the voice data, and speech recognition result is fed back To speech recognition apparatus, speech recognition apparatus exports upper speech recognition result.

Currently, speech recognition depends on single speech recognition engine, and the speech recognition engine is a general-purpose platform, needle It is relatively poor to the recognition effect of certain specific areas, cause recognition result inaccurate.

Summary of the invention

The embodiment of the present application provides a kind of audio recognition method, device, equipment and storage medium, can be used for solving correlation In technology, speech recognition engine is relatively poor for the recognition effect of certain specific areas, the problem of recognition result inaccuracy.Institute It is as follows to state technical solution:

On the one hand, the embodiment of the present application provides a kind of audio recognition method, which comprises

Obtain voice data to be identified；

The voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, the n For the integer greater than 1；

According to the characteristic information of the n speech recognition result, the selection target voice from the n speech recognition result Recognition result；Wherein, the voice that the characteristic information of institute's speech recognition result is used to indicate output institute's speech recognition result is known Between other engine and the voice data to be identified be adapted to degree and words that institute's speech recognition result includes can Letter degree.

On the other hand, the embodiment of the present application provides a kind of speech recognition equipment, and described device includes:

Data acquisition module, for obtaining voice data to be identified；

Data transmission blocks obtain n for the voice data to be identified to be sent to n speech recognition engine Speech recognition result, the n are the integer greater than 1；

As a result selecting module, for the characteristic information according to the n speech recognition result, from the n speech recognition As a result middle selection target speech recognition result；Wherein, the characteristic information of institute's speech recognition result is used to indicate output institute's predicate Degree and the speech recognition are adapted between the speech recognition engine of sound recognition result and the voice data to be identified As a result the credibility for the words for including.

Another aspect, the embodiment of the present application provide a kind of computer equipment, the computer equipment include processor and Memory, is stored at least one instruction, at least a Duan Chengxu, code set or instruction set in the memory, and described at least one Item instruction, an at least Duan Chengxu, the code set or instruction set are loaded by the processor and are executed to realize as above-mentioned Audio recognition method described in aspect.

In another aspect, the embodiment of the present application provides a kind of computer readable storage medium, stored in the storage medium Have at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, The code set or instruction set are as processor loads and executes to realize such as the audio recognition method as described in terms of above-mentioned.

Also on the one hand, the embodiment of the present application provides a kind of computer program product, when the computer program product is located When managing device execution, for realizing above-mentioned audio recognition method.

Technical solution provided by the embodiments of the present application may include it is following the utility model has the advantages that

It is identified by the way that voice data to be identified is sent to multiple speech recognition engines, obtains multiple speech recognitions As a result, and according to the characteristic information of multiple speech recognition result, selected from multiple speech recognition result one as mesh Mark speech recognition result.Compared in the related technology, speech recognition depends on single speech recognition engine, and the speech recognition Engine is a general-purpose platform, relatively poor for the recognition effect of certain specific areas, in technical solution provided by the present application, is adopted It uses the speech recognition result of multiple speech recognition engines as reference, and chooses preferably voice from multiple speech recognition results Recognition result improves the accuracy of recognition result.

Detailed description of the invention

Fig. 1 is the schematic diagram for the implementation environment that the application one embodiment provides；

Fig. 2 is the schematic diagram of the application one complete speech recognition process；

Fig. 3 is the flow chart for the audio recognition method that the application one embodiment provides；

Fig. 4 is the flow chart for the audio recognition method that another embodiment of the application provides；

Fig. 5 is the block diagram for illustrating an error correction replacement system；

Fig. 6 is the block diagram for illustrating another error correction replacement system；

Fig. 7 is the schematic diagram for illustrating the second recast layer and rewriting process；

Fig. 8 is the block diagram for the speech recognition equipment that the application one embodiment provides；

Fig. 9 is the block diagram for the speech recognition equipment that the application one embodiment provides；

Figure 10 is the structural schematic diagram for the computer equipment that the application one embodiment provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

Referring to FIG. 1, the schematic diagram of the implementation environment provided it illustrates the application one embodiment.The implementation environment can To include: terminal 10, background server 20 and speech recognition server 30.

In the embodiment of the present application, voice acquisition device can be configured in terminal 10, which can be Microphone, microphone array, transmitter etc., for obtaining the voice data of user's input.Above-mentioned background server 20 is terminal 10 provide background service.

Optionally, the application with speech voice input function can be installed in terminal 10, as instant messaging application, voice input Method application, voice assistant etc..These applications at runtime, call above-mentioned voice acquisition device, the voice number of acquisition user's input According to.

Optionally, above-mentioned background server 20 is also possible to the background service of the above-mentioned application with speech voice input function Device.

In a kind of possible embodiment, the above-mentioned application with speech voice input function can not only acquire user's input Voice data, additionally it is possible to identify the voice data.

In alternatively possible embodiment, the above-mentioned application with speech voice input function is only capable of acquisition user's input Voice data needs that the voice data of acquisition is sent to speech recognition server 30 to the voice via background server 20 Data are identified that speech recognition server 30 feeds back speech recognition result via background server 20 after identification is completed It is exported to terminal 10.In the embodiment of the present application, the voice data that user inputs is carried out with speech recognition server 30 Illustrate for identification.

Above-mentioned terminal 10 can be such as smart phone, tablet computer, PC (Personal Computer, individual calculus Machine), intelligent robot, smart television, intelligent sound box etc. the electronic equipment of interactive voice can be carried out with user.

In the embodiment of the present application, operation has speech recognition engine on above-mentioned speech recognition server 30.Optionally, different Speech recognition server in can run different speech recognition engines；It can have been run in one speech recognition server One speech recognition engine, can also run multiple and different speech recognition engines, and the embodiment of the present application is not construed as limiting this.

Above-mentioned background server 20 and speech recognition server 30 may each be a server, be also possible to be taken by more The server cluster or a cloud computing service center of business device composition.

Terminal 10 can be communicated by network with background server 20, and background server 20 is known by network and voice Other server 30 is communicated.

It should be noted is that in some embodiments, the quantity of speech recognition server 30 is one, which knows Operation has multiple and different speech recognition engines in other server 30；In further embodiments, speech recognition server 30 Quantity be it is multiple, operation has a speech recognition engine in each speech recognition server, and in each speech recognition server Speech recognition engine it is different.

In addition, technical solution provided by the embodiments of the present application, is applicable to the speech recognition of multiple and different languages, such as Chinese Language, English, French, German, Japanese, Korean etc..In the embodiment of the present application, mainly it is situated between for applying in Chinese Continue explanation, but does not constitute the restriction to technical scheme.

It should be noted that technical solution provided by the embodiments of the present application can be applied in all kinds of artificial intelligence products, Application scenarios include but is not limited to household, vehicle-mounted, game etc..

Referring to FIG. 2, its schematic diagram for illustrating the application one complete speech recognition process.User passes through The voice acquisition device configured in terminal 10 (such as intelligent robot, smart television, intelligent sound box), inputs voice to be identified Data；Then, which is sent to multiple speech recognition engines via background server 20 by terminal 10, such as Speech recognition engine A, speech recognition engine B and speech recognition engine C obtain the speech recognition knot of each speech recognition engine Fruit, such as speech recognition result A, speech recognition result B and speech recognition result C.Above-mentioned speech recognition engine operates in voice knowledge In other server 30.Then, background server 20 can from above-mentioned multiple speech recognition results selection target speech recognition knot Fruit.Error correction rewriting further is carried out to the target voice recognition result, obtains the revised target voice recognition result of error correction.Its In, during above-mentioned selection target speech recognition result and error correction rewriting, combine history interactive log, domain knowledge map And the contents such as user characteristics.Finally, later, background server 20 feeds back to the revised target voice recognition result of error correction Terminal 10, so that terminal 10 can be by interactive system, in conjunction with the revised target voice recognition result of error correction to user with sound It answers.

In the following, explanation is introduced to technical scheme by several embodiments.

Referring to FIG. 3, the flow chart of the audio recognition method provided it illustrates the application one embodiment.In the application In embodiment, it is applied to the background server in implementation environment shown in Fig. 1 in this way mainly to illustrate.This method can be with It comprises the following steps:

Step 301, voice data to be identified is obtained.

Voice data to be identified refers to the voice data of user's input.When user wants by way of voice and terminal When interacting, can directly it speak against the voice acquisition device in the terminal, correspondingly, terminal can be adopted by the voice Acquisition means obtain user, and what is said or talked about, as voice data to be identified.

Optionally, terminal can acquire voice data to be identified when receiving voice recognition instruction, the speech recognition Instruction can be by user's triggering, the user can may include by specified operation triggering, the specified operation clicking operation, Slide etc., the embodiment of the present application do not limit this.

For example, speech recognition option can be provided in terminal, when user wants to interact by voice and the terminal When, the speech recognition option can be clicked, to trigger the voice recognition instruction.After the terminal receives the voice recognition instruction, Obtain voice data to be identified.

Step 302, voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n For the integer greater than 1.

The voice data to be identified can be sent to multiple voices after getting voice data to be identified by terminal Identify that engine carries out speech recognition, accordingly, speech recognition engine, can should after receiving voice data to be identified Voice data to be identified is converted into text, the i.e. upper speech recognition result of the text.

Above-mentioned n speech recognition engine can be to be run in a server, is also possible in multiple servers Operation.The corresponding feature of n speech recognition is different, and this feature may include domain features, history feature, gender spy Sign, regional feature etc..

In the embodiment of the present application, above-mentioned n speech recognition engine is that universal phonetic identifies engine.

Step 303, according to the characteristic information of n speech recognition result, the selection target voice from n speech recognition result Recognition result.

After n speech recognition result for obtaining the output of n speech recognition engine, since n speech recognition engine is directed to The confidence level of different scenes identification is different, therefore the accuracy of above-mentioned n speech recognition result may not also be identical, therefore, from n Select a recognition result as target voice recognition result in a speech recognition result.

Wherein, the characteristic information of speech recognition result be used to indicate output speech recognition result speech recognition engine with to The credibility for the words that adaptation degree and speech recognition result between the voice data of identification include.Above-mentioned adaptation degree The accuracy identified for reflecting speech recognition engine for voice data to be identified, adaptation degree is higher, indicates output voice The accuracy that the speech recognition engine of recognition result identifies voice data to be identified is higher；For example, working as voice to be identified Data are " battlefield is dominated by my people ", and speech recognition engine A is adapted to degree with the voice data to be identified, are known lower than voice Other engine B is adapted to degree with the voice data to be identified, then it represents that speech recognition engine B identifies the voice data to be identified Accuracy is higher.

Above-mentioned credibility is used to reflect being overlapped for words that speech recognition result includes and the words in predefined dictionary Degree, credibility is higher, and the registration of the words that speech recognition result includes and the words in predefined dictionary is higher, indicates language The accuracy for the words for including in sound recognition result is higher.For example, speech recognition result A be " you not and Lyu's cloth who severity ", language Sound recognition result B is that the words for " li po and Lv Bu who severe ", in predefined dictionary includes " li po ", " Lv Bu ", i.e. voice Recognition result B exists with words in predefined dictionary to be overlapped, then it represents that the accuracy for the words for including in speech recognition result B is got over It is high.

Detailed content about features described above information is introduced in following Fig. 4 embodiments, and details are not described herein again.

Optionally, the above-mentioned characteristic information according to n speech recognition result, the selection target from n speech recognition result Speech recognition result, comprising: according to the characteristic information of n speech recognition result, it is corresponding to calculate n speech recognition result Confidence level score value；From n speech recognition result, select the highest speech recognition result of confidence level score value as the target language Sound recognition result.

The corresponding confidence level score value of n speech recognition result of above-mentioned calculating can be by machine learning model come real It is existing.The machine learning model can be Markov continuous speech recognition model, neural network model, SVM (Support Vector Machine, support vector machines) etc., the embodiment of the present application is not construed as limiting this.

In conclusion in technical solution provided by the embodiments of the present application, it is more by the way that voice data to be identified to be sent to A speech recognition engine is identified, obtains multiple speech recognition results, and believe according to the feature of multiple speech recognition result Breath, selected from multiple speech recognition result one as target voice recognition result.Compared in the related technology, voice is known Speech recognition engine that Yi Laiyu be not single, and the speech recognition engine is a general-purpose platform, for the knowledge of certain specific areas Other effect is relatively poor, in technical solution provided by the present application, using the speech recognition result conduct of multiple speech recognition engines With reference to, and preferably speech recognition result is chosen from multiple speech recognition results, improve the accuracy of recognition result.

Referring to FIG. 4, the flow chart of the audio recognition method provided it illustrates another embodiment of the application.In this Shen It please be applied in this way the background server in implementation environment shown in Fig. 1 mainly to illustrate in embodiment.This method can To comprise the following steps:

Step 401, voice data to be identified is obtained.

This step is same or like with step 301 in Fig. 3 embodiment, and details are not described herein again.

Step 402, voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n For the integer greater than 1.

This step is same or like with step 302 in Fig. 3 embodiment, and details are not described herein again.

Step 403, according to the characteristic information of n speech recognition result, calculate that n speech recognition result is corresponding to set Reliability score value.

Above-mentioned confidence level score value is used to reflect the accuracy of speech recognition result, and confidence level score value is higher, indicates the voice The accuracy of recognition result is higher.

The above-mentioned characteristic information according to n speech recognition result calculates the corresponding confidence level of n speech recognition result Score value may include following two step: for i-th of speech recognition result in n speech recognition result, obtain i-th The characteristic information of speech recognition result；By the characteristic information of i-th of speech recognition result, it is input to machine learning model, is obtained The corresponding confidence level score value of i-th of speech recognition result, above-mentioned i are the positive integer less than or equal to n.

Features described above information is used to reflect the speech recognition of i-th voice data, output of speech recognition result to be identified The abstract characteristics of engine and i-th of speech recognition result, this feature information include but is not limited to following at least one: to be identified The corresponding domain features of voice data, the corresponding regional feature of voice data to be identified, voice data to be identified it is corresponding Sex character, the corresponding history feature of voice data to be identified, export i-th of speech recognition result speech recognition engine I-th corresponding domain features, the corresponding regional feature of speech recognition engine for exporting i-th of speech recognition result, output of language The corresponding sex character of the speech recognition engine of sound recognition result, the speech recognition engine pair for exporting i-th of speech recognition result The matching degree, etc. between history feature, i-th of speech recognition result and predefined dictionary answered.

Wherein, the corresponding domain features of voice data to be identified are for reflecting neck belonging to the theme of the voice data Domain, as the field of game themes, musical theme field, children stories field, the field of political topics, economic theme field, The field etc. of scientific and technological theme；The corresponding domain features of speech recognition engine of i-th of speech recognition result are exported for reflecting The confidence level that the speech recognition engine identifies above-mentioned field.

The corresponding regional feature of voice data to be identified is used to reflect the mouth of region locating for the user for inputting the voice data Sound, such as northeast dialect, Guangdong dialect, Shanghai dialect；The speech recognition engine for exporting i-th of speech recognition result is corresponding Regional feature is used to reflect the speech recognition engine for the confidence level of the voice data identification with different geographical feature.

The corresponding sex character of voice data to be identified is used to reflect the gender for the user for inputting the voice data, such as male Sound or female voice；The corresponding sex character of speech recognition engine of i-th of speech recognition result is exported for reflecting that the voice is known Confidence level of the other engine for the voice data identification with different sexes feature.

The corresponding history feature of voice data to be identified is for reflecting whether the voice data to be identified appears in history In interactive log；The corresponding history feature of speech recognition engine of i-th of speech recognition result is exported for reflecting that the voice is known Confidence level of the other engine for the identification of voice data in history interactive log.

Matching degree between i-th of speech recognition result and predefined dictionary is for reflecting i-th of speech recognition result With the similarity degree of word in predefined dictionary.The predefined dictionary may include the theme institute of i-th of speech recognition result The domain knowledge map in category field includes the proprietary dictionary in the field in the domain knowledge map；It can also include history interaction The dictionary of the history identification recorded in log；It can also include the dictionary etc. that backstage designer high frequency gathered in advance uses word Deng the embodiment of the present application is not construed as limiting this.

In addition, features described above information can also include other feature, the embodiment of the present application is not construed as limiting this.

It, can be by multiple characteristic informations after obtaining the corresponding above-mentioned multiple characteristic informations of i-th of speech recognition result It is input in machine learning model trained in advance, the corresponding confidence level score value of i-th of speech recognition result is calculated.On The input for stating machine learning model is the corresponding multiple characteristic informations of speech recognition result, and output is that the speech recognition result is corresponding Confidence level score value.

Step 404, from n speech recognition result, select the highest speech recognition result of confidence level score value as target Speech recognition result.

Above-mentioned confidence level score value is higher, indicates that the speech recognition result is more accurate.Therefore, n speech recognition is being got As a result after respective confidence level score value, the highest speech recognition result of confidence level score value can be identified as target voice and is tied Fruit.

In addition, can choose when highest arranged side by side there are two or more confidence level score values in speech recognition result Wherein any one is as target voice recognition result.

Optionally, after obtaining target voice recognition result, can to the erroneous words in target voice recognition result into Row error correction is rewritten, and the revised target voice recognition result of error correction is obtained.

It is accurate for certain specific scene Recognitions since above-mentioned speech recognition engine is that universal phonetic identifies engine Degree is inadequate, it is possible that partial error word, influences human-computer interaction.Therefore, it introduces error correction to rewrite, it can be found that target voice is known Erroneous words in other result, further progress error correction are rewritten, and obtain the revised target voice recognition result of error correction, and pass through end The interactive system at end is responded, and the accuracy of speech recognition is promoted.

The above-mentioned erroneous words in target voice recognition result carry out error correction rewriting, may comprise steps of 405-406.

Step 405, target voice recognition result is input to error correction replacement system.

It is accurate for certain specific scene Recognitions since above-mentioned speech recognition engine is that universal phonetic identifies engine Degree is inadequate, it is possible that partial error word, influences human-computer interaction.Therefore, target voice recognition result error correction is input to change System is write, it can be found that the erroneous words in target voice recognition result, further progress error correction is rewritten, and promotes the standard of speech recognition Exactness.

Above-mentioned error correction replacement system is used to determine the erroneous words in target voice recognition result, obtain corresponding with erroneous words Erroneous words in target voice recognition result are rewritten as correct word by correct word, obtain the revised target voice identification of error correction As a result.For example, target voice recognition result be " you not and Lyu's cloth who severity ", error correction replacement system can determine the target voice " you not " in recognition result is erroneous words, further obtains correct word corresponding with " you not ", such as " li po ", and by the mistake Word " you not " is rewritten as correct word " li po ", and obtaining the revised target voice recognition result of error correction is that " who is strict by li po and Lv Bu Evil "；In another example target voice recognition result is " song of our Zhou Jielun ", error correction replacement system can determine the target voice " we " in recognition result is erroneous words, further obtains corresponding with " we " correct word, such as " broadcasting ", and by the mistake Word " we " is rewritten as correct word " broadcasting ", and obtaining the revised target voice recognition result of error correction is " to play Zhou Jielun's Song ".

Optionally, in conjunction with reference Fig. 5, error correction replacement system 500 includes: the first recast layer 501,502 and of the second recast layer Third recast layer 503；Wherein, the first recast layer 501 is used to rewrite the high frequency erroneous words in target voice recognition result, and second changes Layer 502 is write for rewriting the erroneous words of target voice recognition result fields, third recast layer 503 is for rewriting target voice The erroneous words of redundancy in recognition result.

Optionally, the rewriting precision of above-mentioned first recast layer, the second recast layer and third recast layer is incremented by successively.When first After the error correction of recast layer is rewritten successfully, the error correction for no longer executing the second recast layer and third recast layer is rewritten；When the first recast layer Error correction rewrite failure after, execute the second recast layer error correction rewrite, after the error correction of the second recast layer is rewritten successfully, no longer hold The error correction of row third recast layer is rewritten, and after failure is rewritten in the error correction of the second recast layer, the error correction for executing third recast layer is rewritten.

Fig. 6 is referred in the following, combining, the content of the first recast layer, the second recast layer and third recast layer is discussed in detail.

1, the first recast layer 501 includes white list and/or at least one rewriting rule.Above-mentioned white list refers to high frequency mistake The mapping table of word correct word corresponding with the high frequency erroneous words includes high frequency erroneous words in target voice recognition result when detecting When, error correction rewriting is carried out to above-mentioned high frequency erroneous words according to white list and/or rewriting rule；Wherein, above-mentioned mapping table can be It is collected according to history interactive log, is also possible to be differentiated in advance by designer and obtains, the embodiment of the present application is to this It is not construed as limiting.

Above-mentioned rewriting rule is the rewriting rule for the setting of high frequency erroneous words, which may include for same High frequency erroneous words, when the rewrite method for belonging to different field is then rewritten rule and be can be and work as voice such as high frequency erroneous words " I puts " When data belong to music field, " I puts " is rewritten as " playing "；When voice data belongs to field of play, " I puts " is rewritten For " we ".Wherein, above-mentioned rewriting rule can be collects according to history interactive log, is also possible to by designing from the background Personnel, which in advance differentiate, to be obtained, and the embodiment of the present application is not construed as limiting this.

2, the second recast layer 502 includes the first error detection module, the second error detection module and rewriting module.Wherein, the first error detection Module is used to call the erroneous words of language model detection target voice recognition result fields.Optionally, above-mentioned language model It can be N metagrammar (N-gram) model.The value of N is 1,2 and 3 namely metagrammar (1-gram) model, two in the application Metagrammar (2-gram) model and three metagrammars (3-gram) model.In other embodiments, N can also be the nature greater than 3 Number, the embodiment of the present application are not construed as limiting this.Use the mistake of N metagrammar model inspection target voice recognition result fields Word, mainly including the following steps:

(1) target voice recognition result is inputted into N metagrammar model, obtains corresponding fractional value；

After obtaining target voice recognition result, so that it may which the target voice recognition result is inputted above-mentioned unitary respectively Syntactic model, two-dimensional grammar model and ternary syntactic model.It optionally, can be to target before inputting above-mentioned syntactic model Speech recognition result is segmented, and gets the corresponding word of target voice recognition result to list, illustratively, with binary language For method model, it is assumed that target voice recognition result is " red bean of our Wang Fei ", then the word of two-dimensional grammar model is to list Are as follows: [we, Wang Fei], [Wang Fei ,], [, red bean].Similarly, the word of Uni-Gram and ternary syntactic model is obtained To list.Further, the word of above-mentioned each syntactic model inputs in corresponding syntactic model list respectively, for Two fractional values can be calculated in any one word therein, each syntactic model.In the embodiment of the present application, for target Any one word in speech recognition result can respectively obtain Uni-Gram, two-dimensional grammar model and three metagrammar moulds The fractional value that type calculates separately out, a total of six fractional value.Optionally, above-mentioned fractional value can use Longest Common Substring And/or the method for editing distance is calculated.

(2) according to the corresponding fractional value of above-mentioned N metagrammar model, the mean value point of above-mentioned fractional value is calculated.

In the embodiment of the present application, for any one word, Uni-Gram, two-dimensional grammar model can be respectively obtained The fractional value for calculating separately out with ternary syntactic model, a total of six fractional value；Further, it is possible to obtain the mean value of each word Point.

(3) erroneous words of target voice recognition result fields are detected using filter algorithm.

It in the embodiment of the present application, is MAD (Mean Absolute Differences, average absolute with the filter algorithm Difference) illustrate for algorithm, MAD value is bigger, indicate that a possibility that word is erroneous words is bigger, further, it is possible to most by MAD value Big word is determined as erroneous words.

After the mean value for getting each word point, average absolute of each word relative to mean value point may further be obtained The maximum word of mean absolute difference is determined as erroneous words according to above-mentioned filter algorithm by difference.In some other embodiments, above-mentioned Filter algorithm can also be SAD (Sum of Absolute Differences, absolute error and) algorithm, SSD (Sum of Squared Differences, error sum of squares) algorithm, MSD (Mean Square Differences, mean error and) calculate Method etc., the embodiment of the present application are not construed as limiting this.

Second error detection module is used to detect the erroneous words of target voice recognition result fields according to Parsing algorithm, I.e. by the syntactic structure of analysis target voice recognition result (Subject, Predicate and Object determines shape benefit) or analysis target voice recognition result Dependence between vocabulary, to detect the erroneous words of target voice recognition result fields.Mainly including the following steps:

(1) keyword in target voice recognition result is detected；

The keyword of usual sentence is predicate.Illustratively, it is assumed that target voice recognition result is " to play leading for Zhou Jielun To ", it can determine that " broadcasting " should be predicate according to syntactic structure, " Zhou Jielun " is attribute, and " guiding " is object, therefore can be true Fixed " broadcasting " is keyword.

(2) entity in above-mentioned keyword detection target voice recognition result is made a thorough investigation of；

The entity of usual sentence is the word after keyword.Illustratively, it is assumed that target voice recognition result is " to play week The guiding of outstanding human relations ", then entity may include " Zhou Jielun " and " guiding ".

(3) target entity in above-mentioned entity is extracted, the erroneous words of target voice recognition result fields are determined as；

Above-mentioned target entity can be the doubtful erroneous words in above-mentioned entity.In entity " Zhou Jielun " and " guiding " " guiding ", and the erroneous words that " guiding " is determined as target voice recognition result fields will be somebody's turn to do.

Module is rewritten to be used to the first error detection module and the second error detection module detected error word being rewritten as correct word.

Rewriting module may include that correct word recalls unit, filter element and rewrites unit.Optionally, module is rewritten may be used also To include sequencing unit.

Wherein, correct word recalls unit for selecting from the domain knowledge map of target voice recognition result fields Take at least one the candidate correct word for being greater than preset threshold with erroneous words similarity score.

Above-mentioned similarity score can be pinyin similarity score value and/or font similarity score.Above-mentioned domain knowledge figure Spectrum may include the proprietary dictionary in field of target voice recognition result fields.Above-mentioned candidate word can be at least one mesh Mark the proprietary word of speech recognition result fields.

Illustratively, the process for determining candidate correct word is introduced for using pinyin similarity.Assuming that target voice is known Other result is " guiding for playing Zhou Jielun ", belongs to music field；Detected error word is " guiding ", corresponding phonetic text For " dao xiang ", then it will be somebody's turn to do the phonetic of " dao xiang " comparison proprietary dictionary of music field, calculate similarity score, such as The similarity score of " rice fragrant " is 90 points, and the similarity score of " swinging to " is 65 points, and the similarity score on " island to " is 50 points； Choose suitable preset threshold, such as 60 points, will be greater than the word in the proprietary dictionary of music field of the preset threshold, such as " rice fragrant " and " swinging to " is as candidate correct word.

Sequencing unit be used for by least one candidate correct corresponding similarity score of word according to the rule successively decreased into Row sequence.Such as the similarity score of " rice fragrant " is 90 points, the similarity of " swinging to " is divided into 65 points, then will " rice is fragrant " as the One candidate correct word, " swinging to " is as the second candidate correct word.

Filter element is used to calculate the puzzlement degree score value of each candidate correct word, according to the puzzlement degree of each candidate correct word Score value determines the correct word of target from each candidate correct word.In the embodiment of the present application, (Perplexity is stranded above-mentioned PPL Puzzled degree) for characterizing at least one above-mentioned candidate, correctly word rewrites the accuracy into target voice recognition result to score value.PPL points Value can be used following formula and calculate:

Wherein, N indicates the length of target voice recognition result, P (w_i) it is the probability that i-th of candidate correct word occurs.It should PPL score value is smaller, P (w_i) bigger, the accuracy for indicating that this i-th candidate correct word is rewritten into target voice recognition result is got over It is high.

Further, it is possible to which the highest candidate correct word of PPL score value is determined as the correct word of target.

For example, target voice recognition result is " guiding for playing Zhou Jielun ", candidate correct word is including " rice is fragrant " and " To ", wherein the PPL score value of " rice is fragrant " is 95, " swinging to " PPL score value is 55, therefore " rice is fragrant " can be used as the correct word of target.

Optionally, filter element can call language model to determine that target is correct from least one above-mentioned candidate correct word Word.

Above-mentioned rewriting unit is used to the erroneous words in target voice recognition result being rewritten as the correct word of target.

For example, target voice recognition result is the guiding of Zhou Jielun " play ", " rice is fragrant " is the correct word of target, therefore can be with " rice is fragrant " is replaced into " guiding " in target voice recognition result, the revised target voice recognition result of error correction is obtained, i.e., " broadcasts The rice for putting Zhou Jielun is fragrant ".

Further, it is possible to which the revised target voice recognition result of the error correction is exported.

3, the mistake that third recast layer 503 is used to that neural network model to be called to rewrite redundancy in target voice recognition result Word.

Optionally, which has the frame of end-to-end (sequence to sequence), i.e. coding-solution Code (Encoder-Decoder) frame；The neural network model can use CNN (Convolutional Neural Networks, convolutional neural networks) model, it can also be using RNN (Recurrent Neural Network, circulation nerve net Network) model, transformer model can also be used, LSTM (Long Short-Term Memory, length can also be used Phase memory network) model etc., the embodiment of the present application is not construed as limiting this.

The erroneous words of above-mentioned redundancy, which can be, refers to duplicate word in target voice recognition result, may also mean that target language Modal particle in sound recognition result.For example, target voice recognition result is " I will listen song for I ", after rewriting the erroneous words of redundancy Target voice recognition result is " I will listen song ".

Step 406, the revised target voice recognition result of error correction of error correction replacement system output is obtained.

The revised target voice recognition result of error correction of terminal available error correction replacement system output, further basis The revised target voice recognition result of the error correction, is responded by the interactive system and user of terminal.

It should be noted is that in the embodiment of the present application, above-mentioned error correction is rewritten independent of above-mentioned general language Identify engine, can be directly by terminal, or service the server execution of the terminal.

In conclusion technical solution provided by the embodiments of the present application, multiple by the way that voice data to be identified to be sent to Speech recognition engine is identified, obtains multiple speech recognition results, and most by confidence level score value in multiple speech recognition results Target voice recognition result input error correction is further rewritten system as target voice recognition result by high speech recognition result System carries out error correction rewriting, obtains the revised target voice recognition result of error correction.Compared to the relevant technologies, by multiple speech recognitions As a result the middle highest speech recognition result of confidence level score value rather than relies on some voice as target voice recognition result Identify engine as a result, improving the accuracy of recognition result.Also, also by being entangled to the speech recognition result of the selection Mistake is rewritten, and the accuracy of recognition result is further improved.

In addition, error correction replacement system be directed to different types of erroneous words, as high frequency erroneous words, the erroneous words of fields and The erroneous words etc. of redundancy effectively improve the conjunction of final speech recognition result using different error correction rewrite strategies or model Rationality.

In addition, in the embodiment of the present application, above-mentioned error correction is rewritten independent of general language identification engine, it can be direct By terminal, or the server execution of the terminal is serviced, saves time and the cost of speech recognition.

Fig. 7 is referred in the following, combining, with the language model of the first error detection module for N metagrammar (N-gram) model, filtering is calculated Method be MAD algorithm for, to the second recast layer rewrite target voice recognition result fields erroneous words entire flow into Row is simple to be introduced.With target voice recognition result be " you not and Lyu's cloth who severity ", the first error detection module of the second recast layer N metagrammar model is called further to obtain a metagrammar such as Uni-Gram, two-dimensional grammar model and ternary syntactic model The fractional value that model, two-dimensional grammar model and ternary syntactic model calculate separately out, i.e. 1-gram score, 2-gram score and 3-gram score；N-gram mean value point is calculated according to 1-gram score, 2-gram score and 3-gram score, is based on the N- Gram mean value point uses MAD filter algorithm, determines that " you are other " is erroneous words；Second error detection module passes through keyword detection, entity Detection and extraction target entity determine that " you are other " is erroneous words；Later, it is recalled in unit in correct word, according to pinyin similarity And/or font similarity, retrieval obtains at least one candidate correct word in domain knowledge map；Sequencing unit will be above-mentioned each The candidate correct corresponding similarity score of word is ranked up according to the rule successively decreased, and according to each time in filter element The sequence of correct word is selected, calculates the PPL score value of each candidate correct word, and the highest candidate correct word of PPL score value is determined as The correct word of target；Unit is rewritten later, the erroneous words in target voice recognition result are rewritten as the correct word of target, obtain error correction Revised target voice recognition result " li po and Lv Bu who severe ", so far, error correction, which is rewritten, to be completed.

Following is the application Installation practice, can be used for executing the application embodiment of the method.It is real for the application device Undisclosed details in example is applied, the application embodiment of the method is please referred to.

Referring to FIG. 8, the block diagram of the speech recognition equipment provided it illustrates the application one embodiment.The device has Realize that the exemplary function of the above method, the function can also be executed corresponding software realization by hardware realization by hardware. The device can be speech recognition equipment described above, and perhaps terminal also can be set on background server or terminal. The device 800 may include: that data acquisition module 810, data transmission blocks 820, result selecting module 830 and result rewrite mould Block 840.

Data acquisition module 810, for obtaining voice data to be identified.

Data transmission blocks 820 obtain n for the voice data to be identified to be sent to n speech recognition engine A speech recognition result, the n are the integer greater than 1.

As a result know for the characteristic information according to the n speech recognition result from the n voice selecting module 830 Selection target speech recognition result in other result；Wherein, the characteristic information of institute's speech recognition result is used to indicate described in output Degree and voice knowledge are adapted between the speech recognition engine of speech recognition result and the voice data to be identified The credibility for the words that other result includes.

In some possible designs, as shown in figure 9, the result selecting module 830, comprising: score value computing unit 831 With result selecting unit 832.

Score value computing unit 831 calculates the n voice for the characteristic information according to the n speech recognition result The corresponding confidence level score value of recognition result.

As a result selecting unit 832, for selecting the highest voice of confidence level score value from the n speech recognition result Recognition result is as the target voice recognition result.

In some possible designs, as shown in figure 9, described device 800 further include: result rewrites module 840.

As a result module 840 is rewritten, for carrying out error correction rewriting to the erroneous words in the target voice recognition result, is obtained The revised target voice recognition result of error correction.

It should be noted that device provided by the above embodiment, when realizing its function, only with above-mentioned each functional module It divides and carries out for example, can according to need in practical application and be completed by different functional modules above-mentioned function distribution, The internal structure of equipment is divided into different functional modules, to complete all or part of the functions described above.In addition, Apparatus and method embodiment provided by the above embodiment belongs to same design, and specific implementation process is detailed in embodiment of the method, this In repeat no more.

Referring to FIG. 10, the structural schematic diagram of the computer equipment provided it illustrates the application one embodiment.The meter Calculate the audio recognition method that machine equipment is used to implement to provide in above-described embodiment.The computer equipment can be it is described above after Platform server is also possible to such as smart phone described above, tablet computer, PC, intelligent robot, smart television, intelligence Speaker etc. can carry out the terminal of interactive voice with user.Specifically:

The computer equipment 1000 includes central processing unit (CPU) 1001 including random access memory (RAM) 1002 and read-only memory (ROM) 1003 system storage 1004, and connection system storage 1004 and central processing list The system bus 1005 of member 1001.The computer equipment 1000 further includes that letter is transmitted between each device helped in computer The basic input/output (I/O system) 1006 of breath, and for storage program area 1013, application program 1014 and other The mass-memory unit 1007 of program module 1012.

The basic input/output 1006 includes display 1008 for showing information and inputs for user The input equipment 1009 of such as mouse, keyboard etc of information.Wherein the display 1008 and input equipment 1009 all pass through The input and output controller 1010 for being connected to system bus 1005 is connected to central processing unit 1001.The basic input/defeated System 1006 can also include input and output controller 1010 to touch for receiving and handling from keyboard, mouse or electronics out Control the input of multiple other equipment such as pen.Similarly, input and output controller 1010 also provide output to display screen, printer or Other kinds of output equipment.

The mass-memory unit 1007 (is not shown by being connected to the bulk memory controller of system bus 1005 It is connected to central processing unit 1001 out).The mass-memory unit 1007 and its associated computer-readable medium are Computer equipment 1000 provides non-volatile memories.That is, the mass-memory unit 1007 may include such as hard The computer-readable medium (not shown) of disk or CD-ROM drive etc.

Without loss of generality, the computer-readable medium may include computer storage media and communication media.Computer Storage medium includes information such as computer readable instructions, data structure, program module or other data for storage The volatile and non-volatile of any method or technique realization, removable and irremovable medium.Computer storage medium includes RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storages its technologies, CD-ROM, DVD or other optical storages, tape Box, tape, disk storage or other magnetic storage devices.Certainly, skilled person will appreciate that the computer storage medium It is not limited to above-mentioned several.Above-mentioned system storage 1004 and mass-memory unit 1007 may be collectively referred to as memory.

According to the various embodiments of the application, the computer equipment 1000 can also be connected by networks such as internets The remote computer operation being connected on network.Namely computer equipment 1000 can be by being connected on the system bus 1005 Network Interface Unit 1011 be connected to network 1012, in other words, Network Interface Unit 1011 can be used also to be connected to it The network or remote computer system (not shown) of his type.

The memory further includes at least one instruction, at least a Duan Chengxu, code set or instruction set, and described at least one Instruction, an at least Duan Chengxu, code set or instruction set are stored in memory, and be configured to by one or more than one It manages device to execute, to realize above-mentioned audio recognition method.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, is stored in the storage medium At least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, institute It states code set or described instruction collection realizes above-mentioned audio recognition method when being executed by processor.

In the exemplary embodiment, a kind of computer program product is additionally provided, when the computer program product is processed When device executes, for realizing above-mentioned audio recognition method.

It should be understood that referenced herein " multiple " refer to two or more."and/or", description association The incidence relation of object indicates may exist three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A And B, individualism B these three situations.Character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".

The foregoing is merely the exemplary embodiments of the application, all in spirit herein not to limit the application Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of audio recognition method, which is characterized in that the described method includes:

Obtain voice data to be identified；

The voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, the n is big In 1 integer；

According to the characteristic information of the n speech recognition result, the selection target speech recognition from the n speech recognition result As a result；Wherein, the speech recognition that the characteristic information of institute's speech recognition result is used to indicate output institute's speech recognition result is drawn It holds up and is adapted to degree and the credible journey of words that institute's speech recognition result includes between the voice data to be identified Degree.

2. the method according to claim 1, wherein the characteristic information according to n speech recognition result, from Selection target speech recognition result in the n speech recognition result, comprising:

According to the characteristic information of the n speech recognition result, the corresponding confidence level of n speech recognition result is calculated Score value；

From the n speech recognition result, select the highest speech recognition result of confidence level score value as the target voice Recognition result.

3. according to the method described in claim 2, it is characterized in that, described believe according to the feature of the n speech recognition result Breath, calculates the corresponding confidence level score value of the n speech recognition result, comprising:

For i-th of speech recognition result in the n speech recognition result, i-th of speech recognition result is obtained Characteristic information, the i are the positive integer less than or equal to the n；

By the characteristic information of i-th of speech recognition result, it is input to machine learning model, i-th of voice is obtained and knows The corresponding confidence level score value of other result.

4. method according to any one of claims 1 to 3, which is characterized in that described according to the n speech recognition result Characteristic information, after selection target speech recognition result in the n speech recognition result, further includes:

Error correction rewriting is carried out to the erroneous words in the target voice recognition result, obtains the revised target voice identification of error correction As a result.

5. according to the method described in claim 4, it is characterized in that, the erroneous words in the target voice recognition result Error correction rewriting is carried out, the revised target voice recognition result of error correction is obtained, comprising:

The target voice recognition result is input to error correction replacement system, the error correction replacement system is for determining the target Erroneous words in speech recognition result obtain correct word corresponding with the erroneous words, will be in the target voice recognition result The erroneous words be rewritten as the correct word, obtain the revised target voice recognition result of the error correction；

Obtain the revised target voice recognition result of the error correction of the error correction replacement system output.

6. according to the method described in claim 5, it is characterized in that, the error correction replacement system includes: the first recast layer, second Recast layer and third recast layer；

Wherein, first recast layer is used to rewrite the high frequency erroneous words in the target voice recognition result, and described second changes Layer is write for rewriting the erroneous words of the target voice recognition result fields, the third recast layer is for rewriting the mesh Mark the erroneous words of redundancy in speech recognition result.

7. according to the method described in claim 6, it is characterized in that, first recast layer, is used for:

When detecting in the target voice recognition result comprising the high frequency erroneous words, according to white list and/or rule are rewritten Error correction rewriting then is carried out to the high frequency erroneous words, the white list refers to the high frequency erroneous words and the high frequency erroneous words pair The mapping table between correct word answered, the rewriting rule are the rewriting rules for high frequency erroneous words setting.

8. according to the method described in claim 6, it is characterized in that, second recast layer includes: the first error detection module, second Error detection module and rewriting module；

First error detection module, for calling language model to detect the mistake of the target voice recognition result fields Word；

Second error detection module, for detecting the mistake of the target voice recognition result fields according to Parsing algorithm Accidentally word；

The rewriting module, the erroneous words for detecting first error detection module and second error detection module change It is written as correct word.

9. according to the method described in claim 8, it is characterized in that, the rewriting module includes that correct word recalls unit, filtering Unit and rewriting unit；

The correct word recalls unit, for selecting from the domain knowledge map of the target voice recognition result fields Take at least one the candidate correct word for being greater than preset threshold with the erroneous words similarity score；

The filter element, for calculating the puzzlement degree score value of each candidate correct word, according to each described candidate correct The puzzlement degree score value of word determines the correct word of target from each candidate correct word, and the puzzled degree score value is for characterizing institute It states candidate correct word and rewrites the accuracy into the target voice recognition result；

The rewriting unit, it is correct for the erroneous words in the target voice recognition result to be rewritten as the target Word.

10. according to the method described in claim 6, it is characterized in that, the third recast layer, is used for:

Neural network model is called to rewrite the erroneous words of redundancy in the target voice recognition result.

11. a kind of speech recognition equipment, which is characterized in that described device includes:

Data acquisition module, for obtaining voice data to be identified；

Data transmission blocks obtain n voice for the voice data to be identified to be sent to n speech recognition engine Recognition result, the n are the integer greater than 1；

As a result selecting module, for the characteristic information according to the n speech recognition result, from the n speech recognition result Middle selection target speech recognition result；Wherein, the characteristic information of institute's speech recognition result is used to indicate the output voice and knows Degree and institute's speech recognition result are adapted between the speech recognition engine of other result and the voice data to be identified The credibility for the words for including.

12. device according to claim 11, which is characterized in that the result selecting module, comprising:

Score value computing unit calculates the n speech recognition knot for the characteristic information according to the n speech recognition result The corresponding confidence level score value of fruit；

As a result selecting unit, for selecting the highest speech recognition result of confidence level score value from the n speech recognition result As the target voice recognition result.

13. device according to claim 11 or 12, which is characterized in that described device further include:

As a result module is rewritten, for carrying out error correction rewriting to the erroneous words in the target voice recognition result, error correction is obtained and changes Target voice recognition result after writing.

14. a kind of computer equipment, which is characterized in that the computer equipment includes processor and memory, the memory In be stored at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, described at least one Duan Chengxu, the code set or instruction set are loaded by the processor and are executed to realize such as any one of claims 1 to 10 institute The method stated.

15. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium A few Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or instruction Collection is loaded by processor and is executed to realize method as described in any one of claim 1 to 10.