CN110148416A - Audio recognition method, device, equipment and storage medium - Google Patents
Audio recognition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN110148416A CN110148416A CN201910327337.6A CN201910327337A CN110148416A CN 110148416 A CN110148416 A CN 110148416A CN 201910327337 A CN201910327337 A CN 201910327337A CN 110148416 A CN110148416 A CN 110148416A
- Authority
- CN
- China
- Prior art keywords
- recognition result
- speech recognition
- target voice
- result
- error correction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000003860 storage Methods 0.000 title claims abstract description 22
- 238000012937 correction Methods 0.000 claims description 63
- 238000001514 detection method Methods 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 12
- 238000010801 machine learning Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 4
- 235000013399 edible fruits Nutrition 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 12
- 241000209094 Oryza Species 0.000 description 10
- 235000007164 Oryza sativa Nutrition 0.000 description 10
- 230000002452 interceptive effect Effects 0.000 description 10
- 235000009566 rice Nutrition 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000012905 input function Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000004744 fabric Substances 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 240000001417 Vigna umbellata Species 0.000 description 2
- 235000011453 Vigna umbellata Nutrition 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 230000001151 other effect Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
This application provides a kind of audio recognition method, device, equipment and storage mediums.The described method includes: obtaining voice data to be identified;Voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n is the integer greater than 1;According to the characteristic information of n speech recognition result, the selection target speech recognition result from n speech recognition result.In technical solution provided by the present application, using the speech recognition result of multiple speech recognition engines as reference, and preferably speech recognition result is chosen from multiple speech recognition results, improve the accuracy of recognition result.
Description
Technical field
The invention relates to technical field of voice recognition, in particular to a kind of audio recognition method, device, equipment and
Storage medium.
Background technique
Speech recognition technology is a kind of technology that the voice of people is converted to text, is widely used in all kinds of artificial intelligence and produces
In product, such as Intelligent dialogue robot, intelligent sound box, intelligent translation apparatus etc..
The substantially process of speech recognition includes: the voice data that speech recognition apparatus obtains user's input, and by the voice
Data are sent to speech recognition engine, and speech recognition engine identifies the voice data, and speech recognition result is fed back
To speech recognition apparatus, speech recognition apparatus exports upper speech recognition result.
Currently, speech recognition depends on single speech recognition engine, and the speech recognition engine is a general-purpose platform, needle
It is relatively poor to the recognition effect of certain specific areas, cause recognition result inaccurate.
Summary of the invention
The embodiment of the present application provides a kind of audio recognition method, device, equipment and storage medium, can be used for solving correlation
In technology, speech recognition engine is relatively poor for the recognition effect of certain specific areas, the problem of recognition result inaccuracy.Institute
It is as follows to state technical solution:
On the one hand, the embodiment of the present application provides a kind of audio recognition method, which comprises
Obtain voice data to be identified;
The voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, the n
For the integer greater than 1;
According to the characteristic information of the n speech recognition result, the selection target voice from the n speech recognition result
Recognition result;Wherein, the voice that the characteristic information of institute's speech recognition result is used to indicate output institute's speech recognition result is known
Between other engine and the voice data to be identified be adapted to degree and words that institute's speech recognition result includes can
Letter degree.
On the other hand, the embodiment of the present application provides a kind of speech recognition equipment, and described device includes:
Data acquisition module, for obtaining voice data to be identified;
Data transmission blocks obtain n for the voice data to be identified to be sent to n speech recognition engine
Speech recognition result, the n are the integer greater than 1;
As a result selecting module, for the characteristic information according to the n speech recognition result, from the n speech recognition
As a result middle selection target speech recognition result;Wherein, the characteristic information of institute's speech recognition result is used to indicate output institute's predicate
Degree and the speech recognition are adapted between the speech recognition engine of sound recognition result and the voice data to be identified
As a result the credibility for the words for including.
Another aspect, the embodiment of the present application provide a kind of computer equipment, the computer equipment include processor and
Memory, is stored at least one instruction, at least a Duan Chengxu, code set or instruction set in the memory, and described at least one
Item instruction, an at least Duan Chengxu, the code set or instruction set are loaded by the processor and are executed to realize as above-mentioned
Audio recognition method described in aspect.
In another aspect, the embodiment of the present application provides a kind of computer readable storage medium, stored in the storage medium
Have at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu,
The code set or instruction set are as processor loads and executes to realize such as the audio recognition method as described in terms of above-mentioned.
Also on the one hand, the embodiment of the present application provides a kind of computer program product, when the computer program product is located
When managing device execution, for realizing above-mentioned audio recognition method.
Technical solution provided by the embodiments of the present application may include it is following the utility model has the advantages that
It is identified by the way that voice data to be identified is sent to multiple speech recognition engines, obtains multiple speech recognitions
As a result, and according to the characteristic information of multiple speech recognition result, selected from multiple speech recognition result one as mesh
Mark speech recognition result.Compared in the related technology, speech recognition depends on single speech recognition engine, and the speech recognition
Engine is a general-purpose platform, relatively poor for the recognition effect of certain specific areas, in technical solution provided by the present application, is adopted
It uses the speech recognition result of multiple speech recognition engines as reference, and chooses preferably voice from multiple speech recognition results
Recognition result improves the accuracy of recognition result.
Detailed description of the invention
Fig. 1 is the schematic diagram for the implementation environment that the application one embodiment provides;
Fig. 2 is the schematic diagram of the application one complete speech recognition process;
Fig. 3 is the flow chart for the audio recognition method that the application one embodiment provides;
Fig. 4 is the flow chart for the audio recognition method that another embodiment of the application provides;
Fig. 5 is the block diagram for illustrating an error correction replacement system;
Fig. 6 is the block diagram for illustrating another error correction replacement system;
Fig. 7 is the schematic diagram for illustrating the second recast layer and rewriting process;
Fig. 8 is the block diagram for the speech recognition equipment that the application one embodiment provides;
Fig. 9 is the block diagram for the speech recognition equipment that the application one embodiment provides;
Figure 10 is the structural schematic diagram for the computer equipment that the application one embodiment provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party
Formula is described in further detail.
Referring to FIG. 1, the schematic diagram of the implementation environment provided it illustrates the application one embodiment.The implementation environment can
To include: terminal 10, background server 20 and speech recognition server 30.
In the embodiment of the present application, voice acquisition device can be configured in terminal 10, which can be
Microphone, microphone array, transmitter etc., for obtaining the voice data of user's input.Above-mentioned background server 20 is terminal
10 provide background service.
Optionally, the application with speech voice input function can be installed in terminal 10, as instant messaging application, voice input
Method application, voice assistant etc..These applications at runtime, call above-mentioned voice acquisition device, the voice number of acquisition user's input
According to.
Optionally, above-mentioned background server 20 is also possible to the background service of the above-mentioned application with speech voice input function
Device.
In a kind of possible embodiment, the above-mentioned application with speech voice input function can not only acquire user's input
Voice data, additionally it is possible to identify the voice data.
In alternatively possible embodiment, the above-mentioned application with speech voice input function is only capable of acquisition user's input
Voice data needs that the voice data of acquisition is sent to speech recognition server 30 to the voice via background server 20
Data are identified that speech recognition server 30 feeds back speech recognition result via background server 20 after identification is completed
It is exported to terminal 10.In the embodiment of the present application, the voice data that user inputs is carried out with speech recognition server 30
Illustrate for identification.
Above-mentioned terminal 10 can be such as smart phone, tablet computer, PC (Personal Computer, individual calculus
Machine), intelligent robot, smart television, intelligent sound box etc. the electronic equipment of interactive voice can be carried out with user.
In the embodiment of the present application, operation has speech recognition engine on above-mentioned speech recognition server 30.Optionally, different
Speech recognition server in can run different speech recognition engines;It can have been run in one speech recognition server
One speech recognition engine, can also run multiple and different speech recognition engines, and the embodiment of the present application is not construed as limiting this.
Above-mentioned background server 20 and speech recognition server 30 may each be a server, be also possible to be taken by more
The server cluster or a cloud computing service center of business device composition.
Terminal 10 can be communicated by network with background server 20, and background server 20 is known by network and voice
Other server 30 is communicated.
It should be noted is that in some embodiments, the quantity of speech recognition server 30 is one, which knows
Operation has multiple and different speech recognition engines in other server 30;In further embodiments, speech recognition server 30
Quantity be it is multiple, operation has a speech recognition engine in each speech recognition server, and in each speech recognition server
Speech recognition engine it is different.
In addition, technical solution provided by the embodiments of the present application, is applicable to the speech recognition of multiple and different languages, such as Chinese
Language, English, French, German, Japanese, Korean etc..In the embodiment of the present application, mainly it is situated between for applying in Chinese
Continue explanation, but does not constitute the restriction to technical scheme.
It should be noted that technical solution provided by the embodiments of the present application can be applied in all kinds of artificial intelligence products,
Application scenarios include but is not limited to household, vehicle-mounted, game etc..
Referring to FIG. 2, its schematic diagram for illustrating the application one complete speech recognition process.User passes through
The voice acquisition device configured in terminal 10 (such as intelligent robot, smart television, intelligent sound box), inputs voice to be identified
Data;Then, which is sent to multiple speech recognition engines via background server 20 by terminal 10, such as
Speech recognition engine A, speech recognition engine B and speech recognition engine C obtain the speech recognition knot of each speech recognition engine
Fruit, such as speech recognition result A, speech recognition result B and speech recognition result C.Above-mentioned speech recognition engine operates in voice knowledge
In other server 30.Then, background server 20 can from above-mentioned multiple speech recognition results selection target speech recognition knot
Fruit.Error correction rewriting further is carried out to the target voice recognition result, obtains the revised target voice recognition result of error correction.Its
In, during above-mentioned selection target speech recognition result and error correction rewriting, combine history interactive log, domain knowledge map
And the contents such as user characteristics.Finally, later, background server 20 feeds back to the revised target voice recognition result of error correction
Terminal 10, so that terminal 10 can be by interactive system, in conjunction with the revised target voice recognition result of error correction to user with sound
It answers.
In the following, explanation is introduced to technical scheme by several embodiments.
Referring to FIG. 3, the flow chart of the audio recognition method provided it illustrates the application one embodiment.In the application
In embodiment, it is applied to the background server in implementation environment shown in Fig. 1 in this way mainly to illustrate.This method can be with
It comprises the following steps:
Step 301, voice data to be identified is obtained.
Voice data to be identified refers to the voice data of user's input.When user wants by way of voice and terminal
When interacting, can directly it speak against the voice acquisition device in the terminal, correspondingly, terminal can be adopted by the voice
Acquisition means obtain user, and what is said or talked about, as voice data to be identified.
Optionally, terminal can acquire voice data to be identified when receiving voice recognition instruction, the speech recognition
Instruction can be by user's triggering, the user can may include by specified operation triggering, the specified operation clicking operation,
Slide etc., the embodiment of the present application do not limit this.
For example, speech recognition option can be provided in terminal, when user wants to interact by voice and the terminal
When, the speech recognition option can be clicked, to trigger the voice recognition instruction.After the terminal receives the voice recognition instruction,
Obtain voice data to be identified.
Step 302, voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n
For the integer greater than 1.
The voice data to be identified can be sent to multiple voices after getting voice data to be identified by terminal
Identify that engine carries out speech recognition, accordingly, speech recognition engine, can should after receiving voice data to be identified
Voice data to be identified is converted into text, the i.e. upper speech recognition result of the text.
Above-mentioned n speech recognition engine can be to be run in a server, is also possible in multiple servers
Operation.The corresponding feature of n speech recognition is different, and this feature may include domain features, history feature, gender spy
Sign, regional feature etc..
In the embodiment of the present application, above-mentioned n speech recognition engine is that universal phonetic identifies engine.
Step 303, according to the characteristic information of n speech recognition result, the selection target voice from n speech recognition result
Recognition result.
After n speech recognition result for obtaining the output of n speech recognition engine, since n speech recognition engine is directed to
The confidence level of different scenes identification is different, therefore the accuracy of above-mentioned n speech recognition result may not also be identical, therefore, from n
Select a recognition result as target voice recognition result in a speech recognition result.
Wherein, the characteristic information of speech recognition result be used to indicate output speech recognition result speech recognition engine with to
The credibility for the words that adaptation degree and speech recognition result between the voice data of identification include.Above-mentioned adaptation degree
The accuracy identified for reflecting speech recognition engine for voice data to be identified, adaptation degree is higher, indicates output voice
The accuracy that the speech recognition engine of recognition result identifies voice data to be identified is higher;For example, working as voice to be identified
Data are " battlefield is dominated by my people ", and speech recognition engine A is adapted to degree with the voice data to be identified, are known lower than voice
Other engine B is adapted to degree with the voice data to be identified, then it represents that speech recognition engine B identifies the voice data to be identified
Accuracy is higher.
Above-mentioned credibility is used to reflect being overlapped for words that speech recognition result includes and the words in predefined dictionary
Degree, credibility is higher, and the registration of the words that speech recognition result includes and the words in predefined dictionary is higher, indicates language
The accuracy for the words for including in sound recognition result is higher.For example, speech recognition result A be " you not and Lyu's cloth who severity ", language
Sound recognition result B is that the words for " li po and Lv Bu who severe ", in predefined dictionary includes " li po ", " Lv Bu ", i.e. voice
Recognition result B exists with words in predefined dictionary to be overlapped, then it represents that the accuracy for the words for including in speech recognition result B is got over
It is high.
Detailed content about features described above information is introduced in following Fig. 4 embodiments, and details are not described herein again.
Optionally, the above-mentioned characteristic information according to n speech recognition result, the selection target from n speech recognition result
Speech recognition result, comprising: according to the characteristic information of n speech recognition result, it is corresponding to calculate n speech recognition result
Confidence level score value;From n speech recognition result, select the highest speech recognition result of confidence level score value as the target language
Sound recognition result.
The corresponding confidence level score value of n speech recognition result of above-mentioned calculating can be by machine learning model come real
It is existing.The machine learning model can be Markov continuous speech recognition model, neural network model, SVM (Support
Vector Machine, support vector machines) etc., the embodiment of the present application is not construed as limiting this.
In conclusion in technical solution provided by the embodiments of the present application, it is more by the way that voice data to be identified to be sent to
A speech recognition engine is identified, obtains multiple speech recognition results, and believe according to the feature of multiple speech recognition result
Breath, selected from multiple speech recognition result one as target voice recognition result.Compared in the related technology, voice is known
Speech recognition engine that Yi Laiyu be not single, and the speech recognition engine is a general-purpose platform, for the knowledge of certain specific areas
Other effect is relatively poor, in technical solution provided by the present application, using the speech recognition result conduct of multiple speech recognition engines
With reference to, and preferably speech recognition result is chosen from multiple speech recognition results, improve the accuracy of recognition result.
Referring to FIG. 4, the flow chart of the audio recognition method provided it illustrates another embodiment of the application.In this Shen
It please be applied in this way the background server in implementation environment shown in Fig. 1 mainly to illustrate in embodiment.This method can
To comprise the following steps:
Step 401, voice data to be identified is obtained.
This step is same or like with step 301 in Fig. 3 embodiment, and details are not described herein again.
Step 402, voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, n
For the integer greater than 1.
This step is same or like with step 302 in Fig. 3 embodiment, and details are not described herein again.
Step 403, according to the characteristic information of n speech recognition result, calculate that n speech recognition result is corresponding to set
Reliability score value.
Above-mentioned confidence level score value is used to reflect the accuracy of speech recognition result, and confidence level score value is higher, indicates the voice
The accuracy of recognition result is higher.
The above-mentioned characteristic information according to n speech recognition result calculates the corresponding confidence level of n speech recognition result
Score value may include following two step: for i-th of speech recognition result in n speech recognition result, obtain i-th
The characteristic information of speech recognition result;By the characteristic information of i-th of speech recognition result, it is input to machine learning model, is obtained
The corresponding confidence level score value of i-th of speech recognition result, above-mentioned i are the positive integer less than or equal to n.
Features described above information is used to reflect the speech recognition of i-th voice data, output of speech recognition result to be identified
The abstract characteristics of engine and i-th of speech recognition result, this feature information include but is not limited to following at least one: to be identified
The corresponding domain features of voice data, the corresponding regional feature of voice data to be identified, voice data to be identified it is corresponding
Sex character, the corresponding history feature of voice data to be identified, export i-th of speech recognition result speech recognition engine
I-th corresponding domain features, the corresponding regional feature of speech recognition engine for exporting i-th of speech recognition result, output of language
The corresponding sex character of the speech recognition engine of sound recognition result, the speech recognition engine pair for exporting i-th of speech recognition result
The matching degree, etc. between history feature, i-th of speech recognition result and predefined dictionary answered.
Wherein, the corresponding domain features of voice data to be identified are for reflecting neck belonging to the theme of the voice data
Domain, as the field of game themes, musical theme field, children stories field, the field of political topics, economic theme field,
The field etc. of scientific and technological theme;The corresponding domain features of speech recognition engine of i-th of speech recognition result are exported for reflecting
The confidence level that the speech recognition engine identifies above-mentioned field.
The corresponding regional feature of voice data to be identified is used to reflect the mouth of region locating for the user for inputting the voice data
Sound, such as northeast dialect, Guangdong dialect, Shanghai dialect;The speech recognition engine for exporting i-th of speech recognition result is corresponding
Regional feature is used to reflect the speech recognition engine for the confidence level of the voice data identification with different geographical feature.
The corresponding sex character of voice data to be identified is used to reflect the gender for the user for inputting the voice data, such as male
Sound or female voice;The corresponding sex character of speech recognition engine of i-th of speech recognition result is exported for reflecting that the voice is known
Confidence level of the other engine for the voice data identification with different sexes feature.
The corresponding history feature of voice data to be identified is for reflecting whether the voice data to be identified appears in history
In interactive log;The corresponding history feature of speech recognition engine of i-th of speech recognition result is exported for reflecting that the voice is known
Confidence level of the other engine for the identification of voice data in history interactive log.
Matching degree between i-th of speech recognition result and predefined dictionary is for reflecting i-th of speech recognition result
With the similarity degree of word in predefined dictionary.The predefined dictionary may include the theme institute of i-th of speech recognition result
The domain knowledge map in category field includes the proprietary dictionary in the field in the domain knowledge map;It can also include history interaction
The dictionary of the history identification recorded in log;It can also include the dictionary etc. that backstage designer high frequency gathered in advance uses word
Deng the embodiment of the present application is not construed as limiting this.
In addition, features described above information can also include other feature, the embodiment of the present application is not construed as limiting this.
It, can be by multiple characteristic informations after obtaining the corresponding above-mentioned multiple characteristic informations of i-th of speech recognition result
It is input in machine learning model trained in advance, the corresponding confidence level score value of i-th of speech recognition result is calculated.On
The input for stating machine learning model is the corresponding multiple characteristic informations of speech recognition result, and output is that the speech recognition result is corresponding
Confidence level score value.
Step 404, from n speech recognition result, select the highest speech recognition result of confidence level score value as target
Speech recognition result.
Above-mentioned confidence level score value is higher, indicates that the speech recognition result is more accurate.Therefore, n speech recognition is being got
As a result after respective confidence level score value, the highest speech recognition result of confidence level score value can be identified as target voice and is tied
Fruit.
In addition, can choose when highest arranged side by side there are two or more confidence level score values in speech recognition result
Wherein any one is as target voice recognition result.
Optionally, after obtaining target voice recognition result, can to the erroneous words in target voice recognition result into
Row error correction is rewritten, and the revised target voice recognition result of error correction is obtained.
It is accurate for certain specific scene Recognitions since above-mentioned speech recognition engine is that universal phonetic identifies engine
Degree is inadequate, it is possible that partial error word, influences human-computer interaction.Therefore, it introduces error correction to rewrite, it can be found that target voice is known
Erroneous words in other result, further progress error correction are rewritten, and obtain the revised target voice recognition result of error correction, and pass through end
The interactive system at end is responded, and the accuracy of speech recognition is promoted.
The above-mentioned erroneous words in target voice recognition result carry out error correction rewriting, may comprise steps of 405-406.
Step 405, target voice recognition result is input to error correction replacement system.
It is accurate for certain specific scene Recognitions since above-mentioned speech recognition engine is that universal phonetic identifies engine
Degree is inadequate, it is possible that partial error word, influences human-computer interaction.Therefore, target voice recognition result error correction is input to change
System is write, it can be found that the erroneous words in target voice recognition result, further progress error correction is rewritten, and promotes the standard of speech recognition
Exactness.
Above-mentioned error correction replacement system is used to determine the erroneous words in target voice recognition result, obtain corresponding with erroneous words
Erroneous words in target voice recognition result are rewritten as correct word by correct word, obtain the revised target voice identification of error correction
As a result.For example, target voice recognition result be " you not and Lyu's cloth who severity ", error correction replacement system can determine the target voice
" you not " in recognition result is erroneous words, further obtains correct word corresponding with " you not ", such as " li po ", and by the mistake
Word " you not " is rewritten as correct word " li po ", and obtaining the revised target voice recognition result of error correction is that " who is strict by li po and Lv Bu
Evil ";In another example target voice recognition result is " song of our Zhou Jielun ", error correction replacement system can determine the target voice
" we " in recognition result is erroneous words, further obtains corresponding with " we " correct word, such as " broadcasting ", and by the mistake
Word " we " is rewritten as correct word " broadcasting ", and obtaining the revised target voice recognition result of error correction is " to play Zhou Jielun's
Song ".
Optionally, in conjunction with reference Fig. 5, error correction replacement system 500 includes: the first recast layer 501,502 and of the second recast layer
Third recast layer 503;Wherein, the first recast layer 501 is used to rewrite the high frequency erroneous words in target voice recognition result, and second changes
Layer 502 is write for rewriting the erroneous words of target voice recognition result fields, third recast layer 503 is for rewriting target voice
The erroneous words of redundancy in recognition result.
Optionally, the rewriting precision of above-mentioned first recast layer, the second recast layer and third recast layer is incremented by successively.When first
After the error correction of recast layer is rewritten successfully, the error correction for no longer executing the second recast layer and third recast layer is rewritten;When the first recast layer
Error correction rewrite failure after, execute the second recast layer error correction rewrite, after the error correction of the second recast layer is rewritten successfully, no longer hold
The error correction of row third recast layer is rewritten, and after failure is rewritten in the error correction of the second recast layer, the error correction for executing third recast layer is rewritten.
Fig. 6 is referred in the following, combining, the content of the first recast layer, the second recast layer and third recast layer is discussed in detail.
1, the first recast layer 501 includes white list and/or at least one rewriting rule.Above-mentioned white list refers to high frequency mistake
The mapping table of word correct word corresponding with the high frequency erroneous words includes high frequency erroneous words in target voice recognition result when detecting
When, error correction rewriting is carried out to above-mentioned high frequency erroneous words according to white list and/or rewriting rule;Wherein, above-mentioned mapping table can be
It is collected according to history interactive log, is also possible to be differentiated in advance by designer and obtains, the embodiment of the present application is to this
It is not construed as limiting.
Above-mentioned rewriting rule is the rewriting rule for the setting of high frequency erroneous words, which may include for same
High frequency erroneous words, when the rewrite method for belonging to different field is then rewritten rule and be can be and work as voice such as high frequency erroneous words " I puts "
When data belong to music field, " I puts " is rewritten as " playing ";When voice data belongs to field of play, " I puts " is rewritten
For " we ".Wherein, above-mentioned rewriting rule can be collects according to history interactive log, is also possible to by designing from the background
Personnel, which in advance differentiate, to be obtained, and the embodiment of the present application is not construed as limiting this.
2, the second recast layer 502 includes the first error detection module, the second error detection module and rewriting module.Wherein, the first error detection
Module is used to call the erroneous words of language model detection target voice recognition result fields.Optionally, above-mentioned language model
It can be N metagrammar (N-gram) model.The value of N is 1,2 and 3 namely metagrammar (1-gram) model, two in the application
Metagrammar (2-gram) model and three metagrammars (3-gram) model.In other embodiments, N can also be the nature greater than 3
Number, the embodiment of the present application are not construed as limiting this.Use the mistake of N metagrammar model inspection target voice recognition result fields
Word, mainly including the following steps:
(1) target voice recognition result is inputted into N metagrammar model, obtains corresponding fractional value;
After obtaining target voice recognition result, so that it may which the target voice recognition result is inputted above-mentioned unitary respectively
Syntactic model, two-dimensional grammar model and ternary syntactic model.It optionally, can be to target before inputting above-mentioned syntactic model
Speech recognition result is segmented, and gets the corresponding word of target voice recognition result to list, illustratively, with binary language
For method model, it is assumed that target voice recognition result is " red bean of our Wang Fei ", then the word of two-dimensional grammar model is to list
Are as follows: [we, Wang Fei], [Wang Fei ,], [, red bean].Similarly, the word of Uni-Gram and ternary syntactic model is obtained
To list.Further, the word of above-mentioned each syntactic model inputs in corresponding syntactic model list respectively, for
Two fractional values can be calculated in any one word therein, each syntactic model.In the embodiment of the present application, for target
Any one word in speech recognition result can respectively obtain Uni-Gram, two-dimensional grammar model and three metagrammar moulds
The fractional value that type calculates separately out, a total of six fractional value.Optionally, above-mentioned fractional value can use Longest Common Substring
And/or the method for editing distance is calculated.
(2) according to the corresponding fractional value of above-mentioned N metagrammar model, the mean value point of above-mentioned fractional value is calculated.
In the embodiment of the present application, for any one word, Uni-Gram, two-dimensional grammar model can be respectively obtained
The fractional value for calculating separately out with ternary syntactic model, a total of six fractional value;Further, it is possible to obtain the mean value of each word
Point.
(3) erroneous words of target voice recognition result fields are detected using filter algorithm.
It in the embodiment of the present application, is MAD (Mean Absolute Differences, average absolute with the filter algorithm
Difference) illustrate for algorithm, MAD value is bigger, indicate that a possibility that word is erroneous words is bigger, further, it is possible to most by MAD value
Big word is determined as erroneous words.
After the mean value for getting each word point, average absolute of each word relative to mean value point may further be obtained
The maximum word of mean absolute difference is determined as erroneous words according to above-mentioned filter algorithm by difference.In some other embodiments, above-mentioned
Filter algorithm can also be SAD (Sum of Absolute Differences, absolute error and) algorithm, SSD (Sum of
Squared Differences, error sum of squares) algorithm, MSD (Mean Square Differences, mean error and) calculate
Method etc., the embodiment of the present application are not construed as limiting this.
Second error detection module is used to detect the erroneous words of target voice recognition result fields according to Parsing algorithm,
I.e. by the syntactic structure of analysis target voice recognition result (Subject, Predicate and Object determines shape benefit) or analysis target voice recognition result
Dependence between vocabulary, to detect the erroneous words of target voice recognition result fields.Mainly including the following steps:
(1) keyword in target voice recognition result is detected;
The keyword of usual sentence is predicate.Illustratively, it is assumed that target voice recognition result is " to play leading for Zhou Jielun
To ", it can determine that " broadcasting " should be predicate according to syntactic structure, " Zhou Jielun " is attribute, and " guiding " is object, therefore can be true
Fixed " broadcasting " is keyword.
(2) entity in above-mentioned keyword detection target voice recognition result is made a thorough investigation of;
The entity of usual sentence is the word after keyword.Illustratively, it is assumed that target voice recognition result is " to play week
The guiding of outstanding human relations ", then entity may include " Zhou Jielun " and " guiding ".
(3) target entity in above-mentioned entity is extracted, the erroneous words of target voice recognition result fields are determined as;
Above-mentioned target entity can be the doubtful erroneous words in above-mentioned entity.In entity " Zhou Jielun " and " guiding "
" guiding ", and the erroneous words that " guiding " is determined as target voice recognition result fields will be somebody's turn to do.
Module is rewritten to be used to the first error detection module and the second error detection module detected error word being rewritten as correct word.
Rewriting module may include that correct word recalls unit, filter element and rewrites unit.Optionally, module is rewritten may be used also
To include sequencing unit.
Wherein, correct word recalls unit for selecting from the domain knowledge map of target voice recognition result fields
Take at least one the candidate correct word for being greater than preset threshold with erroneous words similarity score.
Above-mentioned similarity score can be pinyin similarity score value and/or font similarity score.Above-mentioned domain knowledge figure
Spectrum may include the proprietary dictionary in field of target voice recognition result fields.Above-mentioned candidate word can be at least one mesh
Mark the proprietary word of speech recognition result fields.
Illustratively, the process for determining candidate correct word is introduced for using pinyin similarity.Assuming that target voice is known
Other result is " guiding for playing Zhou Jielun ", belongs to music field;Detected error word is " guiding ", corresponding phonetic text
For " dao xiang ", then it will be somebody's turn to do the phonetic of " dao xiang " comparison proprietary dictionary of music field, calculate similarity score, such as
The similarity score of " rice fragrant " is 90 points, and the similarity score of " swinging to " is 65 points, and the similarity score on " island to " is 50 points;
Choose suitable preset threshold, such as 60 points, will be greater than the word in the proprietary dictionary of music field of the preset threshold, such as " rice fragrant " and
" swinging to " is as candidate correct word.
Sequencing unit be used for by least one candidate correct corresponding similarity score of word according to the rule successively decreased into
Row sequence.Such as the similarity score of " rice fragrant " is 90 points, the similarity of " swinging to " is divided into 65 points, then will " rice is fragrant " as the
One candidate correct word, " swinging to " is as the second candidate correct word.
Filter element is used to calculate the puzzlement degree score value of each candidate correct word, according to the puzzlement degree of each candidate correct word
Score value determines the correct word of target from each candidate correct word.In the embodiment of the present application, (Perplexity is stranded above-mentioned PPL
Puzzled degree) for characterizing at least one above-mentioned candidate, correctly word rewrites the accuracy into target voice recognition result to score value.PPL points
Value can be used following formula and calculate:
Wherein, N indicates the length of target voice recognition result, P (wi) it is the probability that i-th of candidate correct word occurs.It should
PPL score value is smaller, P (wi) bigger, the accuracy for indicating that this i-th candidate correct word is rewritten into target voice recognition result is got over
It is high.
Further, it is possible to which the highest candidate correct word of PPL score value is determined as the correct word of target.
For example, target voice recognition result is " guiding for playing Zhou Jielun ", candidate correct word is including " rice is fragrant " and "
To ", wherein the PPL score value of " rice is fragrant " is 95, " swinging to " PPL score value is 55, therefore " rice is fragrant " can be used as the correct word of target.
Optionally, filter element can call language model to determine that target is correct from least one above-mentioned candidate correct word
Word.
Above-mentioned rewriting unit is used to the erroneous words in target voice recognition result being rewritten as the correct word of target.
For example, target voice recognition result is the guiding of Zhou Jielun " play ", " rice is fragrant " is the correct word of target, therefore can be with
" rice is fragrant " is replaced into " guiding " in target voice recognition result, the revised target voice recognition result of error correction is obtained, i.e., " broadcasts
The rice for putting Zhou Jielun is fragrant ".
Further, it is possible to which the revised target voice recognition result of the error correction is exported.
3, the mistake that third recast layer 503 is used to that neural network model to be called to rewrite redundancy in target voice recognition result
Word.
Optionally, which has the frame of end-to-end (sequence to sequence), i.e. coding-solution
Code (Encoder-Decoder) frame;The neural network model can use CNN (Convolutional Neural
Networks, convolutional neural networks) model, it can also be using RNN (Recurrent Neural Network, circulation nerve net
Network) model, transformer model can also be used, LSTM (Long Short-Term Memory, length can also be used
Phase memory network) model etc., the embodiment of the present application is not construed as limiting this.
The erroneous words of above-mentioned redundancy, which can be, refers to duplicate word in target voice recognition result, may also mean that target language
Modal particle in sound recognition result.For example, target voice recognition result is " I will listen song for I ", after rewriting the erroneous words of redundancy
Target voice recognition result is " I will listen song ".
Step 406, the revised target voice recognition result of error correction of error correction replacement system output is obtained.
The revised target voice recognition result of error correction of terminal available error correction replacement system output, further basis
The revised target voice recognition result of the error correction, is responded by the interactive system and user of terminal.
It should be noted is that in the embodiment of the present application, above-mentioned error correction is rewritten independent of above-mentioned general language
Identify engine, can be directly by terminal, or service the server execution of the terminal.
In conclusion technical solution provided by the embodiments of the present application, multiple by the way that voice data to be identified to be sent to
Speech recognition engine is identified, obtains multiple speech recognition results, and most by confidence level score value in multiple speech recognition results
Target voice recognition result input error correction is further rewritten system as target voice recognition result by high speech recognition result
System carries out error correction rewriting, obtains the revised target voice recognition result of error correction.Compared to the relevant technologies, by multiple speech recognitions
As a result the middle highest speech recognition result of confidence level score value rather than relies on some voice as target voice recognition result
Identify engine as a result, improving the accuracy of recognition result.Also, also by being entangled to the speech recognition result of the selection
Mistake is rewritten, and the accuracy of recognition result is further improved.
In addition, error correction replacement system be directed to different types of erroneous words, as high frequency erroneous words, the erroneous words of fields and
The erroneous words etc. of redundancy effectively improve the conjunction of final speech recognition result using different error correction rewrite strategies or model
Rationality.
In addition, in the embodiment of the present application, above-mentioned error correction is rewritten independent of general language identification engine, it can be direct
By terminal, or the server execution of the terminal is serviced, saves time and the cost of speech recognition.
Fig. 7 is referred in the following, combining, with the language model of the first error detection module for N metagrammar (N-gram) model, filtering is calculated
Method be MAD algorithm for, to the second recast layer rewrite target voice recognition result fields erroneous words entire flow into
Row is simple to be introduced.With target voice recognition result be " you not and Lyu's cloth who severity ", the first error detection module of the second recast layer
N metagrammar model is called further to obtain a metagrammar such as Uni-Gram, two-dimensional grammar model and ternary syntactic model
The fractional value that model, two-dimensional grammar model and ternary syntactic model calculate separately out, i.e. 1-gram score, 2-gram score and
3-gram score;N-gram mean value point is calculated according to 1-gram score, 2-gram score and 3-gram score, is based on the N-
Gram mean value point uses MAD filter algorithm, determines that " you are other " is erroneous words;Second error detection module passes through keyword detection, entity
Detection and extraction target entity determine that " you are other " is erroneous words;Later, it is recalled in unit in correct word, according to pinyin similarity
And/or font similarity, retrieval obtains at least one candidate correct word in domain knowledge map;Sequencing unit will be above-mentioned each
The candidate correct corresponding similarity score of word is ranked up according to the rule successively decreased, and according to each time in filter element
The sequence of correct word is selected, calculates the PPL score value of each candidate correct word, and the highest candidate correct word of PPL score value is determined as
The correct word of target;Unit is rewritten later, the erroneous words in target voice recognition result are rewritten as the correct word of target, obtain error correction
Revised target voice recognition result " li po and Lv Bu who severe ", so far, error correction, which is rewritten, to be completed.
Following is the application Installation practice, can be used for executing the application embodiment of the method.It is real for the application device
Undisclosed details in example is applied, the application embodiment of the method is please referred to.
Referring to FIG. 8, the block diagram of the speech recognition equipment provided it illustrates the application one embodiment.The device has
Realize that the exemplary function of the above method, the function can also be executed corresponding software realization by hardware realization by hardware.
The device can be speech recognition equipment described above, and perhaps terminal also can be set on background server or terminal.
The device 800 may include: that data acquisition module 810, data transmission blocks 820, result selecting module 830 and result rewrite mould
Block 840.
Data acquisition module 810, for obtaining voice data to be identified.
Data transmission blocks 820 obtain n for the voice data to be identified to be sent to n speech recognition engine
A speech recognition result, the n are the integer greater than 1.
As a result know for the characteristic information according to the n speech recognition result from the n voice selecting module 830
Selection target speech recognition result in other result;Wherein, the characteristic information of institute's speech recognition result is used to indicate described in output
Degree and voice knowledge are adapted between the speech recognition engine of speech recognition result and the voice data to be identified
The credibility for the words that other result includes.
In conclusion in technical solution provided by the embodiments of the present application, it is more by the way that voice data to be identified to be sent to
A speech recognition engine is identified, obtains multiple speech recognition results, and believe according to the feature of multiple speech recognition result
Breath, selected from multiple speech recognition result one as target voice recognition result.Compared in the related technology, voice is known
Speech recognition engine that Yi Laiyu be not single, and the speech recognition engine is a general-purpose platform, for the knowledge of certain specific areas
Other effect is relatively poor, in technical solution provided by the present application, using the speech recognition result conduct of multiple speech recognition engines
With reference to, and preferably speech recognition result is chosen from multiple speech recognition results, improve the accuracy of recognition result.
In some possible designs, as shown in figure 9, the result selecting module 830, comprising: score value computing unit 831
With result selecting unit 832.
Score value computing unit 831 calculates the n voice for the characteristic information according to the n speech recognition result
The corresponding confidence level score value of recognition result.
As a result selecting unit 832, for selecting the highest voice of confidence level score value from the n speech recognition result
Recognition result is as the target voice recognition result.
In some possible designs, as shown in figure 9, described device 800 further include: result rewrites module 840.
As a result module 840 is rewritten, for carrying out error correction rewriting to the erroneous words in the target voice recognition result, is obtained
The revised target voice recognition result of error correction.
It should be noted that device provided by the above embodiment, when realizing its function, only with above-mentioned each functional module
It divides and carries out for example, can according to need in practical application and be completed by different functional modules above-mentioned function distribution,
The internal structure of equipment is divided into different functional modules, to complete all or part of the functions described above.In addition,
Apparatus and method embodiment provided by the above embodiment belongs to same design, and specific implementation process is detailed in embodiment of the method, this
In repeat no more.
Referring to FIG. 10, the structural schematic diagram of the computer equipment provided it illustrates the application one embodiment.The meter
Calculate the audio recognition method that machine equipment is used to implement to provide in above-described embodiment.The computer equipment can be it is described above after
Platform server is also possible to such as smart phone described above, tablet computer, PC, intelligent robot, smart television, intelligence
Speaker etc. can carry out the terminal of interactive voice with user.Specifically:
The computer equipment 1000 includes central processing unit (CPU) 1001 including random access memory (RAM)
1002 and read-only memory (ROM) 1003 system storage 1004, and connection system storage 1004 and central processing list
The system bus 1005 of member 1001.The computer equipment 1000 further includes that letter is transmitted between each device helped in computer
The basic input/output (I/O system) 1006 of breath, and for storage program area 1013, application program 1014 and other
The mass-memory unit 1007 of program module 1012.
The basic input/output 1006 includes display 1008 for showing information and inputs for user
The input equipment 1009 of such as mouse, keyboard etc of information.Wherein the display 1008 and input equipment 1009 all pass through
The input and output controller 1010 for being connected to system bus 1005 is connected to central processing unit 1001.The basic input/defeated
System 1006 can also include input and output controller 1010 to touch for receiving and handling from keyboard, mouse or electronics out
Control the input of multiple other equipment such as pen.Similarly, input and output controller 1010 also provide output to display screen, printer or
Other kinds of output equipment.
The mass-memory unit 1007 (is not shown by being connected to the bulk memory controller of system bus 1005
It is connected to central processing unit 1001 out).The mass-memory unit 1007 and its associated computer-readable medium are
Computer equipment 1000 provides non-volatile memories.That is, the mass-memory unit 1007 may include such as hard
The computer-readable medium (not shown) of disk or CD-ROM drive etc.
Without loss of generality, the computer-readable medium may include computer storage media and communication media.Computer
Storage medium includes information such as computer readable instructions, data structure, program module or other data for storage
The volatile and non-volatile of any method or technique realization, removable and irremovable medium.Computer storage medium includes
RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storages its technologies, CD-ROM, DVD or other optical storages, tape
Box, tape, disk storage or other magnetic storage devices.Certainly, skilled person will appreciate that the computer storage medium
It is not limited to above-mentioned several.Above-mentioned system storage 1004 and mass-memory unit 1007 may be collectively referred to as memory.
According to the various embodiments of the application, the computer equipment 1000 can also be connected by networks such as internets
The remote computer operation being connected on network.Namely computer equipment 1000 can be by being connected on the system bus 1005
Network Interface Unit 1011 be connected to network 1012, in other words, Network Interface Unit 1011 can be used also to be connected to it
The network or remote computer system (not shown) of his type.
The memory further includes at least one instruction, at least a Duan Chengxu, code set or instruction set, and described at least one
Instruction, an at least Duan Chengxu, code set or instruction set are stored in memory, and be configured to by one or more than one
It manages device to execute, to realize above-mentioned audio recognition method.
In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, is stored in the storage medium
At least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, institute
It states code set or described instruction collection realizes above-mentioned audio recognition method when being executed by processor.
In the exemplary embodiment, a kind of computer program product is additionally provided, when the computer program product is processed
When device executes, for realizing above-mentioned audio recognition method.
It should be understood that referenced herein " multiple " refer to two or more."and/or", description association
The incidence relation of object indicates may exist three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A
And B, individualism B these three situations.Character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".
The foregoing is merely the exemplary embodiments of the application, all in spirit herein not to limit the application
Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.
Claims (15)
1. a kind of audio recognition method, which is characterized in that the described method includes:
Obtain voice data to be identified;
The voice data to be identified is sent to n speech recognition engine, obtains n speech recognition result, the n is big
In 1 integer;
According to the characteristic information of the n speech recognition result, the selection target speech recognition from the n speech recognition result
As a result;Wherein, the speech recognition that the characteristic information of institute's speech recognition result is used to indicate output institute's speech recognition result is drawn
It holds up and is adapted to degree and the credible journey of words that institute's speech recognition result includes between the voice data to be identified
Degree.
2. the method according to claim 1, wherein the characteristic information according to n speech recognition result, from
Selection target speech recognition result in the n speech recognition result, comprising:
According to the characteristic information of the n speech recognition result, the corresponding confidence level of n speech recognition result is calculated
Score value;
From the n speech recognition result, select the highest speech recognition result of confidence level score value as the target voice
Recognition result.
3. according to the method described in claim 2, it is characterized in that, described believe according to the feature of the n speech recognition result
Breath, calculates the corresponding confidence level score value of the n speech recognition result, comprising:
For i-th of speech recognition result in the n speech recognition result, i-th of speech recognition result is obtained
Characteristic information, the i are the positive integer less than or equal to the n;
By the characteristic information of i-th of speech recognition result, it is input to machine learning model, i-th of voice is obtained and knows
The corresponding confidence level score value of other result.
4. method according to any one of claims 1 to 3, which is characterized in that described according to the n speech recognition result
Characteristic information, after selection target speech recognition result in the n speech recognition result, further includes:
Error correction rewriting is carried out to the erroneous words in the target voice recognition result, obtains the revised target voice identification of error correction
As a result.
5. according to the method described in claim 4, it is characterized in that, the erroneous words in the target voice recognition result
Error correction rewriting is carried out, the revised target voice recognition result of error correction is obtained, comprising:
The target voice recognition result is input to error correction replacement system, the error correction replacement system is for determining the target
Erroneous words in speech recognition result obtain correct word corresponding with the erroneous words, will be in the target voice recognition result
The erroneous words be rewritten as the correct word, obtain the revised target voice recognition result of the error correction;
Obtain the revised target voice recognition result of the error correction of the error correction replacement system output.
6. according to the method described in claim 5, it is characterized in that, the error correction replacement system includes: the first recast layer, second
Recast layer and third recast layer;
Wherein, first recast layer is used to rewrite the high frequency erroneous words in the target voice recognition result, and described second changes
Layer is write for rewriting the erroneous words of the target voice recognition result fields, the third recast layer is for rewriting the mesh
Mark the erroneous words of redundancy in speech recognition result.
7. according to the method described in claim 6, it is characterized in that, first recast layer, is used for:
When detecting in the target voice recognition result comprising the high frequency erroneous words, according to white list and/or rule are rewritten
Error correction rewriting then is carried out to the high frequency erroneous words, the white list refers to the high frequency erroneous words and the high frequency erroneous words pair
The mapping table between correct word answered, the rewriting rule are the rewriting rules for high frequency erroneous words setting.
8. according to the method described in claim 6, it is characterized in that, second recast layer includes: the first error detection module, second
Error detection module and rewriting module;
First error detection module, for calling language model to detect the mistake of the target voice recognition result fields
Word;
Second error detection module, for detecting the mistake of the target voice recognition result fields according to Parsing algorithm
Accidentally word;
The rewriting module, the erroneous words for detecting first error detection module and second error detection module change
It is written as correct word.
9. according to the method described in claim 8, it is characterized in that, the rewriting module includes that correct word recalls unit, filtering
Unit and rewriting unit;
The correct word recalls unit, for selecting from the domain knowledge map of the target voice recognition result fields
Take at least one the candidate correct word for being greater than preset threshold with the erroneous words similarity score;
The filter element, for calculating the puzzlement degree score value of each candidate correct word, according to each described candidate correct
The puzzlement degree score value of word determines the correct word of target from each candidate correct word, and the puzzled degree score value is for characterizing institute
It states candidate correct word and rewrites the accuracy into the target voice recognition result;
The rewriting unit, it is correct for the erroneous words in the target voice recognition result to be rewritten as the target
Word.
10. according to the method described in claim 6, it is characterized in that, the third recast layer, is used for:
Neural network model is called to rewrite the erroneous words of redundancy in the target voice recognition result.
11. a kind of speech recognition equipment, which is characterized in that described device includes:
Data acquisition module, for obtaining voice data to be identified;
Data transmission blocks obtain n voice for the voice data to be identified to be sent to n speech recognition engine
Recognition result, the n are the integer greater than 1;
As a result selecting module, for the characteristic information according to the n speech recognition result, from the n speech recognition result
Middle selection target speech recognition result;Wherein, the characteristic information of institute's speech recognition result is used to indicate the output voice and knows
Degree and institute's speech recognition result are adapted between the speech recognition engine of other result and the voice data to be identified
The credibility for the words for including.
12. device according to claim 11, which is characterized in that the result selecting module, comprising:
Score value computing unit calculates the n speech recognition knot for the characteristic information according to the n speech recognition result
The corresponding confidence level score value of fruit;
As a result selecting unit, for selecting the highest speech recognition result of confidence level score value from the n speech recognition result
As the target voice recognition result.
13. device according to claim 11 or 12, which is characterized in that described device further include:
As a result module is rewritten, for carrying out error correction rewriting to the erroneous words in the target voice recognition result, error correction is obtained and changes
Target voice recognition result after writing.
14. a kind of computer equipment, which is characterized in that the computer equipment includes processor and memory, the memory
In be stored at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, described at least one
Duan Chengxu, the code set or instruction set are loaded by the processor and are executed to realize such as any one of claims 1 to 10 institute
The method stated.
15. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium
A few Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or instruction
Collection is loaded by processor and is executed to realize method as described in any one of claim 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327337.6A CN110148416B (en) | 2019-04-23 | 2019-04-23 | Speech recognition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910327337.6A CN110148416B (en) | 2019-04-23 | 2019-04-23 | Speech recognition method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110148416A true CN110148416A (en) | 2019-08-20 |
CN110148416B CN110148416B (en) | 2024-03-15 |
Family
ID=67593849
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910327337.6A Active CN110148416B (en) | 2019-04-23 | 2019-04-23 | Speech recognition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110148416B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110364146A (en) * | 2019-08-23 | 2019-10-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, speech recognition apparatus and storage medium |
CN110597082A (en) * | 2019-10-23 | 2019-12-20 | 北京声智科技有限公司 | Intelligent household equipment control method and device, computer equipment and storage medium |
CN110648668A (en) * | 2019-09-24 | 2020-01-03 | 上海依图信息技术有限公司 | Keyword detection device and method |
CN110647987A (en) * | 2019-08-22 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Method and device for processing data in application program, electronic equipment and storage medium |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN110852087A (en) * | 2019-09-23 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Chinese error correction method and device, storage medium and electronic device |
CN111081247A (en) * | 2019-12-24 | 2020-04-28 | 腾讯科技(深圳)有限公司 | Method for speech recognition, terminal, server and computer-readable storage medium |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN111599359A (en) * | 2020-05-09 | 2020-08-28 | 标贝(北京)科技有限公司 | Man-machine interaction method, server, client and storage medium |
CN111627438A (en) * | 2020-05-21 | 2020-09-04 | 四川虹美智能科技有限公司 | Voice recognition method and device |
CN111883122A (en) * | 2020-07-22 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Voice recognition method and device, storage medium and electronic equipment |
CN112151022A (en) * | 2020-09-25 | 2020-12-29 | 北京百度网讯科技有限公司 | Speech recognition optimization method, device, equipment and storage medium |
CN112509566A (en) * | 2020-12-22 | 2021-03-16 | 北京百度网讯科技有限公司 | Voice recognition method, device, equipment, storage medium and program product |
CN112509565A (en) * | 2020-11-13 | 2021-03-16 | 中信银行股份有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN113096654A (en) * | 2021-03-26 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Computer voice recognition system based on big data |
CN113658586A (en) * | 2021-08-13 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice interaction method and device |
CN113782030A (en) * | 2021-09-10 | 2021-12-10 | 平安科技(深圳)有限公司 | Error correction method based on multi-mode speech recognition result and related equipment |
CN113793597A (en) * | 2021-09-15 | 2021-12-14 | 云知声智能科技股份有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113793604A (en) * | 2021-09-14 | 2021-12-14 | 思必驰科技股份有限公司 | Speech recognition system optimization method and device |
CN114446279A (en) * | 2022-02-18 | 2022-05-06 | 青岛海尔科技有限公司 | Voice recognition method, voice recognition device, storage medium and electronic equipment |
CN114842871A (en) * | 2022-03-25 | 2022-08-02 | 青岛海尔科技有限公司 | Voice data processing method and device, storage medium and electronic device |
WO2022262542A1 (en) * | 2021-06-15 | 2022-12-22 | 南京硅基智能科技有限公司 | Text output method and system, storage medium, and electronic device |
WO2023273776A1 (en) * | 2021-06-30 | 2023-01-05 | 青岛海尔科技有限公司 | Speech data processing method and apparatus, and storage medium and electronic apparatus |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020184016A1 (en) * | 2001-05-29 | 2002-12-05 | International Business Machines Corporation | Method of speech recognition using empirically determined word candidates |
US20030040907A1 (en) * | 2001-08-24 | 2003-02-27 | Sen-Chia Chang | Speech recognition system |
JP2003228393A (en) * | 2002-01-31 | 2003-08-15 | Nippon Telegr & Teleph Corp <Ntt> | Device and method for voice interaction, voice interaction program and recording medium therefor |
US20090259466A1 (en) * | 2008-04-15 | 2009-10-15 | Nuance Communications, Inc. | Adaptive Confidence Thresholds for Speech Recognition |
KR20110010233A (en) * | 2009-07-24 | 2011-02-01 | 고려대학교 산학협력단 | Apparatus and method for speaker adaptation by evolutional learning, and speech recognition system using thereof |
US20120179469A1 (en) * | 2011-01-07 | 2012-07-12 | Nuance Communication, Inc. | Configurable speech recognition system using multiple recognizers |
CN103440867A (en) * | 2013-08-02 | 2013-12-11 | 安徽科大讯飞信息科技股份有限公司 | Method and system for recognizing voice |
CN105869634A (en) * | 2016-03-31 | 2016-08-17 | 重庆大学 | Field-based method and system for feeding back text error correction after speech recognition |
CN106653007A (en) * | 2016-12-05 | 2017-05-10 | 苏州奇梦者网络科技有限公司 | Speech recognition system |
CN106683662A (en) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | Speech recognition method and device |
CN107016995A (en) * | 2016-01-25 | 2017-08-04 | 福特全球技术公司 | The speech recognition based on acoustics and domain for vehicle |
CN107741928A (en) * | 2017-10-13 | 2018-02-27 | 四川长虹电器股份有限公司 | A kind of method to text error correction after speech recognition based on field identification |
WO2018059957A1 (en) * | 2016-09-30 | 2018-04-05 | Robert Bosch Gmbh | System and method for speech recognition |
CN108694940A (en) * | 2017-04-10 | 2018-10-23 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
CN109145281A (en) * | 2017-06-15 | 2019-01-04 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method, device and storage medium |
-
2019
- 2019-04-23 CN CN201910327337.6A patent/CN110148416B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020184016A1 (en) * | 2001-05-29 | 2002-12-05 | International Business Machines Corporation | Method of speech recognition using empirically determined word candidates |
US20030040907A1 (en) * | 2001-08-24 | 2003-02-27 | Sen-Chia Chang | Speech recognition system |
JP2003228393A (en) * | 2002-01-31 | 2003-08-15 | Nippon Telegr & Teleph Corp <Ntt> | Device and method for voice interaction, voice interaction program and recording medium therefor |
US20090259466A1 (en) * | 2008-04-15 | 2009-10-15 | Nuance Communications, Inc. | Adaptive Confidence Thresholds for Speech Recognition |
KR20110010233A (en) * | 2009-07-24 | 2011-02-01 | 고려대학교 산학협력단 | Apparatus and method for speaker adaptation by evolutional learning, and speech recognition system using thereof |
US20120179469A1 (en) * | 2011-01-07 | 2012-07-12 | Nuance Communication, Inc. | Configurable speech recognition system using multiple recognizers |
CN103440867A (en) * | 2013-08-02 | 2013-12-11 | 安徽科大讯飞信息科技股份有限公司 | Method and system for recognizing voice |
CN106683662A (en) * | 2015-11-10 | 2017-05-17 | 中国电信股份有限公司 | Speech recognition method and device |
CN107016995A (en) * | 2016-01-25 | 2017-08-04 | 福特全球技术公司 | The speech recognition based on acoustics and domain for vehicle |
CN105869634A (en) * | 2016-03-31 | 2016-08-17 | 重庆大学 | Field-based method and system for feeding back text error correction after speech recognition |
WO2018059957A1 (en) * | 2016-09-30 | 2018-04-05 | Robert Bosch Gmbh | System and method for speech recognition |
CN106653007A (en) * | 2016-12-05 | 2017-05-10 | 苏州奇梦者网络科技有限公司 | Speech recognition system |
CN108694940A (en) * | 2017-04-10 | 2018-10-23 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
CN109145281A (en) * | 2017-06-15 | 2019-01-04 | 北京嘀嘀无限科技发展有限公司 | Audio recognition method, device and storage medium |
CN107741928A (en) * | 2017-10-13 | 2018-02-27 | 四川长虹电器股份有限公司 | A kind of method to text error correction after speech recognition based on field identification |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110647987A (en) * | 2019-08-22 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Method and device for processing data in application program, electronic equipment and storage medium |
CN110364146B (en) * | 2019-08-23 | 2021-07-27 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium |
CN110364146A (en) * | 2019-08-23 | 2019-10-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, speech recognition apparatus and storage medium |
CN110852087A (en) * | 2019-09-23 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Chinese error correction method and device, storage medium and electronic device |
CN110648668A (en) * | 2019-09-24 | 2020-01-03 | 上海依图信息技术有限公司 | Keyword detection device and method |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN110765996B (en) * | 2019-10-21 | 2022-07-29 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN110597082A (en) * | 2019-10-23 | 2019-12-20 | 北京声智科技有限公司 | Intelligent household equipment control method and device, computer equipment and storage medium |
CN111081247A (en) * | 2019-12-24 | 2020-04-28 | 腾讯科技(深圳)有限公司 | Method for speech recognition, terminal, server and computer-readable storage medium |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN111402861B (en) * | 2020-03-25 | 2022-11-15 | 思必驰科技股份有限公司 | Voice recognition method, device, equipment and storage medium |
CN111599359A (en) * | 2020-05-09 | 2020-08-28 | 标贝(北京)科技有限公司 | Man-machine interaction method, server, client and storage medium |
CN111627438A (en) * | 2020-05-21 | 2020-09-04 | 四川虹美智能科技有限公司 | Voice recognition method and device |
CN111883122B (en) * | 2020-07-22 | 2023-10-27 | 海尔优家智能科技(北京)有限公司 | Speech recognition method and device, storage medium and electronic equipment |
CN111883122A (en) * | 2020-07-22 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Voice recognition method and device, storage medium and electronic equipment |
CN112151022A (en) * | 2020-09-25 | 2020-12-29 | 北京百度网讯科技有限公司 | Speech recognition optimization method, device, equipment and storage medium |
CN112509565A (en) * | 2020-11-13 | 2021-03-16 | 中信银行股份有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112509566B (en) * | 2020-12-22 | 2024-03-19 | 阿波罗智联(北京)科技有限公司 | Speech recognition method, device, equipment, storage medium and program product |
CN112509566A (en) * | 2020-12-22 | 2021-03-16 | 北京百度网讯科技有限公司 | Voice recognition method, device, equipment, storage medium and program product |
CN113096654B (en) * | 2021-03-26 | 2022-06-24 | 山西三友和智慧信息技术股份有限公司 | Computer voice recognition system based on big data |
CN113096654A (en) * | 2021-03-26 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Computer voice recognition system based on big data |
US11651139B2 (en) | 2021-06-15 | 2023-05-16 | Nanjing Silicon Intelligence Technology Co., Ltd. | Text output method and system, storage medium, and electronic device |
WO2022262542A1 (en) * | 2021-06-15 | 2022-12-22 | 南京硅基智能科技有限公司 | Text output method and system, storage medium, and electronic device |
WO2023273776A1 (en) * | 2021-06-30 | 2023-01-05 | 青岛海尔科技有限公司 | Speech data processing method and apparatus, and storage medium and electronic apparatus |
CN113658586A (en) * | 2021-08-13 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice interaction method and device |
CN113658586B (en) * | 2021-08-13 | 2024-04-09 | 北京百度网讯科技有限公司 | Training method of voice recognition model, voice interaction method and device |
CN113782030A (en) * | 2021-09-10 | 2021-12-10 | 平安科技(深圳)有限公司 | Error correction method based on multi-mode speech recognition result and related equipment |
CN113782030B (en) * | 2021-09-10 | 2024-02-02 | 平安科技(深圳)有限公司 | Error correction method based on multi-mode voice recognition result and related equipment |
CN113793604A (en) * | 2021-09-14 | 2021-12-14 | 思必驰科技股份有限公司 | Speech recognition system optimization method and device |
CN113793604B (en) * | 2021-09-14 | 2024-01-05 | 思必驰科技股份有限公司 | Speech recognition system optimization method and device |
CN113793597A (en) * | 2021-09-15 | 2021-12-14 | 云知声智能科技股份有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN114446279A (en) * | 2022-02-18 | 2022-05-06 | 青岛海尔科技有限公司 | Voice recognition method, voice recognition device, storage medium and electronic equipment |
CN114842871A (en) * | 2022-03-25 | 2022-08-02 | 青岛海尔科技有限公司 | Voice data processing method and device, storage medium and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN110148416B (en) | 2024-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110148416A (en) | Audio recognition method, device, equipment and storage medium | |
US11302330B2 (en) | Clarifying questions for rewriting ambiguous user utterance | |
US11055355B1 (en) | Query paraphrasing | |
US11222030B2 (en) | Automatically augmenting message exchange threads based on tone of message | |
CN108847241B (en) | Method for recognizing conference voice as text, electronic device and storage medium | |
US9734193B2 (en) | Determining domain salience ranking from ambiguous words in natural speech | |
US20200335096A1 (en) | Pinyin-based method and apparatus for semantic recognition, and system for human-machine dialog | |
US20200380077A1 (en) | Architecture for resolving ambiguous user utterance | |
US20140316764A1 (en) | Clarifying natural language input using targeted questions | |
JP2021533397A (en) | Speaker dialification using speaker embedding and a trained generative model | |
CN112673421A (en) | Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances | |
WO2018045646A1 (en) | Artificial intelligence-based method and device for human-machine interaction | |
US11830482B2 (en) | Method and apparatus for speech interaction, and computer storage medium | |
US11238858B2 (en) | Speech interactive method and device | |
CN109192194A (en) | Voice data mask method, device, computer equipment and storage medium | |
WO2020151690A1 (en) | Statement generation method, device and equipment and storage medium | |
JP7063937B2 (en) | Methods, devices, electronic devices, computer-readable storage media, and computer programs for voice interaction. | |
CN110717021B (en) | Input text acquisition and related device in artificial intelligence interview | |
CN109492085B (en) | Answer determination method, device, terminal and storage medium based on data processing | |
CN114678027A (en) | Error correction method and device for voice recognition result, terminal equipment and storage medium | |
CN110020429A (en) | Method for recognizing semantics and equipment | |
WO2021129411A1 (en) | Text processing method and device | |
TWI734085B (en) | Dialogue system using intention detection ensemble learning and method thereof | |
CN116978367A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN114254634A (en) | Multimedia data mining method, device, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |