WO2014129033A1 - 音声認識システムおよび音声認識装置 - Google Patents
音声認識システムおよび音声認識装置 Download PDFInfo
- Publication number
- WO2014129033A1 WO2014129033A1 PCT/JP2013/081288 JP2013081288W WO2014129033A1 WO 2014129033 A1 WO2014129033 A1 WO 2014129033A1 JP 2013081288 W JP2013081288 W JP 2013081288W WO 2014129033 A1 WO2014129033 A1 WO 2014129033A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recognition result
- speech recognition
- voice
- unit
- server
- Prior art date
Links
- 238000001514 detection method Methods 0.000 claims abstract description 25
- 230000010354 integration Effects 0.000 claims description 68
- 230000005540 biological transmission Effects 0.000 claims description 14
- 238000010586 diagram Methods 0.000 description 20
- 238000000034 method Methods 0.000 description 12
- 230000003111 delayed effect Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 5
- 238000009825 accumulation Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/34—Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Definitions
- the present invention relates to a speech recognition system that performs speech recognition on a server side and a client side, and a technique for improving speech recognition accuracy in a client-side speech recognition apparatus in the speech recognition system.
- a voice recognition system that performs voice recognition on a server side and a client side in order to improve voice recognition performance of voice data.
- speech recognition is first performed on the client side, and when it is determined that the recognition score indicating the accuracy of the speech recognition result on the client side is bad, speech recognition is performed on the server side.
- a method of adopting the side speech recognition result has been proposed.
- the voice recognition on the client side and the voice recognition on the server side are performed simultaneously in parallel, and the recognition score of the voice recognition result on the client side is compared with the recognition score of the voice recognition result on the server side.
- a method of adopting the recognition result has also been proposed.
- the server side transmits part-of-speech information (general nouns, particles, etc.) in addition to the speech recognition result, and the part-of-speech information received by the client side is used to replace, for example, common nouns with proper nouns.
- a method for correcting the recognition result is proposed.
- the server side does not transmit the recognition score, or when the calculation method of the recognition score transmitted by the server side is unknown (for example, in-house When developing only client-side speech recognition and using another company's speech recognition server), the client-side recognition scores cannot be accurately compared, and it is not possible to select highly accurate speech recognition results. was there.
- the present invention has been made to solve the above-described problems, and suppresses the delay time from the input of speech until the acquisition of the speech recognition result, and the recognition score, part of speech information, etc. transmitted by the server side
- An object is to select a speech recognition result with high accuracy even when information other than the speech recognition result cannot be used.
- a speech recognition system generates a server-side speech recognition result candidate by performing speech recognition on a server-side receiving unit that receives speech data input from a speech recognition device and speech data received by the server-side receiving unit.
- the server side speech recognition unit, the server side transmission unit that transmits the server side speech recognition result candidate generated by the server side speech recognition unit to the speech recognition device, and the input speech voice as speech data A voice input unit to be converted, voice recognition of the voice data converted by the voice input unit is performed, a client side voice recognition unit that generates a client side voice recognition result candidate, and the voice data converted by the voice input unit is transmitted to the server device
- a plurality of server side speech recognition result candidates received by the unit and a recognition result candidate comparison unit that detects a different text, a client side speech recognition result candidate, a server side speech recognition result candidate, and a recognition result candidate comparison unit Based on the detection result, the client-side speech recognition result candidate and the server-side speech recognition result candidate are integrated, and a recognition result integration unit that determines the speech recognition result, and an output that outputs the speech recognition result determined by the recognition result integration unit And a voice recognition device provided with a unit.
- the present invention it is possible to suppress the delay time from when a voice is input until the voice recognition result is acquired, and to select the voice recognition result with high accuracy.
- FIG. 1 is a block diagram illustrating a configuration of a voice recognition system according to Embodiment 1.
- FIG. 3 is a flowchart showing the operation of the voice recognition system according to the first embodiment. It is a figure which shows the example of a production
- FIG. It is a block diagram which shows the structure of the speech recognition system by Embodiment 2.
- 6 is a flowchart showing the operation of the voice recognition system according to the second embodiment. It is a figure which shows the example of a production
- FIG. It is a figure which shows the example of a pattern storage of the speech rule of the speech recognition system by Embodiment 2.
- FIG. 10 is a block diagram illustrating a configuration of a voice recognition system according to a third embodiment.
- 10 is a flowchart showing first and third operations of the speech recognition system according to the third embodiment.
- FIG. 11 is a diagram illustrating an accumulation example of an input voice / recognition result storage unit of the voice recognition system according to the third embodiment.
- 10 is a flowchart showing a second operation of the speech recognition system according to the third embodiment. It is a figure which shows the database for correction
- FIG. 10 is a flowchart showing the operation of the speech recognition system according to the fourth embodiment. It is a figure which shows the example of a production
- FIG. 20 is a diagram illustrating a pattern storage example of an utterance rule of the voice recognition system according to the sixth embodiment.
- FIG. 20 is a diagram illustrating an accumulation example of an input voice / recognition result storage unit of the voice recognition system according to the seventh embodiment. It is a figure which shows an example of the database for correction
- FIG. 20 is a diagram illustrating a generation example of a voice recognition result of the voice recognition system according to the eighth embodiment.
- FIG. 20 is a diagram illustrating an example of storing a pattern of utterance rules in the speech recognition system according to the eighth embodiment.
- FIG. 1 is a block diagram showing a configuration of a speech recognition system according to Embodiment 1 of the present invention.
- the speech recognition system includes a speech recognition server (server device) 100 and a speech recognition device 200.
- the speech recognition server 100 includes a reception unit (server side reception unit) 101, a server side speech recognition unit 102, and a transmission unit (server side transmission unit) 103.
- the speech recognition server 200 recognizes speech data received from the speech recognition apparatus 200 and performs speech. A function of transmitting the recognition result to the speech recognition apparatus 200 is provided.
- the receiving unit 101 receives voice data from the voice recognition device 200.
- the server-side voice recognition unit 102 recognizes the voice data received by the receiving unit 101 and generates server-side voice recognition result candidates.
- the transmission unit 103 transmits the server-side speech recognition result candidate generated by the server-side speech recognition unit 102 to the speech recognition apparatus 200.
- the speech recognition apparatus 200 includes a speech input unit 201, a client side speech recognition unit 202, a transmission unit (client side transmission unit) 203, a reception unit (client side reception unit) 204, a recognition result candidate comparison unit 205, and a recognition result integration unit 206. And an output unit 207, which has a function of recognizing voice data input via a microphone and outputting a voice recognition result.
- the voice input unit 201 converts a user's uttered voice input via a microphone or the like into voice data that is a data signal.
- the client-side voice recognition unit 202 recognizes the voice data converted by the voice input unit 201 and generates a client-side voice recognition result candidate.
- the transmission unit 203 transmits the voice data input from the voice input unit 201 to the voice recognition server 100.
- the receiving unit 204 receives the server-side speech recognition result candidate transmitted from the speech recognition server 100.
- the recognition result candidate comparison unit 205 compares text information included in a plurality of server side speech recognition result candidates transmitted from the speech recognition server 100 via the receiving unit 204, and detects a partial text having a difference.
- the recognition result integration unit 206 performs speech recognition based on the client side speech recognition result candidate generated by the client side speech recognition unit 202, the server side speech recognition result candidate received by the reception unit 204, and the detection result of the recognition result candidate comparison unit 205. Integrate the result candidates and confirm the speech recognition result.
- the output unit 207 outputs the voice recognition result determined by the recognition result integration unit 206 to an output device such as a monitor or a speaker.
- FIG. 2 is a flowchart showing the operation of the speech recognition system according to the first embodiment of the present invention
- FIG. 3 is a diagram showing an example of generating a speech recognition result of the speech recognition system according to the first embodiment of the present invention.
- the receiving unit 101 receives the voice data transmitted in step ST3, and outputs the received voice data to the server side voice recognition unit 102 (step ST4).
- the server-side voice recognition unit 102 performs voice recognition on the voice data input in step ST4 and generates server-side voice recognition result candidates (step ST5).
- the transmitting unit 103 transmits the server-side speech recognition result candidate text information generated in step ST5 to the speech recognition apparatus 200 (step ST6).
- the server-side voice recognition unit 102 recognizes an arbitrary sentence and performs voice recognition on the voice data “set to destination, Ofuna clock specialty store” received from the voice recognition device 200, as shown in FIG.
- Server-side speech recognition including server-side speech recognition result candidate 301 “Set destination to Ofunato light specialty store” and server-side speech recognition result candidate 302 “Set destination to rich watch shop”
- the result candidate list 303 is acquired.
- the transmission unit 103 transmits the server-side speech recognition result candidate list 303 to the speech recognition apparatus 200 side.
- the client-side speech recognition unit 202 performs speech recognition on the speech data input in step ST2 to generate a client-side speech recognition result candidate, and the obtained client-side speech recognition result candidate Is output to the recognition result integration unit 206 (step ST7).
- the client-side voice recognition unit 202 recognizes only the voice operation command and the place name information near the site, and the user inputs a voice “Set to destination, Ofuna clock specialty store”, the client-side voice The recognizing unit 202 recognizes “Destination” of the voice operation command and “Ofune Clock Special Store” which is the place name information in the vicinity of the current location.
- the client side speech recognition result candidate list 305 including “store” is acquired. In the example of FIG. 3, the client side speech recognition result candidate list 305 includes only one client side speech recognition result candidate 304.
- the receiving unit 204 of the speech recognition apparatus 200 receives the received server side speech recognition result candidate as a recognition result candidate comparison unit 205 and a recognition unit.
- the result is output to the result integration unit 206 (step ST8).
- the recognition result candidate comparison unit 205 determines whether or not the server side speech recognition result candidate input in step ST8 includes a plurality of speech recognition result candidates (step ST9).
- the recognition result candidate comparison unit 205 further compares the texts of the respective speech recognition result candidates and detects a partial text having a difference (step ST10).
- the recognition result candidate comparison unit 205 determines whether or not a partial text with a difference is detected (step ST11).
- a partial text with a difference is detected (step ST11; YES)
- the partial text with a difference is detected.
- the detection result is output to the recognition result integration unit 206 (step ST12).
- the server-side speech recognition result candidate list 303 includes two server-side speech recognition result candidates 301 and 302, and each text information “sets the destination as a Ofunato light specialty store”.
- a recognition result integration unit with a difference non-detection as a detection result outputs to 206 (step ST13). For example, in the example of FIG. 3, when only the server-side speech recognition result candidate 301 is included in the server-side speech recognition result candidate list 303, a different partial text is not detected.
- the recognition result integration unit 206 refers to the detection result input in step ST12 or step ST13 and determines whether or not there is a partial text having a difference (step ST14). If there is a difference partial text (step ST14; YES), the recognition result integration unit 206 replaces the text information of the difference partial text with the text information of the client side speech recognition result candidate generated in step ST7. The result is a speech recognition result (step ST15). Thereafter, the voice recognition result is output to the output unit 207 (step ST16).
- neither partial text is included.
- a part of the partial text to be searched is shortened like “destination” and “specialty store”, and re-searching is performed using the shortened partial text.
- FIG. 3 the partial text “Ofunato Watari” and “Abundant clocks” surrounded by the first text “Destination” and the last text “Set as a specialty store” in the server-side speech recognition result candidate 301.
- the recognition result integration unit 206 uses the server side speech recognition result candidate received by the reception unit 204 in step ST8 as a speech recognition result (step ST17). ), And outputs the speech recognition result to the output unit 207 (step ST16). In the voice recognition system of the present invention, the above-described processing is always repeated.
- the server-side speech recognition result candidate texts are compared with each other. Since the partial text is detected, the detected partial text is replaced with the partial text of the client side speech recognition result candidate generated by the speech recognition apparatus 200, and the final speech recognition result is obtained. Even when using a speech recognition server with an unknown numerical value (recognition score) calculation method, the server and client side speech recognition result candidates can be integrated without using the recognition score. Accurate speech recognition results can be output.
- the recognition result for detecting the partial text having a difference by comparing the texts of the server-side speech recognition result candidates without performing complicated parsing processing and recalculation of the recognition score since it is configured to include the candidate comparison unit 205 and the recognition result integration unit 206 that replaces partial texts having differences, the function of the speech recognition apparatus can be realized while suppressing the processing load on the CPU.
- the voice recognition apparatus 200 is configured to transmit the voice data to the voice recognition server 100 at the same time as the voice data is input to the client-side voice recognition unit 202.
- the voice recognition result can be acquired from the voice recognition server 100 earlier, and the voice recognition result is confirmed and output. It is possible to reduce the delay time until
- Embodiment 1 described above when a plurality of server-side speech recognition result candidates are acquired from the speech recognition server 100, the texts of the server-side speech recognition result candidates are compared with each other to detect a partial text that is different.
- the partial text is replaced based on whether or not there is a difference, but the number of different server-side speech recognition result candidates and the type of difference may be used as the determination criteria. For example, when there are three candidates as server-side speech recognition result candidates and all three different partial texts are different, it is determined that the reliability is 1/3, and when the different partial texts differ only by one candidate It is determined that the reliability is 2/3. Only the partial text having the determined reliability of 1/3 or less is replaced with the text of the client side speech recognition result candidate of the client side speech recognition unit 202. Thereby, the accuracy of speech recognition can be improved, and a more accurate speech recognition result can be obtained.
- Embodiment 1 described above when a plurality of server-side speech recognition result candidates are acquired, the texts of the server-side speech recognition result candidates are compared with each other and only one partial text with a difference is detected. However, when there are multiple partial texts with differences, it is determined that the reliability of the entire server-side speech recognition result candidate is low, and the user is requested to re-input the speech. May be. Thereby, it can suppress that an incorrect speech recognition result is output.
- Embodiment 1 described above when a plurality of server-side speech recognition result candidates are acquired from the speech recognition server 100, a portion having a difference in the text of the server-side speech recognition result candidate is determined as the client-side speech recognition result candidate.
- the configuration to replace the text is shown, the configuration is such that the client side speech recognition unit 202 calculates the recognition score, and the text is replaced only when the calculated recognition score is equal to or greater than a preset threshold value. Good. Thereby, the accuracy of speech recognition can be improved, and a more accurate speech recognition result can be output.
- Embodiment 2 the server-side speech recognition result candidate texts are compared with each other and the partial text having a difference is replaced with the client-side speech recognition result candidate.
- segmented text and the data based on a client side speech recognition result candidate is shown.
- FIG. 4 is a block diagram showing the configuration of the speech recognition system according to the second embodiment of the present invention.
- the voice recognition system of the second embodiment is also configured by the voice recognition server 100 and the voice recognition device 200 ′.
- the speech recognition apparatus 200 ′ of the second embodiment is provided with an input rule determination unit 211 and an input rule storage unit 212 in addition to the speech recognition apparatus 200 shown in FIG.
- the same or corresponding parts as the components of the speech recognition system according to the first embodiment are denoted by the same reference numerals as those used in FIG. 1, and the description thereof is omitted or simplified.
- the input rule determination unit 211 extracts a keyword from the client side speech recognition result candidate generated by the client side speech recognition unit 202 and determines the utterance rule of the input speech.
- the input rule storage unit 212 is a database that stores patterns of utterance rules for input speech.
- the recognition result integration unit 206 ′ includes the client side speech recognition result candidate generated by the client side speech recognition unit 202, the server side speech recognition result candidate received by the reception unit 204, the detection result of the recognition result candidate comparison unit 205, and the input rule.
- the speech recognition result candidates are integrated based on the utterance rule determined by the determination unit 211, and the speech recognition result is determined.
- FIG. 5 is a flowchart showing the operation of the speech recognition system according to the second embodiment of the present invention
- FIG. 6 is a diagram showing a generation example of the speech recognition result of the speech recognition system according to the second embodiment
- FIG. It is a figure which shows the example of a pattern storage of the speech rule of the speech recognition system of the form 2 of.
- the same steps as those in the speech recognition system according to the first embodiment are denoted by the same reference numerals as those used in FIG. 2, and the description thereof is omitted or simplified.
- the speech recognition apparatus 200 ′ performs the processing of steps ST1, ST2, and ST7, and performs speech recognition on the input speech data.
- the client-side voice recognition unit 202 recognizes only voice operation commands, voice recognition is performed on voice data “mail, arrival is delayed due to traffic jam” input by the user in the example shown in FIG.
- the client side speech recognition result list 405 is composed of one client side speech recognition result candidate 404.
- the acquired client side speech recognition result candidate is output to the recognition result integration unit 206 ′ and the input rule determination unit 211.
- the input rule determination unit 211 refers to the client-side voice recognition result candidate input from the client-side voice recognition unit 202 and the speech rule pattern stored in the input rule storage unit 212 to collate voice operation commands.
- the utterance rule pattern 500 stored in the input rule storage unit 212 includes a voice operation command 501 and an utterance rule 502 of the input voice.
- the voice operation command 501 is “mail”.
- “command (mail) + free sentence” is obtained as the utterance rule 502 of the input voice.
- the input rule determination unit 211 uses the input speech utterance rule 502 corresponding to “mail” that is the matching voice operation command 501. A certain “command + free sentence” is acquired. The acquired input speech utterance rule is output to the recognition result integration unit 206 ′.
- the speech recognition server 100 performs the same processing as step ST4 to step ST6, and transmits the obtained server-side speech recognition result candidate to the speech recognition apparatus 200 ′.
- the server-side speech recognition unit 102 recognizes an arbitrary sentence, speech recognition is performed on the received speech data “Email, arrival is delayed due to traffic jam”, and the server-side speech recognition result candidate 401 “ "The arrival is delayed due to traffic jam” and the server side speech recognition result candidate 402 "I can see the arrival delay due to traffic jam” is acquired.
- the server-side speech recognition result candidate list 403 the two acquired server-side speech recognition result candidates 401 and 402 are output to the speech recognition apparatus 200 ′.
- the speech recognition apparatus 200 ′ performs the processing from step ST8 to step ST13.
- FIG. 6 will be described as an example.
- the server-side voice recognition result candidate 401 “disappears and the arrival is delayed due to traffic jam” in the server-side voice recognition result candidate list 403 and the server-side voice
- the recognition result candidate 402 “visible, arrival is delayed due to traffic jam” is compared, and “disappear” and “visible” are detected as partial texts with differences.
- the detection result is output to the recognition result integration unit 206 ′.
- the recognition result integration unit 206 ′ receives the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and the reception unit 204 received in step ST8. From the server side speech recognition result candidate and the detection result of the difference input from the recognition result candidate comparison unit 205 in step ST12 or step ST13, it is determined whether or not text division of the server side speech recognition result candidate is necessary ( Step ST22). 6 and 7, the client-side speech recognition result candidate 404 “mail” of the client-side speech recognition unit 202 is input, and the server-side speech configured by the server-side speech recognition result candidates 401 and 402 from the reception unit 204.
- the recognition result candidate list 403 When the recognition result candidate list 403 is input, the text of the server side speech recognition result candidates 401 and 402 does not include “mail”, and the utterance rule input from the input rule determination unit 211 is “command + free sentence”. Since the detection result indicating that the difference is detected is input from the recognition result candidate comparison unit 205, it is determined that the text needs to be divided.
- the recognition result integration unit 206 ′ has a partial text that is different from the server side speech recognition result candidate text received by the reception unit 204. Is used as a reference to divide the text (step ST23).
- “disappear” is detected as a different partial text for the text of the server-side speech recognition result candidate 401, so that “disappear” and “arrival is delayed due to traffic jam”.
- the recognition result integration unit 206 ′ combines the text divided in step ST23 and the voice operation command corresponding to the client-side voice recognition result candidate based on the utterance rule input from the input rule determination unit 211.
- the result is output to the output unit 207 as a voice recognition result.
- Step ST24 In the example shown in FIG. 6, based on the utterance rule “command + free text”, the voice operation command “mail” and the divided text corresponding to the free text “arrival is delayed due to traffic jam” are combined. Will be late for arrival "as a speech recognition result.
- the recognition result integration unit 206 ′ uses the server side speech recognition result candidate received in step ST8 as the speech recognition result (step ST25).
- the voice recognition result is output to the output unit 207 (step ST16).
- the recognition result integration unit 206 ′ when the client side speech recognition result candidate text input from the client side speech recognition unit 202 is included in the server side speech recognition result candidate received by the reception unit 204, Is determined to be unnecessary. If the utterance rule input from the input rule determination unit 211 is “command only”, it is determined that text division is unnecessary. Furthermore, when the detection result input from the recognition result candidate comparison unit 205 indicates that no difference has been detected, it is determined that text division is unnecessary.
- the server-side speech recognition result candidate texts are compared with each other. Since it is configured to detect the partial text, split the text based on the difference partial text, and combine the split text based on the utterance rule and the client-side speech recognition result candidate text, the accuracy of the speech recognition result Even when using a speech recognition server with an unknown numerical value (recognition score) calculation method, the server-side and client-side speech recognition result candidates can be integrated more accurately without using the recognition score. A speech recognition result can be output.
- the voice recognition server since the text is divided on the basis of the different partial text, and the divided text and the client-side voice recognition result candidate text are combined, the voice recognition server performs the voice Even if the operation command cannot be recognized with high accuracy, it is possible to use only the partial text of the sentence without using the text corresponding to the voice operation command, and output a more accurate speech recognition result. be able to.
- the recognition result for detecting a partial text having a difference by comparing the texts of the server side speech recognition result candidates without performing complicated parsing processing and recalculation of the recognition score since it is configured to include a candidate comparison unit 205 and a recognition result integration unit 206 ′ that divides text based on partial texts with differences and combines the texts of client-side speech recognition result candidates, the processing load on the CPU is suppressed. In addition, the function of the voice recognition device can be realized.
- the configuration is such that the amount of calculation is suppressed without performing complicated syntax analysis by comparing the texts of the recognition results and detecting a portion with low reliability.
- the function of the speech recognition apparatus 200 ′ can be realized using a low CPU.
- the voice recognition device 200 ′ is configured to transmit the voice data to the voice recognition server 100 at the same time as inputting the voice data to the client side voice recognition unit 202.
- the voice recognition result can be acquired from the voice recognition server 100 earlier. The delay time until the result is finalized can be shortened.
- command only “command + free sentence”, and “command + place name” are given as examples of the utterance rule pattern, but the position of the voice operation command is used as the utterance rule. You may limit only to the head or the tail. In this case, if there is a difference in the part other than the beginning or end of the server-side speech recognition result candidate, it is determined that a recognition error has occurred in a part other than the voice operation command, and the voice is re-input to the user. Can also be requested. Thereby, it can suppress that an incorrect speech recognition result is output.
- the configuration in which the input rule storage unit 212 is provided in the speech recognition apparatus 200 ′ is shown.
- the configuration may be such that an utterance rule pattern stored externally is acquired.
- Embodiment 3 FIG. In the second embodiment described above, the server-side speech recognition result candidate texts are compared with each other, and the server-side speech recognition result candidate text is divided based on the different partial texts. 3 shows a configuration in which a change in server-side speech recognition result candidates is detected and text is always divided.
- FIG. 8 is a block diagram showing the configuration of the speech recognition system according to the third embodiment of the present invention.
- the voice recognition system of the third embodiment is also configured by the voice recognition server 100 and the voice recognition device 200 ′′.
- the speech recognition apparatus 200 ′′ according to Embodiment 3 is provided with a recognition result candidate correction unit 221 and an input speech / recognition result storage unit 222 added to the speech recognition device 200 ′ illustrated in FIG.
- the part 205 is deleted.
- the same or corresponding parts as those of the speech recognition system according to the first and second embodiments are denoted by the same reference numerals as those used in FIG. 1 or FIG. Turn into.
- the recognition result candidate correction unit 221 automatically transmits voice data to the voice recognition server 100 when the voice recognition device 200 ′′ is activated, and corrects a voice operation command based on the voice recognition result received from the voice recognition server 100.
- a database 221a is created.
- the input voice / recognition result storage unit 222 is a buffer that stores the voice data converted by the voice input unit 201 and the voice recognition result generated by the recognition result integration unit 206 ′′ in association with each other.
- the recognition result integration unit 206 ′′ integrates the server-side speech recognition result candidate and the client-side speech recognition result candidate using the correction database 221a created by the recognition result candidate correction unit 221.
- the first operation is an operation performed when speech input is performed without data being stored in the input speech / recognition result storage unit 222
- the second operation is performed when the speech recognition apparatus 200 ′′ is activated.
- Operation 3 for creating the correction database 221a and operation 3 when data is accumulated in the input voice / recognition result storage unit 222 as a third operation and voice input is performed with the correction database 221a being created The explanation is divided into three parts. In the following, the same steps as those in the speech recognition system according to the first or second embodiment are denoted by the same reference numerals as those used in FIG. 2 or 5, and the description thereof is omitted or simplified.
- FIG. 9 is a flowchart showing the first and third operations of the speech recognition system according to the third embodiment of the present invention
- FIG. 10 is a diagram showing an accumulation example of the input speech / recognition result storage unit.
- the input voice / recognition result storage unit 222 accumulates the voice data input in step ST2 ′ as “voice data (1)”, for example, in the format shown in FIG. 10 (step ST31).
- the input voice information 600 is configured by associating a voice operation command 601 and voice data 602.
- the voice recognition server 100 and the voice recognition device 200 ′′ perform the same processing as in step ST3 to step ST7 and step ST21, as in the second embodiment.
- the receiving unit 204 of the speech recognition apparatus 200 receives the server-side speech recognition result candidate transmitted from the speech recognition server 100 in step ST6, and the received server-side speech recognition result candidate is converted into the recognition result candidate correction unit 221 and the recognition result integration.
- the data is output to the unit 206 ′′ (step ST8 ′).
- the recognition result candidate correction unit 221 collates the server-side speech recognition result candidate text input in step ST8 'with the correction database 221a (step ST32). In this first operation, since no data is stored in the input voice / recognition result storage unit 222, the correction database 221a is not created. Therefore, the recognition result candidate correction unit 221 outputs a collation result indicating that there is no correction candidate to the recognition result integration unit 206 ′′ (step ST33).
- the recognition result integration unit 206 ′′ is the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and the reception unit 204 in step ST8 ′. It is determined from the received server side speech recognition result candidate and the collation result acquired by the recognition result candidate correction unit 221 in step ST33 whether or not the server side speech recognition result candidate can be divided into texts (step ST34).
- the client side speech recognition result candidate 404 “mail” shown in FIG. 6 is input as the client side speech recognition result candidate of the client side speech recognition unit 202, and the server side speech recognition result list 403 shown in FIG. If entered, “mail” is not included in the text of the server side speech recognition result candidates 401 and 402 included in the server side speech recognition result list 403. Also, the collation result that the utterance rule input from the input rule determination unit 211 is “command + free sentence” and there is no correction candidate is input from the recognition result candidate correction unit 221. As a result, the recognition result integration unit 206 ′′ determines that the text cannot be divided. On the other hand, when the client-side speech recognition result candidate text input from the client-side speech recognition unit 202 is included in the server-side speech recognition result candidate input from the reception unit 204, the text can be divided. judge.
- the recognition result integration unit 206 '' is input from the client side speech recognition unit 202 to the server side speech recognition result candidate text received by the reception unit 204. Then, the text is divided based on the client-side speech recognition result candidate text (step ST35). Next, the recognition result integration unit 206 ′′ combines the text divided in step ST35 and the voice operation command corresponding to the client side voice recognition result candidate based on the utterance rule input from the input rule determination unit 211. The voice recognition result is set (step ST24), and the voice recognition result is output to the output unit 207 (step ST16).
- the recognition result integration unit 206 ′′ uses the client side speech recognition result candidate acquired in step ST7 as the speech recognition result (step ST36).
- the voice recognition result is stored in the input voice / recognition result storage unit 222 (step ST37).
- the voice recognition result “mail” input from the client side voice recognition unit 202 is stored as the voice operation command 601 corresponding to “voice data (1)” of the voice data 602.
- the above is the first operation of the speech recognition system according to the third exemplary embodiment.
- FIG. 11 is a flowchart showing a second operation of the voice recognition system according to the third embodiment of the present invention
- FIG. 12 shows an example of a correction database of the voice recognition device of the voice recognition system according to the third embodiment of the present invention.
- the recognition result candidate correction unit 221 refers to the input speech / recognition result storage unit 222 and determines whether speech data is accumulated (step ST41). If audio data is not accumulated (step ST41; NO), the process is terminated.
- step ST41 when the voice data is accumulated (step ST41; YES), the voice data accumulated in the input voice / recognition result storage unit 222 is acquired (step ST42), and the acquired voice data is transmitted via the transmission unit 203. To the voice recognition server 100 (step ST43).
- the speech recognition server 100 the same processing as in steps ST4 to ST6 of the first embodiment described above is performed, speech recognition of the transmitted speech data is performed, and the server side speech recognition result candidate is assigned to the speech recognition device 200 ′′ side.
- the receiving unit 204 of the speech recognition apparatus 200 ′′ receives the server side speech recognition result candidate transmitted from the speech recognition server 100 in step ST6, and outputs the received server side speech recognition result candidate to the recognition result candidate correcting unit 221.
- the recognition result candidate correction unit 221 determines whether or not the server-side speech recognition result candidate input in step ST8 ′′ matches the voice operation command stored in the input speech / recognition result storage unit 222 (step). ST44). When the server side voice recognition result candidate matches the voice operation command (step ST44; YES), the process proceeds to step ST46.
- Step ST44 if the server-side voice recognition result candidate and the voice operation command do not match (step ST44; NO), information that associates the voice operation command with the server-side voice recognition result candidate as the correction candidate is added to the correction database 221a (Ste ST45).
- the voice operation command 701 stored in the input voice / recognition result storage unit 222 is “mail”, and the correction candidate 702 that is a server-side voice recognition result candidate is “defeated” or “visible”. If there is, information associated with each is added as correction data 700 to the correction database 221a.
- the recognition result candidate correction unit 221 refers to the voice data stored in the input voice / recognition result storage unit 222 and determines whether or not processing has been performed on all the voice data (step ST46). If processing has been performed for all audio data (step ST46; YES), the processing ends. On the other hand, when the processing has not been performed for all audio data (step ST46; NO), the processing returns to step ST42 and the above-described processing is repeated.
- the above is the second operation of the speech recognition system according to the third exemplary embodiment.
- the recognition result candidate correction unit 221 collates the server-side speech recognition result candidate text received in step ST8 'with the correction database 221a.
- the server-side speech recognition result candidate list 403 shown in FIG. 6 is input as the server-side speech recognition result candidate, the text of the server-side speech recognition result candidate 401 and the correction database 221a shown in FIG. 12 are configured.
- the correction candidates 702 of the correction data 700 to be checked are collated.
- the correction candidate “disappears” in the correction database 221a When it is detected that the correction candidate “disappears” in the correction database 221a is included in the text of the server-side speech recognition result candidate 401, the correction candidate “disappears” in the correction database 221a and the corresponding voice operation command as step ST33 “Mail” is output as a collation result to the recognition result integration unit 206 ′′.
- step ST34 the recognition result integration unit 206 ′′ recognizes the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and step ST8. Whether or not the server side speech recognition result candidate can be divided into text is determined from the server side speech recognition result candidate received by the reception unit 204 and the collation result input from the recognition result candidate correction unit 221 in step ST33. .
- the client side speech recognition result candidate 404 “mail” shown in FIG. 6 is input as the client side speech recognition result candidate of the client side speech recognition unit 202, and the utterance rule determined by the input rule determination unit 211 is “command + free sentence”.
- the server side speech recognition result list 403 shown in FIG. 6 is input from the receiving unit 204, “mail” is added to the text of the server side speech recognition results 401 and 402 in the server side speech recognition result list 403.
- “mail” is input as the collation result from the recognition result candidate correction unit 221, it is determined that the text can be divided (step ST 34; YES).
- step ST35 the recognition result integration unit 206 ′′ divides the text on the server side speech recognition result candidate text based on the correction candidate “disappears” corresponding to the determination result “mail”.
- step ST24 the text divided based on the utterance rule information input from the input rule determination unit 211 and the voice operation command corresponding to the client side voice recognition result candidate are combined to obtain a voice recognition result.
- the speech recognition result is output to the output unit 207.
- the server side acquired by transmitting speech data to the speech recognition server 100 using speech data input in the past. Since the speech recognition result candidate correction unit 221 for creating the speech recognition result candidate correction database 211a based on the speech recognition result candidate is provided, the server side speech recognition result candidate of the speech recognition server 100 and the input speech / recognition Even when the voice operation commands stored in the result storage unit 222 do not match, if the correction candidate corresponding to the voice operation command matches the server side voice recognition result candidate from the voice recognition server 100, that portion is used as a reference.
- the text is divided, and the divided text and the text of the client-side speech recognition result candidate of the speech recognition apparatus 200 ′′ are input rule judgment It can be integrated on the basis of the information of the input speech rule from part 211.
- the voice recognition server 100 is updated and the recognition result changes, it is possible to follow, and the voice recognition result candidates on the server side and the client side are integrated to obtain a more accurate voice recognition result. Can be output.
- the recognition result integration unit 206 ′′ divides the text based on the difference portion, and the divided text and the client-side speech recognition result candidate of the speech recognition apparatus 200 ′′. Since the text is integrated based on the utterance rule information input from the input rule determination unit 211, even if the voice recognition server 100 cannot recognize the voice operation command with high accuracy, Only the part of the sentence can be used without using the corresponding part, and a more accurate speech recognition result can be output.
- the recognition result candidate correction unit 221 that collates the server-side speech recognition result candidate text with the correction database 221a without performing complicated syntax analysis processing or recalculation of the recognition score. Therefore, the function of the speech recognition apparatus 200 ′′ can be realized while suppressing the processing load on the CPU.
- the server side speech recognition result candidate text and the correction database 221a are collated to detect a portion with low reliability, thereby reducing the amount of calculation without performing complicated syntax analysis. Since it was comprised so that it might suppress, the function of speech recognition apparatus 200 '' is realizable using CPU with low arithmetic performance.
- the voice recognition device 200 ′′ is configured to transmit the voice data to the voice recognition server 100 at the same time as inputting the voice data to the client side voice recognition unit 202.
- the voice recognition result can be acquired from the voice recognition server 100 earlier. The delay time until the result is finalized can be shortened.
- Embodiment 4 FIG. In the third embodiment described above, a configuration has been shown in which a change in the server-side speech recognition result candidate of the speech recognition server 100 is detected and the text can always be divided, but in this fourth embodiment, it is divided as a free sentence. A configuration for detecting proper nouns contained in the text is shown.
- the voice recognition system of the fourth embodiment is also configured by the voice recognition server 100 and the voice recognition device 200 ′.
- the components of the voice recognition server 100 and the voice recognition device 200 ′ according to the fourth embodiment are the same as those of the voice recognition system according to the second embodiment, and thus description thereof is omitted.
- the recognition result candidate comparison unit 205 compares the server-side speech recognition candidates and detects a plurality of different parts, so that the contents of the detected parts are the same. It is determined whether or not.
- the recognition result candidate comparison unit 205 determines that the text of the detected portion has the same content
- the recognition result integration unit 206 ′ converts the text determined to have the same content into a corresponding proper noun. replace.
- FIG. 13 is a flowchart showing the operation of the speech recognition system according to Embodiment 4 of the present invention.
- FIG. 14 shows a generation example of a voice recognition result of the voice recognition system according to the fourth embodiment of the present invention
- FIG. 15 shows a pattern storage example of an utterance rule.
- the same steps as those in the speech recognition system according to the second embodiment are denoted by the same reference numerals as those used in FIG. 5, and description thereof is omitted or simplified.
- the speech recognition apparatus 200 ′ performs the processing of step ST1 and step ST2, and the client side speech recognition unit 202 performs speech recognition on the input speech data (step ST7).
- the client-side voice recognition unit 202 recognizes only proper nouns and voice operation commands registered in the address book or the like, in the example shown in FIG. “Today I and Kenji will respond,” by performing speech recognition to recognize the proper noun “Kenji” and the voice operation command “San ni Mail”, and the client side speech recognition result candidate 804 “ Get “Mail to Kenji”.
- the client side speech recognition result candidate list 805 is composed of one client side speech recognition result candidate 804.
- the acquired client side speech recognition result candidate is output to the recognition result integration unit 206 ′ and the input rule determination unit 211.
- the input rule determination unit 211 refers to the client-side voice recognition result candidate input from the client-side voice recognition unit 202 and the speech rule pattern stored in the input rule storage unit 212 to collate voice operation commands.
- the utterance rule of the voice data input in step ST1 step ST21. For example, when the client-side speech recognition result candidate 804 “Mail to healthy child” shown in FIG. 14 is compared with the utterance rule pattern 900 shown in FIG. 15, the matching voice operation command 901 “Mail to San” is detected. The corresponding input speech utterance rule 902 “proprietary noun + command + free sentence” is acquired. The acquired input speech utterance rule is output to the recognition result integration unit 206 ′.
- the recognition result candidate comparison unit 205 determines that it includes a plurality of speech recognition result candidates (step ST9; YES).
- the texts of the respective speech recognition result candidates are compared with each other to detect a partial text having a difference (step ST10).
- the recognition result candidate comparison unit 205 determines whether or not a partial text with a difference is detected (step ST11). When a partial text with a difference is detected (step ST11; YES), the partial text with a difference is detected.
- the detection result is output to the recognition result integration unit 206 ′ (step ST12).
- server-side speech recognition result list 803 includes two server-side speech recognition result candidates 801 and 802, each text information is “e-mail to the prosecutor. ”And“ Kenji-san, I will respond to me and Kenji-san today ”, there are two different parts, both of which are the same text (speech recognition result candidate 801 is“ Prosecutor ", The speech recognition result candidate 802 is detected as" Kenji ").
- the recognition result integration unit 206 ′ receives the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and the reception unit 204 received in step ST8. From the server side speech recognition result candidate and the detection result of the difference input from the recognition result candidate comparison unit 205 in step ST12 or step ST13, it is determined whether or not the proper noun included in the free text can be replaced. It performs (step ST51).
- the determination as to whether the proper noun can be replaced is specifically performed as follows.
- the client-side speech recognition result candidate 804 “mail to healthy child” of the client-side speech recognition unit 202 is input, and the server-side speech recognition result candidates 801 and 802 are configured from the reception unit 204.
- the server-side speech recognition result candidate list 803 is input, it is determined whether or not the text of the server-side speech recognition result candidates 801 and 802 includes the speech operation command “San to mail”.
- the utterance rule information input from the input rule determination unit 211 (the utterance rule of the input voice corresponding to the voice operation command “san to mail” shown in FIG. 15)
- the text corresponding to the proper noun based on the text of the voice operation command in the example of FIG. 14, “prosecutor” of the server-side voice recognition result candidate 801 and the server-side voice recognition result)
- Candidate 802 “Kenji” and the text corresponding to the free sentence in the example of FIG. 14, the server-side speech recognition result candidate 801 “I will correspond with me today and the prosecutor”) and the server-side speech recognition result candidate 802 "Today I will respond with Mr. Kenji").
- the proper noun text it is determined whether or not there is a portion that matches the proper noun text in the text corresponding to the free sentence (in the example of FIG. 14, the portion that matches the proper noun text (speech recognition). It is determined that there is a “prosecutor” of the result candidate 801 and “Kenji” of the speech recognition result candidate 802. If there is a portion matching the proper noun text in the free sentence, the proper noun can be replaced. judge.
- step ST51 When it is determined that the proper noun can be replaced (step ST51; YES), the proper noun included in the text divided as a free sentence from the detection result when input from the recognition result candidate comparison unit 205, and the correspondence The text to be replaced is replaced (step ST52).
- the text “prosecutor” corresponding to the proper noun contained in the text “Today I will correspond with the prosecutor” divided as the free sentence is recognized by the client-side speech recognition unit 202.
- Replaced with the noun text "Kenji”, "Today I will deal with Kenji”.
- the recognition result integration unit 206 ′ determines the speech recognition result by combining the divided text and the voice operation command corresponding to the client side speech recognition result candidate. (Step ST24).
- the confirmed speech recognition result is output to the output unit 207 (step ST16).
- the proper noun “kenji” and the voice operation command “san to mail” and the text corresponding to the free sentence “Today and me and Kenji "I will respond to Kenji's email, today I will deal with Kenji's" as a voice recognition result.
- the recognition result integration unit 206 ′ uses the server-side speech recognition result candidate received in step ST8 as a speech recognition result (step ST25), and the speech The recognition result is output to the output unit 207 (step ST16).
- the difference between the texts of the server-side speech recognition result candidates is compared.
- the partial text with a difference corresponds to the recognition result of the proper noun of the client side speech recognition result candidate, and the text corresponding to the proper noun is included in the text divided as a free sentence, Since the proper noun text included in the free text is replaced with the proper noun text recognized by the client-side speech recognition unit 202, the part-of-speech information is not given to the server-side speech recognition result candidate. Integrate server-side and client-side speech recognition results with high accuracy without using part-of-speech information, resulting in more accurate speech recognition results. It can be output.
- the recognition result candidate correction unit 221 and the input voice / recognition result storage unit described in the third embodiment are used.
- the recognition result integration unit 206 ′ searches the correction database 221a and becomes a correction candidate when the voice operation command is not correctly recognized as the server-side voice recognition result candidate of the voice recognition server 100.
- the voice recognition result command it may be determined that the text can be divided based on the voice operation command. As a result, even when the voice operation command cannot be normally recognized by the voice recognition server 100, it is possible to divide the text with high accuracy and output a more accurate voice recognition result.
- Embodiment 5 FIG.
- the processing operation of the voice recognition system has been described by taking as an example the case where the voice uttered by the user in Japanese is input.
- the voice uttered by the user in English is described.
- the processing operation of the speech recognition system will be described by taking the case of input as an example.
- the configuration and operation of the speech recognition system of the fifth embodiment are the same as the configuration (see FIG. 1) and operation (see FIG. 2) shown in the first embodiment, and therefore FIG. 1 and FIG. 2 are used. To explain.
- FIG. 16 is a diagram showing a generation example of a voice recognition result of the voice recognition system according to the fifth embodiment of the present invention.
- step ST5 for example, the server-side voice recognition unit 102 performs voice recognition on the voice data “Send SMS to John, Take care yourself.” Received from the voice recognition device 200, with an arbitrary sentence as a recognition target. Server-side speech recognition result candidates including “SEND ⁇ ⁇ S AND S TO JOHN TAKE CARE YOURSELF” and server-side speech recognition result candidate 312 “SEND S AND ASKED JOHN TAKE CARE YOURSELF” shown in FIG. A list 313 is obtained.
- step ST7 for example, the client-side voice recognition unit 202 recognizes only voice operation commands and personal name information registered in advance in the address book, and the user speaks “Send SMS to John, Take care yourself.”
- the client side voice recognition unit 202 recognizes the voice operation command “SEND SMS TO” and the person name “JOHN”, and the client side voice recognition result candidate 314 shown in FIG. 16 is “SEND SMS TO JOHN”.
- the server-side speech recognition result candidate list 313 includes two server-side speech recognition result candidates 311 and 312 and each text information “SEND S AND S TO JOHN TAKE CARE YOURSELF ”and“ SEND S AND ASKED JOHN TAKE CARE YOURSELF ”are compared, and the part between the first text“ SEND S AND ”and the last text“ JOHN TAKE CARE YOURSELF ”is detected as a different partial text.
- “S TO” of the server side speech recognition result candidate 311 and “ASKED” of the server side speech recognition result candidate 312 are detected as different partial texts.
- step ST15 in the example of FIG. 16, in the server side speech recognition result candidate 311, the partial text “S TO” surrounded by the head text “SEND S AND” and the tail text “JOHN TAKE CARE YOURSELF” and
- a search is performed as to whether a partial text matching “SEND ⁇ S AND” and “JOHN” exists in the client side speech recognition result candidate 314.
- “JOHN” is included, but the partial text “SEND S AND” is not included.
- the partial text to be searched is shortened like “SEND”, and the search is performed again using the shortened partial text.
- FIG. 16 the partial text “S TO” surrounded by the head text “SEND S AND” and the tail text “JOHN TAKE CARE YOURSELF” and
- the same effect as that of the first embodiment can be obtained.
- Embodiment 6 FIG.
- the processing operation of the voice recognition system has been described by taking as an example the case where the voice spoken in Japanese by the user is input.
- the voice spoken by the user in English is explained.
- the processing operation of the speech recognition system will be described by taking the case of input as an example.
- the configuration and operation of the voice recognition system of the sixth embodiment are the same as the configuration (see FIG. 4) and operation (see FIG. 5) shown in the second embodiment, and therefore FIG. 4 and FIG. 5 are used. To explain.
- FIG. 17 is a diagram showing a generation example of a voice recognition result of the voice recognition system according to the sixth embodiment of the present invention
- FIG. 18 is a diagram showing a pattern storage example of an utterance rule.
- the speech recognition apparatus 200 ′ performs the processes of steps ST1, ST2, and ST7, and performs speech recognition on the input speech data.
- the client-side voice recognition unit 202 recognizes only voice operation commands
- the client side speech recognition result list 415 includes one client side speech recognition result candidate 414.
- the input rule determination unit 211 refers to the client-side speech recognition result candidate input from the client-side speech recognition unit 202 and the speech rule pattern stored in the input rule storage unit 212 to perform voice operation.
- the commands are collated, and the speech rule of the voice data input at step ST1 is determined.
- the utterance rule pattern 510 stored in the input rule accumulating unit 212 includes a voice operation command 511 and an utterance rule 512 of the input voice.
- the voice operation command 511 is “SEARCH FOR”.
- “command + keyword” is obtained as the utterance rule 512 of the input voice.
- the input rule determination unit 211 utters the input speech corresponding to “SEARCH FOR” which is the matching voice operation command 511.
- the rule 512 “command + keyword” is acquired.
- step ST4 to step ST6 when the server-side speech recognition unit 102 recognizes an arbitrary sentence, in the example of FIG. 17, the received speech data “Search17for pictures of the golden gate bridge.”
- the server side speech recognition result candidate 411 “SYSTEM PICTURES” OF “THE GOLDEN” GATE “BRIDGE”
- the server side speech recognition result candidate 412 “SISTER PICTURES“ OF ”THE“ GOLDEN ”GATE“ BRIDGE ”are obtained.
- the server-side speech recognition result candidate list 413 the two acquired server-side speech recognition result candidates 411 and 412 are output to the speech recognition apparatus 200 ′.
- FIG. 17 will be described as an example.
- Server-side speech recognition result candidates 411 “SYSTEM PICTURES OF THE GOLDEN GATE BRIDGE” and server-side speech recognition in the server-side speech recognition result candidate list 413
- the result candidate 412 “SISTER PICTURES OF THE GOLDEN GATE BRIDGE” is compared, and “SYSTEM” and “SISTER” are detected as different partial texts.
- the detection result is output to the recognition result integration unit 206 ′.
- the recognition result integration unit 206 ′ receives the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and the reception unit in step ST8.
- the server side speech recognition result candidate received by 204 and the detection result of difference input from the recognition result candidate comparison unit 205 in step ST12 or step ST13 determine whether or not text division of the server side speech recognition result candidate is necessary. I do.
- the client side speech recognition result candidate 414 “SEARCH FOR” of the client side speech recognition unit 202 is input, and the server side configured by the server side speech recognition result candidates 411 and 412 from the reception unit 204.
- the speech recognition result candidate list 413 is input, the text of the server side speech recognition result candidates 411 and 412 does not include “SEARCH FOR”, and the utterance rule input from the input rule determination unit 211 is “command + keyword”. Since the detection result indicating that the difference is detected is input from the recognition result candidate comparison unit 205, it is determined that the text needs to be divided.
- step ST23 the recognition result integration unit 206 ′ differs from the server side speech recognition result candidate text received by the reception unit 204.
- the text is divided based on the partial text.
- “SYSTEM” is detected as a partial text with a difference from the text of the server-side speech recognition result candidate 411, two of “SYSTEM” and “PICTURES OF THE GOLDEN GATE BRIDGE” are detected.
- step ST24 the recognition result integration unit 206 ′, based on the utterance rule input from the input rule determination unit 211, the voice operation command corresponding to the text divided in step ST23 and the client side speech recognition result candidate. Are output to the output unit 207 as a speech recognition result.
- the recognition result integration unit 206 ′ based on the utterance rule “command + keyword”, “SEARCH FOR PICTURES OF” combining the voice operation command “SEARCH FOR” and the divided text “PICTURES OF THE GOLDEN GATE BRIDGE” corresponding to the free text. “THE GOLDEN GATE BRIDGE” is the speech recognition result.
- the same effect as in the second embodiment can be obtained even when a voice uttered in English is input to the voice recognition device 200 ′.
- Embodiment 7 FIG.
- the processing operation of the voice recognition system has been described by taking as an example the case where the voice spoken in Japanese by the user is input.
- the voice spoken by the user in English is explained.
- the processing operation of the speech recognition system will be described by taking the case of input as an example.
- the configuration and operation of the speech recognition system according to the seventh embodiment are the same as the configuration (see FIG. 8) and operation (see FIGS. 9 and 11) shown in the third embodiment. 9 and FIG.
- the operation in the case where speech input uttered in English is performed in a state where data is not accumulated in the input speech / recognition result storage unit 222
- the second operation As an operation of creating the correction database 221a when the speech recognition apparatus 200 ′′ is started, and as a third operation, data is accumulated in the input speech / recognition result storage unit 222, and the correction database 221a is created
- the operation when the voice input uttered in is performed will be described in three parts.
- FIG. 19 is a diagram showing an accumulation example of the input speech / recognition result storage unit of the speech recognition system according to the seventh embodiment of the present invention. In step ST34 of the flowchart of FIG.
- the client side speech recognition result candidate 414 “SEARCH FOR” shown in FIG. 17 is input as the client side speech recognition result candidate of the client side speech recognition unit 202, and the server side speech recognition result list 413 shown in FIG. Is input, the text of the server side speech recognition result candidates 411 and 412 included in the server side speech recognition result list 413 does not include “SEARCH FOR”.
- the collation result that the utterance rule input from the input rule determination unit 211 is “command + keyword” and there is no correction candidate is input from the recognition result candidate correction unit 221.
- the recognition result integration unit 206 ′′ determines that the text cannot be divided.
- step ST34 If the server side speech recognition result candidate cannot be divided into text (step ST34; NO), in step ST36 and step ST37, the recognition result integration unit 206 ′′ uses the client side speech recognition result candidate acquired in step ST7 as a speech. The result is stored in the input voice / recognition result storage unit 222 as a recognition result.
- the speech recognition result “SEARCH FOR” input from the client-side speech recognition unit 202 is accumulated as the speech operation command 611 corresponding to “speech data (1)” of the speech data 612.
- the above is the first operation of the speech recognition system according to the seventh embodiment.
- FIG. 20 is a diagram showing an example of a correction database of the speech recognition apparatus of the speech recognition system according to the seventh embodiment of the present invention.
- step ST44 of the flowchart of FIG. 11 when the server side voice recognition result candidate and the voice operation command do not match (step ST44; NO), in step ST45, the voice operation command is associated with the server side voice recognition result candidate as the correction candidate. Information is added to the correction database 221a.
- the voice operation command 711 stored in the input voice / recognition result storage unit 222 is “SEARCH FOR”, and the correction candidates 712 that are server-side voice recognition result candidates are “SYSTEM” or “SISTER”. If it is, the information associating them with each other is added as correction data 710 to the correction database 221a.
- the above is the second operation of the speech recognition system of the seventh embodiment.
- the recognition result candidate correction unit 221 collates the server-side speech recognition result candidate text received in step ST8 'with the correction database 221a.
- the server-side speech recognition result candidate list 413 shown in FIG. 17 is input as the server-side speech recognition result candidate
- the text of the server-side speech recognition result candidate 411 and the correction database 221a shown in FIG. 20 are configured.
- the correction candidate 712 of the correction data 710 to be checked is collated.
- step ST34 the recognition result integration unit 206 ′′ recognizes the client side speech recognition result candidate generated by the client side speech recognition unit 202 in step ST7, the utterance rule determined by the input rule determination unit 211 in step ST21, and step ST8. Whether or not the server side speech recognition result candidate can be divided into text is determined from the server side speech recognition result candidate received by the reception unit 204 and the collation result input from the recognition result candidate correction unit 221 in step ST33. .
- the client side speech recognition result candidate 414 “SEARCH FOR” shown in FIG. 17 is input as the client side speech recognition result candidate of the client side speech recognition unit 202, and the utterance rule determined by the input rule determination unit 211 is “command + keyword”.
- the server side speech recognition result list 413 shown in FIG. 17 is input from the receiving unit 204, “SEARCH FOR” is added to the text of the server side speech recognition results 411 and 412 of the server side speech recognition result list 413.
- “SEARCH FOR” is input as the collation result from the recognition result candidate correction unit 221, it is determined that the text can be divided (step ST 34; YES).
- step ST35 the recognition result integration unit 206 ′′ divides the text of the server-side speech recognition result candidate using the correction candidate “SYSTEM” corresponding to the determination result “SEARCH FOR” as a reference.
- step ST24 the text divided based on the utterance rule information input from the input rule determination unit 211 and the voice operation command corresponding to the client side voice recognition result candidate are combined to obtain a voice recognition result.
- the speech recognition result is output to the output unit 207.
- the above is the third operation of the speech recognition system according to the third embodiment.
- the same effect as in the third embodiment can be obtained even when a voice uttered in English is input to the voice recognition device 200 ′′.
- Embodiment 8 FIG.
- the processing operation of the voice recognition system has been described by taking as an example the case where the voice uttered by the user in Japanese is input.
- the voice uttered by the user in English is described.
- the processing operation of the speech recognition system will be described by taking the case of input as an example.
- the configuration and operation of the voice recognition system of the eighth embodiment are the same as the configuration shown in the third embodiment (see FIG. 8) and the operation shown in the fourth embodiment (see FIG. 13). 8 and FIG.
- FIG. 21 shows an example of generating a speech recognition result of the speech recognition system according to the eighth embodiment of the present invention
- FIG. 22 is a diagram showing an example of utterance rule pattern storage.
- the client side speech recognition unit 202 performs speech recognition on the input speech data. For example, when the client-side voice recognition unit 202 recognizes only proper nouns and voice operation commands registered in an address book or the like, in the example shown in FIG. 21, the voice data “Send e-mail to "Send E-MAIL TO" and proper name "JONES" are recognized and voice recognition result candidate 814 "SEND” Get "E-MAIL TO JONES".
- the client side speech recognition result candidate list 815 includes one client side speech recognition result candidate 814.
- the acquired client side speech recognition result candidate is output to the recognition result integration unit 206 ′ and the input rule determination unit 211.
- the input rule determination unit 211 refers to the client-side speech recognition result candidate input from the client-side speech recognition unit 202 and the speech rule pattern stored in the input rule storage unit 212 to perform voice operation.
- the commands are collated, and the speech rule of the voice data input at step ST1 is determined.
- the client-side speech recognition result candidate 814 “SEND E-MAIL TO JONES” shown in FIG. 21 is compared with the utterance rule pattern 910 shown in FIG. 22, a matching voice operation command 911 “SEND E-MAIL TO” is obtained.
- the corresponding input speech utterance rule 912 “command + proper noun + free sentence” is acquired.
- the acquired input speech utterance rule is output to the recognition result integration unit 206 ′.
- step ST11 the recognition result candidate comparison unit 205 determines whether or not a difference partial text is detected. If a difference partial text is detected (step ST11; YES), the difference is determined as step ST12. Is output as a detection result to the recognition result integration unit 206 ′.
- a difference partial text is detected (step ST11; YES)
- the difference is determined as step ST12. Is output as a detection result to the recognition result integration unit 206 ′.
- the recognition result integration unit 206 ′ determines whether or not the proper noun included in the free text can be replaced.
- the determination as to whether the proper noun can be replaced is specifically performed as follows.
- the client-side speech recognition result candidate 814 “SEND E-MAIL TO JONES” of the client-side speech recognition unit 202 is input, and the server-side speech recognition result candidates 811 and 812 are configured from the reception unit 204.
- the server-side voice recognition result candidate list 813 is input, it is determined whether or not the voice operation command “SEND E-MAIL TO” is included in the text of the server-side voice recognition result candidates 811 and 812.
- the utterance rule information input from the input rule determination unit 211 (the input voice corresponding to the voice operation command “SEND-E-MAIL TO” shown in FIG. 22)
- the utterance rule “command + proper noun + free sentence” the text corresponding to the proper noun (“JOHN” of the server side speech recognition result candidate 811 and the server side speech recognition in the example of FIG. 21) based on the text of the voice operation command.
- Result candidate 812 “JON”) and text corresponding to a free sentence in the example of FIG. 21, “HAPPY BIRTHDAY JOHN” of server-side speech recognition result candidate 811 and “HAPPY BIRTHDAY JON” of server-side speech recognition result candidate 812) Divide into
- step ST51 When it is determined that the proper noun can be replaced (step ST51; YES), the uniqueness included in the text divided as a free sentence from the detection result when input from the recognition result candidate comparison unit 205 as step ST52. Replace nouns with the corresponding text.
- the text “JOHN” corresponding to the proper noun included in the text “HAPPY BIRTHDAY JOHN” divided as the free sentence is the proper noun text “JONES” recognized by the client side speech recognition unit 202. Replace with “HAPPY BIRTHDAY JONES”.
- step ST24 the recognition result integration unit 206 ′ combines the divided text and the voice operation command corresponding to the client-side voice recognition result candidate based on the utterance rule information input from the input rule determination unit 211 to perform voice recognition. Confirm the result.
- the voice operation command “SEND E-MAIL TO” the proper noun “JONES” and the text “HAPPY BIRTHDAY JONES” corresponding to the free sentence are changed.
- the combined “SEND E-MAIL TO JONES HAPPY BIRTHDAY JONES” is confirmed as the speech recognition result.
- the same effect as in the fourth embodiment can be obtained even when a voice uttered in English is input to the voice recognition apparatus 200 ′′.
- the voice recognition system and the voice recognition device according to the present invention can be applied to various devices having a voice recognition function, and are optimally accurate even when an input including a plurality of intentions is performed. Voice recognition results can be provided.
- 100 voice recognition server 101 receiving unit, 102 server side voice recognizing unit, 103 transmitting unit, 200, 200 'voice recognition device, 201 voice inputting unit, 202 client side voice recognizing unit, 203 transmitting unit, 204 receiving unit, 205 recognition Result candidate comparison unit, 206, 206 ′, 206 ′′ recognition result integration unit, 207 output unit, 211 input rule determination unit, 212 input rule storage unit, 221 recognition result candidate correction unit, 221a correction database, 222 input voice / Recognition result storage unit.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
- Computer And Data Communications (AREA)
Abstract
Description
例えば、特許文献1の音声認識装置では、初めにクライアント側で音声認識を行い、クライアント側の音声認識結果の精度を示す認識スコアが悪いと判定した場合に、サーバ側で音声認識を行い、サーバ側の音声認識結果を採用する方法が提案されている。また、クライアント側の音声認識とサーバ側の音声認識を同時並列的に行い、クライアント側の音声認識結果の認識スコアとサーバ側の音声認識結果の認識スコアを比較し、認識スコアがより良好な音声認識結果を採用する方法も提案されている。
また、クライアント側とサーバ側の認識スコアを比較してより良好な認識スコアを採用するため、サーバ側が認識スコアを送信しない場合、あるいはサーバ側が送信する認識スコアの算出方法が不明な場合(例えば自社でクライアント側の音声認識のみを開発して他社の音声認識サーバを利用する場合)に、クライアント側の認識スコアを正確に比較することができず、高精度な音声認識結果の選択ができないという課題があった。
実施の形態1.
図1は、この発明の実施の形態1による音声認識システムの構成を示すブロック図である。
音声認識システムは、音声認識サーバ(サーバ装置)100および音声認識装置200によって構成する。
音声認識サーバ100は、受信部(サーバ側受信部)101、サーバ側音声認識部102および送信部(サーバ側送信部)103を備え、音声認識装置200から受信した音声データを音声認識して音声認識結果を音声認識装置200へ送信する機能を備える。受信部101は、音声認識装置200から音声データを受信する。サーバ側音声認識部102は、受信部101が受信した音声データを音声認識してサーバ側音声認識結果候補を生成する。送信部103は、サーバ側音声認識部102が生成したサーバ側音声認識結果候補を音声認識装置200へ送信する。
図2はこの発明の実施の形態1による音声認識システムの動作を示すフローチャートであり、図3はこの発明の実施の形態1による音声認識システムの音声認識結果の生成例を示す図である。
利用者が発話した音声が入力されると(ステップST1)、音声認識装置200の音声入力部201は入力された音声を音声データに変換し、変換した音声データをクライアント側音声認識部202および送信部203に出力する(ステップST2)。送信部203は、ステップST2で入力された音声データを音声認識サーバ100へ送信する(ステップST3)。
例えば、サーバ側音声認識部102が任意の文章を認識対象とし、音声認識装置200から受信した音声データ「目的地、大船時計専門店に設定する」に対して音声認識を行い、図3に示すサーバ側音声認識結果候補301である「目的地を大船渡軽専門店に設定する」およびサーバ側音声認識結果候補302である「目的地を豊富な時計専門店に設定する」を含むサーバ側音声認識結果候補リスト303を取得する。送信部103は、サーバ側音声認識結果候補リスト303を音声認識装置200側に送信する。
例えば、クライアント側音声認識部202が音声操作用コマンドと現地付近の地名情報のみを認識対象とし、利用者が「目的地、大船時計専門店に設定する」と音声入力した場合に、クライアント側音声認識部202は音声操作コマンドの「目的地」および現在地付近の地名情報である「大船時計専門店」を認識し、図3に示すクライアント側音声認識結果候補304である「目的地、大船時計専門店」を含むクライアント側音声認識結果候補リスト305を取得する。なお、図3の例ではクライアント側音声認識結果候補リスト305は、1つのクライアント側音声認識結果候補304のみで構成されている。
例えば、図3の例ではサーバ側音声認識結果候補リスト303に2つのサーバ側音声認識結果候補301,302が含まれ、それぞれのテキスト情報である「目的地を大船渡軽専門店に設定する」と「目的地を豊富な時計専門店に設定する」を比較して先頭テキスト「目的地を」と末尾テキスト「専門店に設定する」に囲まれた部分を差異のある部分テキストとして検出する。具体的には、サーバ側音声認識結果候補301の「大船渡軽」およびサーバ側音声認識結果候補302の「豊富な時計」を差異のある部分テキストとして検出する。
例えば、図3の例において、サーバ側音声認識結果候補リスト303にサーバ側音声認識結果候補301のみ含まれる場合には、差異のある部分テキストは検出しない。
例えば、サーバ側音声認識結果候補として3つの候補が存在し、差異のある部分テキストが3つとも異なる場合は信頼度1/3と判定し、差異のある部分テキストが1つの候補のみ異なる場合は信頼度2/3と判定する。判定した信頼度が1/3以下の部分テキストのみ、クライアント側音声認識部202のクライアント側音声認識結果候補のテキストと置き換えるように構成する。
これにより、音声認識の精度を向上させることができ、より正確な音声認識結果を得ることができる。
これにより、誤った音声認識結果が出力されるのを抑制することができる。
これにより、音声認識の精度を向上させることができ、より正確な音声認識結果を出力することができる。
上述した実施の形態1では、サーバ側音声認識結果候補のテキスト同士を比較して差異のある部分テキストをクライアント側音声認識結果候補で置換する構成を示したが、この実施の形態2では差異のある部分テキストを基準としてサーバ側音声認識結果候補のテキストを分割し、分割したテキストとクライアント側音声認識結果候補に基づくデータとを結合する構成を示す。
図5はこの発明の実施の形態2による音声認識システムの動作を示すフローチャートであり、図6は実施の形態2による音声認識システムの音声認識結果の生成例を示す図であり、図7は実施の形態2の音声認識システムの発話規則のパターン格納例を示す図である。なお、図5のフローチャートでは、実施の形態1に係る音声認識システムと同一のステップには図2で使用した符号と同一の符号を付し、説明を省略または簡略化する。
例えば、クライアント側音声認識部202が音声操作コマンドのみを認識対象とする場合、図6に示す例では利用者が入力した音声データ「メール、渋滞で到着が遅れます。」に対して、音声認識を行い1つのクライアント側音声認識結果候補404「メール」を取得する。図6の例では、クライアント側音声認識結果リスト405は、1つのクライアント側音声認識結果候補404で構成される。取得されたクライアント側音声認識結果候補は、認識結果統合部206´および入力規則判定部211に出力される。
図7に示すように、入力規則蓄積部212に格納された発話規則のパターン500は、音声操作コマンド501および入力音声の発話規則502で構成され、例えば音声操作コマンド501が「メール」であった場合に、入力音声の発話規則502として「コマンド(メール)+自由文」が得られることを示している。
図6に示すようにクライアント側音声認識結果候補404が「メール」であった場合に、入力規則判定部211は一致する音声操作コマンド501である「メール」に対応した入力音声の発話規則502である「コマンド+自由文」を取得する。取得した入力音声の発話規則は、認識結果統合部206´に出力される。
例えば、サーバ側音声認識部102が任意の文章を認識対象とする場合、受信した音声データ「メール、渋滞で到着が遅れます。」に対して音声認識を行い、サーバ側音声認識結果候補401「滅入る、渋滞で到着が遅れます」およびサーバ側音声認識結果候補402「見える、渋滞で到着が遅れます」を取得する。サーバ側音声認識結果候補リスト403として、取得された2つのサーバ側音声認識結果候補401,402が音声認識装置200´に出力される。
図6および図7の例では、クライアント側音声認識部202のクライアント側音声認識結果候補404「メール」が入力され、受信部204からサーバ側音声認識結果候補401,402で構成されるサーバ側音声認識結果候補リスト403が入力された場合、サーバ側音声認識結果候補401,402のテキストに「メール」が含まれておらず、入力規則判定部211から入力された発話規則が「コマンド+自由文」であり、認識結果候補比較部205から差異を検出したことを示す検出結果が入力されるため、テキストの分割が必要であると判定する。
図6に示す例では、サーバ側音声認識結果候補401のテキストに対して「滅入る」を差異のある部分テキストとして検出しているため、「滅入る」と「渋滞で到着が遅れます」の2つにテキストを分割する。
図6に示す例では、発話規則の「コマンド+自由文」に基づいて、音声操作コマンド「メール」と自由文に対応する分割したテキスト「渋滞で到着が遅れます」を結合した「メール、渋滞で到着が遅れます」を音声認識結果とする。
なお、認識結果統合部206´は、クライアント側音声認識部202から入力されたクライアント側音声認識結果候補のテキストが、受信部204が受信したサーバ側音声認識結果候補に含まれる場合には、テキストの分割が不要であると判定する。
また、入力規則判定部211から入力された発話規則が「コマンドのみ」である場合には、テキストの分割が不要であると判定する。
さらに、認識結果候補比較部205から入力された検出結果が、差異を検出しなかったことを示している場合には、テキストの分割が不要であると判定する。
この場合、サーバ側音声認識結果候補の先頭あるいは末尾以外の部分で差異が生じた場合には、音声操作コマンド以外の部分で認識誤りが発生したと判断し、利用者に対して音声の再入力を要求することも可能となる。これにより、誤った音声認識結果が出力されるのを抑制することができる。
上述した実施の形態2では、サーバ側音声認識結果候補のテキスト同士を比較して差異のある部分テキストを基準としてサーバ側音声認識結果候補のテキストを分割する構成を示したが、この実施の形態3ではサーバ側音声認識結果候補の変化を検出して常にテキストの分割を行う構成を示す。
この実施の形態3の音声認識システムにおいても音声認識サーバ100および音声認識装置200´´によって構成する。実施の形態3の音声認識装置200´´は、図2で示した音声認識装置200´に認識結果候補修正部221および入力音声/認識結果記憶部222を追加して設けると共に、認識結果候補比較部205を削除している。以下では、実施の形態1および実施の形態2による音声認識システムの構成要素と同一または相当する部分には、図1または図4で使用した符号と同一の符号を付して説明を省略または簡略化する。
なお以下では、実施の形態1または実施の形態2に係る音声認識システムと同一のステップには図2または図5で使用した符号と同一の符号を付し、説明を省略または簡略化する。
まず、第1の動作について、図9、図10および実施の形態2の図6を参照しながら説明する。
図9はこの発明の実施の形態3の音声認識システムの第1および第3の動作を示すフローチャートであり、図10は入力音声/認識結果記憶部の蓄積例を示す図である。
利用者が発話した音声が入力されると(ステップST1)、音声認識装置200´´の音声入力部201は入力された発話音声を音声データに変換し、変換した音声データをクライアント側音声認識部202、送信部203および入力音声/認識結果記憶部222に出力する(ステップST2´)。入力音声/認識結果記憶部222は、ステップST2´で入力された音声データを、例えば図10に示す形式で「音声データ(1)」として蓄積する(ステップST31)。
図10の例では、音声操作コマンド601と音声データ602とを対応付けて、入力音声情報600を構成している。
一方、クライアント側音声認識部202から入力されたクライアント側音声認識結果候補のテキストが、受信部204から入力されたサーバ側音声認識結果候補に含まれる場合には、テキストの分割が可能であると判定する。
以上が、実施の形態3の音声認識システムの第1の動作である。
次に、第2の動作について、図11および図12を参照しながら説明する。
図11はこの発明の実施の形態3の音声認識システムの第2の動作を示すフローチャートであり、図12はこの発明の実施の形態3の音声認識システムの音声認識装置の修正用データベースの一例を示す図である。
音声認識装置200´´が起動すると認識結果候補修正部221は、入力音声/認識結果記憶部222を参照して音声データが蓄積されているか否か判定を行う(ステップST41)。音声データが蓄積されていない場合(ステップST41;NO)、処理を終了する。一方、音声データが蓄積されている場合(ステップST41;YES)、入力音声/認識結果記憶部222に蓄積された音声データを取得し(ステップST42)、取得した音声データを、送信部203を介して音声認識サーバ100に送信する(ステップST43)。
音声認識装置200´´の受信部204は、ステップST6で音声認識サーバ100から送信されたサーバ側音声認識結果候補を受信し、受信したサーバ側音声認識結果候補を認識結果候補修正部221へ出力する(ステップST8´´)。認識結果候補修正部221は、ステップST8´´で入力されたサーバ側音声認識結果候補が、入力音声/認識結果記憶部222に蓄積された音声操作コマンドと一致するか否か判定を行う(ステップST44)。サーバ側音声認識結果候補と音声操作コマンドが一致する場合(ステップST44;YES)、ステップST46の処理に進む。
図12に示す例では、入力音声/認識結果記憶部222に蓄積された音声操作コマンド701が「メール」であり、サーバ側音声認識結果候補である修正候補702が「滅入る」あるいは「見える」であった場合に、それぞれを対応付けた情報を修正データ700として修正用データベース221aに追加する。
以上が、実施の形態3の音声認識システムの第2の動作である。
次に、第3の動作について、上述した図9のフローチャートを参照しながら説明を行う。なお、上述した第1の動作と同一の処理について説明を省略する。
ステップST32として、認識結果候補修正部221はステップST8´で受信したサーバ側音声認識結果候補のテキストを修正用データベース221aと照合する。例えば、サーバ側音声認識結果候補として図6に示したサーバ側音声認識結果候補リスト403が入力された場合、サーバ側音声認識結果候補401のテキストと、図12に示した修正用データベース221aを構成する修正データ700の修正候補702を照合する。
修正用データベース221aの修正候補「滅入る」がサーバ側音声認識結果候補401のテキストに含まれていると検出した場合、ステップST33として修正用データベース221aの修正候補「滅入る」およびそれに対応する音声操作コマンド「メール」を照合結果として認識結果統合部206´´に出力する。
以上が、実施の形態3の音声認識システムの第3の動作である。
これにより、音声認識サーバ100がアップデートされ認識結果に変化が生じた場合であっても追従することが可能となり、サーバ側とクライアント側の音声認識結果候補を統合してより正確な音声認識結果を出力することができる。
上述した実施の形態3では、音声認識サーバ100のサーバ側音声認識結果候補の変化を検出して常にテキストの分割を可能にする構成を示したが、この実施の形態4では自由文として分割されたテキストに含まれる固有名詞を検出する構成を示す。
図13は、この発明の実施の形態4による音声認識システムの動作を示すフローチャートである。図14はこの発明の実施の形態4による音声認識システムの音声認識結果の生成例を示し、図15は発話規則のパターン格納例を示す図である。なお以下では、実施の形態2に係る音声認識システムと同一のステップには図5で使用した符号と同一の符号を付し、説明を省略または簡略化する。
例えば、クライアント側音声認識部202がアドレス帳などに登録されている固有名詞と音声操作コマンドのみを認識対象とする場合、図14で示す例では利用者が入力した音声データ「健児さんにメール、本日は私と健児さんで対応します」に対して、音声認識を行って固有名詞である「健児」および音声操作コマンドである「さんにメール」を認識し、クライアント側音声認識結果候補804「健児さんにメール」を取得する。図14の例では、クライアント側音声認識結果候補リスト805は、1つのクライアント側音声認識結果候補804で構成される。取得されたクライアント側音声認識結果候補は、認識結果統合部206´および入力規則判定部211に出力される。
例えば、図14に示すクライアント側音声認識結果候補804「健児さんにメール」と、図15に示す発話規則のパターン900とを比較すると、一致する音声操作コマンド901「さんにメール」が検出され、対応した入力音声の発話規則902「固有名詞+コマンド+自由文」が取得される。取得した入力音声の発話規則は、認識結果統合部206´に出力される。
図14の例では、サーバ側音声認識結果リスト803に2つのサーバ側音声認識結果候補801,802が含まれるため、それぞれのテキスト情報である「検事さんにメール、本日は私と検事さんで対応します」と「賢治さんにメール、本日は私と賢治さんで対応します」を比較して、差異のある部分が2か所存在し、いずれも同じテキスト(音声認識結果候補801は「検事」、音声認識結果候補802は「賢治」)であることを検出する。
図14および図15の例では、クライアント側音声認識部202のクライアント側音声認識結果候補804「健児さんにメール」が入力され、受信部204からサーバ側音声認識結果候補801,802で構成されるサーバ側音声認識結果候補リスト803が入力された場合、サーバ側音声認識結果候補801,802のテキストに音声操作コマンド「さんにメール」が含まれているか否かを判定する。
図14の例では、自由文として分割したテキスト「本日は私と検事さんで対応します」の中に含まれる固有名詞に対応するテキスト「検事」を、クライアント側音声認識部202で認識した固有名詞のテキスト「健児」と置き換えて「本日は私と健児さんで対応します」とする。
図14の例では、発話規則の「固有名詞+コマンド+自由文」に基づいて、固有名詞「健児」と音声操作コマンド「さんにメール」および自由文に対応するテキスト「本日は私と健児さんで対応します」を結合した「健児さんにメール、本日は私と健児さんで対応します」を音声認識結果として確定する。
上述した実施の形態1では利用者が日本語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明したが、この実施の形態5では利用者が英語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明する。なお、この実施の形態5の音声認識システムの構成および動作は、実施の形態1で示した構成(図1参照)および動作(図2参照)と同様であるため、図1および図2を用いて説明を行う。
上述した実施の形態2では利用者が日本語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明したが、この実施の形態6では利用者が英語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明する。なお、この実施の形態6の音声認識システムの構成および動作は、実施の形態2で示した構成(図4参照)および動作(図5参照)と同様であるため、図4および図5を用いて説明を行う。
例えば、クライアント側音声認識部202が音声操作コマンドのみを認識対象とする場合、図17に示す例では利用者が入力した音声データ「Search for pictures of the golden gate bridge.」に対して、音声認識を行い1つのクライアント側音声認識結果候補414「SEARCH FOR」を取得する。図17の例では、クライアント側音声認識結果リスト415は、1つのクライアント側音声認識結果候補414で構成される。
図18に示す例では、入力規則蓄積部212に格納された発話規則のパターン510は、音声操作コマンド511および入力音声の発話規則512で構成され、例えば音声操作コマンド511が「SEARCH FOR」であった場合に、入力音声の発話規則512として「command+キーワード」が得られることを示している。
図17に示す例では、クライアント側音声認識結果候補414が「SEARCH FOR」であった場合に、入力規則判定部211は一致する音声操作コマンド511である「SEARCH FOR」に対応した入力音声の発話規則512である「command+キーワード」を取得する。
図17に示す例では、サーバ側音声認識結果候補411のテキストに対して「SYSTEM」を差異のある部分テキストとして検出しているため、「SYSTEM」と「PICTURES OF THE GOLDEN GATE BRIDGE」の2つにテキストを分割する。
図17に示す例では、発話規則の「command+キーワード」に基づいて、音声操作コマンド「SEARCH FOR」と自由文に対応する分割したテキスト「PICTURES OF THE GOLDEN GATE BRIDGE」を結合した「SEARCH FOR PICTURES OF THE GOLDEN GATE BRIDGE」を音声認識結果とする。
上述した実施の形態3では利用者が日本語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明したが、この実施の形態7では利用者が英語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明する。なお、この実施の形態7の音声認識システムの構成および動作は、実施の形態3で示した構成(図8参照)および動作(図9、図11参照)と同様であるため、図8、図9および図11を用いて説明を行う。
まず、第1の動作について、図9、図19および実施の形態6の図17を参照しながら説明する。実施の形態3と同様の動作については説明を省略する。
図19は、この発明の実施の形態7による音声認識システムの入力音声/認識結果記憶部の蓄積例を示す図である。
図9のフローチャートのステップST34において、認識結果統合部206´´は、
ステップST7でクライアント側音声認識部202が生成したクライアント側音声認識結果候補、ステップST21で入力規則判定部211が判定した発話規則、ステップST8´で受信部204が受信したサーバ側音声認識結果候補、およびステップST33で認識結果候補修正部221が取得した照合結果から、サーバ側音声認識結果候補のテキスト分割が可能であるか否か判定する。
図19に示す例では、音声データ612の「音声データ(1)」に対応する音声操作コマンド611としてクライアント側音声認識部202から入力された音声認識結果「SEARCH FOR」を蓄積する。
以上が、実施の形態7の音声認識システムの第1の動作である。
次に、第2の動作について、図11および図20を参照しながら説明する。
図20は、この発明の実施の形態7の音声認識システムの音声認識装置の修正用データベースの一例を示す図である。
図11のフローチャートのステップST44において、サーバ側音声認識結果候補と音声操作コマンドが一致しない場合(ステップST44;NO)、ステップST45としてサーバ側音声認識結果候補を修正候補として音声操作コマンドを対応付けた情報を修正用データベース221aに追加する。
以上が、実施の形態7の音声認識システムの第2の動作である。
次に、第3の動作について、上述した図9のフローチャートを参照しながら説明を行う。
ステップST32として、認識結果候補修正部221はステップST8´で受信したサーバ側音声認識結果候補のテキストを修正用データベース221aと照合する。例えば、サーバ側音声認識結果候補として図17に示したサーバ側音声認識結果候補リスト413が入力された場合、サーバ側音声認識結果候補411のテキストと、図20に示した修正用データベース221aを構成する修正データ710の修正候補712を照合する。
修正用データベース221aの修正候補「SYSTEM」がサーバ側音声認識結果候補411のテキストに含まれていると検出した場合、ステップST33として修正用データベース221aの修正候補「SYSTEM」およびそれに対応する音声操作コマンド「SEARCH FOR」を照合結果として認識結果統合部206´´に出力する。
以上が、実施の形態3の音声認識システムの第3の動作である。
上述した実施の形態4では利用者が日本語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明したが、この実施の形態8では利用者が英語で発話した音声が入力される場合を例に音声認識システムの処理動作を説明する。なお、この実施の形態8の音声認識システムの構成および動作は実施の形態3で示した構成(図8参照)および実施の形態4で示した動作(図13参照)と同様であるため、図8および図13を用いて説明を行う。
例えば、クライアント側音声認識部202がアドレス帳などに登録されている固有名詞と音声操作コマンドのみを認識対象とする場合、図21で示す例では利用者が入力した音声データ「Send e-mail to Jones, Happy birthday, Jones.」に対して、音声認識を行って音声操作コマンドである「SEND E-MAIL TO」および固有名詞である「JONES」を認識し、クライアント側音声認識結果候補814「SEND E-MAIL TO JONES」を取得する。図21の例では、クライアント側音声認識結果候補リスト815は、1つのクライアント側音声認識結果候補814で構成される。取得されたクライアント側音声認識結果候補は、認識結果統合部206´および入力規則判定部211に出力される。
例えば、図21に示すクライアント側音声認識結果候補814「SEND E-MAIL TO JONES」と、図22に示す発話規則のパターン910とを比較すると、一致する音声操作コマンド911「SEND E-MAIL TO」が検出され、対応した入力音声の発話規則912「command+固有名詞+自由文」が取得される。取得した入力音声の発話規則は、認識結果統合部206´に出力される。
図21の例では、サーバ側音声認識結果リスト813に2つのサーバ側音声認識結果候補811,812が含まれるため、それぞれのテキスト情報である「SEND E-MAIL TO JOHN HAPPY BIRTHDAY JOHN」と「SEND E-MAIL TO JON HAPPY BIRTHDAY JON」を比較して、差異のある部分が2か所存在し、いずれも同じテキスト(音声認識結果候補811は「JOHN」、音声認識結果候補812は「JON」)であることを検出する。
固有名詞の置き換えが可能であるか否かの判定は、具体的に次のように行われる。図21および図22の例では、クライアント側音声認識部202のクライアント側音声認識結果候補814「SEND E-MAIL TO JONES」が入力され、受信部204からサーバ側音声認識結果候補811,812で構成されるサーバ側音声認識結果候補リスト813が入力された場合、サーバ側音声認識結果候補811,812のテキストに音声操作コマンド「SEND E-MAIL TO」が含まれているか否かを判定する。
図21の例では、自由文として分割したテキスト「HAPPY BIRTHDAY JOHN」の中に含まれる固有名詞に対応するテキスト「JOHN」を、クライアント側音声認識部202で認識した固有名詞のテキスト「JONES」と置き換えて「HAPPY BIRTHDAY JONES」とする。
図21の例では、発話規則の「command+固有名詞+自由文」に基づいて、音声操作コマンド「SEND E-MAIL TO」と固有名詞「JONES」および自由文に対応するテキスト「HAPPY BIRTHDAY JONES」を結合した「SEND E-MAIL TO JONES HAPPY BIRTHDAY JONES」を音声認識結果として確定する。
Claims (5)
- サーバ装置と、前記サーバ装置と接続されるクライアント側の音声認識装置とを備えた音声認識システムにおいて、
前記サーバ装置は、
前記音声認識装置から入力される音声データを受信するサーバ側受信部と、
前記サーバ側受信部が受信した音声データの音声認識を行い、サーバ側音声認識結果候補を生成するサーバ側音声認識部と、
前記サーバ側音声認識部が生成した前記サーバ側音声認識結果候補を前記音声認識装置に送信するサーバ側送信部とを備え、
前記音声認識装置は、
入力された発話音声を前記音声データに変換する音声入力部と、
前記音声入力部が変換した前記音声データの音声認識を行い、クライアント側音声認識結果候補を生成するクライアント側音声認識部と、
前記音声入力部が変換した前記音声データを前記サーバ装置に送信するクライアント側送信部と、
前記サーバ側送信部が送信した前記サーバ側音声認識結果候補を受信するクライアント側受信部と、
前記クライアント側受信部が受信した複数の前記サーバ側音声認識結果候補を比較し、差異のあるテキストを検出する認識結果候補比較部と、
前記クライアント側音声認識結果候補、前記サーバ側音声認識結果候補および前記認識結果候補比較部の検出結果に基づいて、前記クライアント側音声認識結果候補と前記サーバ側音声認識結果候補とを統合し、音声認識結果を確定する認識結果統合部と、
前記認識結果統合部が確定した音声認識結果を出力する出力部とを備えたことを特徴とする音声認識システム。 - 前記音声認識装置は、
前記クライアント側音声認識結果と、所定のキーワードと当該キーワードの発話規則とを対応付けた発話規則パターンとを比較し、前記音声データの発話規則を判定する入力規則判定部を備え、
前記認識結果統合部は、前記クライアント側音声認識結果、前記サーバ側音声認識結果、前記認識結果候補比較部の検出結果および前記入力規則判定部が判定した発話規則に基づいて、前記クライアント側音声認識結果候補と前記サーバ側音声認識結果候補とを統合することを特徴とする請求項1記載の音声認識システム。 - 前記音声認識装置は、
前記音声入力部が変換した音声データおよび前記認識結果統合部が確定した音声認識結果を対応付けて蓄積する入力音声/認識結果記憶部と、
装置起動時に前記入力音声/認識結果記憶部に蓄積された音声データに対するサーバ側音声認識結果候補を取得してデータベースを作成すると共に、作成したデータベースと、前記クライアント側受信部が受信したサーバ側音声認識結果候補とを照合する認識結果候補修正部とを備え、
前記認識結果統合部は、前記認識結果候補修正部の照合結果に基づいて前記クライアント側音声認識結果候補と前記サーバ側音声認識結果候補とを統合することを特徴とする請求項2記載の音声認識システム。 - 前記認識結果候補比較部は、前記クライアント側受信部が受信した複数の前記サーバ側音声認識結果候補を比較して差異のあるテキストを複数検出し、且つ検出した複数のテキストが同一の内容を示しているか否か判定を行い、
前記認識結果統合部は、前記認識結果候補比較部が検出した複数のテキストが同一の内容を示していると判定した場合に、前記検出したテキストを前記サーバ側音声認識結果に基づく固有名詞に置き換えることを特徴とする請求項2記載の音声認識システム。 - 音声認識機能を備えたサーバ装置と接続されるクライアント側の音声認識装置において、
入力された発話音声を音声データに変換する音声入力部と、
前記音声入力部が変換した前記音声データの音声認識を行い、クライアント側音声認識結果候補を生成するクライアント側音声認識部と、
前記音声入力部が変換した前記音声データを前記サーバ装置に送信するクライアント側送信部と、
前記クライアント側送信部が送信した前記音声データに基づいて前記サーバ装置が生成したサーバ側音声認識結果候補を受信するクライアント側受信部と、
前記クライアント側受信部が受信した複数の前記サーバ側音声認識結果候補を比較し、差異のあるテキストを検出する認識結果候補比較部と、
前記クライアント側音声認識結果候補、前記サーバ側音声認識結果候補および前記認識結果候補比較部の検出結果に基づいて、前記クライアント側音声認識結果候補と前記サーバ側音声認識結果候補とを統合し、音声認識結果を確定する認識結果統合部と、
前記認識結果統合部が確定した音声認識結果を出力する出力部とを備えたことを特徴と
する音声認識装置。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/649,128 US9761228B2 (en) | 2013-02-25 | 2013-11-20 | Voice recognition system and voice recognition device |
CN201380073708.3A CN105027198B (zh) | 2013-02-25 | 2013-11-20 | 语音识别***以及语音识别装置 |
JP2015501272A JP5921756B2 (ja) | 2013-02-25 | 2013-11-20 | 音声認識システムおよび音声認識装置 |
DE112013006728.5T DE112013006728B4 (de) | 2013-02-25 | 2013-11-20 | Spracherkennungssystem und Spracherkennungsgerät |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2013034574 | 2013-02-25 | ||
JP2013-034574 | 2013-02-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014129033A1 true WO2014129033A1 (ja) | 2014-08-28 |
Family
ID=51390848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/081288 WO2014129033A1 (ja) | 2013-02-25 | 2013-11-20 | 音声認識システムおよび音声認識装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US9761228B2 (ja) |
JP (1) | JP5921756B2 (ja) |
CN (1) | CN105027198B (ja) |
DE (1) | DE112013006728B4 (ja) |
WO (1) | WO2014129033A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105225665A (zh) * | 2015-10-15 | 2016-01-06 | 桂林电子科技大学 | 一种语音识别方法及语音识别装置 |
JP2018045193A (ja) * | 2016-09-16 | 2018-03-22 | 株式会社リコー | 通信端末、音声変換方法、及びプログラム |
JP2022001930A (ja) * | 2020-06-22 | 2022-01-06 | 徹 江崎 | アクティブラーニングシステム及びアクティブラーニングプログラム |
JP2022128594A (ja) * | 2021-02-23 | 2022-09-02 | アバイア マネジメント エル.ピー. | 通信セッション品質の単語ベース表現 |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2012308731B2 (en) | 2011-09-13 | 2017-07-13 | Medrobotics Corporation | Highly articulated probes with anti-twist link arrangement, methods of formation thereof, and methods of performing medical procedures |
KR102246893B1 (ko) * | 2013-12-11 | 2021-04-30 | 삼성전자주식회사 | 대화형 시스템, 이의 제어 방법, 대화형 서버 및 이의 제어 방법 |
WO2016014026A1 (en) * | 2014-07-22 | 2016-01-28 | Nuance Communications, Inc. | Systems and methods for speech-based searching of content repositories |
CN106782546A (zh) * | 2015-11-17 | 2017-05-31 | 深圳市北科瑞声科技有限公司 | 语音识别方法与装置 |
CN105374357B (zh) * | 2015-11-23 | 2022-03-29 | 青岛海尔智能技术研发有限公司 | 一种语音识别方法、装置及语音控制*** |
CN106228975A (zh) * | 2016-09-08 | 2016-12-14 | 康佳集团股份有限公司 | 一种移动终端的语音识别***及方法 |
KR101700099B1 (ko) * | 2016-10-11 | 2017-01-31 | 미디어젠(주) | 하이브리드 음성인식 복합 성능 자동 평가시스템 |
US10971157B2 (en) * | 2017-01-11 | 2021-04-06 | Nuance Communications, Inc. | Methods and apparatus for hybrid speech recognition processing |
KR102389625B1 (ko) * | 2017-04-30 | 2022-04-25 | 삼성전자주식회사 | 사용자 발화를 처리하는 전자 장치 및 이 전자 장치의 제어 방법 |
KR102552486B1 (ko) * | 2017-11-02 | 2023-07-06 | 현대자동차주식회사 | 차량의 음성인식 장치 및 방법 |
JP2019200393A (ja) * | 2018-05-18 | 2019-11-21 | シャープ株式会社 | 判定装置、電子機器、応答システム、判定装置の制御方法、および制御プログラム |
KR20200045851A (ko) * | 2018-10-23 | 2020-05-06 | 삼성전자주식회사 | 음성 인식 서비스를 제공하는 전자 장치 및 시스템 |
CN109524002A (zh) * | 2018-12-28 | 2019-03-26 | 江苏惠通集团有限责任公司 | 智能语音识别方法及装置 |
EP3690876A1 (de) | 2019-01-30 | 2020-08-05 | Siemens Healthcare GmbH | System zur durchführung einer magnetresonanztomographie und verfahren zur steuerung eines mr scanners |
JP6718182B1 (ja) * | 2019-05-08 | 2020-07-08 | 株式会社インタラクティブソリューションズ | 誤変換辞書作成システム |
CN114223029A (zh) * | 2019-08-13 | 2022-03-22 | 三星电子株式会社 | 支持装置进行语音识别的服务器及服务器的操作方法 |
CN110853635B (zh) * | 2019-10-14 | 2022-04-01 | 广东美的白色家电技术创新中心有限公司 | 语音识别方法、音频标注方法、计算机设备、存储装置 |
CN111063347B (zh) * | 2019-12-12 | 2022-06-07 | 安徽听见科技有限公司 | 实时语音识别方法、服务端及客户端 |
JP2021152589A (ja) * | 2020-03-24 | 2021-09-30 | シャープ株式会社 | 電子機器の制御装置、制御プログラム、制御方法、電子機器 |
JP2021191231A (ja) * | 2020-06-04 | 2021-12-13 | トランスポーテーション アイピー ホールディングス,エルエルシー | 電気供給システム |
US11798549B2 (en) * | 2021-03-19 | 2023-10-24 | Mitel Networks Corporation | Generating action items during a conferencing session |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004245938A (ja) * | 2003-02-12 | 2004-09-02 | Fujitsu Ten Ltd | 音声認識装置及びプログラム |
JP2009237439A (ja) * | 2008-03-28 | 2009-10-15 | Kddi Corp | 携帯端末の音声認識装置、音声認識方法、音声認識プログラム |
JP2010085536A (ja) * | 2008-09-30 | 2010-04-15 | Fyuutorekku:Kk | 音声認識システム、音声認識方法、音声認識クライアントおよびプログラム |
WO2011163538A1 (en) * | 2010-06-24 | 2011-12-29 | Honda Motor Co., Ltd. | Communication system and method between an on-vehicle voice recognition system and an off-vehicle voice recognition system |
Family Cites Families (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5642519A (en) * | 1994-04-29 | 1997-06-24 | Sun Microsystems, Inc. | Speech interpreter with a unified grammer compiler |
US5596679A (en) * | 1994-10-26 | 1997-01-21 | Motorola, Inc. | Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs |
JP3741156B2 (ja) * | 1995-04-07 | 2006-02-01 | ソニー株式会社 | 音声認識装置および音声認識方法並びに音声翻訳装置 |
US6064959A (en) * | 1997-03-28 | 2000-05-16 | Dragon Systems, Inc. | Error correction in speech recognition |
US5864805A (en) * | 1996-12-20 | 1999-01-26 | International Business Machines Corporation | Method and apparatus for error correction in a continuous dictation system |
US6574596B2 (en) * | 1999-02-08 | 2003-06-03 | Qualcomm Incorporated | Voice recognition rejection scheme |
EP1181684B1 (en) * | 1999-03-26 | 2004-11-03 | Scansoft, Inc. | Client-server speech recognition |
JP2001188555A (ja) * | 1999-12-28 | 2001-07-10 | Sony Corp | 情報処理装置および方法、並びに記録媒体 |
AU2001259446A1 (en) * | 2000-05-02 | 2001-11-12 | Dragon Systems, Inc. | Error correction in speech recognition |
CN1180398C (zh) * | 2000-05-26 | 2004-12-15 | 封家麒 | 一种语音辨识方法及*** |
US6671669B1 (en) * | 2000-07-18 | 2003-12-30 | Qualcomm Incorporated | combined engine system and method for voice recognition |
US7143040B2 (en) * | 2000-07-20 | 2006-11-28 | British Telecommunications Public Limited Company | Interactive dialogues |
US6754629B1 (en) * | 2000-09-08 | 2004-06-22 | Qualcomm Incorporated | System and method for automatic voice recognition using mapping |
CN1151489C (zh) * | 2000-11-15 | 2004-05-26 | 中国科学院自动化研究所 | 中国人名、地名和单位名的语音识别方法 |
US6985862B2 (en) * | 2001-03-22 | 2006-01-10 | Tellme Networks, Inc. | Histogram grammar weighting and error corrective training of grammar weights |
US6701293B2 (en) * | 2001-06-13 | 2004-03-02 | Intel Corporation | Combining N-best lists from multiple speech recognizers |
GB2383459B (en) * | 2001-12-20 | 2005-05-18 | Hewlett Packard Co | Speech recognition system and method |
JP2003295893A (ja) * | 2002-04-01 | 2003-10-15 | Omron Corp | 音声認識システム、装置、音声認識方法、音声認識プログラム及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 |
JP2005010691A (ja) * | 2003-06-20 | 2005-01-13 | P To Pa:Kk | 音声認識装置、音声認識方法、会話制御装置、会話制御方法及びこれらのためのプログラム |
WO2007013521A1 (ja) * | 2005-07-26 | 2007-02-01 | Honda Motor Co., Ltd. | ユーザと機械とのインタラクションを実施するための装置、方法、およびプログラム |
US20080288252A1 (en) * | 2007-03-07 | 2008-11-20 | Cerra Joseph P | Speech recognition of speech recorded by a mobile communication facility |
US8949266B2 (en) * | 2007-03-07 | 2015-02-03 | Vlingo Corporation | Multiple web-based content category searching in mobile search application |
JP4812029B2 (ja) * | 2007-03-16 | 2011-11-09 | 富士通株式会社 | 音声認識システム、および、音声認識プログラム |
US8041565B1 (en) * | 2007-05-04 | 2011-10-18 | Foneweb, Inc. | Precision speech to text conversion |
JP2009128675A (ja) * | 2007-11-26 | 2009-06-11 | Toshiba Corp | 音声を認識する装置、方法およびプログラム |
US8219407B1 (en) * | 2007-12-27 | 2012-07-10 | Great Northern Research, LLC | Method for processing the output of a speech recognizer |
US8676577B2 (en) * | 2008-03-31 | 2014-03-18 | Canyon IP Holdings, LLC | Use of metadata to post process speech recognition output |
CN102341843B (zh) * | 2009-03-03 | 2014-01-29 | 三菱电机株式会社 | 语音识别装置 |
US8892439B2 (en) * | 2009-07-15 | 2014-11-18 | Microsoft Corporation | Combination and federation of local and remote speech recognition |
US20110184740A1 (en) * | 2010-01-26 | 2011-07-28 | Google Inc. | Integration of Embedded and Network Speech Recognizers |
CN101807399A (zh) * | 2010-02-02 | 2010-08-18 | 华为终端有限公司 | 一种语音识别方法及装置 |
CN101923854B (zh) * | 2010-08-31 | 2012-03-28 | 中国科学院计算技术研究所 | 一种交互式语音识别***和方法 |
US10032455B2 (en) * | 2011-01-07 | 2018-07-24 | Nuance Communications, Inc. | Configurable speech recognition system using a pronunciation alignment between multiple recognizers |
US9183843B2 (en) * | 2011-01-07 | 2015-11-10 | Nuance Communications, Inc. | Configurable speech recognition system using multiple recognizers |
US8954329B2 (en) * | 2011-05-23 | 2015-02-10 | Nuance Communications, Inc. | Methods and apparatus for acoustic disambiguation by insertion of disambiguating textual information |
US9443518B1 (en) * | 2011-08-31 | 2016-09-13 | Google Inc. | Text transcript generation from a communication session |
US20130073286A1 (en) * | 2011-09-20 | 2013-03-21 | Apple Inc. | Consolidating Speech Recognition Results |
US9275635B1 (en) * | 2012-03-08 | 2016-03-01 | Google Inc. | Recognizing different versions of a language |
US9530103B2 (en) * | 2013-04-04 | 2016-12-27 | Cypress Semiconductor Corporation | Combining of results from multiple decoders |
-
2013
- 2013-11-20 JP JP2015501272A patent/JP5921756B2/ja not_active Expired - Fee Related
- 2013-11-20 US US14/649,128 patent/US9761228B2/en not_active Expired - Fee Related
- 2013-11-20 CN CN201380073708.3A patent/CN105027198B/zh not_active Expired - Fee Related
- 2013-11-20 WO PCT/JP2013/081288 patent/WO2014129033A1/ja active Application Filing
- 2013-11-20 DE DE112013006728.5T patent/DE112013006728B4/de not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004245938A (ja) * | 2003-02-12 | 2004-09-02 | Fujitsu Ten Ltd | 音声認識装置及びプログラム |
JP2009237439A (ja) * | 2008-03-28 | 2009-10-15 | Kddi Corp | 携帯端末の音声認識装置、音声認識方法、音声認識プログラム |
JP2010085536A (ja) * | 2008-09-30 | 2010-04-15 | Fyuutorekku:Kk | 音声認識システム、音声認識方法、音声認識クライアントおよびプログラム |
WO2011163538A1 (en) * | 2010-06-24 | 2011-12-29 | Honda Motor Co., Ltd. | Communication system and method between an on-vehicle voice recognition system and an off-vehicle voice recognition system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105225665A (zh) * | 2015-10-15 | 2016-01-06 | 桂林电子科技大学 | 一种语音识别方法及语音识别装置 |
JP2018045193A (ja) * | 2016-09-16 | 2018-03-22 | 株式会社リコー | 通信端末、音声変換方法、及びプログラム |
JP2022001930A (ja) * | 2020-06-22 | 2022-01-06 | 徹 江崎 | アクティブラーニングシステム及びアクティブラーニングプログラム |
JP2022128594A (ja) * | 2021-02-23 | 2022-09-02 | アバイア マネジメント エル.ピー. | 通信セッション品質の単語ベース表現 |
Also Published As
Publication number | Publication date |
---|---|
CN105027198A (zh) | 2015-11-04 |
CN105027198B (zh) | 2018-11-20 |
JPWO2014129033A1 (ja) | 2017-02-02 |
JP5921756B2 (ja) | 2016-05-24 |
US20160275950A1 (en) | 2016-09-22 |
US9761228B2 (en) | 2017-09-12 |
DE112013006728B4 (de) | 2020-10-01 |
DE112013006728T5 (de) | 2015-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5921756B2 (ja) | 音声認識システムおよび音声認識装置 | |
US10643609B1 (en) | Selecting speech inputs | |
US7672846B2 (en) | Speech recognition system finding self-repair utterance in misrecognized speech without using recognized words | |
US10719507B2 (en) | System and method for natural language processing | |
US9076451B2 (en) | Operating system and method of operating | |
US10037758B2 (en) | Device and method for understanding user intent | |
US9449599B2 (en) | Systems and methods for adaptive proper name entity recognition and understanding | |
US9002705B2 (en) | Interactive device that recognizes input voice of a user and contents of an utterance of the user, and performs a response corresponding to the recognized contents | |
CN109937447B (zh) | 语音识别装置、语音识别*** | |
WO2015098109A1 (ja) | 音声認識処理装置、音声認識処理方法、および表示装置 | |
JP5951161B2 (ja) | 音声認識装置及び音声認識方法 | |
JP4867622B2 (ja) | 音声認識装置、および音声認識方法 | |
WO2012004955A1 (ja) | テキスト補正方法及び認識方法 | |
JP5335165B2 (ja) | 発音情報生成装置、車載情報装置およびデータベース生成方法 | |
EP3005152B1 (en) | Systems and methods for adaptive proper name entity recognition and understanding | |
JP5606951B2 (ja) | 音声認識システムおよびこれを用いた検索システム | |
JP2002202797A (ja) | 音声認識方法 | |
JP5396530B2 (ja) | 音声認識装置および音声認識方法 | |
JP5160594B2 (ja) | 音声認識装置および音声認識方法 | |
US11164578B2 (en) | Voice recognition apparatus, voice recognition method, and non-transitory computer-readable storage medium storing program | |
JP6322125B2 (ja) | 音声認識装置、音声認識方法および音声認識プログラム | |
JP4639990B2 (ja) | 音声対話装置及び音声理解結果生成方法 | |
JP2007264229A (ja) | 対話装置 | |
JP3581044B2 (ja) | 音声対話処理方法、音声対話処理システムおよびプログラムを記憶した記憶媒体 | |
US20230267926A1 (en) | False Suggestion Detection for User-Provided Content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201380073708.3 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13875600 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015501272 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 112013006728 Country of ref document: DE Ref document number: 1120130067285 Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13875600 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14649128 Country of ref document: US |