WO2012081788A1 - 온라인 음성인식을 처리하는 음성인식 클라이언트 시스템, 음성인식 서버 시스템 및 음성인식 방법 - Google Patents
온라인 음성인식을 처리하는 음성인식 클라이언트 시스템, 음성인식 서버 시스템 및 음성인식 방법 Download PDFInfo
- Publication number
- WO2012081788A1 WO2012081788A1 PCT/KR2011/005394 KR2011005394W WO2012081788A1 WO 2012081788 A1 WO2012081788 A1 WO 2012081788A1 KR 2011005394 W KR2011005394 W KR 2011005394W WO 2012081788 A1 WO2012081788 A1 WO 2012081788A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice recognition
- unit
- sound signal
- time
- client system
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000004891 communication Methods 0.000 claims abstract description 8
- 230000005236 sound signal Effects 0.000 claims description 236
- 230000005540 biological transmission Effects 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims 1
- 230000008569 process Effects 0.000 description 21
- 238000010586 diagram Methods 0.000 description 14
- 239000000284 extract Substances 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 241000272814 Anser sp. Species 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000006837 decompression Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/221—Announcement of recognition results
Definitions
- Embodiments of the present invention relate to a voice recognition client system, a voice recognition server system and a voice recognition method for processing online voice recognition.
- Speech recognition is to identify linguistic meaning content from speech by an automatic means. Specifically, speech recognition may mean a process of inputting a speech waveform to identify a word or word string and extracting meaning.
- the speech recognition result is generated by using the entire sound signal input.
- a search is performed using a voice recognition result generated after all user utterances are finished, and a search result is provided.
- Voice recognition client system that provides the user with an intermediate result of voice recognition after the user starts talking and before the voice recognition ends, thereby reducing the user's worry about whether the voice recognition is being performed correctly and providing more accurate voice recognition.
- a voice recognition server system and a voice recognition method are provided.
- the partial sound signal is generated by accumulating at least one unit sound signal input for each predetermined unit time, and generates a voice recognition intermediate result based on the partial sound signal and provides the result to the user.
- a voice recognition client system a voice recognition server system
- a voice recognition method that can provide a sense of stability and show a process of voice recognition.
- a voice recognition client system for displaying a voice recognition result of a sound signal input from the start point to the end point of the voice recognition, the unit sound signal input at every predetermined unit time from the start point to the end point
- a voice recognition client system includes a communication unit for transmitting to a recognition server system and receiving a voice recognition intermediate result from a voice recognition server system and a display unit for displaying the received voice recognition intermediate result between a start point and an end point.
- the speech recognition intermediate result may be generated through the partial sound signal accumulated at least one unit sound signal according to the input time in the speech recognition server system.
- the display unit may sequentially display the plurality of voice recognition intermediate results between the start time and the end time.
- the display unit may display all of two or more results when one of the voice recognition intermediate results includes two or more results.
- the voice recognition client system may further include a user interface unit for receiving an event from the user, and if the result of one of two or more displayed results is selected through the event, the selected result is sent to the voice recognition server system The feedback may be reflected in the voice recognition process.
- the voice recognition client system may further include an accuracy determiner that determines the accuracy of each of the two or more results when the voice recognition intermediate result includes two or more results.
- the display unit may display two or more results in the order of accuracy or display the results with the highest accuracy.
- the voice recognition client system may further include a feature information extractor for extracting feature information from the input unit sound signal and an encoder for encoding the input unit sound signal.
- the communication unit may transmit the feature information and the encoded unit sound signal to the voice recognition server system as the input unit sound signal.
- a voice recognition server system for generating a voice recognition result using a sound signal received from a voice recognition client system, wherein the unit sound signal input to the voice recognition client system is received every unit time from the start point to the end point of the voice recognition.
- Receiving unit a voice recognition result generating unit for generating a voice recognition intermediate results using the partial sound signal accumulated at least one unit sound signal according to the input time and a transmission unit for transmitting the voice recognition intermediate results to the voice recognition client system
- a voice recognition server system is provided. In this case, the voice recognition intermediate result is displayed through the display portion of the voice recognition client system between the start point and the end point.
- a voice recognition client system for displaying a voice recognition result of a sound signal input from a start point to a end point of voice recognition, the voice recognition client system being input to at least one of a plurality of time points between a start point and a start point and an end point.
- a voice recognition client system is provided that includes a control unit for controlling a voice recognition intermediate result of a partial sound signal to be displayed between a start time and an end time.
- a voice recognition server system for generating a voice recognition result using a sound signal received from a voice recognition client system, wherein the voice recognition is performed from a start point of voice recognition to at least one of a plurality of time points between a start point and an end point.
- a voice recognition server system including a voice recognition result generator for generating a voice recognition intermediate result using a partial sound signal input to the client system and a transmission unit for transmitting the voice recognition intermediate result to the voice recognition client system.
- the voice recognition intermediate result is displayed through the display portion of the voice recognition client system between the start point and the end point.
- a voice recognition method includes transmitting to a server system, receiving a voice recognition intermediate result from a voice recognition server system, and displaying the received voice recognition intermediate result between a start point and an end point.
- a voice recognition method for generating a voice recognition result using a sound signal received from a voice recognition client system comprising: receiving a unit sound signal input to the voice recognition client system every unit time from the start point to the end point of the voice recognition
- the voice recognition method includes generating a voice recognition intermediate result using a partial sound signal in which at least one unit sound signal is accumulated according to an input time, and transmitting the voice recognition intermediate result to a voice recognition client system. do.
- the voice recognition intermediate result is displayed through the display portion of the voice recognition client system between the start point and the end point.
- a speech recognition method includes controlling an intermediate result of speech recognition for a sound signal to be displayed between a start point and an end point.
- a voice recognition method for generating a voice recognition result using a sound signal received from a voice recognition client system comprising: a voice recognition client from a start point of voice recognition to at least one of a plurality of time points between a start point and an end point
- a voice recognition method comprising generating a voice recognition intermediate result by using a partial sound signal input to the system and transmitting the voice recognition intermediate result to the voice recognition client system.
- the voice recognition intermediate result is displayed through the display portion of the voice recognition client system between the start point and the end point.
- the user can reduce the worry about whether the speech recognition is correctly performed and perform more accurate speech recognition.
- the partial sound signal is generated by accumulating at least one unit sound signal input for each predetermined unit time, and generates a voice recognition intermediate result based on the partial sound signal and provides the result to the user. It can give a sense of stability and show the process of speech recognition.
- FIG. 1 is a diagram illustrating an overall system for online voice recognition according to an embodiment of the present invention.
- FIG. 2 is a block diagram illustrating a function-specific structure of a voice recognition client system and a voice recognition server system according to an embodiment of the present invention.
- FIG. 3 is a block diagram illustrating an internal configuration of a speech recognition unit according to an embodiment of the present invention.
- FIG. 4 is a diagram illustrating a process of speech recognition according to an embodiment of the present invention.
- FIG. 5 is a diagram illustrating a voice recognition result according to time in a voice recognition process according to an embodiment of the present invention.
- FIG. 6 is a block diagram illustrating an internal configuration of a voice recognition client system and a voice recognition server system according to an embodiment of the present invention.
- FIG. 7 is a flowchart illustrating a voice recognition method performed by a voice recognition client system according to an embodiment of the present invention.
- FIG. 8 is a flowchart illustrating a voice recognition method performed by a voice recognition server system according to an embodiment of the present invention.
- FIG. 9 is a block diagram showing the internal configuration of a voice recognition client system and a voice recognition server system according to another embodiment of the present invention.
- 1 is a diagram illustrating an overall system for online voice recognition according to an embodiment of the present invention. 1 illustrates a user 110, a voice recognition client system 120, and a voice recognition server system 130.
- the voice recognition client system 120 may be a terminal of the user 110 or one module included in the terminal.
- the voice recognition client system 120 may extract a feature of the input voice.
- the voice recognition client system 120 may transmit the extracted feature to the voice recognition server system 130, and the voice recognition server system 130 may perform voice recognition using the received feature to generate a voice recognition result.
- the voice recognition server system 130 may transmit the generated voice recognition result to the voice recognition client system 120, and the voice recognition client system 120 displays the voice recognition result by using a display device or the like. ) Will be able to check the voice recognition result of the voice input by the user.
- the voice recognition client system 120 and the voice recognition server system 130 not only provide a voice recognition result for the entire sound signal input after all of the utterances of the user 110 are completed.
- the user 110 may further provide a voice recognition intermediate result for the sound signal input until the moment when the user 110 utters every predetermined unit time after starting the utterance. For example, the user 110 may start speaking and provide the user 110 with an intermediate result of speech recognition every one second or every 0.5 second after about one second.
- the voice recognition client system 120 may transmit a sound signal input every 20 milliseconds to the voice recognition server system 130, and the voice recognition server system 130 recognizes a voice using a voice recognizer. Every 500 milliseconds thereafter, the voice recognition intermediate result may be returned to the voice recognition client system 120. In this case, the voice recognition client system 120 may display the received voice recognition intermediate result on the screen, and provide the same to the user 110.
- the user 110 may be sensed with stability.
- a specific example of a process of presenting one of the voice recognition intermediate result candidates to the user for example, when the user 110 who wants to obtain "Gustav Klimt” as the voice recognition result, speaks “9", “9", The identifier "g", “nose” or “g” selects the most likely result until then and sends the candidate (eg "9") to the client. The user 110 sees “9” at that time, but can be changed to "sphere” in the future by the voice spoken, and this process can be repeated until the final voice recognition result is shown.
- the voice recognition server system 130 may transmit the final voice recognition result to the voice recognition client system 120 by using the entire transmitted sound signal.
- the voice recognition client system 120 may provide the final voice recognition result to the user 110 by displaying the result on the screen.
- the user 110 may generate an event in the voice recognition client system 120 to select an end point of the voice recognition. For example, although the user 110 completes the utterance, voice recognition may not continue due to ambient noise and may continue. In this case, an incorrect speech recognition result may be generated, and since the speech recognition time becomes long, the speech recognition client system 120 may control the speech recognition to be terminated when a preset event is generated from the user 110. have. In this case, the voice recognition client system 120 and the voice recognition server system 130 may generate a voice recognition final result by using the input sound signal until the voice recognition ends.
- the speech recognition final result can be used as the user's input, such as a search query.
- FIG. 2 is a block diagram illustrating a function-specific structure of a voice recognition client system and a voice recognition server system according to an embodiment of the present invention. That is, the embodiment of FIG. 2 illustrates the internal configuration of the voice recognition client system 120 and the voice recognition server system 130 described with reference to FIG. 1.
- the voice recognition client system 120 may include a user interface 210, a sound signal compression unit 220, a feature extractor 230, and a client socket 240, and the voice recognition server system 130.
- the sound signal decompression unit 250, a voice recognition unit 260, a handler 270 and a listener socket 280 may be included.
- the user interface unit 210 may include a display device for displaying at least a voice recognition result and an input interface for receiving an event from a user. That is, the user interface unit 210 may include an interface for receiving an event from the user or displaying a voice recognition result to the user.
- the sound signal compression unit 220 receives and records a sound signal input through the microphone 290.
- the sound signal compression unit 220 may receive a sound signal at 16 KHz and mono.
- the feature extractor 230 extracts the feature from the sound signal.
- a method of extracting a feature from a sound signal such as a user's voice is widely known, a detailed description thereof will be omitted. That is, in the present embodiment, one of various known methods may be used as a method of extracting a feature.
- the sound signal compression unit 220 encodes data for transmission to the voice recognition server system 130. That is, the feature extracted by the feature extractor 230 and the sound signal recorded by the sound signal compressor 220 may be encoded.
- the voice recognition client system 120 extracts a feature from the sound signal input every unit time from the time when the voice recognition starts or after a predetermined time after the voice recognition starts, and extracts the extracted feature and
- the sound signal may be encoded and transmitted to the voice recognition server system 130.
- the sound signal decompression unit 250 of the voice recognition server system 130 decompresses the compressed sound signal in the packet received from the voice recognition client system 120.
- the speech recognizer 260 obtains language data using the decompressed sound signal.
- the handler 270 may include a server socket which is client information about the connected voice recognition client system 120 and a socket to which the voice recognition client system 120 is connected. In this case, one handler 270 may be generated for each of a plurality of connected voice recognition client systems.
- the listener socket 280 may include a socket waiting for a connection request of a voice recognition client system.
- the voice recognition server system 130 may use a multi-thread in order for a plurality of voice recognition client system to use the resource efficiently.
- the voice recognition client system 120 in order to provide the voice recognition intermediate results to the user, the voice recognition client system 120 inputs a sound signal input every unit time
- the voice recognition server system 130 may be transferred, and the voice recognition intermediate result generated every other unit time in the voice recognition server system 130 may be transferred to the voice recognition client system 120.
- the voice recognition client system 120 displays the transferred voice recognition intermediate result to the user, so that the user can recognize that the voice recognition process is in progress and can feel a sense of stability.
- the user interface 210 of the voice recognition client system 120 may receive an event for determining an end point of voice recognition from the user. In this case, the voice recognition client system 120 may terminate the voice recognition and recognize the voice recognition intermediate result of the voice signal input until the voice recognition is terminated as the final voice recognition result.
- the voice recognition unit 260 described with reference to FIG. 2 may include the acoustic model unit 310, the language model unit 330, and the decoder 350 as shown in FIG. 3.
- the voice database 320 and the query log 340 shown in FIG. 3 may be included in the voice recognition unit 260 or may be connected to the voice recognition unit 260 to provide data to the voice recognition unit 260. .
- the acoustic model unit 310 of the speech recognition unit 260 presents a matching value between the received feature and the recognition unit word.
- the acoustic model unit 310 may be used to create a unit word model from the pre-built speech database 320 and calculate a degree of matching between the unit word model and the received feature.
- a matching method may also be performed using one of various known methods.
- the language model unit 330 builds a language model.
- a Bigram model or a Trigram model can be used to build a language model. Since the language model construction method is already well known, the detailed description thereof will be omitted.
- the query log 340 described above may be used as the text database as a text database to be used for constructing the language model.
- the query log 340 may include a user query log input for a search service.
- the decoder 290 generates a voice recognition result using the output of the acoustic model unit 310 and the output of the language model unit 330.
- the voice recognition result generated as described above may be transmitted to the voice recognition client system 120 described with reference to FIGS. 1 and 2.
- the voice recognition server system 130 when the voice recognition client system 120 transmits a sound signal and a feature input every unit time, the voice recognition server system 130 also generates a voice recognition result using the received sound signal and features every other unit time. can do. In this case, the voice recognition server system 130 may transmit the generated voice recognition results to the voice recognition client system 120, and the voice recognition client system 120 sequentially displays the received voice recognition results during the voice recognition process. can do. Therefore, the user can recognize that voice recognition is currently in progress and can feel a sense of stability.
- Table 1 shows an example of a voice recognition intermediate result and a final voice recognition result provided by a user to input a “bus from sperm station to Gangnam station”.
- the order means the order in which the results of speech recognition are provided.
- FIG. 4 is a diagram illustrating a process of speech recognition according to an embodiment of the present invention.
- the first dotted line 410 means a process in which the voice recognition client system 120 is connected to the voice recognition server system 130.
- TCP / IP may be used for the connection.
- the first benefit chain 420 may mean that the voice recognition client system 120 provides the voice recognition server system 130 with a first control packet such as protocol version information or terminal information.
- the second benefit chain 430 may mean that the voice recognition server system 130 provides the voice response client system 120 with the first response packet to the control packet.
- the solid lines within the first range 440 may mean that the voice recognition client system 120 provides the voice recognition server system 130 with a packet including a sound signal every unit time.
- the voice recognition client system 120 may transmit a packet including the sound signal input therein to the voice recognition server system 130 every 20 milliseconds.
- the dashed-dotted lines in the second range 450 may mean that the voice recognition server system 130 provides the voice recognition intermediate result and the voice recognition final result generated every other unit time to the voice recognition client system 120. .
- the voice recognition server system 130 may generate an intermediate result of voice information by using the partial sound signal generated by accumulating the received sound signal every 500 milliseconds, and generate the intermediate result of the voice information.
- the voice recognition client system 130 when the voice recognition server system 130 obtains the final result from the voice recognition unit 260 described with reference to FIG. 2, the voice recognition server system 130 may generate the voice recognition final result and transmit the result to the voice recognition client system 130. At this time, when the voice recognition process is completed, the voice recognition server system 130 may discard the packets including the received sound signal.
- the third benefit chain 460 may mean that the voice recognition client system 120 notifies the connection termination by transmitting the second control packet to the voice recognition server system 130.
- the fourth benefit chain 470 may mean that the voice recognition server system 130 transmits a second response packet for the second control packet to the voice recognition client system 120 to confirm receipt of the connection termination notification. have.
- the second dotted line 480 may mean that the voice recognition client system 120 terminates the connection with the voice recognition server system 130.
- the packets used in FIG. 4 may basically consist of a header and a payload.
- the header is essentially included and the payload may be optionally included. That is, the payload may be selectively included in the packet according to the type of the packet.
- the graph 500 is a diagram illustrating a voice recognition result according to time in a voice recognition process according to an embodiment of the present invention.
- the graph 500 illustrates an intermediate process of speech recognition generated over time (horizontal axis) when the user wants to input a voice of "Gustav Klimt".
- an example of providing an intermediate result of speech recognition every unit time from the start point 510 of speech recognition is shown.
- the results of the speech recognition intermediate results for the cumulative signal of the input sound signal up to that point are shown.
- the vertical axis indicates the possibility of voice matching of the intermediate candidates, indicating that the candidates displayed at the top (except candidates marked with X) show the highest probability at each unit time point.
- the candidate with the highest likelihood is displayed to the user as an intermediate result of speech recognition at that time.
- the next N candidates may be exposed to the user.
- FIG. 5 it is expressed as if one syllable is input every unit time. However, this is for convenience of description, and no voice may be included in one unit time. May be included. Also, a voice for a plurality of syllables may be included in one unit time.
- a unit time of a reference for transmitting a sound signal and a unit time for generating and providing a voice recognition intermediate result may be different. For example, as described above, the unit sound signal input therebetween every 20 milliseconds may be transmitted from the voice recognition client system to the voice recognition server system.
- a voice recognition intermediate result may be generated every 500 milliseconds and transmitted from the voice recognition server system to the voice recognition client system.
- the first speech recognition intermediate result may include speech recognition results for 25 unit sound signals
- the second speech recognition intermediate result may include speech recognition results for 50 unit sound signals.
- FIG. 6 is a block diagram illustrating an internal configuration of a voice recognition client system and a voice recognition server system according to an embodiment of the present invention.
- the voice recognition client system 610 includes a user interface 611, a feature information extractor 612, an encoder 613, a communication unit 614, and a display unit 615. ) May be included.
- the user interface 611, the feature information extractor 612, and the encoder 613 may be selectively included in the voice recognition client system 610 as necessary.
- the voice recognition server system 620 includes a receiver 621, a partial sound signal generator 622, a voice recognition result generator 623, and a transmitter 624, as shown in FIG. 6. can do. Even in this case, the receiver 621 and the partial sound signal generator 622 may be selectively included in the voice recognition server system 620 as necessary.
- the user interface 611 receives an event from a user.
- an event may include an event for initiating speech recognition or an event used to select one result from a speech recognition intermediate result comprising two or more results.
- the feature information extractor 612 extracts feature information from the input unit sound signal.
- the encoder 613 encodes the input unit sound signal.
- the unit sound signal may include a sound signal input every predetermined time from the start point to the end point of voice recognition.
- the communication unit 614 transmits the unit sound signal to the voice recognition server system 620 every unit time, and receives a voice recognition intermediate result from the voice recognition server system 620.
- the communication unit 614 may transmit the feature information extracted by the feature information extractor 612 and the unit sound signal encoded by the encoder 613 to the voice recognition server system 620 every unit time.
- the voice recognition intermediate result may be generated through the partial sound signal in which at least one unit sound signal is accumulated according to an input time in the voice recognition server system 620.
- the voice recognition client system 610 transmits a unit sound signal every 20 milliseconds
- the voice recognition server system 620 generates and transmits a voice recognition intermediate result every 500 milliseconds.
- 620 may generate an intermediate result of speech recognition using the partial sound signal in which the first 25 unit sound signals are accumulated.
- a partial sound signal in which 50 unit sound signals are accumulated may be used.
- the display unit 615 displays the received voice recognition intermediate result between the start point and the end point of the voice recognition.
- the display unit 615 may sequentially display the plurality of voice recognition intermediate results between the start point and the end point. have. For example, when the first speech recognition intermediate result is 'sleep', the second speech recognition intermediate result is 'auto', and the third speech recognition intermediate result is 'car' received, the display unit 615 starts and ends. Between the viewpoints, the ruler, auto and car may be sequentially displayed.
- the voice recognition client system 610 may further include a user interface unit (not shown) that receives an event for determining an end point of voice recognition from the user.
- a final result of speech recognition may be generated using the unit sound signals input until the event is input. That is, the voice recognition client system 610 notifies the voice recognition server system 620 that voice recognition has ended, and generates the last received voice recognition intermediate result as the voice recognition final result or inputs until the voice recognition ends.
- the unit sound signals can be controlled to generate a final result of speech recognition.
- the receiver 621 receives a unit sound signal input to the voice recognition client system 610 every unit time from the start point of the voice recognition to the end point.
- the partial sound signal generator 622 generates a partial sound signal by accumulating a predetermined number of unit sound signals transmitted from the voice recognition client system 610 every unit time.
- the speech recognition result generator 623 generates a speech recognition intermediate result by using the partial sound signal generated by the partial sound signal generator 622. That is, the voice recognition result generator 623 may generate the voice recognition intermediate result through at least one unit sound signal input to the middle while the user is speaking through the voice recognition client system 610. Basically, the voice recognition result generator 623 may generate a voice recognition intermediate result for the partial sound signal generated whenever the partial sound signal is generated.
- the transmitter 624 transmits the voice recognition intermediate result to the voice recognition client system 610. At this time, the transmitter 624 transmits only one of the most likely intermediate results to the client system 610. In this case, the voice recognition server system 620 manages all of the intermediate result candidates because the most appropriate result may be different when more voices are input in the future. For example, when “9”, “nose”, “old”, and “g” are candidates, only "9” is transmitted to the client system 610, but the remaining candidates are not discarded by the voice recognition server system 620. Will continue to calculate the matching degree of candidates using the incoming voice.
- the voice recognition server system 620 may transmit a plurality of results to the client system 610 instead of one as a voice recognition intermediate result.
- it may further include an accuracy determiner (not shown) for determining the accuracy of each of the two or more results.
- the transmitter 624 may generate a voice recognition intermediate result including two or more results in the order of accuracy, a voice recognition intermediate result including the accuracy of each of the two or more results and the two or more results, and a result having the highest accuracy.
- One of the voice recognition intermediate results may be transmitted to the voice recognition client system 610. For example, suppose that the accuracy of Gusta is 5, higher than the accuracy of Kosdaq 3, for the two results, KOSDAQ and Gusta.
- the transmission unit 624 transmits the voice recognition intermediate results arranged in the order of 'Gusta', 'Kosdaq' or the voice recognition intermediate results including the accuracy such as 'Gusta-5', 'KOSDAQ-3' Or a voice recognition intermediate result including only the Gusta, which has the highest accuracy.
- '-' Is a symbol indicating that the following number is accuracy, which is assumed arbitrarily in this example, and accuracy may be transmitted to the voice recognition client system 610 through various methods.
- FIG. 7 is a flowchart illustrating a voice recognition method performed by a voice recognition client system according to an embodiment of the present invention.
- the voice recognition method according to the present embodiment may be performed by the voice recognition client system 610 described with reference to FIG. 6.
- FIG. 7 a voice recognition method will be described by explaining a process in which each step is performed by the voice recognition client system 610.
- the voice recognition client system 610 transmits a unit sound signal input every pre-selected unit time from the start point to the end point of the voice recognition to the voice recognition server system every unit time.
- the voice recognition client system 610 extracts feature information from a unit sound signal input for a unit time with respect to a sound signal input through an interface such as a microphone, and encodes the input unit sound signal.
- the voice recognition client system 610 may transmit the extracted feature information and the encoded unit sound signal to the voice recognition server system every unit time.
- the voice recognition server system may correspond to the voice recognition server system 620 described with reference to FIG. 6.
- the voice recognition client system 610 receives a voice recognition intermediate result from the voice recognition server system.
- the voice recognition intermediate result may be generated through the partial sound signal in which at least one unit sound signal is accumulated according to the input time in the voice recognition server system. For example, if the voice recognition client system 610 transmits a unit sound signal every 20 milliseconds, and the voice recognition server system generates and transmits a voice recognition intermediate result every 500 milliseconds, the voice recognition server system may be the first time.
- An intermediate result of speech recognition may be generated using partial sound signals in which unit sound signals are accumulated. In order to generate the second speech recognition intermediate result, a partial sound signal in which 50 unit sound signals are accumulated may be used.
- the voice recognition client system 610 displays the received voice recognition intermediate result between the start point and the end point of the voice recognition.
- the voice recognition client system 610 may sequentially display the plurality of voice recognition intermediate results between the start time and the end time. have. For example, when the first voice recognition intermediate result is 'sleep', the second voice recognition intermediate result is 'automatic', and the third voice recognition intermediate result is 'car', the voice recognition client system 610 is started.
- the ruler, auto and car may be sequentially displayed between the start point and the end point.
- each of the voice recognition intermediate results may include one result, but may include two or more results.
- the voice recognition intermediate result for "gu” is "9", “gu", “nose”, “g Etc.
- the voice recognition client system 610 may show the candidates sorted by the matching value or only the candidate having the highest value.
- the voice recognition client system 610 may further perform a step (not shown) of receiving an event for determining an end point of voice recognition from the user.
- a final result of speech recognition may be generated using the unit sound signals input until the event is input. That is, the voice recognition client system 610 notifies the voice recognition server system 620 that voice recognition has ended, and generates the last received voice recognition intermediate result as the voice recognition final result or inputs until the voice recognition ends.
- the unit sound signals can be controlled to generate a final result of speech recognition.
- FIG. 8 is a flowchart illustrating a voice recognition method performed by a voice recognition server system according to an embodiment of the present invention.
- the voice recognition method according to the present embodiment may be performed by the voice recognition server system 620 described with reference to FIG. 6.
- FIG. 8 a voice recognition method will be described by explaining a process in which each step is performed by the voice recognition server system 620.
- the voice recognition server system 620 receives a unit sound signal input to the voice recognition client system every unit time from the start point to the end point of the voice recognition.
- the voice recognition client system may correspond to the voice recognition client system 610 described with reference to FIG. 6.
- the voice recognition server system 620 generates a voice recognition intermediate result by using the partial sound signal in which at least one unit sound signal is accumulated according to an input time. That is, the voice recognition server system 620 may generate a voice recognition intermediate result through at least one unit sound signal input to the middle while the user is talking through the voice recognition client system 610. Basically, the voice recognition server system 620 may generate a voice recognition intermediate result for the partial sound signal generated whenever the partial sound signal is generated.
- the partial sound signal may be generated by accumulating a predetermined number of unit sound signals transmitted from the voice recognition client system every unit time.
- the voice recognition server system 620 transmits the voice recognition intermediate result to the voice recognition client system.
- the voice recognition server system 620 may transmit one voice recognition intermediate result including all of the two or more results to the voice recognition client system. For example, even if a single voice recognition intermediate result includes four results of '9', 'nose', 'phrase', and 'g', the voice recognition server system 620 outputs four results in one voice. As a result of the recognition, it can be sent to the voice recognition client system.
- the voice recognition server system 620 may determine the accuracy of each of the two or more results.
- the speech recognition server system 620 includes the speech recognition intermediate result including two or more results sorted in the order of accuracy, the speech recognition intermediate result including the accuracy of each of the two or more results and the two or more results, and the highest accuracy.
- One of the speech recognition intermediate results including the results may be sent to the speech recognition client system. For example, suppose that the accuracy of Gusta is 5, higher than the accuracy of Kosdaq 3, for the two results, KOSDAQ and Gusta.
- the voice recognition server system 620 transmits the voice recognition intermediate results arranged in the order of 'Gusta' and 'KOSDAQ', or includes voice recognition intermediate such as 'Gusta-5' and 'KOSDAQ-3'.
- the results can be sent or a voice recognition intermediate result containing only the Gusta, which has the highest accuracy.
- '-' Is a symbol indicating that the following number is accuracy, which is assumed arbitrarily in this example, and accuracy may be transmitted to the voice recognition client system through various methods.
- FIG. 9 is a block diagram showing the internal configuration of a voice recognition client system and a voice recognition server system according to another embodiment of the present invention.
- the voice recognition client system 910 may include a transmitter 911, a receiver 912, a display unit 913, and a controller 914.
- the transmitter 911, the receiver 912, and the display 913 may be selectively included in the voice recognition client system 910 as necessary.
- the voice recognition client system 910 may be one module included in a user's terminal. That is, the voice recognition client system 910 may include only the controller 914 to control the transmitter 911, the receiver 912, and the display 913 of the terminal to perform voice recognition.
- the voice recognition server system 920 includes a receiver 921, a partial sound signal generator 922, a voice recognition result generator 923, and a transmitter 924 as shown in FIG. 9. can do.
- the transmission unit 911 in the voice recognition client system 910 transmits the unit sound signal input every predetermined unit time to the voice recognition server system 920
- the receiver 912 is the voice recognition intermediate
- the result is received from the voice recognition server system 920.
- the display unit 913 displays the received voice recognition intermediate result between the start point and the end point of the voice recognition.
- the voice recognition intermediate result may be generated through a partial sound signal in which at least one unit sound signal of the transmitted unit sound signals is accumulated according to an input time.
- the partial sound signal may include a signal in which at least one unit sound signal is accumulated according to an input time, and the unit sound signal may include a sound signal input every unit time from a start point.
- the transmission unit 911 in the voice recognition client system 910 transmits the partial sound signal accumulated in accordance with the input time of the unit sound signal input every unit time from the start time to the voice recognition server system 920
- the receiving unit 912 receives the voice recognition intermediate result generated through the partial sound signal from the voice recognition server system 920.
- the display unit 913 displays the received voice recognition intermediate result between the start point and the end point of the voice recognition.
- the voice recognition client system 910 transmits the unit sound signal input for the unit time to the voice recognition server system 920 every unit time or the partial sound signal in which a certain number of unit sound signals are accumulated according to the input time.
- the voice recognition server system 920 may transmit the same.
- the voice recognition server system 920 may generate the partial sound signal through the unit sound signals, and generate the voice recognition intermediate result using the generated partial sound signal.
- the voice recognition client system 910 is provided with the phrases 'gu', 'speaker', 'ta', and 'f' for four unit times. ',' Ta ',' pe 'can be sent.
- the voice recognition client system 920 generates a partial sound signal in which unit sound signals are accumulated, such as 'gu', 'goose', 'gusta' and 'gustav', and generates a voice for each partial sound signal. Can generate intermediate results.
- a partial sound signal which is a sound signal in which at least one unit sound signal is accumulated, may be transmitted from the voice recognition client system 910 to the voice recognition server system 920, and the voice recognition server system 920 may simply
- the received partial sound signal may generate an intermediate result of speech recognition.
- the voice recognition client system 910 has four phrases such as nine phrases, phrases, strokes, strokes, and audible sound signals for four unit times.
- the partial sound signal in which the unit sound signals are accumulated, such as', 'Gusta' and 'Gustaf' may be transmitted.
- the voice recognition client system 920 may generate a voice recognition intermediate result by using the received partial sound signal such as 'gu', 'gus', 'gusta' and 'gustav'.
- the controller 914 controls the voice recognition intermediate result for the partial sound signal input from at least one of a plurality of time points between the start time and the end time of the voice recognition to be displayed between the start time and the end time. do.
- the controller 914 may control the transmitter 911, the receiver 912, and the display 913 so that the voice recognition intermediate result is displayed between the start time and the end time.
- the controller 914 may control to display all of the two or more results when a single voice recognition intermediate result includes two or more results. Even in this case, as described above, the voice recognition client system 910 transmits a result selected through an event input from the user to the voice recognition server system 920 to generate a next voice recognition intermediate result or a voice recognition final result. Can be reflected in the
- the voice recognition client system 910 may further include an accuracy determiner (not shown) that determines the accuracy of each of the two or more results when one voice recognition intermediate result includes two or more results.
- the controller 914 may control to display two or more results in the order of accuracy or to display the results with the highest accuracy.
- the controller 914 may control the plurality of voice recognition intermediate results to be sequentially displayed between a start time and an end time.
- the voice recognition client system 910 may further include a user interface unit (not shown) that receives an event for determining an end point of voice recognition from the user.
- a final result of speech recognition may be generated using the partial sound signal input until the event is input. That is, the voice recognition client system 610 notifies the voice recognition server system 620 that voice recognition has ended, and generates the last received voice recognition intermediate result as the voice recognition final result or inputs until the voice recognition ends.
- the partial sound signal can be controlled to generate a final result of speech recognition.
- the receiving unit 921 receives unit sound signals input to the voice recognition client system 910 from the voice recognition client system 910 at every predetermined unit time or every predetermined time. At least one unit sound signal among the unit sound signals input to the voice recognition client system 910 may be received from the voice recognition client system 910 based on an input time.
- the voice recognition result generator 922 uses the partial sound signal input to the voice recognition client system 910 to at least one of a plurality of time points between the start point and the start point and the end point of the voice recognition. Produce intermediate results. That is, the voice recognition result generator 922 directly generates the partial sound signal using the unit sound signal when the receiver 921 receives the unit sound signal, and generates the voice recognition intermediate result using the generated partial sound signal. When the receiver 921 receives the partial sound signal, it may generate an intermediate result of speech recognition using the received partial sound signal.
- the transmitter 923 transmits the voice recognition intermediate result to the voice recognition client system 910.
- the voice recognition intermediate result may be displayed through the display unit 913 of the voice recognition client system 910 between the start time and the end time.
- a voice recognition intermediate result for a partial sound signal input from at least one time point between a start time point and a start time point and a stop point time of voice recognition is input. It may include a first step (not shown) for controlling to be displayed between the start time and the end time. In this case, the voice recognition client system 910 controls the second voice (not shown) and the voice recognition intermediate to transmit the unit sound signal input to the voice recognition server system 920 every unit time predetermined in the first step. A third step (not shown) of controlling to receive the result from the voice recognition server system 920 and a fourth step (not shown) of controlling to display the received voice recognition intermediate result between the start point and the end point. Can be done.
- the voice recognition client system 910 controls to transmit the partial sound signal accumulated in accordance with the input time of the unit sound signal input every unit time from the start point to the voice recognition server system 920.
- a fourth step may be performed.
- the voice recognition client system 910 may further perform a step (not shown) of receiving an event for determining an end point of voice recognition from the user.
- a final result of speech recognition may be generated using the partial sound signal input until the event is input. That is, the voice recognition client system 910 notifies the voice recognition server system 920 that voice recognition has ended, and generates the last received voice recognition intermediate result as the voice recognition final result or inputs until the voice recognition ends.
- the partial sound signal can be controlled to generate a final result of speech recognition.
- the voice recognition method performed by the voice recognition server system 920 receives or pre-selects unit sound signals input to the voice recognition client system 910 at predetermined unit times.
- the voice recognition server system 920 may directly generate the partial sound signal using the unit sound signal, and generate the voice recognition intermediate result using the generated partial sound signal.
- an intermediate result of speech recognition may be generated using the received partial sound signal.
- the voice recognition intermediate result may be displayed through the display unit 913 of the voice recognition client system 910 between the start time and the end time.
- the user by providing the user with an intermediate result of speech recognition after the user starts speech and before speech recognition ends, the user is correctly performing the speech recognition. You can reduce your worry.
- the partial sound signal is generated, and based on the partial sound signal, a voice recognition intermediate result is generated and provided to the user. It can give a sense of stability to the user and show the process of speech recognition.
- Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. remind
- Computer-readable media may include, alone or in combination with the program instructions, data files, data structures, and the like.
- Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts.
- the above-described file system can be recorded in a computer-readable recording medium.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (30)
- 음성인식의 시작시점부터 종료시점까지 입력되는 소리신호에 대한 음성인식 결과를 표시하는 음성인식 클라이언트 시스템에 있어서,상기 시작시점부터 상기 종료시점까지 기선정된 단위시간마다 입력되는 단위소리신호를 상기 단위시간마다 음성인식 서버 시스템으로 전송하고, 상기 음성인식 서버 시스템으로부터 음성인식 중간 결과를 수신하는 통신부; 및상기 수신된 음성인식 중간 결과를 상기 시작시점과 상기 종료시점 사이에 표시하는 표시부를 포함하는 음성인식 클라이언트 시스템.
- 제1항에 있어서,상기 음성인식 중간 결과는 상기 음성인식 서버 시스템에서 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 통해 생성되는, 음성인식 클라이언트 시스템.
- 제1항에 있어서,상기 표시부는,상기 음성인식 서버 시스템으로부터 복수의 음성인식 중간 결과가 수신되는 경우, 상기 복수의 음성인식 중간 결과를 상기 시작시점과 상기 종료시점 사이에 순차적으로 표시하는, 음성인식 클라이언트 시스템.
- 제1항에 있어서,상기 표시부는,상기 단위시간마다의 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과를 모두 전송받아 표시하는 음성인식 클라이언트 시스템.
- 제1항에 있어서,상기 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과 각각의 정확도를 결정하는 정확도 결정부를 더 포함하고,상기 표시부는,상기 정확도의 순서로 상기 둘 이상의 결과를 정렬하여 표시하거나 또는 상기 정확도가 가장 높은 결과를 표시하는, 음성인식 클라이언트 시스템.
- 제1항에 있어서,상기 입력된 단위소리신호에서 특징정보를 추출하는 특징정보 추출부; 및상기 입력된 단위소리신호를 부호화하는 부호화부를 더 포함하고,상기 통신부는,상기 특징정보 및 상기 부호화된 단위소리신호를 상기 입력된 단위소리신호로서 상기 음성인식 서버 시스템으로 전송하는, 음성인식 클라이언트 시스템.
- 제1항에 있어서,사용자로부터 음성인식의 종료 시점을 결정하기 위한 이벤트를 입력받는 사용자 인터페이스부를 더 포함하고,상기 이벤트가 입력되기 이전까지 입력된 단위소리신호들을 이용하여 음성인식 최종 결과가 생성되는, 음성인식 클라이언트 시스템.
- 음성인식 클라이언트 시스템으로부터 수신된 소리신호를 이용하여 음성인식 결과를 생성하는 음성인식 서버 시스템에 있어서,음성인식의 시작시점부터 종료시점까지 단위시간마다 상기 음성인식 클라이언트 시스템으로 입력되는 단위소리신호를 수신하는 수신부;적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 이용하여 음성인식 중간 결과를 생성하는 음성인식 결과 생성부; 및상기 음성인식 중간 결과를 상기 음성인식 클라이언트 시스템으로 전송하는 전송부를 포함하고,상기 음성인식 중간 결과는 상기 시작시점과 상기 종료시점 사이에 상기 음성인식 클라이언트 시스템의 표시부를 통해 표시되는, 음성인식 서버 시스템.
- 제8항에 있어서,상기 단위시간마다 상기 음성인식 클라이언트 시스템으로부터 전송되는 단위소리신호를 기선정된 수만큼 누적하여 부분소리신호를 생성하는 부분소리신호 생성부를 더 포함하는, 음성인식 서버 시스템.
- 제9항에 있어서,상기 음성인식 결과 생성부는,상기 부분소리신호가 생성될 때마다 상기 생성된 부분소리신호에 대한 음성인식 중간 결과를 생성하는, 음성인식 서버 시스템.
- 제8항에 있어서,상기 전송부는,하나의 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과를 모두 포함하는 상기 하나의 음성인식 중간 결과를 상기 음성인식 클라이언트 시스템으로 전송하는, 음성인식 서버 시스템.
- 제8항에 있어서,하나의 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과 각각의 정확도를 결정하는 정확도 결정부를 더 포함하고,상기 전송부는,상기 정확도의 순서로 상기 둘 이상의 결과를 정렬하여 포함하는 음성인식 중간 결과, 상기 둘 이상의 결과 및 상기 둘 이상의 결과 각각의 정확도를 포함하는 음성인식 중간 결과 및 상기 정확도가 가장 높은 결과를 포함하는 음성인식 중간 결과 중 하나를 상기 음성인식 클라이언트 시스템으로 전송하는, 음성인식 서버 시스템.
- 음성인식의 시작시점부터 종료시점까지 입력되는 소리신호에 대한 음성인식 결과를 표시하는 음성인식 클라이언트 시스템에 있어서,상기 시작시점부터 상기 시작시점과 상기 종료시점 사이의 복수의 시점들 중 적어도 하나의 시점까지 입력되는 부분소리신호에 대한 음성인식 중간 결과가 상기 시작시점과 상기 종료시점 사이에 표시되도록 제어하는 제어부를 포함하는 음성인식 클라이언트 시스템.
- 제13항에 있어서,상기 부분소리신호는 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 신호를 포함하고,상기 단위소리신호는 상기 시작시점부터 단위시간마다 입력된 소리신호를 포함하는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,상기 제어부는,상기 음성인식 서버 시스템으로부터 복수의 음성인식 중간 결과가 수신되는 경우, 상기 복수의 음성인식 중간 결과가 상기 시작시점과 상기 종료시점 사이에 순차적으로 표시되도록 제어하는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,기선정된 단위시간마다 입력된 단위소리신호를 음성인식 서버 시스템으로 전송하는 전송부;음성인식 중간 결과를 상기 음성인식 서버 시스템으로부터 수신하는 수신부; 및상기 수신된 음성인식 중간 결과를 상기 시작시점과 상기 종료시점 사이에 표시하는 표시부를 더 포함하고,상기 음성인식 중간 결과는 상기 전송된 단위소리신호들 중 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 통해 생성되는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,상기 시작시점부터 단위시간마다 입력된 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 음성인식 서버 시스템으로 전송하는 전송부;상기 부분소리신호를 통해 생성된 음성인식 중간 결과를 상기 음성인식 서버 시스템으로부터 수신하는 수신부; 및상기 수신된 음성인식 중간 결과를 상기 시작시점과 상기 종료시점 사이에 표시하는 표시부를 더 포함하는 음성인식 클라이언트 시스템.
- 제16항 또는 제17항에 있어서,상기 제어부는,상기 음성인식 중간 결과가 상기 시작시점과 상기 종료시점 사이에 표시되도록 상기 전송부, 상기 수신부 및 상기 표시부를 제어하는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,상기 제어부는,하나의 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과가 모두 표시되도록 제어하는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,하나의 음성인식 중간 결과가 둘 이상의 결과를 포함하는 경우, 상기 둘 이상의 결과 각각의 정확도를 결정하는 정확도 결정부를 더 포함하고,상기 제어부는,상기 둘 이상의 결과가 상기 정확도의 순서로 정렬되어 표시되도록 제어하거나 또는 상기 정확도가 가장 높은 결과가 표시되도록 제어하는, 음성인식 클라이언트 시스템.
- 제13항에 있어서,사용자로부터 음성인식의 종료 시점을 결정하기 위한 이벤트를 입력받는 사용자 인터페이스부를 더 포함하고,상기 이벤트가 입력되기 이전까지 입력된 부분소리신호를 이용하여 음성인식 최종 결과가 생성되는, 음성인식 클라이언트 시스템.
- 음성인식 클라이언트 시스템으로부터 수신된 소리신호를 이용하여 음성인식 결과를 생성하는 음성인식 서버 시스템에 있어서,음성인식의 시작시점부터 상기 시작시점과 종료시점 사이의 복수의 시점들 중 적어도 하나의 시점까지 상기 음성인식 클라이언트 시스템으로 입력된 부분소리신호를 이용하여 음성인식 중간 결과를 생성하는 음성인식 결과 생성부; 및상기 음성인식 중간 결과를 상기 음성인식 클라이언트 시스템으로 전송하는 전송부를 포함하고,상기 음성인식 중간 결과는 상기 시작시점과 상기 종료시점 사이에 상기 음성인식 클라이언트 시스템의 표시부를 통해 표시되는, 음성인식 서버 시스템.
- 제22항에 있어서,상기 부분소리신호는 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 신호를 포함하고,상기 단위소리신호는 상기 시작시점부터 단위시간마다 입력된 소리신호를 포함하는, 음성인식 서버 시스템.
- 제22항에 있어서,기선정된 단위시간마다 상기 음성인식 클라이언트 시스템으로 입력된 단위소리신호들을 상기 음성인식 클라이언트 시스템으로부터 수신하는 수신부를 더 포함하고,상기 음성인식 결과 생성부는,상기 수신된 단위소리신호들 중 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 상기 부분소리신호를 이용하여 상기 음성인식 중간 결과를 생성하는, 음성인식 서버 시스템.
- 제22항에 있어서,기선정된 단위시간마다 상기 음성인식 클라이언트 시스템으로 입력된 단위소리신호들 중 적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 상기 음성인식 클라이언트 시스템으로부터 수신하는 수신부를 더 포함하는, 음성인식 서버 시스템.
- 음성인식의 시작시점부터 종료시점까지 입력되는 소리신호에 대한 음성인식 결과를 표시하는 음성인식 방법에 있어서,상기 시작시점부터 상기 종료시점까지 기선정된 단위시간마다 입력되는 단위소리신호를 상기 단위시간마다 음성인식 서버 시스템으로 전송하고, 상기 음성인식 서버 시스템으로부터 음성인식 중간 결과를 수신하는 단계; 및상기 수신된 음성인식 중간 결과를 상기 시작시점과 상기 종료시점 사이에 표시하는 단계를 포함하는 음성인식 방법.
- 음성인식 클라이언트 시스템으로부터 수신된 소리신호를 이용하여 음성인식 결과를 생성하는 음성인식 방법에 있어서,음성인식의 시작시점부터 종료시점까지 단위시간마다 상기 음성인식 클라이언트 시스템으로 입력되는 단위소리신호를 수신하는 단계;적어도 하나의 단위소리신호가 입력 시간에 따라 누적된 부분소리신호를 이용하여 음성인식 중간 결과를 생성하는 단계; 및상기 음성인식 중간 결과를 상기 음성인식 클라이언트 시스템으로 전송하는 단계를 포함하고,상기 음성인식 중간 결과는 상기 시작시점과 상기 종료시점 사이에 상기 음성인식 클라이언트 시스템의 표시부를 통해 표시되는, 음성인식 방법.
- 음성인식의 시작시점부터 종료시점까지 입력되는 소리신호에 대한 음성인식 결과를 표시하는 음성인식 방법에 있어서,상기 시작시점부터 상기 시작시점과 상기 종료시점 사이의 복수의 시점들 중 적어도 하나의 시점까지 입력되는 부분소리신호에 대한 음성인식 중간 결과가 상기 시작시점과 상기 종료시점 사이에 표시되도록 제어하는 단계를 포함하는 음성인식 방법.
- 음성인식 클라이언트 시스템으로부터 수신된 소리신호를 이용하여 음성인식 결과를 생성하는 음성인식 방법에 있어서,음성인식의 시작시점부터 상기 시작시점과 종료시점 사이의 복수의 시점들 중 적어도 하나의 시점까지 상기 음성인식 클라이언트 시스템으로 입력된 부분소리신호를 이용하여 음성인식 중간 결과를 생성하는 단계; 및상기 음성인식 중간 결과를 상기 음성인식 클라이언트 시스템으로 전송하는 단계를 포함하고,상기 음성인식 중간 결과는 상기 시작시점과 상기 종료시점 사이에 상기 음성인식 클라이언트 시스템의 표시부를 통해 표시되는, 음성인식 방법.
- 제26항 내지 제29항 중 어느 한 항의 방법을 수행하는 프로그램을 기록한 컴퓨터 판독 가능 기록 매체.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/995,085 US9318111B2 (en) | 2010-12-16 | 2011-07-21 | Voice recognition client system for processing online voice recognition, voice recognition server system, and voice recognition method |
JP2013544373A JP2014505270A (ja) | 2010-12-16 | 2011-07-21 | オンライン音声認識を処理する音声認識クライアントシステム、音声認識サーバシステム及び音声認識方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2010-0129217 | 2010-12-16 | ||
KR1020100129217A KR101208166B1 (ko) | 2010-12-16 | 2010-12-16 | 온라인 음성인식을 처리하는 음성인식 클라이언트 시스템, 음성인식 서버 시스템 및 음성인식 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012081788A1 true WO2012081788A1 (ko) | 2012-06-21 |
Family
ID=46244864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2011/005394 WO2012081788A1 (ko) | 2010-12-16 | 2011-07-21 | 온라인 음성인식을 처리하는 음성인식 클라이언트 시스템, 음성인식 서버 시스템 및 음성인식 방법 |
Country Status (4)
Country | Link |
---|---|
US (1) | US9318111B2 (ko) |
JP (2) | JP2014505270A (ko) |
KR (1) | KR101208166B1 (ko) |
WO (1) | WO2012081788A1 (ko) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016175443A1 (ko) * | 2015-04-30 | 2016-11-03 | 주식회사 아마다스 | 음성 인식을 이용한 정보 검색 방법 및 장치 |
CN115188368A (zh) * | 2022-06-30 | 2022-10-14 | 北京百度网讯科技有限公司 | 语音测试方法、装置、电子设备及存储介质 |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130133629A (ko) | 2012-05-29 | 2013-12-09 | 삼성전자주식회사 | 전자장치에서 음성명령을 실행시키기 위한 장치 및 방법 |
CN103076893B (zh) * | 2012-12-31 | 2016-08-17 | 百度在线网络技术(北京)有限公司 | 一种用于实现语音输入的方法与设备 |
KR102301880B1 (ko) * | 2014-10-14 | 2021-09-14 | 삼성전자 주식회사 | 전자 장치 및 이의 음성 대화 방법 |
EP3282447B1 (en) * | 2015-03-31 | 2020-08-26 | Sony Corporation | PROGRESSIVE UTTERANCE ANALYSIS FOR SUCCESSIVELY DISPLAYING EARLY SUGGESTIONS BASED ON PARTIAL SEMANTIC PARSES FOR VOICE CONTROL. 
REAL TIME PROGRESSIVE SEMANTIC UTTERANCE ANALYSIS FOR VISUALIZATION AND ACTIONS CONTROL. |
KR102365757B1 (ko) * | 2015-09-09 | 2022-02-18 | 삼성전자주식회사 | 인식 장치, 인식 방법 및 협업 처리 장치 |
JP6766991B2 (ja) * | 2016-07-13 | 2020-10-14 | 株式会社富士通ソーシアルサイエンスラボラトリ | 端末装置、翻訳方法、及び、翻訳プログラム |
US10339224B2 (en) | 2016-07-13 | 2019-07-02 | Fujitsu Social Science Laboratory Limited | Speech recognition and translation terminal, method and non-transitory computer readable medium |
KR102502220B1 (ko) | 2016-12-20 | 2023-02-22 | 삼성전자주식회사 | 전자 장치, 그의 사용자 발화 의도 판단 방법 및 비일시적 컴퓨터 판독가능 기록매체 |
US10229682B2 (en) | 2017-02-01 | 2019-03-12 | International Business Machines Corporation | Cognitive intervention for voice recognition failure |
JP2019016206A (ja) * | 2017-07-07 | 2019-01-31 | 株式会社富士通ソーシアルサイエンスラボラトリ | 音声認識文字表示プログラム、情報処理装置、及び、音声認識文字表示方法 |
KR102412523B1 (ko) * | 2017-07-18 | 2022-06-24 | 삼성전자주식회사 | 음성 인식 서비스 운용 방법, 이를 지원하는 전자 장치 및 서버 |
KR102443079B1 (ko) | 2017-12-06 | 2022-09-14 | 삼성전자주식회사 | 전자 장치 및 그의 제어 방법 |
EP3888080A4 (en) * | 2018-11-27 | 2022-07-13 | LG Electronics Inc. | MULTIMEDIA DEVICE FOR VOICE COMMAND PROCESSING |
US11211063B2 (en) | 2018-11-27 | 2021-12-28 | Lg Electronics Inc. | Multimedia device for processing voice command |
US11538481B2 (en) * | 2020-03-18 | 2022-12-27 | Sas Institute Inc. | Speech segmentation based on combination of pause detection and speaker diarization |
KR20240068017A (ko) * | 2022-11-08 | 2024-05-17 | 한국전자기술연구원 | 턴프리 대화 방법 및 장치 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005283972A (ja) * | 2004-03-30 | 2005-10-13 | Advanced Media Inc | 音声認識方法及びこの音声認識方法を利用した情報提示方法と情報提示装置 |
JP2005331616A (ja) * | 2004-05-18 | 2005-12-02 | Nippon Telegr & Teleph Corp <Ntt> | クライアント・サーバ音声認識方法、これに用いる装置、そのプログラム及び記録媒体 |
JP2010048890A (ja) * | 2008-08-19 | 2010-03-04 | Ntt Docomo Inc | クライアント装置、認識結果フィードバック方法、認識結果フィードバックプログラム、サーバ装置、音声認識のモデル更新方法、音声認識のモデル更新プログラム、音声認識システム、音声認識方法、音声認識プログラム |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11175093A (ja) | 1997-12-08 | 1999-07-02 | Nippon Telegr & Teleph Corp <Ntt> | 音声認識確認応答方法 |
US20030182113A1 (en) * | 1999-11-22 | 2003-09-25 | Xuedong Huang | Distributed speech recognition for mobile communication devices |
US7409349B2 (en) * | 2001-05-04 | 2008-08-05 | Microsoft Corporation | Servers for web enabled speech recognition |
JP2004094077A (ja) | 2002-09-03 | 2004-03-25 | Nec Corp | 音声認識装置及び制御方法並びにプログラム |
US7774694B2 (en) * | 2002-12-06 | 2010-08-10 | 3M Innovation Properties Company | Method and system for server-based sequential insertion processing of speech recognition results |
JP2005037615A (ja) | 2003-07-18 | 2005-02-10 | Omron Corp | クライアント装置、音声認識サーバ、分散型音声認識システム、音声認識プログラム、およびコンピュータ読み取り可能な記録媒体 |
US7729912B1 (en) * | 2003-12-23 | 2010-06-01 | At&T Intellectual Property Ii, L.P. | System and method for latency reduction for automatic speech recognition using partial multi-pass results |
JP4297349B2 (ja) * | 2004-03-30 | 2009-07-15 | Kddi株式会社 | 音声認識システム |
TWI251754B (en) * | 2004-12-16 | 2006-03-21 | Delta Electronics Inc | Method for optimizing loads of speech/user recognition system |
CA2648617C (en) * | 2006-04-05 | 2017-12-12 | Yap, Inc. | Hosted voice recognition system for wireless devices |
US8352264B2 (en) * | 2008-03-19 | 2013-01-08 | Canyon IP Holdings, LLC | Corrective feedback loop for automated speech recognition |
US8352261B2 (en) * | 2008-03-07 | 2013-01-08 | Canyon IP Holdings, LLC | Use of intermediate speech transcription results in editing final speech transcription results |
US20090070109A1 (en) | 2007-09-12 | 2009-03-12 | Microsoft Corporation | Speech-to-Text Transcription for Personal Communication Devices |
JP5495612B2 (ja) * | 2008-04-23 | 2014-05-21 | キヤノン株式会社 | カメラ制御装置及び方法 |
US8019608B2 (en) * | 2008-08-29 | 2011-09-13 | Multimodal Technologies, Inc. | Distributed speech recognition using one way communication |
JP4902617B2 (ja) * | 2008-09-30 | 2012-03-21 | 株式会社フュートレック | 音声認識システム、音声認識方法、音声認識クライアントおよびプログラム |
US8965545B2 (en) * | 2010-09-30 | 2015-02-24 | Google Inc. | Progressive encoding of audio |
-
2010
- 2010-12-16 KR KR1020100129217A patent/KR101208166B1/ko active IP Right Grant
-
2011
- 2011-07-21 JP JP2013544373A patent/JP2014505270A/ja active Pending
- 2011-07-21 WO PCT/KR2011/005394 patent/WO2012081788A1/ko active Application Filing
- 2011-07-21 US US13/995,085 patent/US9318111B2/en active Active
-
2015
- 2015-06-10 JP JP2015117281A patent/JP6139598B2/ja active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005283972A (ja) * | 2004-03-30 | 2005-10-13 | Advanced Media Inc | 音声認識方法及びこの音声認識方法を利用した情報提示方法と情報提示装置 |
JP2005331616A (ja) * | 2004-05-18 | 2005-12-02 | Nippon Telegr & Teleph Corp <Ntt> | クライアント・サーバ音声認識方法、これに用いる装置、そのプログラム及び記録媒体 |
JP2010048890A (ja) * | 2008-08-19 | 2010-03-04 | Ntt Docomo Inc | クライアント装置、認識結果フィードバック方法、認識結果フィードバックプログラム、サーバ装置、音声認識のモデル更新方法、音声認識のモデル更新プログラム、音声認識システム、音声認識方法、音声認識プログラム |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016175443A1 (ko) * | 2015-04-30 | 2016-11-03 | 주식회사 아마다스 | 음성 인식을 이용한 정보 검색 방법 및 장치 |
US10403277B2 (en) | 2015-04-30 | 2019-09-03 | Amadas Co., Ltd. | Method and apparatus for information search using voice recognition |
CN115188368A (zh) * | 2022-06-30 | 2022-10-14 | 北京百度网讯科技有限公司 | 语音测试方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US9318111B2 (en) | 2016-04-19 |
KR101208166B1 (ko) | 2012-12-04 |
US20140316776A1 (en) | 2014-10-23 |
JP6139598B2 (ja) | 2017-05-31 |
JP2014505270A (ja) | 2014-02-27 |
KR20120067680A (ko) | 2012-06-26 |
JP2015179287A (ja) | 2015-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012081788A1 (ko) | 온라인 음성인식을 처리하는 음성인식 클라이언트 시스템, 음성인식 서버 시스템 및 음성인식 방법 | |
CN113327609B (zh) | 用于语音识别的方法和装置 | |
WO2015068947A1 (ko) | 녹취된 음성 데이터에 대한 핵심어 추출 기반 발화 내용 파악 시스템과, 이 시스템을 이용한 인덱싱 방법 및 발화 내용 파악 방법 | |
WO2016194740A1 (ja) | 音声認識装置、音声認識システム、当該音声認識システムで使用される端末、および、話者識別モデルを生成するための方法 | |
JP5042194B2 (ja) | 話者テンプレートを更新する装置及び方法 | |
US8521525B2 (en) | Communication control apparatus, communication control method, and non-transitory computer-readable medium storing a communication control program for converting sound data into text data | |
WO2019143022A1 (ko) | 음성 명령을 이용한 사용자 인증 방법 및 전자 장치 | |
JP2003308087A (ja) | 文法更新システム及び方法 | |
JP2009169139A (ja) | 音声認識装置 | |
WO2019208860A1 (ko) | 음성 인식 기술을 이용한 다자간 대화 기록/출력 방법 및 이를 위한 장치 | |
US20100178956A1 (en) | Method and apparatus for mobile voice recognition training | |
WO2020054980A1 (ko) | 음소기반 화자모델 적응 방법 및 장치 | |
WO2021251539A1 (ko) | 인공신경망을 이용한 대화형 메시지 구현 방법 및 그 장치 | |
JP2018045001A (ja) | 音声認識システム、情報処理装置、プログラム、音声認識方法 | |
WO2021091145A1 (en) | Electronic apparatus and method thereof | |
JP7026004B2 (ja) | 会話補助装置、会話補助方法及びプログラム | |
JP2007322523A (ja) | 音声翻訳装置及びその方法 | |
TW200304638A (en) | Network-accessible speaker-dependent voice models of multiple persons | |
JP2018174442A (ja) | 会議支援システム、会議支援方法、会議支援装置のプログラム、および端末のプログラム | |
WO2009104332A1 (ja) | 発話分割システム、発話分割方法および発話分割プログラム | |
KR20120127773A (ko) | 음성인식 정보검색 시스템 및 그 방법 | |
JP2017068061A (ja) | 通信端末及び音声認識システム | |
WO2015030340A1 (ko) | 핸즈프리 자동 통역 서비스를 위한 단말 장치 및 핸즈프리 장치와, 핸즈프리 자동 통역 서비스 방법 | |
CN108174030B (zh) | 定制化语音控制的实现方法、移动终端及可读存储介质 | |
WO2016129188A1 (ja) | 音声認識処理装置、音声認識処理方法およびプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11848718 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013544373 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13995085 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11848718 Country of ref document: EP Kind code of ref document: A1 |