WO2011121978A1 - Voice-recognition system, device, method and program - Google Patents

Voice-recognition system, device, method and program Download PDF

Info

Publication number
WO2011121978A1
WO2011121978A1 PCT/JP2011/001826 JP2011001826W WO2011121978A1 WO 2011121978 A1 WO2011121978 A1 WO 2011121978A1 JP 2011001826 W JP2011001826 W JP 2011001826W WO 2011121978 A1 WO2011121978 A1 WO 2011121978A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
recognition
speech recognition
speech
data
Prior art date
Application number
PCT/JP2011/001826
Other languages
French (fr)
Japanese (ja)
Inventor
祐 北出
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2012508079A priority Critical patent/JPWO2011121978A1/en
Publication of WO2011121978A1 publication Critical patent/WO2011121978A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present invention relates to a voice recognition system, apparatus, method, and program, and more particularly, to a voice recognition system, apparatus, method, and program using a plurality of voice data.
  • Patent Document 1 Japanese Patent Laid-Open No. 10-232691
  • the speech recognition apparatus of Patent Document 1 Japanese Patent Laid-Open No. 10-232691
  • a recognition unit that recognizes a voice signal and outputs a recognition result
  • a comprehensive processing unit that compares the recognition result output from the recognition unit and selects and outputs the recognition result with the highest accuracy.
  • voice input can be performed even if the speaker's posture changes.
  • a value indicating the accuracy of the recognition result a distance value between the speaker's mouth and the microphone is used as a value indicating the accuracy, and the recognition result is selected from the accuracy of the recognition result.
  • An object of the present invention is to provide a speech recognition system, apparatus, method, and program for solving the above-described problem of lowering speech recognition accuracy.
  • the speech recognition apparatus of the present invention A voice recognition means for recognizing a plurality of voice data obtained by inputting a speaker's utterance voice under different recording conditions; A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.
  • the speech recognition system of the present invention A plurality of voice input means for inputting voice respectively under different recording conditions; Voice recognition means for recognizing each of a plurality of voice data input from the voice input means; A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.
  • the data processing method of the speech recognition apparatus of the present invention includes: A data processing method of a voice recognition device for voice recognition of voice data,
  • the speech recognition device Recognize multiple voice data input under different recording conditions, A plurality of speech recognition results obtained by speech recognition are compared, and an optimum one is selected.
  • the computer program of the present invention is: A computer program for realizing a voice recognition device for voice recognition of voice data, A procedure for recognizing a plurality of audio data input under different recording conditions, This is for causing a computer to execute a procedure for comparing a plurality of speech recognition results obtained by speech recognition and selecting an optimum one.
  • a plurality of components are formed as a single member, and a single component is formed of a plurality of members. It may be that a certain component is a part of another component, a part of a certain component overlaps with a part of another component, or the like.
  • the data processing method and the computer program of the present invention describe a plurality of procedures in order, the described order does not limit the order in which the plurality of procedures are executed. For this reason, when implementing the data processing method and computer program of this invention, the order of the several procedure can be changed in the range which does not interfere in content.
  • the data processing method and the plurality of procedures of the computer program of the present invention are not limited to being executed at different timings. For this reason, another procedure may occur during the execution of a certain procedure, or some or all of the execution timing of a certain procedure and the execution timing of another procedure may overlap.
  • a speech recognition system apparatus, method, and program for improving speech recognition accuracy are provided.
  • FIG. 1 is a functional block diagram showing a configuration of a speech recognition system according to an embodiment of the present invention.
  • the speech recognition apparatus 100 includes a plurality of speech data d1, d2,. . . , Dn (where n is a natural number), respectively, and a plurality of speech recognition results t1, t2,. . . , Tn, and a recognition result selection unit 104 that selects an optimum one.
  • the speech recognition apparatus 100 includes, for example, a CPU (Central Processing Unit) (not shown), a memory, a hard disk, and a communication device, and is connected to an input device such as a keyboard and a mouse and an output device such as a display and a printer. It can be realized by a server computer, a personal computer, or a device corresponding to them. Each function of each unit can be realized by the CPU reading the program stored in the hard disk into the memory and executing it.
  • a CPU Central Processing Unit
  • Each component of the speech recognition apparatus 100 includes an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus. Each figure described below shows a functional unit block, not a hardware unit configuration.
  • the voice recognition system recognizes and automatically records the voice of a speaker in a conference or lecture.
  • Conferences and lectures are held at various venues and in various facilities and environments. In many cases, the existing audio equipment is used at the venue. Therefore, there are a wide variety of acoustic devices such as microphones, amplifiers, and mixers, and there are countless combinations thereof.
  • the voice recognition accuracy is not stable when, for example, temporary noise occurs or the speaker is switched.
  • a microphone that is fixedly installed such as a stand microphone or a boundary microphone, if the speaker moves and speaks on the way, the distance from the microphone is increased. Therefore, there is a problem that it becomes difficult to pick up the voice of the speaker.
  • a configuration that solves the problem of speaker movement by attaching a pin microphone to the chest of the speaker is also conceivable.
  • the microphone comes into contact with clothing or body and noise occurs. That is, for example, in a normal utterance, the optimum input device is a stand microphone, and when the speaker moves, the situation may be changed to a pin microphone, and the optimum microphone can change dynamically. As described above, there is a problem that the voice recognition accuracy is not stable when the situation changes in the middle.
  • the speech recognition system of the present invention compares a plurality of recognition results obtained from speech data input under a plurality of different recording conditions, and selects and recognizes the optimum one. It is output as a result.
  • a plurality of types of microphones are prepared, and if the microphones are of the same type, the input level and the like are set differently in advance and prepared.
  • existing equipment if a plurality of microphones are originally set differently, they can be applied as they are.
  • the installation location should be installed in advance in the place where the speaker is scheduled to move, for example, in front of the whiteboard in addition to the stage where the speaker speaks.
  • a hand microphone or the like may be prepared for questions from listeners at the venue.
  • multiple microphones are prepared with the same recording conditions, for example, the same type of microphone set to the same input level, the situation changes during the process, such as microphone failure or noise, as described above. There are things to do.
  • the speech recognition system of the present invention can be applied even when the recording conditions for each microphone differ as a result of such changes in the situation.
  • an existing voice data input device may be used, or an input device provided as a voice recognition system may be used. That is, according to the speech recognition system of the present invention, the accuracy of speech recognition can be improved without depending on what kind of speech input devices are prepared in combination.
  • Recording conditions are various conditions when recording a speaker's voice using a microphone.
  • the speech recognition apparatus 110 includes a speech segment adjustment unit 112, a speech recognition unit 102, and a recognition result selection integration unit 114.
  • the speech recognition apparatus 110 integrates the speech recognition apparatus 100 with the point that the speech segment adjustment unit 112 detects the speech segment of each speech data and the recognition result selected by the recognition result selection integration unit 114 for each speech segment. Is different.
  • the voice section adjustment unit 112 receives input of a plurality of series of voice data d1, d2,..., Dn, and each of the plurality of series of voice data d1, d2,. Is detected. And the audio
  • the “speech segment” means “section detected by the speech segment adjustment unit 112” including “speech data actually spoken by the speaker” or “automatic detection” from a series of input speech data. Means an “interval”. Then, in the subsequent speech recognition unit, the speech recognition processing is executed with this utterance section as one processing unit. That is, the speech section adjustment unit 112 indicates that each segment of speech data to be subjected to speech recognition processing is the same section between the plurality of speech data (the start time and the end time are the same sections. And the end point time are referred to as “start / end time”).
  • the speech section adjustment unit 112 detects DS11, DS12,..., DS1a (where a is a natural number) as speech sections from the first series of voice data d1, and the second series of voice data.
  • DS21, DS22,..., DS2b (where b is a natural number) is detected from d2, and DSn1, DSn2,..., DSnc (b is a natural number) from the n-th series of speech data dn.
  • c is a natural number
  • the voice section adjustment unit 112 includes the first utterance section DS11 of the first series of voice data d1, the first utterance section DS21 of the second series of voice data d2, and the nth series of voice data. Each utterance interval is adjusted so that the utterances included in the first utterance interval DSn1 of dn are the same. Similarly, the second utterance section DS12 of the first series of voice data d1, the second utterance section DS22 of the second series of voice data d2, and the second utterance of the n series of voice data dn. Each utterance section is adjusted so that the utterances included in the section DSn2 are the same, and the recognition target section is determined. Thereafter, each utterance section is similarly adjusted.
  • the first utterance section DS21 of the second voice data d2 is the first utterance section of the first voice data d1, the second voice data d2, and the nth voice data dn.
  • the section is adjusted to be longer in accordance with the first utterance section of the other voice data.
  • the utterance section of one voice data is detected shorter than the utterance section of the other voice data, and if there is a deviation in the utterance section, synchronization is made between a plurality of voice data.
  • the start / end time of the utterance section is adjusted.
  • a plurality of utterance sections may become one utterance section for other voices.
  • the first utterance section DS11 of the first series of voice data d1 is from the first to the fourth second
  • the first utterance section DS21 of the second series of voice data d2 is the first to second seconds.
  • the case where the second utterance section DS22 of the second series of audio data d2 is from the second to the fourth second will be described.
  • the first utterance section DS11 of the first series of voice data d1 and the first utterance section DS21 and the second utterance section DS22 of the second series of voice data d2 are the same. Adjustment is made so that it is the utterance section, and the recognition target section after the adjustment is from the first to fourth seconds.
  • the speech recognition unit 102 uses the same recognition target section (first recognition target section DS′11, DS) of a plurality of series of voice data d1, d2,. ′21, DS′n1, m-th recognition target section DS′1m, DS2′m, DS′nm, etc. (where m is a natural number), respectively, are subjected to speech recognition processing, and the same recognition target section A plurality of speech recognition results corresponding to are output respectively.
  • the speech recognition processing may be performed in units of utterance sections, and after the recognition processing, the recognition result may be aligned with the recognition target section adjusted in the section.
  • the recognition result selection / integration unit 114 outputs the same recognition target section (first recognition target sections DS′11, DS) of a plurality of series of audio data d1, d2,. '21, DS'n1, m'th recognition target section DS'1m, DS2'm, DS'nm, etc.) for each of a plurality of speech recognition results t1, t2,. The optimum one is selected for each recognition target section. Then, the recognition result selection integration unit 114 integrates a plurality of speech recognition results of each recognition target section selected for each recognition target section, and outputs the result as a speech recognition result T of a series of speech data. For example, the speech recognition result of DS′11 is selected in the first recognition target section, and the speech recognition result of DS′22 is selected in the second recognition target section.
  • the voice recognition unit 102 can perform voice recognition processing on the plurality of voice data d1, d2,..., Dn under the same voice recognition processing conditions. That is, the same language model and dictionary can be used.
  • sound is collected by a plurality of sound input units 10 (U1, U2,..., Un), and a plurality of series of sound data d1, d2,. Are entered respectively.
  • the voice input unit 10 can be various types of microphones, for example, a stand microphone, a boundary microphone, a pin microphone, a hand microphone, or the like.
  • microphones can be installed in front of the speaker's immediate eyes, that is, at the speaker's chest, such as at the mouth or pin microphone, or at a position away from the speaker.
  • the location where the speaker may move such as installing in front of the whiteboard, or wirelessly using a pin microphone or hand microphone, etc., can be used while moving without fixing the installation location. It is done.
  • the multiple audio input units 10 have different recording conditions. These recording conditions may be set by the recording condition setting unit 20. For example, the type and location of the microphone may be different, and the sound input level, sensitivity, correction processing method, and the like of each microphone may be different.
  • the microphone, amplifier, or mixer that is the audio input unit 10 may be adjusted according to a setting value stored in a setting storage unit (not shown) of the recording condition setting unit 20, or the recording condition setting unit 20 It is also possible to adopt a configuration in which setting is automatically performed by a setting adjustment device (not shown).
  • the microphone, amplifier, or mixer can be adjusted manually by the user according to the recording conditions and the situation of each venue or speaker.
  • the recognition result selection integration unit 114 compares each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance of a plurality of series of speech data d1, d2,..., Dn output from the speech recognition unit 102. Are selected for each recognition target section, a plurality of the respective speech recognition results selected for each recognition target section are integrated, and output as a speech recognition result T of a series of voice data.
  • a plurality of speech recognitions corresponding to the first recognition target sections DS′11, DS′21,..., DS′n1 including the same utterance of a plurality of series of speech data d1, d2,.
  • the results are TS11, TS21, ..., TSn1, and a plurality of speech recognition results corresponding to the second recognition target sections DS'12, DS'22, ..., DS'n2 are TS12, TS22, ... , TSn2, and a plurality of speech recognition results corresponding to the m-th recognition target sections DS′1m, DS′2m,..., DS′nm are TS1m, TS2m,.
  • a plurality of speech recognition results TS11 to TSnm corresponding to each recognition target section are not shown.
  • the recognition result selection / integration unit 114 compares the recognition results of the plurality of speech data in the recognition target section output from the speech recognition unit 102 with each other for each recognition target section, selects an optimum one, and outputs the result by combining them. .
  • the recognition result TS11 of the first voice data d1 is selected
  • the recognition result TS22 of the second voice data d2 is selected
  • the mth recognition In the target section an optimum one is selected for each recognition target section, such that the recognition result TSnm of the nth audio data dn is selected.
  • the recognition result selection integration unit 114 can integrate the recognition results selected for each recognition target section and output the recognition results T as a series of speech data recognition results.
  • the optimum one is selected for each recognition target section, but the present invention is not limited to this.
  • the recognition result can be selected in units shorter than one utterance section, for example, word level.
  • the text data of the speech recognition results are compared with each other, and a majority decision is made to select the one that has obtained the same result more, that is, the one that has obtained more similar results among a plurality of recognition results.
  • a majority decision is made to select the one that has obtained the same result more, that is, the one that has obtained more similar results among a plurality of recognition results.
  • information obtained together with recognition results such as an acoustic score, a language score, and reliability can be used. That is, when majority of the speech recognition results are determined, it is conceivable to use recognition result information such as reliability as a weight for the speech recognition results. Further, it may be possible to determine whether or not to adopt the recognition result based on the threshold value of the recognition result information of the speech recognition result. Moreover, you may combine these.
  • the input conditions of each speech input unit 10 are not included in the recognition result selection conditions. Regardless of the input conditions, only the information obtained from the recognition result is used for comparison, and by selecting the optimum one, the speech recognition result can be kept accurate.
  • the recognition result T of the recognition result selection / integration unit 114 is output, for example, as text data, recorded in a storage unit (not shown) or a recording medium, and provided to the user.
  • the speech recognition system of the present invention can also be provided to the user as a SaaS (Software As Service) type service.
  • the recognition result can be provided to the user so as to be browsed by referring to the web page from the user terminal via the network.
  • the recognition result can be provided to the user by downloading as necessary or distributing to a predetermined mail address designated by the user.
  • the speech recognition apparatus 110 can be realized by a computer.
  • the computer program according to the present embodiment includes a procedure for recognizing a plurality of sound data input under different recording conditions on a computer for realizing the sound recognition device 110, and a plurality of sound recognition results obtained by sound recognition. And a procedure for selecting an optimum one is described.
  • the computer program accepts a plurality of series of voice data input under different recording conditions to a computer for realizing the voice recognition device 110, and each of the plurality of series of voice data Procedure for detecting each utterance section, procedure for adjusting the recognition target section so as to include the same utterance between a plurality of series of voice data, and for each recognition target section including the same utterance of a plurality of adjusted series of voice data , A procedure for performing each voice recognition process and outputting each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance, a plurality of corresponding to a recognition target section including the same utterance of a plurality of output voice data For each speech recognition result, and the procedure for selecting the best for each recognition target section, for each utterance section selected for each recognition target section
  • the speech recognition result integrates an array are described so as to perform the steps of outputting a speech recognition result of a series of audio data.
  • the computer program of this embodiment may be recorded on a computer-readable storage medium.
  • the recording medium is not particularly limited, and various forms can be considered.
  • the program may be loaded from a recording medium into a computer memory, or downloaded to a computer through a network and loaded into the memory.
  • FIG. 3 is a flowchart showing an example of the operation of the speech recognition system of the present embodiment.
  • the data processing method of the speech recognition apparatus 110 is a data processing method of the speech recognition apparatus that recognizes speech data, and a plurality of speeches input by the speech recognition apparatus 110 under different recording conditions.
  • Each of the data is subjected to voice recognition (step S105), and a plurality of voice recognition results obtained by the voice recognition are compared to select an optimum one (step S107).
  • the voice segment adjustment unit 112 of the voice recognition device 110 inputs the voice data d1, d2,..., Dn collected from the plurality of voice input units 10 under different recording conditions. (Step S101). Then, the speech segment adjustment unit 112 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S103).
  • the speech recognition unit 102 recognizes the plurality of speech data output from the speech segment adjustment unit 112 for each speech segment (step S105).
  • the recognition result corresponding to each utterance section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 114.
  • the recognition result selection / integration unit 114 compares a plurality of speech recognition results for each utterance section, and selects an optimum one from them (step S107).
  • the recognition result selection / integration unit 114 integrates the recognition results for each selected utterance section, and outputs them as a series of speech data recognition results T (step S109).
  • the speech recognition system even if there are some speech data with poor input conditions, the obtained speech recognition results are compared. By selecting the optimum one, the speech recognition result can be kept accurately.
  • the voice input unit 10 may be of any type and any setting, but if one of the different settings is obtained by setting different from each other, The result can be adopted.
  • the voice recognition system of the present embodiment since the optimum one can be selected for each utterance section from among a series of voice data, even if the situation changes in the middle of the series of voice data, the middle Since the speech recognition result for other speech data can be adopted, the speech recognition result can be kept accurately.
  • the speaker may move away from a fixed microphone, the volume of the speaker may change because the speaker has changed, some microphones may malfunction, or noise may occur. The same applies when the situation changes during the process. Or, when the speaker returns to the position of the fixed microphone, the malfunctioning microphone recovers, or the noise is reduced, the accuracy of the speech recognition result can be maintained well. . The reason is that it is possible to switch from the middle to the one with the optimum speech recognition result.
  • the speech recognition unit 102 can perform speech recognition processing on a plurality of speech data using the same recognition processing conditions, that is, the same language model or the same acoustic model. At that time, since the results recognized under the same recognition processing conditions are evaluated, it is easy to compare multiple audio data with different recording conditions using various recognition parameters and scores obtained by the recognition results and voice recognition processing. Can be superior or inferior.
  • FIG. 4 is a functional block diagram showing the configuration of the speech recognition system according to the embodiment of the present invention.
  • the speech recognition system according to this embodiment is different from the above embodiment in that the recognition result selection / integration unit 214 records conditions and the like at the time of speech recognition processing of a recognition result selected from a plurality of recognition results, and the subsequent speech It is different in that it is fed back as a condition for selecting a voice interval adjustment or recognition result of data.
  • the speech recognition apparatus 200 has a speech recognition processing unit in which the speech recognition unit 102 has processed the speech recognition processing conditions of the speech recognition unit 102 when a plurality of speech recognition results are obtained.
  • a processing condition recording unit that records the voice recognition processing conditions in the recognition unit 102 in the processing condition storage unit (condition storage unit 210) for each voice recognition processing unit (speech section or recognition processing section).
  • the recognition result selection / integration unit 214 refers to the processing condition storage unit (condition storage unit 210) and selects a speech recognition result for each speech recognition processing unit (utterance section) in consideration of the speech recognition processing conditions.
  • the voice recognition device 200 includes a plurality of voice data d1, d2,. . . , Dn when the speech recognition result is selected by the condition storage unit 210 and the recognition result selection / integration unit 214 for storing the input condition for each utterance section (or recognition target section), respectively, or when it is not selected It is also possible to further include an input condition recording unit (not shown) that stores the voice data input condition in the condition storage unit 210 for each utterance section (or recognition target section).
  • the voice segment adjustment unit 212 may refer to the condition storage unit 210 and adjust the speech segment in consideration of input conditions of a plurality of input voice data.
  • the input condition is, for example, the power level of the input voice data, the S / N ratio, the difference or ratio of the power level with other voice data, or the difference of the S / N ratio with other voice data, etc. Can be included.
  • the speech recognition apparatus 200 of the present embodiment includes the same speech recognition unit 102 as the speech recognition apparatus 110 of the above embodiment, a condition storage unit 210, a speech segment adjustment unit 212, and recognition result selection integration. Unit 214.
  • the condition storage unit 210 recognizes the speech data (or the recognition target section) for each speech data, and for each speech section (or the recognition target section). Can be included, a voice recognition processing condition when the recognition result of the speech section of the voice data is selected, and an input condition of the voice input unit 10.
  • the speech recognition processing conditions can include a recognition result (not shown) of the speech section of the speech data, an acoustic score, a language score, reliability, and the like.
  • the input conditions of the voice input unit 10 can include an input power level, an S / N ratio, and the like.
  • acoustic information such as power and S / N ratio and information obtained at the time of analysis are sent from the voice section adjustment section 212 to the condition storage section 210.
  • the selection flag is assigned to each utterance section (recognition target section).
  • selection can be made in units shorter than the utterance section, such as a word level. Therefore, a flag can be given at a selected unit, for example, a word level, and stored in the condition storage unit 210.
  • the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the speech recognition processing condition stored in the condition storage unit 210.
  • the speech interval adjustment unit 212 may detect and adjust the utterance interval by referring to the condition storage unit 210 and considering the input conditions stored in the condition storage unit 210.
  • the recognition result selection / integration unit 214 selects the recognition result in consideration of the information as a weight.
  • the condition storage unit 210 may store an identification model for identifying whether the utterance section (or recognition target section, word, phrase, etc.) or the recognition result is selected or rejected.
  • a base identification model is learned in advance using voice data different from the input voice (given as a teacher) and stored in the condition storage unit 210.
  • the speech interval adjustment unit 212 uses the identification model stored in the condition storage unit 210 and based on various feature amounts obtained from the input speech, the speech interval (or A determination result (or a score obtained from the identification model) of whether to select or reject a recognition target section, a word, a phrase, or the like is acquired.
  • voice area adjustment part 212 adjusts an audio
  • the recognition result selection / integration unit 214 uses the identification model stored in the condition storage unit 210 to determine whether to select or reject the recognition result based on various feature quantities and scores obtained (or identification). Get the score obtained from the model. Then, the recognition result selection integration unit 214 selects and rejects the recognition result using the result. It is also conceivable to update the identification model sequentially by adding the final adjustment result and recognition result of the speech section.
  • the voice section adjustment unit 212 and the recognition result selection integration unit 214 are configured to refer to the condition storage unit 210.
  • the present invention is not limited to this, and other determination units (not shown) refer to the condition storage unit 210.
  • the voice section adjustment unit 212 or the recognition result selection integration unit 214 may determine whether or not the conditions recorded in the condition storage unit 210 need to be considered. And when it is necessary, it is good also as a structure which notifies a required condition to the audio
  • the speech recognition apparatus 200 of the present embodiment can be realized by a computer.
  • the computer program of the present embodiment is a computer for realizing the speech recognition apparatus 200, in addition to the procedure of the computer program of the above embodiment, and further, when a speech recognition result is selected or not selected.
  • the computer program of the present embodiment is further provided with a computer for realizing the speech recognition apparatus 200.
  • the speech recognition processing conditions in the speech recognition unit 102 when the speech recognition result is selected or not selected. Is recorded in the condition storage unit 210 for each recognition target section, and the condition storage unit 210 is referred to, and the procedure for selecting the recognition result for each recognition target section in consideration of the speech recognition processing conditions is described. Has been.
  • FIG. 6 is a flowchart showing an example of the operation of the speech recognition system of this embodiment.
  • the speech recognition apparatus 200 includes steps S203 to S208 in addition to steps S101, S105, and S109 similar to those in the flowchart of the above-described embodiment of FIG.
  • the speech section adjustment unit 212 of the speech recognition apparatus 200 inputs speech data d1, d2,..., Dn collected from the plurality of speech input units 10 under different recording conditions, respectively (step S101). . Then, the speech segment adjustment unit 212 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S203). At this time, the voice segment adjustment unit 212 refers to the condition storage unit 210 and detects and adjusts the speech segment in consideration of the input condition.
  • the voice segment adjustment unit 212 records the input condition in the condition storage unit 210 for each voice data and for each utterance segment (or recognition processing segment) (step S204). Then, the speech recognition unit 102 recognizes the plurality of speech data output from the speech interval adjustment unit 212 for each recognition processing interval (step S105). As a result, the recognition result corresponding to each recognition processing section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 214. Then, the recognition result selection / integration unit 214 compares a plurality of speech recognition results for each recognition processing section, and selects an optimum one from them (step S207). At this time, the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the voice recognition processing condition.
  • the recognition result selection / integration unit 214 adds a speech recognition processing condition for each utterance section of each speech data and a selection flag indicating whether the speech data for that section has been adopted or not to the condition storage unit 210 (step S110). S208). Then, the recognition result selection / integration unit 214 integrates the recognition results for each selected recognition processing section, and outputs them as a series of speech data recognition results T (step S109).
  • the voice data that has the same effects as those of the above-described embodiment and has been selected or not selected in the past when selecting the voice recognition result. Therefore, the processing can be performed in consideration of the tendency of different recording conditions depending on the situation of the venue, and the recognition accuracy can be improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed is a voice-recognition device (100) equipped with a voice-recognition unit (102) that performs voice-recognition on a plurality of voice data obtained by inputting a speaker's voice under different recording conditions; and a recognition result selection unit (104) that selects an optimum result by comparing a plurality of audio-recognition results obtained by recognizing audio at the audio-recognition unit (102)

Description

音声認識システム、装置、方法、およびプログラムSpeech recognition system, apparatus, method, and program
 本発明は、音声認識システム、装置、方法、およびプログラムに関し、特に、複数の音声データを利用した音声認識システム、装置、方法、およびプログラムに関する。 The present invention relates to a voice recognition system, apparatus, method, and program, and more particularly, to a voice recognition system, apparatus, method, and program using a plurality of voice data.
 複数マイク使用による認識結果選択機能付き音声認識装置の一例が特許文献1(特開平10-232691号公報)に記載されている。特許文献1(特開平10-232691号公報)の音声認識装置は、話者の音声発生源である口に相対的に固定されない位置の話者の体に装着されたマイクロフォンと、マイクロフォンから入力された音声信号の認識および認識結果の出力を行う認識部と、認識部から出力された認識結果の比較を行い、最も確度の高い認識結果を選択・出力する総合処理部とから構成されている。この構成により、話者の姿勢が変化しても音声入力を行うことができるようになっている。また、認識結果の確度を示す値として、話者の口とマイクとの距離値を、確度を示す値として用いており、認識結果の確度から認識結果を選択している。 An example of a speech recognition apparatus with a recognition result selection function using a plurality of microphones is described in Patent Document 1 (Japanese Patent Laid-Open No. 10-232691). The speech recognition apparatus of Patent Document 1 (Japanese Patent Laid-Open No. 10-232691) is a microphone that is attached to a speaker's body at a position that is not fixed relative to the mouth, which is a speaker's voice generation source, and an input from the microphone. A recognition unit that recognizes a voice signal and outputs a recognition result, and a comprehensive processing unit that compares the recognition result output from the recognition unit and selects and outputs the recognition result with the highest accuracy. With this configuration, voice input can be performed even if the speaker's posture changes. Further, as a value indicating the accuracy of the recognition result, a distance value between the speaker's mouth and the microphone is used as a value indicating the accuracy, and the recognition result is selected from the accuracy of the recognition result.
特開平10-232691号公報Japanese Patent Laid-Open No. 10-232691
 近年、会議や講演会などにおける話者の音声を音声認識して自動的に記録するシステムのニーズが高まっている。ところが、会議や講演会などは、様々な会場で、様々な設備および環境下で行われる。また、音響設備は会場既存のものを使用することも多く、音響機器、たとえば、マイクロフォン、アンプ、ミキサーは多種多様であり、それらの組み合わせも無数である。そして、たとえば、講演会場などで、話者が入れ替わった場合に、一般的には、音響設備等の収録条件を話者毎に変更しない。そのため、設定に対して、話者の声量が大きすぎると誤りを多く含む認識結果が出力されてしまうといった問題点があった。逆に、小さすぎると音声区間を検知できなかったりして、音声の認識精度が低下してしまうといった問題点があった。 In recent years, there has been an increasing need for a system that automatically recognizes and automatically records a speaker's voice in a conference or lecture. However, conferences and lectures are held at various venues and in various facilities and environments. In addition, there are many cases where the existing audio equipment is used at the venue, and there are a wide variety of audio equipment such as microphones, amplifiers, and mixers, and there are countless combinations thereof. For example, when a speaker is changed in a lecture hall or the like, generally, recording conditions such as audio equipment are not changed for each speaker. For this reason, if the speaker's voice volume is too large for the setting, there is a problem that a recognition result including many errors is output. On the other hand, if it is too small, there is a problem that the voice section cannot be detected and the voice recognition accuracy is lowered.
 本発明の目的は、上述した課題である音声の認識精度の低下を解決する音声認識システム、装置、方法、およびプログラムを提供することにある。 An object of the present invention is to provide a speech recognition system, apparatus, method, and program for solving the above-described problem of lowering speech recognition accuracy.
 本発明の音声認識装置は、
 話者の発話音声を異なる収録条件で入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
 前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える。
The speech recognition apparatus of the present invention
A voice recognition means for recognizing a plurality of voice data obtained by inputting a speaker's utterance voice under different recording conditions;
A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.
 本発明の音声認識システムは、
 異なる収録条件でそれぞれ音声を入力する複数の音声入力手段と、
 前記音声入力手段から入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
 前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える。
The speech recognition system of the present invention
A plurality of voice input means for inputting voice respectively under different recording conditions;
Voice recognition means for recognizing each of a plurality of voice data input from the voice input means;
A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.
 本発明の音声認識装置のデータ処理方法は、
 音声データを音声認識する音声認識装置のデータ処理方法であって、
 前記音声認識装置が、
 異なる収録条件で入力した複数の音声データをそれぞれ音声認識し、
 音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する。
The data processing method of the speech recognition apparatus of the present invention includes:
A data processing method of a voice recognition device for voice recognition of voice data,
The speech recognition device
Recognize multiple voice data input under different recording conditions,
A plurality of speech recognition results obtained by speech recognition are compared, and an optimum one is selected.
 本発明のコンピュータプログラムは、
 音声データを音声認識する音声認識装置を実現するコンピュータプログラムであって、
 異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、
 音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、をコンピュータに実行させるためのものである。
The computer program of the present invention is:
A computer program for realizing a voice recognition device for voice recognition of voice data,
A procedure for recognizing a plurality of audio data input under different recording conditions,
This is for causing a computer to execute a procedure for comparing a plurality of speech recognition results obtained by speech recognition and selecting an optimum one.
 なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that an arbitrary combination of the above-described components and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.
 また、本発明の各種の構成要素は、必ずしも個々に独立した存在である必要はなく、複数の構成要素が一個の部材として形成されていること、一つの構成要素が複数の部材で形成されていること、ある構成要素が他の構成要素の一部であること、ある構成要素の一部と他の構成要素の一部とが重複していること、等でもよい。 The various components of the present invention do not necessarily have to be independent of each other. A plurality of components are formed as a single member, and a single component is formed of a plurality of members. It may be that a certain component is a part of another component, a part of a certain component overlaps with a part of another component, or the like.
 また、本発明のデータ処理方法およびコンピュータプログラムには複数の手順を順番に記載してあるが、その記載の順番は複数の手順を実行する順番を限定するものではない。このため、本発明のデータ処理方法およびコンピュータプログラムを実施するときには、その複数の手順の順番は内容的に支障しない範囲で変更することができる。 In addition, although the data processing method and the computer program of the present invention describe a plurality of procedures in order, the described order does not limit the order in which the plurality of procedures are executed. For this reason, when implementing the data processing method and computer program of this invention, the order of the several procedure can be changed in the range which does not interfere in content.
 さらに、本発明のデータ処理方法およびコンピュータプログラムの複数の手順は個々に相違するタイミングで実行されることに限定されない。このため、ある手順の実行中に他の手順が発生すること、ある手順の実行タイミングと他の手順の実行タイミングとの一部ないし全部が重複していること、等でもよい。 Furthermore, the data processing method and the plurality of procedures of the computer program of the present invention are not limited to being executed at different timings. For this reason, another procedure may occur during the execution of a certain procedure, or some or all of the execution timing of a certain procedure and the execution timing of another procedure may overlap.
 本発明によれば、音声の認識精度を向上する音声認識システム、装置、方法、およびプログラムが提供される。 According to the present invention, a speech recognition system, apparatus, method, and program for improving speech recognition accuracy are provided.
 上述した目的、およびその他の目的、特徴および利点は、以下に述べる好適な実施の形態、およびそれに付随する以下の図面によってさらに明らかになる。 The above-described object and other objects, features, and advantages will be further clarified by a preferred embodiment described below and the following drawings attached thereto.
本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの条件記憶部の構造の一例を示す図である。It is a figure which shows an example of the structure of the condition memory | storage part of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition system which concerns on embodiment of this invention.
 以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.
(第1の実施の形態)
 図1は、本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。
 同図に示すように、本実施形態の音声認識システムにおいて、音声認識装置100は、話者の発話音声を異なる収録条件で入力した複数の音声データd1、d2、...、dn(ここで、nは自然数)をそれぞれ音声認識する音声認識部102と、音声認識部102で音声認識して得られた複数の音声認識結果t1、t2、...、tnを比較して、最適なものを選択する認識結果選択部104と、を備える。
(First embodiment)
FIG. 1 is a functional block diagram showing a configuration of a speech recognition system according to an embodiment of the present invention.
As shown in the figure, in the speech recognition system of the present embodiment, the speech recognition apparatus 100 includes a plurality of speech data d1, d2,. . . , Dn (where n is a natural number), respectively, and a plurality of speech recognition results t1, t2,. . . , Tn, and a recognition result selection unit 104 that selects an optimum one.
 本実施形態において、音声認識装置100は、たとえば、図示しないCPU(Central Processing Unit)やメモリ、ハードディスク、および通信装置を備え、キーボードやマウス等の入力装置やディスプレイやプリンタ等の出力装置と接続されるサーバコンピュータやパーソナルコンピュータ、またはそれらに相当する装置により実現することができる。そして、CPUが、ハードディスクに記憶されるプログラムをメモリに読み出して実行することにより、上記各ユニットの各機能を実現することができる。 In this embodiment, the speech recognition apparatus 100 includes, for example, a CPU (Central Processing Unit) (not shown), a memory, a hard disk, and a communication device, and is connected to an input device such as a keyboard and a mouse and an output device such as a display and a printer. It can be realized by a server computer, a personal computer, or a device corresponding to them. Each function of each unit can be realized by the CPU reading the program stored in the hard disk into the memory and executing it.
 なお、以下の各図において、本発明の本質に関わらない部分の構成については省略してあり、図示されていない。
 また、音声認識装置100の各構成要素は、任意のコンピュータのCPU、メモリ、メモリにロードされた本図の構成要素を実現するプログラム、そのプログラムを格納するハードディスクなどの記憶ユニット、ネットワーク接続用インタフェースを中心にハードウェアとソフトウェアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。以下に説明する各図は、ハードウェア単位の構成ではなく、機能単位のブロックを示している。
In the following drawings, the configuration of parts not related to the essence of the present invention is omitted and is not shown.
Each component of the speech recognition apparatus 100 includes an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus. Each figure described below shows a functional unit block, not a hardware unit configuration.
 本実施形態の音声認識システムは、会議や講演会などにおける話者の音声を音声認識して自動的に記録するものである。会議や講演会などは、様々な会場で、様々な設備および環境下で行われる。音響設備は会場既存のものを使用することが多い。そのため、音響機器、たとえば、マイクロフォン、アンプ、ミキサーは多種多様であり、それらの組み合わせも無数である。 The voice recognition system according to this embodiment recognizes and automatically records the voice of a speaker in a conference or lecture. Conferences and lectures are held at various venues and in various facilities and environments. In many cases, the existing audio equipment is used at the venue. Therefore, there are a wide variety of acoustic devices such as microphones, amplifiers, and mixers, and there are countless combinations thereof.
 また、たとえば、講演会場などで、話者が入れ替わった場合に、一般的には、音響設備等の収録条件を話者毎に変更しない。そのため、設定に対して、話者の声量が大きすぎると誤りを多く含む認識結果が出力されてしまうといった問題点があった。逆に、小さすぎると音声区間を検知できなかったりするといった問題点があった。 Also, for example, when a speaker is changed in a lecture hall, generally, recording conditions such as audio equipment are not changed for each speaker. For this reason, if the speaker's voice volume is too large for the setting, there is a problem that a recognition result including many errors is output. Conversely, if it is too small, there is a problem that the voice section cannot be detected.
 また、会場や話者の状況によって、たとえば、一時的な騒音の発生や、話者が入れ替わったりした場合等に、音声認識精度が安定しないという問題点があった。あるいは、スタンドマイクやバウンダリマイクなどの固定的に設置されているマイクロフォンを使用する場合に、途中で話者が移動して発話をされると、マイクロフォンとの距離が離れてしまう。そのため、話者の声をひろうことが困難になってしまうといった問題点があった。 Also, depending on the situation of the venue and the speaker, for example, there is a problem that the voice recognition accuracy is not stable when, for example, temporary noise occurs or the speaker is switched. Alternatively, when using a microphone that is fixedly installed such as a stand microphone or a boundary microphone, if the speaker moves and speaks on the way, the distance from the microphone is increased. Therefore, there is a problem that it becomes difficult to pick up the voice of the speaker.
 話者の移動の問題に対しては、話者の胸元にピンマイクを付けることで解決する構成も考えられる。しかし、衣類や体とマイクが接触して雑音が入ったりすることも考えられる。すなわち、例えば通常の発話では最適な入力デバイスがスタンドマイクであり、話者が移動したときにはピンマイクに変わるといった状況が考えられ、最適なマイクロフォンは動的に変わりうる。
 このように、途中で状況が変化した場合に、音声認識精度が安定しないという問題点があった。
A configuration that solves the problem of speaker movement by attaching a pin microphone to the chest of the speaker is also conceivable. However, it is also possible that the microphone comes into contact with clothing or body and noise occurs. That is, for example, in a normal utterance, the optimum input device is a stand microphone, and when the speaker moves, the situation may be changed to a pin microphone, and the optimum microphone can change dynamically.
As described above, there is a problem that the voice recognition accuracy is not stable when the situation changes in the middle.
 本発明の音声認識システムは、このような問題を解決するために、複数の異なる収録条件で入力された音声データから得られた複数の認識結果を比較して、最適なものを選択して認識結果として出力するものである。たとえば、複数の種類のマイクを準備し、また、同じ種類のマイクであれば、入力レベルなどの設定をそれぞれ異なるように予め設定して準備する。あるいは、既存の設備を用いる場合には、元々複数のマイクが異なる設定になっていれば、そのまま適用することもできる。 In order to solve such problems, the speech recognition system of the present invention compares a plurality of recognition results obtained from speech data input under a plurality of different recording conditions, and selects and recognizes the optimum one. It is output as a result. For example, a plurality of types of microphones are prepared, and if the microphones are of the same type, the input level and the like are set differently in advance and prepared. Alternatively, when existing equipment is used, if a plurality of microphones are originally set differently, they can be applied as they are.
 または、設置場所は、話者の移動を考慮して、予め話者が移動予定の場所、たとえば、講演会の場合、講演者が話をする壇上以外にホワイトボード前等にも設置するのが好ましい。また、会場のリスナーの質問用にハンドマイクなどを準備してもよい。また、複数のマイクを同じ収録条件、たとえば、同じ種類のマイクを同じ入力レベルに設定して準備した場合であっても、途中で、上述したようにマイクの故障や騒音の発生など状況が変化することがある。このような状況の変化に応じて、結果としてマイク毎の収録条件が異なった場合にも、本発明の音声認識システムは適用できる。 Or, considering the movement of the speaker, the installation location should be installed in advance in the place where the speaker is scheduled to move, for example, in front of the whiteboard in addition to the stage where the speaker speaks. preferable. A hand microphone or the like may be prepared for questions from listeners at the venue. Also, even when multiple microphones are prepared with the same recording conditions, for example, the same type of microphone set to the same input level, the situation changes during the process, such as microphone failure or noise, as described above. There are things to do. The speech recognition system of the present invention can be applied even when the recording conditions for each microphone differ as a result of such changes in the situation.
 本実施形態において、音声データの入力装置は、会場既存のものを用いてもよいし、音声認識システムとして、設けられた入力装置を用いてもよい。すなわち、本発明の音声認識システムによれば、どのような種類の音声入力装置を、どのように組み合わせて準備するかに依存せずに、音声認識の精度を向上させることができる。 In the present embodiment, an existing voice data input device may be used, or an input device provided as a voice recognition system may be used. That is, according to the speech recognition system of the present invention, the accuracy of speech recognition can be improved without depending on what kind of speech input devices are prepared in combination.
 収録条件は、マイクを用いて話者の音声を収録するときの各種条件であり、使用前に予め決まっているものと、使用中に状況に応じて変化するものとの2種類ある。前者の例として、マイクの種類、設置場所、入力レベル、感度、補正処理方法、空調などの定常的な雑音など、後者の例として、話者(声量、性別等)、音源や話者とマイクの距離、周囲の騒音レベル、マイクの入力レベルや感度(故障などにより変化したとき)などを含むことができる。 Recording conditions are various conditions when recording a speaker's voice using a microphone. There are two types of recording conditions, one that is determined in advance before use and one that changes depending on the situation during use. Examples of the former include microphone type, installation location, input level, sensitivity, correction processing method, and stationary noise such as air conditioning. Examples of the latter include speaker (voice volume, gender, etc.), sound source, speaker, and microphone. Distance, ambient noise level, microphone input level and sensitivity (when changed due to failure, etc.).
 具体的には、図2に示すように、本実施形態の音声認識システムにおいて、音声認識装置110は、音声区間調整部112と、音声認識部102と、認識結果選択統合部114と、を備える。以後、本実施形態では、音声認識装置110を例に説明する。なお、音声認識装置110は、音声認識装置100とは、音声区間調整部112が各音声データの発話区間を検出する点、および認識結果選択統合部114が発話区間毎に選択した認識結果を統合して出力する点が異なる。 Specifically, as shown in FIG. 2, in the speech recognition system of the present embodiment, the speech recognition apparatus 110 includes a speech segment adjustment unit 112, a speech recognition unit 102, and a recognition result selection integration unit 114. . Hereinafter, in this embodiment, the speech recognition apparatus 110 will be described as an example. Note that the speech recognition apparatus 110 integrates the speech recognition apparatus 100 with the point that the speech segment adjustment unit 112 detects the speech segment of each speech data and the recognition result selected by the recognition result selection integration unit 114 for each speech segment. Is different.
 音声区間調整部112は、複数の一連の音声データd1、d2、・・・、dnの入力を受け付け、複数の一連の音声データd1、d2、・・・、dnについて、それぞれ音声データに対する発話区間を検出する。そして、音声区間調整部112は、複数の一連の音声データd1、d2、・・・、dn間で、同じ発話を含むように発話区間を調整する。 The voice section adjustment unit 112 receives input of a plurality of series of voice data d1, d2,..., Dn, and each of the plurality of series of voice data d1, d2,. Is detected. And the audio | voice area adjustment part 112 adjusts an utterance area so that the same utterance may be included among several series of audio | voice data d1, d2, ..., dn.
 ここでいう、「発話区間」とは、入力される一連の音声データの中から、実際に話者が発話した音声データを含む「音声区間調整部112が検出した区間」、もしくは、「自動検出された区間」を意味する。そして、後段の音声認識部では、この発話区間を1つの処理単位として音声認識処理が実行される。すなわち、音声区間調整部112は、音声認識処理を行う対象の音声データのひと区切りずつが、複数の音声データ間で同じ区間(始点の時刻と終点の時刻がそれぞれ同じ区間を指す。以後、始点の時刻と終点の時刻を「始終端時刻」と呼ぶ。)になるように、調整を行う。 Here, the “speech segment” means “section detected by the speech segment adjustment unit 112” including “speech data actually spoken by the speaker” or “automatic detection” from a series of input speech data. Means an “interval”. Then, in the subsequent speech recognition unit, the speech recognition processing is executed with this utterance section as one processing unit. That is, the speech section adjustment unit 112 indicates that each segment of speech data to be subjected to speech recognition processing is the same section between the plurality of speech data (the start time and the end time are the same sections. And the end point time are referred to as “start / end time”).
 たとえば、音声区間調整部112により、第1の一連の音声データd1から発話区間として、DS11、DS12、・・・、DS1a(ここで、aは自然数)が検出され、第2の一連の音声データd2から発話区間として、DS21、DS22、・・・、DS2b(ここで、bは自然数)が検出され、第nの一連の音声データdnから発話区間として、DSn1、DSn2、・・・、DSnc(ここで、cは自然数)が検出されたとする。なお、発話区間は図示していない。 For example, the speech section adjustment unit 112 detects DS11, DS12,..., DS1a (where a is a natural number) as speech sections from the first series of voice data d1, and the second series of voice data. DS21, DS22,..., DS2b (where b is a natural number) is detected from d2, and DSn1, DSn2,..., DSnc (b is a natural number) from the n-th series of speech data dn. Here, it is assumed that c is a natural number). Note that the utterance section is not shown.
 そこで、音声区間調整部112は、第1の一連の音声データd1の第1の発話区間DS11、第2の一連の音声データd2の第1の発話区間DS21、および、第nの一連の音声データdnの第1の発話区間DSn1に、それぞれ含まれる発話が、同じになるように、各発話区間を調整する。同様に、第1の一連の音声データd1の第2の発話区間DS12、第2の一連の音声データd2の第2の発話区間DS22、および、第nの一連の音声データdnの第2の発話区間DSn2に、それぞれ含まれる発話が、同じになるように、各発話区間を調整し、認識対象区間を決定する。以後、同様に各発話区間を調整する。 Therefore, the voice section adjustment unit 112 includes the first utterance section DS11 of the first series of voice data d1, the first utterance section DS21 of the second series of voice data d2, and the nth series of voice data. Each utterance interval is adjusted so that the utterances included in the first utterance interval DSn1 of dn are the same. Similarly, the second utterance section DS12 of the first series of voice data d1, the second utterance section DS22 of the second series of voice data d2, and the second utterance of the n series of voice data dn. Each utterance section is adjusted so that the utterances included in the section DSn2 are the same, and the recognition target section is determined. Thereafter, each utterance section is similarly adjusted.
 具体的には、たとえば、第1の音声データd1、第2の音声データd2、第nの音声データdnの第1の発話区間のうち、第2の音声データd2の第1の発話区間DS21が、他の音声データの第1の発話区間に比べて、検出された区間が短かったような場合、他の音声データの第1の発話区間に合わせて、区間を長くするように調整する。つまり、収録条件が異なるために、ある音声データの発話区間が他の音声データの発話区間に比べて短く検出され、発話区間にずれが生じた場合には、複数の音声データ間で同期を取り、発話区間の始終端時刻を調整する。 Specifically, for example, the first utterance section DS21 of the second voice data d2 is the first utterance section of the first voice data d1, the second voice data d2, and the nth voice data dn. When the detected section is shorter than the first utterance section of the other voice data, the section is adjusted to be longer in accordance with the first utterance section of the other voice data. In other words, because the recording conditions are different, the utterance section of one voice data is detected shorter than the utterance section of the other voice data, and if there is a deviation in the utterance section, synchronization is made between a plurality of voice data. The start / end time of the utterance section is adjusted.
 なお、複数の発話区間が他の音声では1つの発話区間となる場合がある。例えば、第1の一連の音声データd1の第1の発話区間DS11が1秒目から4秒目までで、第2の一連の音声データd2の第1の発話区間DS21が1秒目から2秒目まで、第2の一連の音声データd2の第2の発話区間DS22が2秒目から4秒目までであった場合について説明する。この場合には、第1の一連の音声データd1の第1の発話区間DS11と、第2の一連の音声データd2の第1の発話区間DS21および第2の発話区間DS22を合わせた区間が同じ発話区間となるように調整し、調整後の認識対象区間は1秒目から4秒目までとなる。 Note that a plurality of utterance sections may become one utterance section for other voices. For example, the first utterance section DS11 of the first series of voice data d1 is from the first to the fourth second, and the first utterance section DS21 of the second series of voice data d2 is the first to second seconds. The case where the second utterance section DS22 of the second series of audio data d2 is from the second to the fourth second will be described. In this case, the first utterance section DS11 of the first series of voice data d1 and the first utterance section DS21 and the second utterance section DS22 of the second series of voice data d2 are the same. Adjustment is made so that it is the utterance section, and the recognition target section after the adjustment is from the first to fourth seconds.
 音声認識部102は、音声区間調整部112により同期が取られた複数の一連の音声データd1、d2、・・・、dnの同一の認識対象区間(第1の認識対象区間DS′11、DS′21、DS′n1や、第mの認識対象区間DS′1m、DS2′m、DS′nm等(ここで、mは自然数))毎に、それぞれ音声認識処理を行い、同一の認識対象区間に対応する複数の音声認識結果をそれぞれ出力する。なお、音声認識処理は発話区間単位で行い、認識処理後に認識結果を前記区間調整された認識対象区間に揃えてもよい。 The speech recognition unit 102 uses the same recognition target section (first recognition target section DS′11, DS) of a plurality of series of voice data d1, d2,. ′21, DS′n1, m-th recognition target section DS′1m, DS2′m, DS′nm, etc. (where m is a natural number), respectively, are subjected to speech recognition processing, and the same recognition target section A plurality of speech recognition results corresponding to are output respectively. Note that the speech recognition processing may be performed in units of utterance sections, and after the recognition processing, the recognition result may be aligned with the recognition target section adjusted in the section.
 認識結果選択統合部114は、音声認識部102から出力された、複数の一連の音声データd1、d2、・・・、dnの同一の認識対象区間(第1の認識対象区間DS′11、DS′21、DS′n1や、第mの認識対象区間DS′1m、DS2′m、DS′nm等)にそれぞれ対応する複数の音声認識結果t1、t2、・・・、tn毎に比較を行い、認識対象区間毎に最適なものを選択する。そして、認識結果選択統合部114は、認識対象区間毎に選択された各認識対象区間の各音声認識結果を複数統合し、一連の音声データの音声認識結果Tとして出力する。たとえば、第1の認識対象区間ではDS′11の音声認識結果が選択され、第2の認識対象区間ではDS′22の音声認識結果が選択される。 The recognition result selection / integration unit 114 outputs the same recognition target section (first recognition target sections DS′11, DS) of a plurality of series of audio data d1, d2,. '21, DS'n1, m'th recognition target section DS'1m, DS2'm, DS'nm, etc.) for each of a plurality of speech recognition results t1, t2,. The optimum one is selected for each recognition target section. Then, the recognition result selection integration unit 114 integrates a plurality of speech recognition results of each recognition target section selected for each recognition target section, and outputs the result as a speech recognition result T of a series of speech data. For example, the speech recognition result of DS′11 is selected in the first recognition target section, and the speech recognition result of DS′22 is selected in the second recognition target section.
 本実施形態において、音声認識部102は、複数の音声データd1、d2、・・・、dnについて、それぞれ同じ音声認識処理条件で音声認識処理を行うことができる。すなわち、同じ言語モデル、辞書などを用いることができる。 In this embodiment, the voice recognition unit 102 can perform voice recognition processing on the plurality of voice data d1, d2,..., Dn under the same voice recognition processing conditions. That is, the same language model and dictionary can be used.
 本実施形態では、複数の音声入力部10(U1、U2、・・・、Un)で集音され、複数の音声入力部10から、複数の一連の音声データd1、d2、・・・、dnがそれぞれ入力される。音声入力部10は、様々な種類のマイクロフォンとすることができ、たとえば、スタンドマイク、バウンダリマイク、ピンマイク、ハンドマイクなどとすることができる。 In the present embodiment, sound is collected by a plurality of sound input units 10 (U1, U2,..., Un), and a plurality of series of sound data d1, d2,. Are entered respectively. The voice input unit 10 can be various types of microphones, for example, a stand microphone, a boundary microphone, a pin microphone, a hand microphone, or the like.
 マイクの設置場所は、様々考えられる。たとえば、話者の直ぐ目の前、すなわち、口元や、ピンマイクなどのように話者の胸元に設けたり、あるいは、話者から離れた位置に設けたりすることができる。さらに、話者が移動する可能性のある場所、たとえば、ホワイトボードの前に設けたり、あるいは、ピンマイクやハンドマイクなどワイヤレスで、設置場所を固定せずに移動しながら使用したりすることなど考えられる。 There are various places where microphones can be installed. For example, it can be provided in front of the speaker's immediate eyes, that is, at the speaker's chest, such as at the mouth or pin microphone, or at a position away from the speaker. In addition, the location where the speaker may move, such as installing in front of the whiteboard, or wirelessly using a pin microphone or hand microphone, etc., can be used while moving without fixing the installation location. It is done.
 複数の音声入力部10は、それぞれ異なる収録条件になっている。これらの収録条件は、収録条件設定部20により設定されてもよい。たとえば、マイクロフォンの種類、設置場所などが異なる場合もあれば、各マイクロフォンの音声入力レベル、感度、補正処理方法等が異なる場合もある。 The multiple audio input units 10 have different recording conditions. These recording conditions may be set by the recording condition setting unit 20. For example, the type and location of the microphone may be different, and the sound input level, sensitivity, correction processing method, and the like of each microphone may be different.
 たとえば、音声入力部10であるマイクロフォン、アンプ、またはミキサーの調整は、収録条件設定部20の設定記憶部(不図示)に記憶されている設定値に従って調整してもよく、収録条件設定部20の設定調整装置(不図示)により自動的に設定を行う構成とすることもできる。マイクロフォン、アンプ、またはミキサーの調整は、上記収録条件および各会場や話者などの状況に応じて、手動でユーザが行うこともできる。 For example, the microphone, amplifier, or mixer that is the audio input unit 10 may be adjusted according to a setting value stored in a setting storage unit (not shown) of the recording condition setting unit 20, or the recording condition setting unit 20 It is also possible to adopt a configuration in which setting is automatically performed by a setting adjustment device (not shown). The microphone, amplifier, or mixer can be adjusted manually by the user according to the recording conditions and the situation of each venue or speaker.
 認識結果選択統合部114は、音声認識部102から出力された複数の一連の音声データd1、d2、・・・、dnの同じ発話を含む認識対象区間に対応する複数の音声認識結果毎に比較を行い、認識対象区間毎に最適なものを選択し、認識対象区間毎に選択された各音声認識結果を複数統合し、一連の音声データの音声認識結果Tとして出力する。 The recognition result selection integration unit 114 compares each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance of a plurality of series of speech data d1, d2,..., Dn output from the speech recognition unit 102. Are selected for each recognition target section, a plurality of the respective speech recognition results selected for each recognition target section are integrated, and output as a speech recognition result T of a series of voice data.
 たとえば、複数の一連の音声データd1、d2、・・・、dnの同じ発話を含む第1の認識対象区間DS′11、DS′21、・・・、DS′n1に対応する複数の音声認識結果をTS11、TS21、・・・、TSn1とし、第2の認識対象区間DS′12、DS′22、・・・、DS′n2に対応する複数の音声認識結果をTS12、TS22、・・・、TSn2とし、第mの認識対象区間DS′1m、DS′2m、・・・、DS′nmに対応する複数の音声認識結果をTS1m、TS2m、・・・、TSnmとする。なお、各認識対象区間に対応する複数の音声認識結果TS11~TSnmは、図示されていない。 For example, a plurality of speech recognitions corresponding to the first recognition target sections DS′11, DS′21,..., DS′n1 including the same utterance of a plurality of series of speech data d1, d2,. The results are TS11, TS21, ..., TSn1, and a plurality of speech recognition results corresponding to the second recognition target sections DS'12, DS'22, ..., DS'n2 are TS12, TS22, ... , TSn2, and a plurality of speech recognition results corresponding to the m-th recognition target sections DS′1m, DS′2m,..., DS′nm are TS1m, TS2m,. A plurality of speech recognition results TS11 to TSnm corresponding to each recognition target section are not shown.
 認識結果選択統合部114は、音声認識部102から出力された認識対象区間の複数の音声データの認識結果を、認識対象区間毎に互いに比較し、最適なものを選択し、つなぎ合わせて出力する。たとえば、第1の認識対象区間では、第1の音声データd1の認識結果TS11が選択され、第2の認識対象区間では、第2の音声データd2の認識結果TS22が選択され、第mの認識対象区間では、第nの音声データdnの認識結果TSnmが選択されるといったふうに、それぞれ認識対象区間毎に最適なものを選択する。そして、認識結果選択統合部114は、各認識対象区間毎に選択された認識結果を統合し、一連の音声データの認識結果Tとして出力することができる。なお、本実施形態では、認識対象区間毎に最適なものを選択しているが、これに限定されない。1発話区間よりも短い単位、たとえば、単語レベル等で認識結果を選択することもできる。 The recognition result selection / integration unit 114 compares the recognition results of the plurality of speech data in the recognition target section output from the speech recognition unit 102 with each other for each recognition target section, selects an optimum one, and outputs the result by combining them. . For example, in the first recognition target section, the recognition result TS11 of the first voice data d1 is selected, and in the second recognition target section, the recognition result TS22 of the second voice data d2 is selected, and the mth recognition In the target section, an optimum one is selected for each recognition target section, such that the recognition result TSnm of the nth audio data dn is selected. Then, the recognition result selection integration unit 114 can integrate the recognition results selected for each recognition target section and output the recognition results T as a series of speech data recognition results. In the present embodiment, the optimum one is selected for each recognition target section, but the present invention is not limited to this. The recognition result can be selected in units shorter than one utterance section, for example, word level.
 認識結果選択統合部114における認識結果の選択方法として、様々なものが考えられる。一例として、ROVER法(ジェイ. ジー. フィスカス(J. G. Fiscus)著、「ア ポストプロセッシング システム トゥ イールド リドゥースド ワード エラー レート:ローバー(A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction:ROVER)」、(米国)、プロシーディングス アイトリプルイー(インスティテュート オブ エレクトリカル アンド エレクトロニクス エンジニアズ) ワークショップ オン オートマティック スピーチ リコグニション アンド アンダスタンディング(Proceedings IEEE (Institute of Electrical and Electronics Engineers) Workshop on Automatic Speech Recognition and Understanding(ASRU))、1997年、p. 347-354)を用いることが考えられる。 Various methods can be considered as a recognition result selection method in the recognition result selection integration unit 114. For example, the ROVER method (J. G. Fiscus), “A post-processing system to yield redoed word error rate: Rover (A post-processing system to yield reduced word Error rates: Recognizer Output Voting Reduction: ROVER ”, (USA), Proceedings I IEEE E (Institute of Electrical and Electronics Engineers) Understanding (ASRU)), 1997, p.
 すなわち、音声認識結果のテキストデータをそれぞれ比較し、同一の結果がより多く得られたもの、すなわち、複数の認識結果の中での同様な結果がより多く得られているものを選択する多数決を行い、出力認識結果列を決定する。あるいは、音響スコアや言語スコア、信頼度などの認識結果とともに得られる情報を用いることも可能である。すなわち、前記音声認識結果を多数決する際に、音声認識結果に対する重み付けとして信頼度等の認識結果情報を用いることが考えられる。さらに、音声認識結果の認識結果情報の閾値を基準として認識結果の採用不採用を決定したりすることも考えられる。また、これらを組み合わせてもよい。 That is, the text data of the speech recognition results are compared with each other, and a majority decision is made to select the one that has obtained the same result more, that is, the one that has obtained more similar results among a plurality of recognition results. To determine the output recognition result sequence. Alternatively, information obtained together with recognition results such as an acoustic score, a language score, and reliability can be used. That is, when majority of the speech recognition results are determined, it is conceivable to use recognition result information such as reliability as a weight for the speech recognition results. Further, it may be possible to determine whether or not to adopt the recognition result based on the threshold value of the recognition result information of the speech recognition result. Moreover, you may combine these.
 本発明の音声認識システムでは、各音声入力部10の入力条件は、認識結果の選択条件には含まれない。入力条件に関わらず、認識結果から得られる情報のみを用いて比較し、最適なものを選択することで、音声認識結果を精度よく保つことができる。 In the speech recognition system of the present invention, the input conditions of each speech input unit 10 are not included in the recognition result selection conditions. Regardless of the input conditions, only the information obtained from the recognition result is used for comparison, and by selecting the optimum one, the speech recognition result can be kept accurate.
 認識結果選択統合部114の認識結果Tは、たとえば、テキストデータとして出力され、図示されない記憶部、または記録媒体に記録され、ユーザに提供されることとなる。 The recognition result T of the recognition result selection / integration unit 114 is output, for example, as text data, recorded in a storage unit (not shown) or a recording medium, and provided to the user.
 本発明の音声認識システムは、SaaS(Software As A Service)型のサービスとして、ユーザに提供することもできる。SaaS型システムの場合、ネットワークを介して、ユーザ端末からウェブページを参照することでユーザに認識結果を閲覧可能に提供することができる。さらに、必要に応じてダウンロードしたり、あるいは、ユーザが指定した所定のメールアドレスに配信したりすることで、ユーザに認識結果を提供することができる。これらの提供方法も特に限定されるものではなく、様々な態様が考えられる。 The speech recognition system of the present invention can also be provided to the user as a SaaS (Software As Service) type service. In the case of the SaaS system, the recognition result can be provided to the user so as to be browsed by referring to the web page from the user terminal via the network. Furthermore, the recognition result can be provided to the user by downloading as necessary or distributing to a predetermined mail address designated by the user. These providing methods are not particularly limited, and various modes are conceivable.
 上述したように、本実施形態の音声認識装置110は、コンピュータにより実現することができる。
 本実施形態のコンピュータプログラムは、音声認識装置110を実現させるためのコンピュータに、異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、を実行させるように記述されている。
As described above, the speech recognition apparatus 110 according to the present embodiment can be realized by a computer.
The computer program according to the present embodiment includes a procedure for recognizing a plurality of sound data input under different recording conditions on a computer for realizing the sound recognition device 110, and a plurality of sound recognition results obtained by sound recognition. And a procedure for selecting an optimum one is described.
 さらに、本実施形態のコンピュータプログラムは、音声認識装置110を実現されるためのコンピュータに、異なる収録条件で入力した複数の一連の前記音声データの入力を受け付け、複数の一連の音声データについて、それぞれ各発話区間を検出する手順、複数の一連の音声データ間で、同じ発話を含むように認識対象区間を調整する手順、調整された複数の一連の音声データの同じ発話を含む認識対象区間毎に、それぞれ音声認識処理を行い、同じ発話を含む認識対象区間に対応する複数の音声認識結果をそれぞれ出力する手順、出力された複数の一連の音声データの同じ発話を含む認識対象区間に対応する複数の音声認識結果毎に比較を行い、認識対象区間毎に最適なものを選択する手順、認識対象区間毎に選択された各発話区間の各音声認識結果を複数統合し、一連の音声データの音声認識結果として出力する手順を実行させるように記述されている。 Furthermore, the computer program according to the present embodiment accepts a plurality of series of voice data input under different recording conditions to a computer for realizing the voice recognition device 110, and each of the plurality of series of voice data Procedure for detecting each utterance section, procedure for adjusting the recognition target section so as to include the same utterance between a plurality of series of voice data, and for each recognition target section including the same utterance of a plurality of adjusted series of voice data , A procedure for performing each voice recognition process and outputting each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance, a plurality of corresponding to a recognition target section including the same utterance of a plurality of output voice data For each speech recognition result, and the procedure for selecting the best for each recognition target section, for each utterance section selected for each recognition target section The speech recognition result integrates an array are described so as to perform the steps of outputting a speech recognition result of a series of audio data.
 本実施形態のコンピュータプログラムは、コンピュータで読み取り可能な記憶媒体に記録されてもよい。記録媒体は特に限定されず、様々に形態のものが考えられる。また、プログラムは、記録媒体からコンピュータのメモリにロードされてもよいし、ネットワークを通じてコンピュータにダウンロードされ、メモリにロードされてもよい。 The computer program of this embodiment may be recorded on a computer-readable storage medium. The recording medium is not particularly limited, and various forms can be considered. The program may be loaded from a recording medium into a computer memory, or downloaded to a computer through a network and loaded into the memory.
 上述のような構成において、本実施の形態の音声認識装置110によるデータ処理方法を以下に説明する。図3は、本実施形態の音声認識システムの動作の一例を示すフローチャートである。 In the configuration as described above, a data processing method by the speech recognition apparatus 110 of the present embodiment will be described below. FIG. 3 is a flowchart showing an example of the operation of the speech recognition system of the present embodiment.
 本発明の実施の形態に係る音声認識装置110のデータ処理方法は、音声データを音声認識する音声認識装置のデータ処理方法であって、音声認識装置110が、異なる収録条件で入力した複数の音声データをそれぞれ音声認識し(ステップS105)、音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する(ステップS107)。 The data processing method of the speech recognition apparatus 110 according to the embodiment of the present invention is a data processing method of the speech recognition apparatus that recognizes speech data, and a plurality of speeches input by the speech recognition apparatus 110 under different recording conditions. Each of the data is subjected to voice recognition (step S105), and a plurality of voice recognition results obtained by the voice recognition are compared to select an optimum one (step S107).
 より詳細には、まず、音声認識装置110の音声区間調整部112が複数の音声入力部10から、それぞれ異なる収録条件で集音された音声データd1、d2、・・・、dnをそれぞれ入力する(ステップS101)。そして、音声区間調整部112が、各音声データの発話区間を検出し、それぞれ同じ発話が含まれるように、発話区間を互いに調整する(ステップS103)。 More specifically, first, the voice segment adjustment unit 112 of the voice recognition device 110 inputs the voice data d1, d2,..., Dn collected from the plurality of voice input units 10 under different recording conditions. (Step S101). Then, the speech segment adjustment unit 112 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S103).
 そして、音声認識部102が、音声区間調整部112から出力された複数の音声データを、発話区間毎に認識処理する(ステップS105)。その結果、音声認識部102から複数の音声データの各発話区間に対応する認識結果がそれぞれ認識結果選択統合部114に出力される。そして、認識結果選択統合部114が、発話区間毎に、複数の音声認識結果を比較し、その中から最適なものを選択する(ステップS107)。そして、認識結果選択統合部114が、選択された発話区間毎の認識結果を統合し、一連の音声データの認識結果Tとして出力する(ステップS109)。 Then, the speech recognition unit 102 recognizes the plurality of speech data output from the speech segment adjustment unit 112 for each speech segment (step S105). As a result, the recognition result corresponding to each utterance section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 114. Then, the recognition result selection / integration unit 114 compares a plurality of speech recognition results for each utterance section, and selects an optimum one from them (step S107). Then, the recognition result selection / integration unit 114 integrates the recognition results for each selected utterance section, and outputs them as a series of speech data recognition results T (step S109).
 以上説明したように、本発明の実施の形態に係る音声認識システムによれば、複数の音声データの中に入力条件が悪いものがあっても、得られた複数の音声認識結果を比較して最適なものを選択することで、音声認識結果を精度よく保つことができる。また、音声入力部10は、どのような種類でも、どのような設定であってもよいが、互いに異なる設定にすることで、異なる設定の中から一つでも良好な結果が得られれば、その結果を採用することができることとなる。 As described above, according to the speech recognition system according to the embodiment of the present invention, even if there are some speech data with poor input conditions, the obtained speech recognition results are compared. By selecting the optimum one, the speech recognition result can be kept accurately. In addition, the voice input unit 10 may be of any type and any setting, but if one of the different settings is obtained by setting different from each other, The result can be adopted.
 また、本実施形態の音声認識システムによれば、一連の音声データの中から発話区間毎に最適なものを選択できるので、一連の音声データにおいて、途中で状況が変わった場合にも、途中から他の音声データに対する音声認識結果を採用することができるので、音声認識結果を精度よく保つことができる。たとえば、固定的に設置されているマイクから話者が移動して遠ざかってしまったり、話者自体が入れ替わったために声量が変わったり、また、一部のマイクが不調になったり、騒音が発生したりして、途中で状況が変わった場合にも、同様である。あるいは、固定的に設置されているマイクの位置に話者が戻って来たり、不調だったマイクが復旧したり、騒音が収まったりした場合にも、音声認識結果の精度をよく保つことができる。その理由は、途中から、最適な音声認識結果が得られたものに切り換えることができるからである。 Further, according to the voice recognition system of the present embodiment, since the optimum one can be selected for each utterance section from among a series of voice data, even if the situation changes in the middle of the series of voice data, the middle Since the speech recognition result for other speech data can be adopted, the speech recognition result can be kept accurately. For example, the speaker may move away from a fixed microphone, the volume of the speaker may change because the speaker has changed, some microphones may malfunction, or noise may occur. The same applies when the situation changes during the process. Or, when the speaker returns to the position of the fixed microphone, the malfunctioning microphone recovers, or the noise is reduced, the accuracy of the speech recognition result can be maintained well. . The reason is that it is possible to switch from the middle to the one with the optimum speech recognition result.
 すなわち、複数の異なる収録条件のマイクロフォンを複数準備し、状況に応じて、どのマイクロフォンの音声データによる認識結果がよいかを評価して選び、切り替えることができるので、状況に応じて、各マイクロフォンの特性を効果的に生かすことができるようになる。 In other words, it is possible to prepare a plurality of microphones with different recording conditions, evaluate and select which microphone's sound recognition result is good according to the situation, and switch between them. The characteristics can be utilized effectively.
 また、本実施形態の音声認識システムでは、音声認識部102が複数の音声データについて、それぞれ同じ認識処理条件、すなわち、同じ言語モデルもしくは同じ音響モデルを用いて音声認識処理を行うことができる。その際、同じ認識処理条件で認識した結果を評価しているため、認識結果や音声認識処理によって得られる各種特徴量やスコアを用いて収録条件の異なる複数の音声データを比較して、簡便に優劣を付けることができる。 In the speech recognition system of the present embodiment, the speech recognition unit 102 can perform speech recognition processing on a plurality of speech data using the same recognition processing conditions, that is, the same language model or the same acoustic model. At that time, since the results recognized under the same recognition processing conditions are evaluated, it is easy to compare multiple audio data with different recording conditions using various recognition parameters and scores obtained by the recognition results and voice recognition processing. Can be superior or inferior.
(第2の実施の形態)
 図4は、本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。
 本実施形態の音声認識システムは、上記実施の形態とは、認識結果選択統合部214において複数の認識結果の中から選択された認識結果の音声認識処理時の条件等を記録し、後続の音声データの音声区間調整や認識結果の選択条件としてフィードバックする点で相違する。
(Second Embodiment)
FIG. 4 is a functional block diagram showing the configuration of the speech recognition system according to the embodiment of the present invention.
The speech recognition system according to this embodiment is different from the above embodiment in that the recognition result selection / integration unit 214 records conditions and the like at the time of speech recognition processing of a recognition result selected from a plurality of recognition results, and the subsequent speech It is different in that it is fed back as a condition for selecting a voice interval adjustment or recognition result of data.
 さらに、本実施形態の音声認識システムにおいて、音声認識装置200は、複数の音声認識結果が得られた時の音声認識部102の音声認識処理条件を、音声認識部102が処理した音声認識処理単位(発話区間もしくは認識処理区間)毎にそれぞれ記憶する処理条件記憶部(条件記憶部210)と、認識結果選択統合部214で音声認識結果が選択されたとき、または選択されなかったときの、音声認識部102における音声認識処理条件を音声認識処理単位(発話区間もしくは認識処理区間)毎にそれぞれ処理条件記憶部(条件記憶部210)に記録する処理条件記録部と、をさらに備える。
 認識結果選択統合部214は、処理条件記憶部(条件記憶部210)を参照し、音声認識処理条件を考慮して、音声認識結果を音声認識処理単位(発話区間)毎に選択する。
Furthermore, in the speech recognition system according to the present embodiment, the speech recognition apparatus 200 has a speech recognition processing unit in which the speech recognition unit 102 has processed the speech recognition processing conditions of the speech recognition unit 102 when a plurality of speech recognition results are obtained. A voice when a voice recognition result is selected or not selected by a processing condition storage unit (condition storage unit 210) and a recognition result selection / integration unit 214, which are stored for each (speech section or recognition processing section). And a processing condition recording unit that records the voice recognition processing conditions in the recognition unit 102 in the processing condition storage unit (condition storage unit 210) for each voice recognition processing unit (speech section or recognition processing section).
The recognition result selection / integration unit 214 refers to the processing condition storage unit (condition storage unit 210) and selects a speech recognition result for each speech recognition processing unit (utterance section) in consideration of the speech recognition processing conditions.
 本実施形態の音声認識システムにおいて、音声認識装置200は、複数の音声データd1、d2、...、dnの入力時の入力条件を発話区間(もしくは認識対象区間)毎にそれぞれ記憶する条件記憶部210と、認識結果選択統合部214で音声認識結果が選択されたとき、または選択されなかったときの、音声データの入力条件を発話区間(もしくは認識対象区間)毎にそれぞれ条件記憶部210に記憶する入力条件記録部(不図示)と、をさらに備えることもできる。 In the voice recognition system of the present embodiment, the voice recognition device 200 includes a plurality of voice data d1, d2,. . . , Dn when the speech recognition result is selected by the condition storage unit 210 and the recognition result selection / integration unit 214 for storing the input condition for each utterance section (or recognition target section), respectively, or when it is not selected It is also possible to further include an input condition recording unit (not shown) that stores the voice data input condition in the condition storage unit 210 for each utterance section (or recognition target section).
 音声区間調整部212は、条件記憶部210を参照し、入力した複数の音声データの入力条件を考慮して、発話区間を調整してもよい。
 ここで、入力条件は、たとえば、入力した音声データのパワーレベル、S/N比、他の音声データとのパワーレベルの差や比、または、他の音声データとのS/N比の差等を含むことができる。
The voice segment adjustment unit 212 may refer to the condition storage unit 210 and adjust the speech segment in consideration of input conditions of a plurality of input voice data.
Here, the input condition is, for example, the power level of the input voice data, the S / N ratio, the difference or ratio of the power level with other voice data, or the difference of the S / N ratio with other voice data, etc. Can be included.
 具体的には、本実施形態の音声認識装置200は、上記実施形態の音声認識装置110と同じ音声認識部102と、さらに、条件記憶部210と、音声区間調整部212と、認識結果選択統合部214と、を備える。 Specifically, the speech recognition apparatus 200 of the present embodiment includes the same speech recognition unit 102 as the speech recognition apparatus 110 of the above embodiment, a condition storage unit 210, a speech segment adjustment unit 212, and recognition result selection integration. Unit 214.
 条件記憶部210は、たとえば、図5に示すように、音声データ毎に、さらに、発話区間(もしくは認識対象区間)毎に、その音声データの、その発話区間(もしくは認識対象区間)の認識結果が採用されたか否かを示す選択フラグと、その音声データの、その発話区間の認識結果が選択されたときの音声認識処理条件と、音声入力部10の入力条件と、を含むことができる。音声認識処理条件として、その音声データの、その発話区間の認識結果(不図示)およびその音響スコア、言語スコア、信頼度等を含むことができる。また、音声入力部10の入力条件として、入力パワーレベルおよびS/N比等を含むことができる。 For example, as shown in FIG. 5, the condition storage unit 210 recognizes the speech data (or the recognition target section) for each speech data, and for each speech section (or the recognition target section). Can be included, a voice recognition processing condition when the recognition result of the speech section of the voice data is selected, and an input condition of the voice input unit 10. The speech recognition processing conditions can include a recognition result (not shown) of the speech section of the speech data, an acoustic score, a language score, reliability, and the like. The input conditions of the voice input unit 10 can include an input power level, an S / N ratio, and the like.
 なお、各音声データの発話区間(もしくは認識対象区間)毎に、パワーやS/N比などの音響的な情報、分析時に得られた情報を、音声区間調整部212から条件記憶部210に送り記憶することができる。また、本実施形態では、発話区間(認識対象区間)毎に選択フラグを付与する構成としているが、上述したように、単語レベルなど、発話区間より短い単位でも選択が可能である。したがって、選択した単位、たとえば、単語レベルでフラグを付与し、条件記憶部210に記憶することもできる。 For each utterance section (or recognition target section) of each voice data, acoustic information such as power and S / N ratio and information obtained at the time of analysis are sent from the voice section adjustment section 212 to the condition storage section 210. Can be remembered. In the present embodiment, the selection flag is assigned to each utterance section (recognition target section). However, as described above, selection can be made in units shorter than the utterance section, such as a word level. Therefore, a flag can be given at a selected unit, for example, a word level, and stored in the condition storage unit 210.
 図4に戻り、認識結果選択統合部214は、条件記憶部210を参照し、条件記憶部210に記憶されている入力条件または音声認識処理条件を考慮して、認識結果を選択する。また、音声区間調整部212は、条件記憶部210を参照し、条件記憶部210に記憶されている入力条件を考慮して、発話区間を検出し、調整してもよい。 Returning to FIG. 4, the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the speech recognition processing condition stored in the condition storage unit 210. In addition, the speech interval adjustment unit 212 may detect and adjust the utterance interval by referring to the condition storage unit 210 and considering the input conditions stored in the condition storage unit 210.
 たとえば、条件記憶部210に記憶された当該音声区間より前の結果より、パワーがある一定値以下であった場合には音声区間とみなさないように閾値として用いることが考えられる。また、パワーやS/N比、さらに言語スコアや音響スコアなどの各種スコアから、複数の認識結果の選択処理を行っている注目の単語が選択されやすいか否かの推定を行うことができる。そして、認識結果選択統合部214において、その情報を重みとして加味して認識結果を選択することが考えられる。 For example, based on the result before the speech section stored in the condition storage unit 210, when the power is below a certain value, it can be used as a threshold value so as not to be regarded as a speech section. In addition, it is possible to estimate whether or not an attention word for which a plurality of recognition result selection processes are performed is easily selected from power, an S / N ratio, and various scores such as a language score and an acoustic score. Then, it is conceivable that the recognition result selection / integration unit 214 selects the recognition result in consideration of the information as a weight.
 また別の一例として、条件記憶部210に、当該発話区間(もしくは認識対象区間、単語、文節等)や認識結果が選択されたか棄却されたかを識別する識別モデルを記憶しておくことも考えられる。すなわち、予め入力音声とは異なる音声データを用いて(教師として与えて)ベースとなる識別モデルを学習し、条件記憶部210に記憶しておく。そして、音声が入力されたときに、音声区間調整部212が、条件記憶部210に記憶された識別モデルを用いて、入力された音声から得られる各種特徴量に基づいて、当該発話区間(もしくは認識対象区間、単語、文節等)を選択するか棄却するかの判定結果(もしくは識別モデルから得られるスコア)を取得する。そして、音声区間調整部212が、その結果を受けて音声区間の調整を行う。 As another example, the condition storage unit 210 may store an identification model for identifying whether the utterance section (or recognition target section, word, phrase, etc.) or the recognition result is selected or rejected. . In other words, a base identification model is learned in advance using voice data different from the input voice (given as a teacher) and stored in the condition storage unit 210. Then, when speech is input, the speech interval adjustment unit 212 uses the identification model stored in the condition storage unit 210 and based on various feature amounts obtained from the input speech, the speech interval (or A determination result (or a score obtained from the identification model) of whether to select or reject a recognition target section, a word, a phrase, or the like is acquired. And the audio | voice area adjustment part 212 adjusts an audio | voice area based on the result.
 さらに、認識結果選択統合部214が、条件記憶部210に記憶された識別モデルを用いて、得られる各種特徴量やスコアに基づいて、認識結果を選択するか棄却するかの判定結果(もしくは識別モデルから得られるスコア)を取得する。そして、認識結果選択統合部214は、その結果を用いて認識結果の選択および棄却を行う。なお、最終的な音声区間の調整結果や認識結果を追加することにより識別モデルを逐次更新することも考えられる。 Furthermore, the recognition result selection / integration unit 214 uses the identification model stored in the condition storage unit 210 to determine whether to select or reject the recognition result based on various feature quantities and scores obtained (or identification). Get the score obtained from the model. Then, the recognition result selection integration unit 214 selects and rejects the recognition result using the result. It is also conceivable to update the identification model sequentially by adding the final adjustment result and recognition result of the speech section.
 ここでは、音声区間調整部212および認識結果選択統合部214が条件記憶部210を参照する構成としたが、これに限定されず、他の判別部(不図示)が、条件記憶部210を参照し、音声区間調整部212または認識結果選択統合部214が条件記憶部210に記録されている条件を考慮する必要があるか否かを判別する構成としてもよい。そして、必要がある場合に、音声区間調整部212または認識結果選択統合部214に必要な条件を通知する構成としてもよい。 Here, the voice section adjustment unit 212 and the recognition result selection integration unit 214 are configured to refer to the condition storage unit 210. However, the present invention is not limited to this, and other determination units (not shown) refer to the condition storage unit 210. In addition, the voice section adjustment unit 212 or the recognition result selection integration unit 214 may determine whether or not the conditions recorded in the condition storage unit 210 need to be considered. And when it is necessary, it is good also as a structure which notifies a required condition to the audio | voice area adjustment part 212 or the recognition result selection integration part 214. FIG.
 上述したように、本実施形態の音声認識装置200は、コンピュータにより実現することができる。
 本実施形態のコンピュータプログラムは、音声認識装置200を実現させるためのコンピュータに、上記実施形態のコンピュータプログラムの手順に加え、さらに、音声認識結果が選択されたとき、または選択されなかったときの、音声データの入力条件を発話区間(もしくは認識対象区間)毎にそれぞれ条件記憶部210に記録する手順、条件記憶部210を参照し、入力した複数の音声データの入力条件を考慮して、発話区間を調整する手順を実行させるように記述されている。
As described above, the speech recognition apparatus 200 of the present embodiment can be realized by a computer.
The computer program of the present embodiment is a computer for realizing the speech recognition apparatus 200, in addition to the procedure of the computer program of the above embodiment, and further, when a speech recognition result is selected or not selected. A procedure for recording voice data input conditions for each utterance section (or recognition target section) in the condition storage unit 210, referring to the condition storage unit 210, and considering the input conditions of a plurality of input voice data, the utterance section It is described to execute the procedure to adjust.
 また、本実施形態のコンピュータプログラムは、音声認識装置200を実現させるためのコンピュータに、さらに、音声認識結果が選択されたとき、または選択されなかったときの、音声認識部102における音声認識処理条件を認識対象区間毎にそれぞれ条件記憶部210に記録する手順、条件記憶部210を参照し、音声認識処理条件を考慮して、認識結果を認識対象区間毎に選択する手順を実行させるように記述されている。 Further, the computer program of the present embodiment is further provided with a computer for realizing the speech recognition apparatus 200. The speech recognition processing conditions in the speech recognition unit 102 when the speech recognition result is selected or not selected. Is recorded in the condition storage unit 210 for each recognition target section, and the condition storage unit 210 is referred to, and the procedure for selecting the recognition result for each recognition target section in consideration of the speech recognition processing conditions is described. Has been.
 このように構成された本実施形態の音声認識システムの動作について、以下に説明する。
 図6は、本実施形態の音声認識システムの動作の一例を示すフローチャートである。
The operation of the speech recognition system of the present embodiment configured as described above will be described below.
FIG. 6 is a flowchart showing an example of the operation of the speech recognition system of this embodiment.
 本実施形態の音声認識システムにおいて、音声認識装置200は、図3の上記実施形態のフローチャートと同様なステップS101、ステップS105、およびステップS109に加え、さらに、ステップS203~ステップS208を含む。 In the speech recognition system of this embodiment, the speech recognition apparatus 200 includes steps S203 to S208 in addition to steps S101, S105, and S109 similar to those in the flowchart of the above-described embodiment of FIG.
 まず、音声認識装置200の音声区間調整部212が、複数の音声入力部10から、それぞれ異なる収録条件で集音された音声データd1、d2、・・・、dnをそれぞれ入力する(ステップS101)。そして、音声区間調整部212が、各音声データの発話区間を検出し、それぞれ同じ発話が含まれるように、発話区間を互いに調整する(ステップS203)。このとき、音声区間調整部212は、条件記憶部210を参照し、入力条件を考慮して、発話区間を検出および調整する。 First, the speech section adjustment unit 212 of the speech recognition apparatus 200 inputs speech data d1, d2,..., Dn collected from the plurality of speech input units 10 under different recording conditions, respectively (step S101). . Then, the speech segment adjustment unit 212 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S203). At this time, the voice segment adjustment unit 212 refers to the condition storage unit 210 and detects and adjusts the speech segment in consideration of the input condition.
 そして、音声区間調整部212が、音声データ毎かつ発話区間(もしくは認識処理区間)毎に、入力条件を条件記憶部210に記録する(ステップS204)。そして、音声認識部102が、音声区間調整部212から出力された複数の音声データを、認識処理区間毎に認識処理する(ステップS105)。その結果、音声認識部102から複数の音声データの各認識処理区間に対応する認識結果がそれぞれ認識結果選択統合部214に出力される。そして、認識結果選択統合部214が、認識処理区間毎に、複数の音声認識結果を比較し、その中から最適なものを選択する(ステップS207)。このとき、認識結果選択統合部214は、条件記憶部210を参照し、入力条件または音声認識処理条件を考慮して、認識結果を選択する。 Then, the voice segment adjustment unit 212 records the input condition in the condition storage unit 210 for each voice data and for each utterance segment (or recognition processing segment) (step S204). Then, the speech recognition unit 102 recognizes the plurality of speech data output from the speech interval adjustment unit 212 for each recognition processing interval (step S105). As a result, the recognition result corresponding to each recognition processing section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 214. Then, the recognition result selection / integration unit 214 compares a plurality of speech recognition results for each recognition processing section, and selects an optimum one from them (step S207). At this time, the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the voice recognition processing condition.
 そして、認識結果選択統合部214が、各音声データの各発話区間の音声認識処理条件と、その区間の音声データが採用されたか否かを示す選択フラグとを条件記憶部210に追記する(ステップS208)。そして、認識結果選択統合部214が、選択された認識処理区間毎の認識結果を統合し、一連の音声データの認識結果Tとして出力する(ステップS109)。 Then, the recognition result selection / integration unit 214 adds a speech recognition processing condition for each utterance section of each speech data and a selection flag indicating whether the speech data for that section has been adopted or not to the condition storage unit 210 (step S110). S208). Then, the recognition result selection / integration unit 214 integrates the recognition results for each selected recognition processing section, and outputs them as a series of speech data recognition results T (step S109).
 以上、説明したように、本実施形態の音声認識システムによれば、上記実施形態と同様な効果を奏するとともに、音声認識結果を選択する際に、過去に選択されたまたは選択されなかった音声データの音声認識処理条件などを考慮するので、そのとき、その会場の状況に応じて異なる収録条件の傾向を考慮して処理を行うことができ、認識精度を向上させることが可能になる。 As described above, according to the voice recognition system of the present embodiment, the voice data that has the same effects as those of the above-described embodiment and has been selected or not selected in the past when selecting the voice recognition result. Therefore, the processing can be performed in consideration of the tendency of different recording conditions depending on the situation of the venue, and the recognition accuracy can be improved.
 以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As described above, the embodiments of the present invention have been described with reference to the drawings. However, these are exemplifications of the present invention, and various configurations other than the above can be adopted.
 以上、実施形態および実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2010年3月29日に出願された日本出願特願2010-076195号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2010-076195 filed on Mar. 29, 2010, the entire disclosure of which is incorporated herein.

Claims (19)

  1.  話者の発話音声を異なる収録条件で入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
     前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える音声認識装置。
    A voice recognition means for recognizing each of a plurality of voice data obtained by inputting a speaker's speech voice under different recording conditions;
    A speech recognition apparatus comprising: a recognition result selection unit that compares a plurality of speech recognition results obtained by performing speech recognition by the speech recognition unit and selects an optimum one.
  2.  請求項1に記載の音声認識装置において、
     複数の一連の前記音声データの入力を受け付け、複数の一連の前記音声データについて、それぞれ各発話区間を検出し、複数の一連の前記音声データ間で、同じ発話を含むように発話区間を調整する音声区間調整手段をさらに備え、
     前記音声認識手段は、前記音声区間調整手段により調整された複数の一連の前記音声データの前記同じ発話について、それぞれ音声認識処理を行い、前記同じ発話に対応する複数の音声認識結果をそれぞれ出力し、
     前記認識結果選択手段は、前記音声認識手段から出力された複数の一連の前記音声データの前記同じ発話に対応する複数の前記音声認識結果毎に比較、選択を行い、統合して1つの最適な音声認識結果として出力する音声認識装置。
    The speech recognition apparatus according to claim 1,
    Accepts input of a plurality of series of voice data, detects each utterance section for each of the plurality of series of voice data, and adjusts the utterance section to include the same utterance among the plurality of series of voice data A voice section adjusting means;
    The speech recognition means performs speech recognition processing on each of the same utterances of the plurality of series of speech data adjusted by the speech interval adjustment means, and outputs a plurality of speech recognition results corresponding to the same utterances, respectively. ,
    The recognition result selection unit compares, selects, and integrates a plurality of the speech recognition results corresponding to the same utterance of the series of the speech data output from the speech recognition unit. A speech recognition device that outputs a speech recognition result.
  3.  請求項1または2に記載の音声認識装置において、
     前記認識結果選択手段は、複数の前記音声認識結果を比較して、同様な結果がより多く得られているものを選択する音声認識装置。
    The speech recognition apparatus according to claim 1 or 2,
    The recognition result selection unit is a speech recognition apparatus that compares a plurality of the speech recognition results and selects one that has obtained more similar results.
  4.  請求項1乃至3いずれかに記載の音声認識装置において、
     前記認識結果選択手段は、前記音声認識手段にて前記音声データが音声認識処理された時に得られる認識結果情報に基づいて、最適なものを選択する音声認識装置。
    The speech recognition apparatus according to any one of claims 1 to 3,
    The recognition result selection unit is a voice recognition device that selects an optimum one based on recognition result information obtained when the voice data is subjected to voice recognition processing by the voice recognition unit.
  5.  請求項4に記載の音声認識装置において、
     前記認識結果情報は、音響スコア、言語スコア、または信頼度である音声認識装置。
    The speech recognition apparatus according to claim 4,
    The speech recognition apparatus, wherein the recognition result information is an acoustic score, a language score, or a reliability.
  6.  請求項5に記載の音声認識装置において、
     前記認識結果選択手段が同様な結果がより多く得られているものを選択する多数決を行うとき、前記音声認識結果に対する重み付けとして前記認識結果情報を用いる音声認識装置。
    The speech recognition apparatus according to claim 5.
    A speech recognition apparatus that uses the recognition result information as a weight for the speech recognition result when the recognition result selection means performs a majority vote to select a result that has obtained more similar results.
  7.  請求項5または6に記載の音声認識装置において、
     前記認識結果選択手段が同様な結果がより多く得られているものを選択する多数決を行うとき、前記認識結果情報の閾値により、前記音声認識結果を採用するか否かを決定する音声認識装置。
    The speech recognition apparatus according to claim 5 or 6,
    A speech recognition apparatus that determines whether or not to adopt the speech recognition result according to a threshold value of the recognition result information when the recognition result selection unit performs a majority vote to select a result that has obtained more similar results.
  8.  請求項2乃至7いずれかに記載の音声認識装置において、
     複数の前記音声認識結果が得られた時の前記音声認識手段の音声認識処理条件を、前記音声認識手段が処理した音声認識処理単位毎にそれぞれ記憶する処理条件記憶部と、
     前記認識結果選択手段で前記音声認識結果が選択されたとき、または選択されなかったときの、前記音声認識手段における音声認識処理条件を前記音声認識処理単位毎にそれぞれ前記処理条件記憶部に記録する処理条件記録手段と、をさらに備え、
     前記認識結果選択手段は、前記処理条件記憶部を参照し、前記音声認識処理条件を考慮して、前記音声認識結果を前記音声認識処理単位毎に選択する音声認識装置。
    The speech recognition apparatus according to any one of claims 2 to 7,
    A processing condition storage unit that stores voice recognition processing conditions of the voice recognition unit when a plurality of the voice recognition results are obtained for each voice recognition processing unit processed by the voice recognition unit;
    When the speech recognition result is selected or not selected by the recognition result selection means, the speech recognition processing conditions in the speech recognition means are recorded in the processing condition storage unit for each speech recognition processing unit. Processing condition recording means,
    The recognition result selection unit refers to the processing condition storage unit, and considers the voice recognition processing condition, and selects the voice recognition result for each voice recognition processing unit.
  9.  請求項1乃至8いずれかに記載の音声認識装置において、
     前記音声認識手段は、複数の前記音声データに対して、同じ音声認識処理条件で音声認識処理を行う音声認識装置。
    The speech recognition apparatus according to any one of claims 1 to 8,
    The voice recognition device, wherein the voice recognition means performs voice recognition processing on a plurality of the voice data under the same voice recognition processing conditions.
  10.  請求項1乃至9いずれかに記載の音声認識装置において、
     複数の前記音声データは、複数の音声入力装置でそれぞれ集音され、入力される音声認識装置。
    The speech recognition apparatus according to any one of claims 1 to 9,
    A plurality of the voice data are collected by a plurality of voice input devices and input.
  11.  異なる収録条件でそれぞれ音声を入力する複数の音声入力手段と、
     前記音声入力手段から入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
     前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える音声認識システム。
    A plurality of voice input means for inputting voice respectively under different recording conditions;
    Voice recognition means for recognizing each of a plurality of voice data input from the voice input means;
    A speech recognition system comprising: a recognition result selection unit that compares a plurality of speech recognition results obtained by performing speech recognition with the speech recognition unit and selects an optimum one.
  12.  音声データを音声認識する音声認識装置のデータ処理方法であって、
     前記音声認識装置が、
     異なる収録条件で入力した複数の音声データをそれぞれ音声認識し、
     音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する音声認識装置のデータ処理方法。
    A data processing method of a voice recognition device for voice recognition of voice data,
    The speech recognition device
    Recognize multiple voice data input under different recording conditions,
    A data processing method of a speech recognition apparatus that compares a plurality of speech recognition results obtained by speech recognition and selects an optimum one.
  13.  請求項12に記載の音声認識装置のデータ処理方法において、
     前記音声認識装置が、
     複数の一連の前記音声データの入力を受け付け、複数の一連の前記音声データについて、それぞれ各発話区間を検出し、複数の一連の前記音声データ間で、同じ発話を含むように発話区間を調整し、
     調整された複数の一連の前記音声データの前記同じ発話について、それぞれ音声認識処理を行い、前記同じ発話に対応する複数の音声認識結果をそれぞれ出力し、
     複数の一連の前記音声データの前記同じ発話に対応する複数の前記音声認識結果毎に比較、選択を行い、統合して1つの最適な音声認識結果として出力する音声認識装置のデータ処理方法。
    The data processing method of the voice recognition device according to claim 12,
    The speech recognition device
    Accepts input of a plurality of series of the voice data, detects each utterance section for each of the plurality of series of voice data, and adjusts the utterance section to include the same utterance among the plurality of series of the voice data. ,
    For each of the same utterances of the adjusted series of the voice data, perform voice recognition processing, respectively, and output a plurality of voice recognition results corresponding to the same utterance,
    A data processing method for a speech recognition apparatus that compares and selects a plurality of speech recognition results corresponding to the same utterance of a series of speech data, and outputs the results as a single optimal speech recognition result.
  14.  請求項12または13に記載の音声認識装置のデータ処理方法において、
     前記音声認識装置が、
     前記音声データが音声認識処理された時に得られる認識結果情報に基づいて、最適なものを選択する音声認識装置のデータ処理方法。
    The data processing method of the speech recognition device according to claim 12 or 13,
    The speech recognition device
    A data processing method of a speech recognition apparatus that selects an optimum one based on recognition result information obtained when the speech data is subjected to speech recognition processing.
  15.  請求項12乃至14いずれかに記載の音声認識装置のデータ処理方法において、
     前記音声認識装置が、
     複数の前記音声認識結果を比較して、同様な結果がより多く得られているものを選択する音声認識装置のデータ処理方法。
    The data processing method of the voice recognition device according to any one of claims 12 to 14,
    The speech recognition device
    A data processing method of a speech recognition apparatus that compares a plurality of the speech recognition results and selects one that has obtained more similar results.
  16.  請求項13乃至15いずれかに記載の音声認識装置のデータ処理方法において、
     前記音声認識装置が、
     複数の前記音声認識結果が得られた時の前記音声認識手段の音声認識処理条件を、前記音声認識手段が処理した音声認識処理単位毎にそれぞれ記憶する処理条件記憶部を備え、
     前記音声認識結果が選択されたとき、または選択されなかったときの、前記音声認識時の音声認識処理条件を前記音声認識処理単位毎にそれぞれ前記処理条件記憶部に記憶し、
     前記処理条件記憶部を参照し、前記音声認識処理条件を考慮して、前記音声認識結果を前記音声認識処理単位毎に選択する音声認識装置のデータ処理方法。
    The data processing method of the voice recognition device according to any one of claims 13 to 15,
    The speech recognition device
    A processing condition storage unit for storing the voice recognition processing conditions of the voice recognition unit when a plurality of the voice recognition results are obtained for each voice recognition processing unit processed by the voice recognition unit;
    When the voice recognition result is selected or not selected, the voice recognition processing condition at the time of voice recognition is stored in the processing condition storage unit for each voice recognition processing unit,
    A data processing method of a speech recognition apparatus that refers to the processing condition storage unit and selects the speech recognition result for each speech recognition processing unit in consideration of the speech recognition processing conditions.
  17.  請求項12乃至16いずれかに記載の音声認識装置のデータ処理方法において、
     前記音声認識装置が、複数の前記音声データに対して、同じ音声認識処理条件で音声認識処理を行う音声認識装置のデータ処理方法。
    The data processing method of the voice recognition device according to any one of claims 12 to 16,
    A data processing method for a voice recognition apparatus, wherein the voice recognition apparatus performs voice recognition processing on the plurality of voice data under the same voice recognition processing conditions.
  18.  請求項12乃至17いずれかに記載の音声認識装置のデータ処理方法において、
     複数の前記音声データは、複数の音声入力装置でそれぞれ集音され、入力される音声認識装置のデータ処理方法。
    The data processing method of the speech recognition device according to any one of claims 12 to 17,
    A data processing method of a voice recognition device in which a plurality of the voice data are collected and inputted by a plurality of voice input devices, respectively.
  19.  音声データを音声認識する音声認識装置を実現するコンピュータプログラムであって、
     異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、
     音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、をコンピュータに実行させるためのコンピュータプログラム。
    A computer program for realizing a voice recognition device for voice recognition of voice data,
    A procedure for recognizing a plurality of audio data input under different recording conditions,
    A computer program for causing a computer to execute a procedure for comparing a plurality of speech recognition results obtained by speech recognition and selecting an optimum one.
PCT/JP2011/001826 2010-03-29 2011-03-28 Voice-recognition system, device, method and program WO2011121978A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2012508079A JPWO2011121978A1 (en) 2010-03-29 2011-03-28 Speech recognition system, apparatus, method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2010-076195 2010-03-29
JP2010076195 2010-03-29

Publications (1)

Publication Number Publication Date
WO2011121978A1 true WO2011121978A1 (en) 2011-10-06

Family

ID=44711741

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2011/001826 WO2011121978A1 (en) 2010-03-29 2011-03-28 Voice-recognition system, device, method and program

Country Status (2)

Country Link
JP (1) JPWO2011121978A1 (en)
WO (1) WO2011121978A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013154010A1 (en) * 2012-04-09 2013-10-17 クラリオン株式会社 Voice recognition server integration device and voice recognition server integration method
KR101736109B1 (en) * 2015-08-20 2017-05-16 현대자동차주식회사 Speech recognition apparatus, vehicle having the same, and method for controlling thereof
CN109473096A (en) * 2017-09-08 2019-03-15 北京君林科技股份有限公司 A kind of intelligent sound equipment and its control method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000148185A (en) * 1998-11-13 2000-05-26 Matsushita Electric Ind Co Ltd Recognition device and method
WO2008096582A1 (en) * 2007-02-06 2008-08-14 Nec Corporation Recognizer weight learning device, speech recognizing device, and system
JP2008250059A (en) * 2007-03-30 2008-10-16 Advanced Telecommunication Research Institute International Voice recognition device, voice recognition system and voice recognition method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6129896A (en) * 1984-07-20 1986-02-10 日本電信電話株式会社 Word voice recognition equipment
JPH02178699A (en) * 1988-12-28 1990-07-11 Nec Corp Voice recognition device
JPH0683388A (en) * 1992-09-04 1994-03-25 Fujitsu Ten Ltd Speech recognition device
JP3017118B2 (en) * 1997-02-20 2000-03-06 日本電気ロボットエンジニアリング株式会社 Voice recognition device with recognition result selection function using multiple microphones
JP3903738B2 (en) * 2001-05-23 2007-04-11 日本電気株式会社 Information recording / retrieval apparatus, method, program, and recording medium
JP2003140691A (en) * 2001-11-07 2003-05-16 Hitachi Ltd Voice recognition device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000148185A (en) * 1998-11-13 2000-05-26 Matsushita Electric Ind Co Ltd Recognition device and method
WO2008096582A1 (en) * 2007-02-06 2008-08-14 Nec Corporation Recognizer weight learning device, speech recognizing device, and system
JP2008250059A (en) * 2007-03-30 2008-10-16 Advanced Telecommunication Research Institute International Voice recognition device, voice recognition system and voice recognition method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013154010A1 (en) * 2012-04-09 2013-10-17 クラリオン株式会社 Voice recognition server integration device and voice recognition server integration method
US9524718B2 (en) 2012-04-09 2016-12-20 Clarion Co., Ltd. Speech recognition server integration device that is an intermediate module to relay between a terminal module and speech recognition server and speech recognition server integration method
KR101736109B1 (en) * 2015-08-20 2017-05-16 현대자동차주식회사 Speech recognition apparatus, vehicle having the same, and method for controlling thereof
US9704487B2 (en) 2015-08-20 2017-07-11 Hyundai Motor Company Speech recognition solution based on comparison of multiple different speech inputs
CN109473096A (en) * 2017-09-08 2019-03-15 北京君林科技股份有限公司 A kind of intelligent sound equipment and its control method

Also Published As

Publication number Publication date
JPWO2011121978A1 (en) 2013-07-04

Similar Documents

Publication Publication Date Title
US9354687B2 (en) Methods and apparatus for unsupervised wakeup with time-correlated acoustic events
US9514747B1 (en) Reducing speech recognition latency
US20180061396A1 (en) Methods and systems for keyword detection using keyword repetitions
JP2023041843A (en) Voice section detection apparatus, voice section detection method, and program
JP6812843B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
US8751230B2 (en) Method and device for generating vocabulary entry from acoustic data
CN112074901A (en) Speech recognition login
EP2388778B1 (en) Speech recognition
US20090119103A1 (en) Speaker recognition system
US9335966B2 (en) Methods and apparatus for unsupervised wakeup
US20140156276A1 (en) Conversation system and a method for recognizing speech
US20220343895A1 (en) User-defined keyword spotting
US9031841B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
US9460714B2 (en) Speech processing apparatus and method
TW202223877A (en) User speech profile management
WO2011121978A1 (en) Voice-recognition system, device, method and program
CN109155128B (en) Acoustic model learning device, acoustic model learning method, speech recognition device, and speech recognition method
EP3195314B1 (en) Methods and apparatus for unsupervised wakeup
KR20120046627A (en) Speaker adaptation method and apparatus
KR101283271B1 (en) Apparatus for language learning and method thereof
Yella et al. Information bottleneck based speaker diarization of meetings using non-speech as side information
KR100622019B1 (en) Voice interface system and method
KR20140035164A (en) Method operating of speech recognition system
KR20200129007A (en) Utterance verification device and method
KR102661005B1 (en) Method and Device for speaker's sound separation from a multi-channel speech signals of multiple speaker

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11762226

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2012508079

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11762226

Country of ref document: EP

Kind code of ref document: A1