WO2011121978A1

WO2011121978A1 - Voice-recognition system, device, method and program

Info

Publication number: WO2011121978A1
Application number: PCT/JP2011/001826
Authority: WO
Inventors: 祐北出
Original assignee: 日本電気株式会社
Priority date: 2010-03-29
Filing date: 2011-03-28
Publication date: 2011-10-06
Also published as: JPWO2011121978A1

Abstract

Disclosed is a voice-recognition device (100) equipped with a voice-recognition unit (102) that performs voice-recognition on a plurality of voice data obtained by inputting a speaker's voice under different recording conditions; and a recognition result selection unit (104) that selects an optimum result by comparing a plurality of audio-recognition results obtained by recognizing audio at the audio-recognition unit (102)

Description

音声認識システム、装置、方法、およびプログラムSpeech recognition system, apparatus, method, and program

　本発明は、音声認識システム、装置、方法、およびプログラムに関し、特に、複数の音声データを利用した音声認識システム、装置、方法、およびプログラムに関する。 The present invention relates to a voice recognition system, apparatus, method, and program, and more particularly, to a voice recognition system, apparatus, method, and program using a plurality of voice data.

　複数マイク使用による認識結果選択機能付き音声認識装置の一例が特許文献１（特開平１０－２３２６９１号公報）に記載されている。特許文献１（特開平１０－２３２６９１号公報）の音声認識装置は、話者の音声発生源である口に相対的に固定されない位置の話者の体に装着されたマイクロフォンと、マイクロフォンから入力された音声信号の認識および認識結果の出力を行う認識部と、認識部から出力された認識結果の比較を行い、最も確度の高い認識結果を選択・出力する総合処理部とから構成されている。この構成により、話者の姿勢が変化しても音声入力を行うことができるようになっている。また、認識結果の確度を示す値として、話者の口とマイクとの距離値を、確度を示す値として用いており、認識結果の確度から認識結果を選択している。 An example of a speech recognition apparatus with a recognition result selection function using a plurality of microphones is described in Patent Document 1 (Japanese Patent Laid-Open No. 10-232691). The speech recognition apparatus of Patent Document 1 (Japanese Patent Laid-Open No. 10-232691) is a microphone that is attached to a speaker's body at a position that is not fixed relative to the mouth, which is a speaker's voice generation source, and an input from the microphone. A recognition unit that recognizes a voice signal and outputs a recognition result, and a comprehensive processing unit that compares the recognition result output from the recognition unit and selects and outputs the recognition result with the highest accuracy. With this configuration, voice input can be performed even if the speaker's posture changes. Further, as a value indicating the accuracy of the recognition result, a distance value between the speaker's mouth and the microphone is used as a value indicating the accuracy, and the recognition result is selected from the accuracy of the recognition result.

特開平１０－２３２６９１号公報Japanese Patent Laid-Open No. 10-232691

　近年、会議や講演会などにおける話者の音声を音声認識して自動的に記録するシステムのニーズが高まっている。ところが、会議や講演会などは、様々な会場で、様々な設備および環境下で行われる。また、音響設備は会場既存のものを使用することも多く、音響機器、たとえば、マイクロフォン、アンプ、ミキサーは多種多様であり、それらの組み合わせも無数である。そして、たとえば、講演会場などで、話者が入れ替わった場合に、一般的には、音響設備等の収録条件を話者毎に変更しない。そのため、設定に対して、話者の声量が大きすぎると誤りを多く含む認識結果が出力されてしまうといった問題点があった。逆に、小さすぎると音声区間を検知できなかったりして、音声の認識精度が低下してしまうといった問題点があった。 In recent years, there has been an increasing need for a system that automatically recognizes and automatically records a speaker's voice in a conference or lecture. However, conferences and lectures are held at various venues and in various facilities and environments. In addition, there are many cases where the existing audio equipment is used at the venue, and there are a wide variety of audio equipment such as microphones, amplifiers, and mixers, and there are countless combinations thereof. For example, when a speaker is changed in a lecture hall or the like, generally, recording conditions such as audio equipment are not changed for each speaker. For this reason, if the speaker's voice volume is too large for the setting, there is a problem that a recognition result including many errors is output. On the other hand, if it is too small, there is a problem that the voice section cannot be detected and the voice recognition accuracy is lowered.

　本発明の目的は、上述した課題である音声の認識精度の低下を解決する音声認識システム、装置、方法、およびプログラムを提供することにある。 An object of the present invention is to provide a speech recognition system, apparatus, method, and program for solving the above-described problem of lowering speech recognition accuracy.

　本発明の音声認識装置は、
　話者の発話音声を異なる収録条件で入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
　前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える。 The speech recognition apparatus of the present invention
A voice recognition means for recognizing a plurality of voice data obtained by inputting a speaker's utterance voice under different recording conditions;
A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.

　本発明の音声認識システムは、
　異なる収録条件でそれぞれ音声を入力する複数の音声入力手段と、
　前記音声入力手段から入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
　前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える。 The speech recognition system of the present invention
A plurality of voice input means for inputting voice respectively under different recording conditions;
Voice recognition means for recognizing each of a plurality of voice data input from the voice input means;
A plurality of speech recognition results obtained by performing speech recognition by the speech recognition means, and a recognition result selection means for selecting an optimum one.

　本発明の音声認識装置のデータ処理方法は、
　音声データを音声認識する音声認識装置のデータ処理方法であって、
　前記音声認識装置が、
　異なる収録条件で入力した複数の音声データをそれぞれ音声認識し、
　音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する。 The data processing method of the speech recognition apparatus of the present invention includes:
A data processing method of a voice recognition device for voice recognition of voice data,
The speech recognition device
Recognize multiple voice data input under different recording conditions,
A plurality of speech recognition results obtained by speech recognition are compared, and an optimum one is selected.

　本発明のコンピュータプログラムは、
　音声データを音声認識する音声認識装置を実現するコンピュータプログラムであって、
　異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、
　音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、をコンピュータに実行させるためのものである。 The computer program of the present invention is:
A computer program for realizing a voice recognition device for voice recognition of voice data,
A procedure for recognizing a plurality of audio data input under different recording conditions,
This is for causing a computer to execute a procedure for comparing a plurality of speech recognition results obtained by speech recognition and selecting an optimum one.

　なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that an arbitrary combination of the above-described components and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.

　また、本発明の各種の構成要素は、必ずしも個々に独立した存在である必要はなく、複数の構成要素が一個の部材として形成されていること、一つの構成要素が複数の部材で形成されていること、ある構成要素が他の構成要素の一部であること、ある構成要素の一部と他の構成要素の一部とが重複していること、等でもよい。 The various components of the present invention do not necessarily have to be independent of each other. A plurality of components are formed as a single member, and a single component is formed of a plurality of members. It may be that a certain component is a part of another component, a part of a certain component overlaps with a part of another component, or the like.

　また、本発明のデータ処理方法およびコンピュータプログラムには複数の手順を順番に記載してあるが、その記載の順番は複数の手順を実行する順番を限定するものではない。このため、本発明のデータ処理方法およびコンピュータプログラムを実施するときには、その複数の手順の順番は内容的に支障しない範囲で変更することができる。 In addition, although the data processing method and the computer program of the present invention describe a plurality of procedures in order, the described order does not limit the order in which the plurality of procedures are executed. For this reason, when implementing the data processing method and computer program of this invention, the order of the several procedure can be changed in the range which does not interfere in content.

　さらに、本発明のデータ処理方法およびコンピュータプログラムの複数の手順は個々に相違するタイミングで実行されることに限定されない。このため、ある手順の実行中に他の手順が発生すること、ある手順の実行タイミングと他の手順の実行タイミングとの一部ないし全部が重複していること、等でもよい。 Furthermore, the data processing method and the plurality of procedures of the computer program of the present invention are not limited to being executed at different timings. For this reason, another procedure may occur during the execution of a certain procedure, or some or all of the execution timing of a certain procedure and the execution timing of another procedure may overlap.

　本発明によれば、音声の認識精度を向上する音声認識システム、装置、方法、およびプログラムが提供される。 According to the present invention, a speech recognition system, apparatus, method, and program for improving speech recognition accuracy are provided.

　上述した目的、およびその他の目的、特徴および利点は、以下に述べる好適な実施の形態、およびそれに付随する以下の図面によってさらに明らかになる。 The above-described object and other objects, features, and advantages will be further clarified by a preferred embodiment described below and the following drawings attached thereto.

本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの条件記憶部の構造の一例を示す図である。It is a figure which shows an example of the structure of the condition memory | storage part of the speech recognition system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識システムの動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech recognition system which concerns on embodiment of this invention.

　以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.

（第１の実施の形態）
　図１は、本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。
　同図に示すように、本実施形態の音声認識システムにおいて、音声認識装置１００は、話者の発話音声を異なる収録条件で入力した複数の音声データｄ１、ｄ２、．．．、ｄｎ（ここで、ｎは自然数）をそれぞれ音声認識する音声認識部１０２と、音声認識部１０２で音声認識して得られた複数の音声認識結果ｔ１、ｔ２、．．．、ｔｎを比較して、最適なものを選択する認識結果選択部１０４と、を備える。 (First embodiment)
FIG. 1 is a functional block diagram showing a configuration of a speech recognition system according to an embodiment of the present invention.
As shown in the figure, in the speech recognition system of the present embodiment, the speech recognition apparatus 100 includes a plurality of speech data d1, d2,. . . , Dn (where n is a natural number), respectively, and a plurality of speech recognition results t1, t2,. . . , Tn, and a recognition result selection unit 104 that selects an optimum one.

　本実施形態において、音声認識装置１００は、たとえば、図示しないＣＰＵ（Central Processing Unit）やメモリ、ハードディスク、および通信装置を備え、キーボードやマウス等の入力装置やディスプレイやプリンタ等の出力装置と接続されるサーバコンピュータやパーソナルコンピュータ、またはそれらに相当する装置により実現することができる。そして、ＣＰＵが、ハードディスクに記憶されるプログラムをメモリに読み出して実行することにより、上記各ユニットの各機能を実現することができる。 In this embodiment, the speech recognition apparatus 100 includes, for example, a CPU (Central Processing Unit) (not shown), a memory, a hard disk, and a communication device, and is connected to an input device such as a keyboard and a mouse and an output device such as a display and a printer. It can be realized by a server computer, a personal computer, or a device corresponding to them. Each function of each unit can be realized by the CPU reading the program stored in the hard disk into the memory and executing it.

　なお、以下の各図において、本発明の本質に関わらない部分の構成については省略してあり、図示されていない。
　また、音声認識装置１００の各構成要素は、任意のコンピュータのＣＰＵ、メモリ、メモリにロードされた本図の構成要素を実現するプログラム、そのプログラムを格納するハードディスクなどの記憶ユニット、ネットワーク接続用インタフェースを中心にハードウェアとソフトウェアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。以下に説明する各図は、ハードウェア単位の構成ではなく、機能単位のブロックを示している。 In the following drawings, the configuration of parts not related to the essence of the present invention is omitted and is not shown.
Each component of the speech recognition apparatus 100 includes an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus. Each figure described below shows a functional unit block, not a hardware unit configuration.

　本実施形態の音声認識システムは、会議や講演会などにおける話者の音声を音声認識して自動的に記録するものである。会議や講演会などは、様々な会場で、様々な設備および環境下で行われる。音響設備は会場既存のものを使用することが多い。そのため、音響機器、たとえば、マイクロフォン、アンプ、ミキサーは多種多様であり、それらの組み合わせも無数である。 The voice recognition system according to this embodiment recognizes and automatically records the voice of a speaker in a conference or lecture. Conferences and lectures are held at various venues and in various facilities and environments. In many cases, the existing audio equipment is used at the venue. Therefore, there are a wide variety of acoustic devices such as microphones, amplifiers, and mixers, and there are countless combinations thereof.

　また、たとえば、講演会場などで、話者が入れ替わった場合に、一般的には、音響設備等の収録条件を話者毎に変更しない。そのため、設定に対して、話者の声量が大きすぎると誤りを多く含む認識結果が出力されてしまうといった問題点があった。逆に、小さすぎると音声区間を検知できなかったりするといった問題点があった。 Also, for example, when a speaker is changed in a lecture hall, generally, recording conditions such as audio equipment are not changed for each speaker. For this reason, if the speaker's voice volume is too large for the setting, there is a problem that a recognition result including many errors is output. Conversely, if it is too small, there is a problem that the voice section cannot be detected.

　また、会場や話者の状況によって、たとえば、一時的な騒音の発生や、話者が入れ替わったりした場合等に、音声認識精度が安定しないという問題点があった。あるいは、スタンドマイクやバウンダリマイクなどの固定的に設置されているマイクロフォンを使用する場合に、途中で話者が移動して発話をされると、マイクロフォンとの距離が離れてしまう。そのため、話者の声をひろうことが困難になってしまうといった問題点があった。 Also, depending on the situation of the venue and the speaker, for example, there is a problem that the voice recognition accuracy is not stable when, for example, temporary noise occurs or the speaker is switched. Alternatively, when using a microphone that is fixedly installed such as a stand microphone or a boundary microphone, if the speaker moves and speaks on the way, the distance from the microphone is increased. Therefore, there is a problem that it becomes difficult to pick up the voice of the speaker.

　話者の移動の問題に対しては、話者の胸元にピンマイクを付けることで解決する構成も考えられる。しかし、衣類や体とマイクが接触して雑音が入ったりすることも考えられる。すなわち、例えば通常の発話では最適な入力デバイスがスタンドマイクであり、話者が移動したときにはピンマイクに変わるといった状況が考えられ、最適なマイクロフォンは動的に変わりうる。
　このように、途中で状況が変化した場合に、音声認識精度が安定しないという問題点があった。 A configuration that solves the problem of speaker movement by attaching a pin microphone to the chest of the speaker is also conceivable. However, it is also possible that the microphone comes into contact with clothing or body and noise occurs. That is, for example, in a normal utterance, the optimum input device is a stand microphone, and when the speaker moves, the situation may be changed to a pin microphone, and the optimum microphone can change dynamically.
As described above, there is a problem that the voice recognition accuracy is not stable when the situation changes in the middle.

　本発明の音声認識システムは、このような問題を解決するために、複数の異なる収録条件で入力された音声データから得られた複数の認識結果を比較して、最適なものを選択して認識結果として出力するものである。たとえば、複数の種類のマイクを準備し、また、同じ種類のマイクであれば、入力レベルなどの設定をそれぞれ異なるように予め設定して準備する。あるいは、既存の設備を用いる場合には、元々複数のマイクが異なる設定になっていれば、そのまま適用することもできる。 In order to solve such problems, the speech recognition system of the present invention compares a plurality of recognition results obtained from speech data input under a plurality of different recording conditions, and selects and recognizes the optimum one. It is output as a result. For example, a plurality of types of microphones are prepared, and if the microphones are of the same type, the input level and the like are set differently in advance and prepared. Alternatively, when existing equipment is used, if a plurality of microphones are originally set differently, they can be applied as they are.

　または、設置場所は、話者の移動を考慮して、予め話者が移動予定の場所、たとえば、講演会の場合、講演者が話をする壇上以外にホワイトボード前等にも設置するのが好ましい。また、会場のリスナーの質問用にハンドマイクなどを準備してもよい。また、複数のマイクを同じ収録条件、たとえば、同じ種類のマイクを同じ入力レベルに設定して準備した場合であっても、途中で、上述したようにマイクの故障や騒音の発生など状況が変化することがある。このような状況の変化に応じて、結果としてマイク毎の収録条件が異なった場合にも、本発明の音声認識システムは適用できる。 Or, considering the movement of the speaker, the installation location should be installed in advance in the place where the speaker is scheduled to move, for example, in front of the whiteboard in addition to the stage where the speaker speaks. preferable. A hand microphone or the like may be prepared for questions from listeners at the venue. Also, even when multiple microphones are prepared with the same recording conditions, for example, the same type of microphone set to the same input level, the situation changes during the process, such as microphone failure or noise, as described above. There are things to do. The speech recognition system of the present invention can be applied even when the recording conditions for each microphone differ as a result of such changes in the situation.

　本実施形態において、音声データの入力装置は、会場既存のものを用いてもよいし、音声認識システムとして、設けられた入力装置を用いてもよい。すなわち、本発明の音声認識システムによれば、どのような種類の音声入力装置を、どのように組み合わせて準備するかに依存せずに、音声認識の精度を向上させることができる。 In the present embodiment, an existing voice data input device may be used, or an input device provided as a voice recognition system may be used. That is, according to the speech recognition system of the present invention, the accuracy of speech recognition can be improved without depending on what kind of speech input devices are prepared in combination.

　収録条件は、マイクを用いて話者の音声を収録するときの各種条件であり、使用前に予め決まっているものと、使用中に状況に応じて変化するものとの２種類ある。前者の例として、マイクの種類、設置場所、入力レベル、感度、補正処理方法、空調などの定常的な雑音など、後者の例として、話者（声量、性別等）、音源や話者とマイクの距離、周囲の騒音レベル、マイクの入力レベルや感度（故障などにより変化したとき）などを含むことができる。 Recording conditions are various conditions when recording a speaker's voice using a microphone. There are two types of recording conditions, one that is determined in advance before use and one that changes depending on the situation during use. Examples of the former include microphone type, installation location, input level, sensitivity, correction processing method, and stationary noise such as air conditioning. Examples of the latter include speaker (voice volume, gender, etc.), sound source, speaker, and microphone. Distance, ambient noise level, microphone input level and sensitivity (when changed due to failure, etc.).

　具体的には、図２に示すように、本実施形態の音声認識システムにおいて、音声認識装置１１０は、音声区間調整部１１２と、音声認識部１０２と、認識結果選択統合部１１４と、を備える。以後、本実施形態では、音声認識装置１１０を例に説明する。なお、音声認識装置１１０は、音声認識装置１００とは、音声区間調整部１１２が各音声データの発話区間を検出する点、および認識結果選択統合部１１４が発話区間毎に選択した認識結果を統合して出力する点が異なる。 Specifically, as shown in FIG. 2, in the speech recognition system of the present embodiment, the speech recognition apparatus 110 includes a speech segment adjustment unit 112, a speech recognition unit 102, and a recognition result selection integration unit 114. . Hereinafter, in this embodiment, the speech recognition apparatus 110 will be described as an example. Note that the speech recognition apparatus 110 integrates the speech recognition apparatus 100 with the point that the speech segment adjustment unit 112 detects the speech segment of each speech data and the recognition result selected by the recognition result selection integration unit 114 for each speech segment. Is different.

　音声区間調整部１１２は、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎの入力を受け付け、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎについて、それぞれ音声データに対する発話区間を検出する。そして、音声区間調整部１１２は、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎ間で、同じ発話を含むように発話区間を調整する。 The voice section adjustment unit 112 receives input of a plurality of series of voice data d1, d2,..., Dn, and each of the plurality of series of voice data d1, d2,. Is detected. And the audio | voice area adjustment part 112 adjusts an utterance area so that the same utterance may be included among several series of audio | voice data d1, d2, ..., dn.

　ここでいう、「発話区間」とは、入力される一連の音声データの中から、実際に話者が発話した音声データを含む「音声区間調整部１１２が検出した区間」、もしくは、「自動検出された区間」を意味する。そして、後段の音声認識部では、この発話区間を１つの処理単位として音声認識処理が実行される。すなわち、音声区間調整部１１２は、音声認識処理を行う対象の音声データのひと区切りずつが、複数の音声データ間で同じ区間（始点の時刻と終点の時刻がそれぞれ同じ区間を指す。以後、始点の時刻と終点の時刻を「始終端時刻」と呼ぶ。）になるように、調整を行う。 Here, the “speech segment” means “section detected by the speech segment adjustment unit 112” including “speech data actually spoken by the speaker” or “automatic detection” from a series of input speech data. Means an “interval”. Then, in the subsequent speech recognition unit, the speech recognition processing is executed with this utterance section as one processing unit. That is, the speech section adjustment unit 112 indicates that each segment of speech data to be subjected to speech recognition processing is the same section between the plurality of speech data (the start time and the end time are the same sections. And the end point time are referred to as “start / end time”).

　たとえば、音声区間調整部１１２により、第１の一連の音声データｄ１から発話区間として、ＤＳ１１、ＤＳ１２、・・・、ＤＳ１ａ（ここで、ａは自然数）が検出され、第２の一連の音声データｄ２から発話区間として、ＤＳ２１、ＤＳ２２、・・・、ＤＳ２ｂ（ここで、ｂは自然数）が検出され、第ｎの一連の音声データｄｎから発話区間として、ＤＳｎ１、ＤＳｎ２、・・・、ＤＳｎｃ（ここで、ｃは自然数）が検出されたとする。なお、発話区間は図示していない。 For example, the speech section adjustment unit 112 detects DS11, DS12,..., DS1a (where a is a natural number) as speech sections from the first series of voice data d1, and the second series of voice data. DS21, DS22,..., DS2b (where b is a natural number) is detected from d2, and DSn1, DSn2,..., DSnc (b is a natural number) from the n-th series of speech data dn. Here, it is assumed that c is a natural number). Note that the utterance section is not shown.

　そこで、音声区間調整部１１２は、第１の一連の音声データｄ１の第１の発話区間ＤＳ１１、第２の一連の音声データｄ２の第１の発話区間ＤＳ２１、および、第ｎの一連の音声データｄｎの第１の発話区間ＤＳｎ１に、それぞれ含まれる発話が、同じになるように、各発話区間を調整する。同様に、第１の一連の音声データｄ１の第２の発話区間ＤＳ１２、第２の一連の音声データｄ２の第２の発話区間ＤＳ２２、および、第ｎの一連の音声データｄｎの第２の発話区間ＤＳｎ２に、それぞれ含まれる発話が、同じになるように、各発話区間を調整し、認識対象区間を決定する。以後、同様に各発話区間を調整する。 Therefore, the voice section adjustment unit 112 includes the first utterance section DS11 of the first series of voice data d1, the first utterance section DS21 of the second series of voice data d2, and the nth series of voice data. Each utterance interval is adjusted so that the utterances included in the first utterance interval DSn1 of dn are the same. Similarly, the second utterance section DS12 of the first series of voice data d1, the second utterance section DS22 of the second series of voice data d2, and the second utterance of the n series of voice data dn. Each utterance section is adjusted so that the utterances included in the section DSn2 are the same, and the recognition target section is determined. Thereafter, each utterance section is similarly adjusted.

　具体的には、たとえば、第１の音声データｄ１、第２の音声データｄ２、第ｎの音声データｄｎの第１の発話区間のうち、第２の音声データｄ２の第１の発話区間ＤＳ２１が、他の音声データの第１の発話区間に比べて、検出された区間が短かったような場合、他の音声データの第１の発話区間に合わせて、区間を長くするように調整する。つまり、収録条件が異なるために、ある音声データの発話区間が他の音声データの発話区間に比べて短く検出され、発話区間にずれが生じた場合には、複数の音声データ間で同期を取り、発話区間の始終端時刻を調整する。 Specifically, for example, the first utterance section DS21 of the second voice data d2 is the first utterance section of the first voice data d1, the second voice data d2, and the nth voice data dn. When the detected section is shorter than the first utterance section of the other voice data, the section is adjusted to be longer in accordance with the first utterance section of the other voice data. In other words, because the recording conditions are different, the utterance section of one voice data is detected shorter than the utterance section of the other voice data, and if there is a deviation in the utterance section, synchronization is made between a plurality of voice data. The start / end time of the utterance section is adjusted.

　なお、複数の発話区間が他の音声では１つの発話区間となる場合がある。例えば、第１の一連の音声データｄ１の第１の発話区間ＤＳ１１が１秒目から４秒目までで、第２の一連の音声データｄ２の第１の発話区間ＤＳ２１が１秒目から２秒目まで、第２の一連の音声データｄ２の第２の発話区間ＤＳ２２が２秒目から４秒目までであった場合について説明する。この場合には、第１の一連の音声データｄ１の第１の発話区間ＤＳ１１と、第２の一連の音声データｄ２の第１の発話区間ＤＳ２１および第２の発話区間ＤＳ２２を合わせた区間が同じ発話区間となるように調整し、調整後の認識対象区間は１秒目から４秒目までとなる。 Note that a plurality of utterance sections may become one utterance section for other voices. For example, the first utterance section DS11 of the first series of voice data d1 is from the first to the fourth second, and the first utterance section DS21 of the second series of voice data d2 is the first to second seconds. The case where the second utterance section DS22 of the second series of audio data d2 is from the second to the fourth second will be described. In this case, the first utterance section DS11 of the first series of voice data d1 and the first utterance section DS21 and the second utterance section DS22 of the second series of voice data d2 are the same. Adjustment is made so that it is the utterance section, and the recognition target section after the adjustment is from the first to fourth seconds.

　音声認識部１０２は、音声区間調整部１１２により同期が取られた複数の一連の音声データｄ１、ｄ２、・・・、ｄｎの同一の認識対象区間（第１の認識対象区間ＤＳ′１１、ＤＳ′２１、ＤＳ′ｎ１や、第ｍの認識対象区間ＤＳ′１ｍ、ＤＳ２′ｍ、ＤＳ′ｎｍ等（ここで、ｍは自然数））毎に、それぞれ音声認識処理を行い、同一の認識対象区間に対応する複数の音声認識結果をそれぞれ出力する。なお、音声認識処理は発話区間単位で行い、認識処理後に認識結果を前記区間調整された認識対象区間に揃えてもよい。 The speech recognition unit 102 uses the same recognition target section (first recognition target section DS′11, DS) of a plurality of series of voice data d1, d2,. ′21, DS′n1, m-th recognition target section DS′1m, DS2′m, DS′nm, etc. (where m is a natural number), respectively, are subjected to speech recognition processing, and the same recognition target section A plurality of speech recognition results corresponding to are output respectively. Note that the speech recognition processing may be performed in units of utterance sections, and after the recognition processing, the recognition result may be aligned with the recognition target section adjusted in the section.

　認識結果選択統合部１１４は、音声認識部１０２から出力された、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎの同一の認識対象区間（第１の認識対象区間ＤＳ′１１、ＤＳ′２１、ＤＳ′ｎ１や、第ｍの認識対象区間ＤＳ′１ｍ、ＤＳ２′ｍ、ＤＳ′ｎｍ等）にそれぞれ対応する複数の音声認識結果ｔ１、ｔ２、・・・、ｔｎ毎に比較を行い、認識対象区間毎に最適なものを選択する。そして、認識結果選択統合部１１４は、認識対象区間毎に選択された各認識対象区間の各音声認識結果を複数統合し、一連の音声データの音声認識結果Ｔとして出力する。たとえば、第１の認識対象区間ではＤＳ′１１の音声認識結果が選択され、第２の認識対象区間ではＤＳ′２２の音声認識結果が選択される。 The recognition result selection / integration unit 114 outputs the same recognition target section (first recognition target sections DS′11, DS) of a plurality of series of audio data d1, d2,. '21, DS'n1, m'th recognition target section DS'1m, DS2'm, DS'nm, etc.) for each of a plurality of speech recognition results t1, t2,. The optimum one is selected for each recognition target section. Then, the recognition result selection integration unit 114 integrates a plurality of speech recognition results of each recognition target section selected for each recognition target section, and outputs the result as a speech recognition result T of a series of speech data. For example, the speech recognition result of DS′11 is selected in the first recognition target section, and the speech recognition result of DS′22 is selected in the second recognition target section.

　本実施形態において、音声認識部１０２は、複数の音声データｄ１、ｄ２、・・・、ｄｎについて、それぞれ同じ音声認識処理条件で音声認識処理を行うことができる。すなわち、同じ言語モデル、辞書などを用いることができる。 In this embodiment, the voice recognition unit 102 can perform voice recognition processing on the plurality of voice data d1, d2,..., Dn under the same voice recognition processing conditions. That is, the same language model and dictionary can be used.

　本実施形態では、複数の音声入力部１０（Ｕ１、Ｕ２、・・・、Ｕｎ）で集音され、複数の音声入力部１０から、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎがそれぞれ入力される。音声入力部１０は、様々な種類のマイクロフォンとすることができ、たとえば、スタンドマイク、バウンダリマイク、ピンマイク、ハンドマイクなどとすることができる。 In the present embodiment, sound is collected by a plurality of sound input units 10 (U1, U2,..., Un), and a plurality of series of sound data d1, d2,. Are entered respectively. The voice input unit 10 can be various types of microphones, for example, a stand microphone, a boundary microphone, a pin microphone, a hand microphone, or the like.

　マイクの設置場所は、様々考えられる。たとえば、話者の直ぐ目の前、すなわち、口元や、ピンマイクなどのように話者の胸元に設けたり、あるいは、話者から離れた位置に設けたりすることができる。さらに、話者が移動する可能性のある場所、たとえば、ホワイトボードの前に設けたり、あるいは、ピンマイクやハンドマイクなどワイヤレスで、設置場所を固定せずに移動しながら使用したりすることなど考えられる。 There are various places where microphones can be installed. For example, it can be provided in front of the speaker's immediate eyes, that is, at the speaker's chest, such as at the mouth or pin microphone, or at a position away from the speaker. In addition, the location where the speaker may move, such as installing in front of the whiteboard, or wirelessly using a pin microphone or hand microphone, etc., can be used while moving without fixing the installation location. It is done.

　複数の音声入力部１０は、それぞれ異なる収録条件になっている。これらの収録条件は、収録条件設定部２０により設定されてもよい。たとえば、マイクロフォンの種類、設置場所などが異なる場合もあれば、各マイクロフォンの音声入力レベル、感度、補正処理方法等が異なる場合もある。 The multiple audio input units 10 have different recording conditions. These recording conditions may be set by the recording condition setting unit 20. For example, the type and location of the microphone may be different, and the sound input level, sensitivity, correction processing method, and the like of each microphone may be different.

　たとえば、音声入力部１０であるマイクロフォン、アンプ、またはミキサーの調整は、収録条件設定部２０の設定記憶部（不図示）に記憶されている設定値に従って調整してもよく、収録条件設定部２０の設定調整装置（不図示）により自動的に設定を行う構成とすることもできる。マイクロフォン、アンプ、またはミキサーの調整は、上記収録条件および各会場や話者などの状況に応じて、手動でユーザが行うこともできる。 For example, the microphone, amplifier, or mixer that is the audio input unit 10 may be adjusted according to a setting value stored in a setting storage unit (not shown) of the recording condition setting unit 20, or the recording condition setting unit 20 It is also possible to adopt a configuration in which setting is automatically performed by a setting adjustment device (not shown). The microphone, amplifier, or mixer can be adjusted manually by the user according to the recording conditions and the situation of each venue or speaker.

　認識結果選択統合部１１４は、音声認識部１０２から出力された複数の一連の音声データｄ１、ｄ２、・・・、ｄｎの同じ発話を含む認識対象区間に対応する複数の音声認識結果毎に比較を行い、認識対象区間毎に最適なものを選択し、認識対象区間毎に選択された各音声認識結果を複数統合し、一連の音声データの音声認識結果Ｔとして出力する。 The recognition result selection integration unit 114 compares each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance of a plurality of series of speech data d1, d2,..., Dn output from the speech recognition unit 102. Are selected for each recognition target section, a plurality of the respective speech recognition results selected for each recognition target section are integrated, and output as a speech recognition result T of a series of voice data.

　たとえば、複数の一連の音声データｄ１、ｄ２、・・・、ｄｎの同じ発話を含む第１の認識対象区間ＤＳ′１１、ＤＳ′２１、・・・、ＤＳ′ｎ１に対応する複数の音声認識結果をＴＳ１１、ＴＳ２１、・・・、ＴＳｎ１とし、第２の認識対象区間ＤＳ′１２、ＤＳ′２２、・・・、ＤＳ′ｎ２に対応する複数の音声認識結果をＴＳ１２、ＴＳ２２、・・・、ＴＳｎ２とし、第ｍの認識対象区間ＤＳ′１ｍ、ＤＳ′２ｍ、・・・、ＤＳ′ｎｍに対応する複数の音声認識結果をＴＳ１ｍ、ＴＳ２ｍ、・・・、ＴＳｎｍとする。なお、各認識対象区間に対応する複数の音声認識結果ＴＳ１１～ＴＳｎｍは、図示されていない。 For example, a plurality of speech recognitions corresponding to the first recognition target sections DS′11, DS′21,..., DS′n1 including the same utterance of a plurality of series of speech data d1, d2,. The results are TS11, TS21, ..., TSn1, and a plurality of speech recognition results corresponding to the second recognition target sections DS'12, DS'22, ..., DS'n2 are TS12, TS22, ... , TSn2, and a plurality of speech recognition results corresponding to the m-th recognition target sections DS′1m, DS′2m,..., DS′nm are TS1m, TS2m,. A plurality of speech recognition results TS11 to TSnm corresponding to each recognition target section are not shown.

　認識結果選択統合部１１４は、音声認識部１０２から出力された認識対象区間の複数の音声データの認識結果を、認識対象区間毎に互いに比較し、最適なものを選択し、つなぎ合わせて出力する。たとえば、第１の認識対象区間では、第１の音声データｄ１の認識結果ＴＳ１１が選択され、第２の認識対象区間では、第２の音声データｄ２の認識結果ＴＳ２２が選択され、第ｍの認識対象区間では、第ｎの音声データｄｎの認識結果ＴＳｎｍが選択されるといったふうに、それぞれ認識対象区間毎に最適なものを選択する。そして、認識結果選択統合部１１４は、各認識対象区間毎に選択された認識結果を統合し、一連の音声データの認識結果Ｔとして出力することができる。なお、本実施形態では、認識対象区間毎に最適なものを選択しているが、これに限定されない。１発話区間よりも短い単位、たとえば、単語レベル等で認識結果を選択することもできる。 The recognition result selection / integration unit 114 compares the recognition results of the plurality of speech data in the recognition target section output from the speech recognition unit 102 with each other for each recognition target section, selects an optimum one, and outputs the result by combining them. . For example, in the first recognition target section, the recognition result TS11 of the first voice data d1 is selected, and in the second recognition target section, the recognition result TS22 of the second voice data d2 is selected, and the mth recognition In the target section, an optimum one is selected for each recognition target section, such that the recognition result TSnm of the nth audio data dn is selected. Then, the recognition result selection integration unit 114 can integrate the recognition results selected for each recognition target section and output the recognition results T as a series of speech data recognition results. In the present embodiment, the optimum one is selected for each recognition target section, but the present invention is not limited to this. The recognition result can be selected in units shorter than one utterance section, for example, word level.

　認識結果選択統合部１１４における認識結果の選択方法として、様々なものが考えられる。一例として、ＲＯＶＥＲ法（ジェイ．　ジー．　フィスカス（J. G. Fiscus）著、「ア　ポストプロセッシング　システム　トゥ　イールド　リドゥースド　ワード　エラー　レート：ローバー（A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction：ROVER）」、（米国）、プロシーディングス　アイトリプルイー（インスティテュート　オブ　エレクトリカル　アンド　エレクトロニクス　エンジニアズ）　ワークショップ　オン　オートマティック　スピーチ　リコグニション　アンド　アンダスタンディング（Proceedings IEEE （Institute of Electrical and Electronics Engineers） Workshop on Automatic Speech Recognition and Understanding（ASRU））、１９９７年、p. 347－354）を用いることが考えられる。 Various methods can be considered as a recognition result selection method in the recognition result selection integration unit 114. For example, the ROVER method (J. G. Fiscus), “A post-processing system to yield redoed word error rate: Rover (A post-processing system to yield reduced word Error rates: Recognizer Output Voting Reduction: ROVER ”, (USA), Proceedings I IEEE E (Institute of Electrical and Electronics Engineers) Understanding (ASRU)), 1997, p.

　すなわち、音声認識結果のテキストデータをそれぞれ比較し、同一の結果がより多く得られたもの、すなわち、複数の認識結果の中での同様な結果がより多く得られているものを選択する多数決を行い、出力認識結果列を決定する。あるいは、音響スコアや言語スコア、信頼度などの認識結果とともに得られる情報を用いることも可能である。すなわち、前記音声認識結果を多数決する際に、音声認識結果に対する重み付けとして信頼度等の認識結果情報を用いることが考えられる。さらに、音声認識結果の認識結果情報の閾値を基準として認識結果の採用不採用を決定したりすることも考えられる。また、これらを組み合わせてもよい。 That is, the text data of the speech recognition results are compared with each other, and a majority decision is made to select the one that has obtained the same result more, that is, the one that has obtained more similar results among a plurality of recognition results. To determine the output recognition result sequence. Alternatively, information obtained together with recognition results such as an acoustic score, a language score, and reliability can be used. That is, when majority of the speech recognition results are determined, it is conceivable to use recognition result information such as reliability as a weight for the speech recognition results. Further, it may be possible to determine whether or not to adopt the recognition result based on the threshold value of the recognition result information of the speech recognition result. Moreover, you may combine these.

　本発明の音声認識システムでは、各音声入力部１０の入力条件は、認識結果の選択条件には含まれない。入力条件に関わらず、認識結果から得られる情報のみを用いて比較し、最適なものを選択することで、音声認識結果を精度よく保つことができる。 In the speech recognition system of the present invention, the input conditions of each speech input unit 10 are not included in the recognition result selection conditions. Regardless of the input conditions, only the information obtained from the recognition result is used for comparison, and by selecting the optimum one, the speech recognition result can be kept accurate.

　認識結果選択統合部１１４の認識結果Ｔは、たとえば、テキストデータとして出力され、図示されない記憶部、または記録媒体に記録され、ユーザに提供されることとなる。 The recognition result T of the recognition result selection / integration unit 114 is output, for example, as text data, recorded in a storage unit (not shown) or a recording medium, and provided to the user.

　本発明の音声認識システムは、ＳａａＳ（Software As A Service）型のサービスとして、ユーザに提供することもできる。ＳａａＳ型システムの場合、ネットワークを介して、ユーザ端末からウェブページを参照することでユーザに認識結果を閲覧可能に提供することができる。さらに、必要に応じてダウンロードしたり、あるいは、ユーザが指定した所定のメールアドレスに配信したりすることで、ユーザに認識結果を提供することができる。これらの提供方法も特に限定されるものではなく、様々な態様が考えられる。 The speech recognition system of the present invention can also be provided to the user as a SaaS (Software As Service) type service. In the case of the SaaS system, the recognition result can be provided to the user so as to be browsed by referring to the web page from the user terminal via the network. Furthermore, the recognition result can be provided to the user by downloading as necessary or distributing to a predetermined mail address designated by the user. These providing methods are not particularly limited, and various modes are conceivable.

　上述したように、本実施形態の音声認識装置１１０は、コンピュータにより実現することができる。
　本実施形態のコンピュータプログラムは、音声認識装置１１０を実現させるためのコンピュータに、異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、を実行させるように記述されている。 As described above, the speech recognition apparatus 110 according to the present embodiment can be realized by a computer.
The computer program according to the present embodiment includes a procedure for recognizing a plurality of sound data input under different recording conditions on a computer for realizing the sound recognition device 110, and a plurality of sound recognition results obtained by sound recognition. And a procedure for selecting an optimum one is described.

　さらに、本実施形態のコンピュータプログラムは、音声認識装置１１０を実現されるためのコンピュータに、異なる収録条件で入力した複数の一連の前記音声データの入力を受け付け、複数の一連の音声データについて、それぞれ各発話区間を検出する手順、複数の一連の音声データ間で、同じ発話を含むように認識対象区間を調整する手順、調整された複数の一連の音声データの同じ発話を含む認識対象区間毎に、それぞれ音声認識処理を行い、同じ発話を含む認識対象区間に対応する複数の音声認識結果をそれぞれ出力する手順、出力された複数の一連の音声データの同じ発話を含む認識対象区間に対応する複数の音声認識結果毎に比較を行い、認識対象区間毎に最適なものを選択する手順、認識対象区間毎に選択された各発話区間の各音声認識結果を複数統合し、一連の音声データの音声認識結果として出力する手順を実行させるように記述されている。 Furthermore, the computer program according to the present embodiment accepts a plurality of series of voice data input under different recording conditions to a computer for realizing the voice recognition device 110, and each of the plurality of series of voice data Procedure for detecting each utterance section, procedure for adjusting the recognition target section so as to include the same utterance between a plurality of series of voice data, and for each recognition target section including the same utterance of a plurality of adjusted series of voice data , A procedure for performing each voice recognition process and outputting each of a plurality of speech recognition results corresponding to a recognition target section including the same utterance, a plurality of corresponding to a recognition target section including the same utterance of a plurality of output voice data For each speech recognition result, and the procedure for selecting the best for each recognition target section, for each utterance section selected for each recognition target section The speech recognition result integrates an array are described so as to perform the steps of outputting a speech recognition result of a series of audio data.

　本実施形態のコンピュータプログラムは、コンピュータで読み取り可能な記憶媒体に記録されてもよい。記録媒体は特に限定されず、様々に形態のものが考えられる。また、プログラムは、記録媒体からコンピュータのメモリにロードされてもよいし、ネットワークを通じてコンピュータにダウンロードされ、メモリにロードされてもよい。 The computer program of this embodiment may be recorded on a computer-readable storage medium. The recording medium is not particularly limited, and various forms can be considered. The program may be loaded from a recording medium into a computer memory, or downloaded to a computer through a network and loaded into the memory.

　上述のような構成において、本実施の形態の音声認識装置１１０によるデータ処理方法を以下に説明する。図３は、本実施形態の音声認識システムの動作の一例を示すフローチャートである。 In the configuration as described above, a data processing method by the speech recognition apparatus 110 of the present embodiment will be described below. FIG. 3 is a flowchart showing an example of the operation of the speech recognition system of the present embodiment.

　本発明の実施の形態に係る音声認識装置１１０のデータ処理方法は、音声データを音声認識する音声認識装置のデータ処理方法であって、音声認識装置１１０が、異なる収録条件で入力した複数の音声データをそれぞれ音声認識し（ステップＳ１０５）、音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する（ステップＳ１０７）。 The data processing method of the speech recognition apparatus 110 according to the embodiment of the present invention is a data processing method of the speech recognition apparatus that recognizes speech data, and a plurality of speeches input by the speech recognition apparatus 110 under different recording conditions. Each of the data is subjected to voice recognition (step S105), and a plurality of voice recognition results obtained by the voice recognition are compared to select an optimum one (step S107).

　より詳細には、まず、音声認識装置１１０の音声区間調整部１１２が複数の音声入力部１０から、それぞれ異なる収録条件で集音された音声データｄ１、ｄ２、・・・、ｄｎをそれぞれ入力する（ステップＳ１０１）。そして、音声区間調整部１１２が、各音声データの発話区間を検出し、それぞれ同じ発話が含まれるように、発話区間を互いに調整する（ステップＳ１０３）。 More specifically, first, the voice segment adjustment unit 112 of the voice recognition device 110 inputs the voice data d1, d2,..., Dn collected from the plurality of voice input units 10 under different recording conditions. (Step S101). Then, the speech segment adjustment unit 112 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S103).

　そして、音声認識部１０２が、音声区間調整部１１２から出力された複数の音声データを、発話区間毎に認識処理する（ステップＳ１０５）。その結果、音声認識部１０２から複数の音声データの各発話区間に対応する認識結果がそれぞれ認識結果選択統合部１１４に出力される。そして、認識結果選択統合部１１４が、発話区間毎に、複数の音声認識結果を比較し、その中から最適なものを選択する（ステップＳ１０７）。そして、認識結果選択統合部１１４が、選択された発話区間毎の認識結果を統合し、一連の音声データの認識結果Ｔとして出力する（ステップＳ１０９）。 Then, the speech recognition unit 102 recognizes the plurality of speech data output from the speech segment adjustment unit 112 for each speech segment (step S105). As a result, the recognition result corresponding to each utterance section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 114. Then, the recognition result selection / integration unit 114 compares a plurality of speech recognition results for each utterance section, and selects an optimum one from them (step S107). Then, the recognition result selection / integration unit 114 integrates the recognition results for each selected utterance section, and outputs them as a series of speech data recognition results T (step S109).

　以上説明したように、本発明の実施の形態に係る音声認識システムによれば、複数の音声データの中に入力条件が悪いものがあっても、得られた複数の音声認識結果を比較して最適なものを選択することで、音声認識結果を精度よく保つことができる。また、音声入力部１０は、どのような種類でも、どのような設定であってもよいが、互いに異なる設定にすることで、異なる設定の中から一つでも良好な結果が得られれば、その結果を採用することができることとなる。 As described above, according to the speech recognition system according to the embodiment of the present invention, even if there are some speech data with poor input conditions, the obtained speech recognition results are compared. By selecting the optimum one, the speech recognition result can be kept accurately. In addition, the voice input unit 10 may be of any type and any setting, but if one of the different settings is obtained by setting different from each other, The result can be adopted.

　また、本実施形態の音声認識システムによれば、一連の音声データの中から発話区間毎に最適なものを選択できるので、一連の音声データにおいて、途中で状況が変わった場合にも、途中から他の音声データに対する音声認識結果を採用することができるので、音声認識結果を精度よく保つことができる。たとえば、固定的に設置されているマイクから話者が移動して遠ざかってしまったり、話者自体が入れ替わったために声量が変わったり、また、一部のマイクが不調になったり、騒音が発生したりして、途中で状況が変わった場合にも、同様である。あるいは、固定的に設置されているマイクの位置に話者が戻って来たり、不調だったマイクが復旧したり、騒音が収まったりした場合にも、音声認識結果の精度をよく保つことができる。その理由は、途中から、最適な音声認識結果が得られたものに切り換えることができるからである。 Further, according to the voice recognition system of the present embodiment, since the optimum one can be selected for each utterance section from among a series of voice data, even if the situation changes in the middle of the series of voice data, the middle Since the speech recognition result for other speech data can be adopted, the speech recognition result can be kept accurately. For example, the speaker may move away from a fixed microphone, the volume of the speaker may change because the speaker has changed, some microphones may malfunction, or noise may occur. The same applies when the situation changes during the process. Or, when the speaker returns to the position of the fixed microphone, the malfunctioning microphone recovers, or the noise is reduced, the accuracy of the speech recognition result can be maintained well. . The reason is that it is possible to switch from the middle to the one with the optimum speech recognition result.

　すなわち、複数の異なる収録条件のマイクロフォンを複数準備し、状況に応じて、どのマイクロフォンの音声データによる認識結果がよいかを評価して選び、切り替えることができるので、状況に応じて、各マイクロフォンの特性を効果的に生かすことができるようになる。 In other words, it is possible to prepare a plurality of microphones with different recording conditions, evaluate and select which microphone's sound recognition result is good according to the situation, and switch between them. The characteristics can be utilized effectively.

　また、本実施形態の音声認識システムでは、音声認識部１０２が複数の音声データについて、それぞれ同じ認識処理条件、すなわち、同じ言語モデルもしくは同じ音響モデルを用いて音声認識処理を行うことができる。その際、同じ認識処理条件で認識した結果を評価しているため、認識結果や音声認識処理によって得られる各種特徴量やスコアを用いて収録条件の異なる複数の音声データを比較して、簡便に優劣を付けることができる。 In the speech recognition system of the present embodiment, the speech recognition unit 102 can perform speech recognition processing on a plurality of speech data using the same recognition processing conditions, that is, the same language model or the same acoustic model. At that time, since the results recognized under the same recognition processing conditions are evaluated, it is easy to compare multiple audio data with different recording conditions using various recognition parameters and scores obtained by the recognition results and voice recognition processing. Can be superior or inferior.

（第２の実施の形態）
　図４は、本発明の実施の形態に係る音声認識システムの構成を示す機能ブロック図である。
　本実施形態の音声認識システムは、上記実施の形態とは、認識結果選択統合部２１４において複数の認識結果の中から選択された認識結果の音声認識処理時の条件等を記録し、後続の音声データの音声区間調整や認識結果の選択条件としてフィードバックする点で相違する。 (Second Embodiment)
FIG. 4 is a functional block diagram showing the configuration of the speech recognition system according to the embodiment of the present invention.
The speech recognition system according to this embodiment is different from the above embodiment in that the recognition result selection / integration unit 214 records conditions and the like at the time of speech recognition processing of a recognition result selected from a plurality of recognition results, and the subsequent speech It is different in that it is fed back as a condition for selecting a voice interval adjustment or recognition result of data.

　さらに、本実施形態の音声認識システムにおいて、音声認識装置２００は、複数の音声認識結果が得られた時の音声認識部１０２の音声認識処理条件を、音声認識部１０２が処理した音声認識処理単位（発話区間もしくは認識処理区間）毎にそれぞれ記憶する処理条件記憶部（条件記憶部２１０）と、認識結果選択統合部２１４で音声認識結果が選択されたとき、または選択されなかったときの、音声認識部１０２における音声認識処理条件を音声認識処理単位（発話区間もしくは認識処理区間）毎にそれぞれ処理条件記憶部（条件記憶部２１０）に記録する処理条件記録部と、をさらに備える。
　認識結果選択統合部２１４は、処理条件記憶部（条件記憶部２１０）を参照し、音声認識処理条件を考慮して、音声認識結果を音声認識処理単位（発話区間）毎に選択する。 Furthermore, in the speech recognition system according to the present embodiment, the speech recognition apparatus 200 has a speech recognition processing unit in which the speech recognition unit 102 has processed the speech recognition processing conditions of the speech recognition unit 102 when a plurality of speech recognition results are obtained. A voice when a voice recognition result is selected or not selected by a processing condition storage unit (condition storage unit 210) and a recognition result selection / integration unit 214, which are stored for each (speech section or recognition processing section). And a processing condition recording unit that records the voice recognition processing conditions in the recognition unit 102 in the processing condition storage unit (condition storage unit 210) for each voice recognition processing unit (speech section or recognition processing section).
The recognition result selection / integration unit 214 refers to the processing condition storage unit (condition storage unit 210) and selects a speech recognition result for each speech recognition processing unit (utterance section) in consideration of the speech recognition processing conditions.

　本実施形態の音声認識システムにおいて、音声認識装置２００は、複数の音声データｄ１、ｄ２、．．．、ｄｎの入力時の入力条件を発話区間（もしくは認識対象区間）毎にそれぞれ記憶する条件記憶部２１０と、認識結果選択統合部２１４で音声認識結果が選択されたとき、または選択されなかったときの、音声データの入力条件を発話区間（もしくは認識対象区間）毎にそれぞれ条件記憶部２１０に記憶する入力条件記録部（不図示）と、をさらに備えることもできる。 In the voice recognition system of the present embodiment, the voice recognition device 200 includes a plurality of voice data d1, d2,. . . , Dn when the speech recognition result is selected by the condition storage unit 210 and the recognition result selection / integration unit 214 for storing the input condition for each utterance section (or recognition target section), respectively, or when it is not selected It is also possible to further include an input condition recording unit (not shown) that stores the voice data input condition in the condition storage unit 210 for each utterance section (or recognition target section).

　音声区間調整部２１２は、条件記憶部２１０を参照し、入力した複数の音声データの入力条件を考慮して、発話区間を調整してもよい。
　ここで、入力条件は、たとえば、入力した音声データのパワーレベル、Ｓ／Ｎ比、他の音声データとのパワーレベルの差や比、または、他の音声データとのＳ／Ｎ比の差等を含むことができる。 The voice segment adjustment unit 212 may refer to the condition storage unit 210 and adjust the speech segment in consideration of input conditions of a plurality of input voice data.
Here, the input condition is, for example, the power level of the input voice data, the S / N ratio, the difference or ratio of the power level with other voice data, or the difference of the S / N ratio with other voice data, etc. Can be included.

　具体的には、本実施形態の音声認識装置２００は、上記実施形態の音声認識装置１１０と同じ音声認識部１０２と、さらに、条件記憶部２１０と、音声区間調整部２１２と、認識結果選択統合部２１４と、を備える。 Specifically, the speech recognition apparatus 200 of the present embodiment includes the same speech recognition unit 102 as the speech recognition apparatus 110 of the above embodiment, a condition storage unit 210, a speech segment adjustment unit 212, and recognition result selection integration. Unit 214.

　条件記憶部２１０は、たとえば、図５に示すように、音声データ毎に、さらに、発話区間（もしくは認識対象区間）毎に、その音声データの、その発話区間（もしくは認識対象区間）の認識結果が採用されたか否かを示す選択フラグと、その音声データの、その発話区間の認識結果が選択されたときの音声認識処理条件と、音声入力部１０の入力条件と、を含むことができる。音声認識処理条件として、その音声データの、その発話区間の認識結果（不図示）およびその音響スコア、言語スコア、信頼度等を含むことができる。また、音声入力部１０の入力条件として、入力パワーレベルおよびＳ／Ｎ比等を含むことができる。 For example, as shown in FIG. 5, the condition storage unit 210 recognizes the speech data (or the recognition target section) for each speech data, and for each speech section (or the recognition target section). Can be included, a voice recognition processing condition when the recognition result of the speech section of the voice data is selected, and an input condition of the voice input unit 10. The speech recognition processing conditions can include a recognition result (not shown) of the speech section of the speech data, an acoustic score, a language score, reliability, and the like. The input conditions of the voice input unit 10 can include an input power level, an S / N ratio, and the like.

　なお、各音声データの発話区間（もしくは認識対象区間）毎に、パワーやＳ／Ｎ比などの音響的な情報、分析時に得られた情報を、音声区間調整部２１２から条件記憶部２１０に送り記憶することができる。また、本実施形態では、発話区間（認識対象区間）毎に選択フラグを付与する構成としているが、上述したように、単語レベルなど、発話区間より短い単位でも選択が可能である。したがって、選択した単位、たとえば、単語レベルでフラグを付与し、条件記憶部２１０に記憶することもできる。 For each utterance section (or recognition target section) of each voice data, acoustic information such as power and S / N ratio and information obtained at the time of analysis are sent from the voice section adjustment section 212 to the condition storage section 210. Can be remembered. In the present embodiment, the selection flag is assigned to each utterance section (recognition target section). However, as described above, selection can be made in units shorter than the utterance section, such as a word level. Therefore, a flag can be given at a selected unit, for example, a word level, and stored in the condition storage unit 210.

　図４に戻り、認識結果選択統合部２１４は、条件記憶部２１０を参照し、条件記憶部２１０に記憶されている入力条件または音声認識処理条件を考慮して、認識結果を選択する。また、音声区間調整部２１２は、条件記憶部２１０を参照し、条件記憶部２１０に記憶されている入力条件を考慮して、発話区間を検出し、調整してもよい。 Returning to FIG. 4, the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the speech recognition processing condition stored in the condition storage unit 210. In addition, the speech interval adjustment unit 212 may detect and adjust the utterance interval by referring to the condition storage unit 210 and considering the input conditions stored in the condition storage unit 210.

　たとえば、条件記憶部２１０に記憶された当該音声区間より前の結果より、パワーがある一定値以下であった場合には音声区間とみなさないように閾値として用いることが考えられる。また、パワーやＳ／Ｎ比、さらに言語スコアや音響スコアなどの各種スコアから、複数の認識結果の選択処理を行っている注目の単語が選択されやすいか否かの推定を行うことができる。そして、認識結果選択統合部２１４において、その情報を重みとして加味して認識結果を選択することが考えられる。 For example, based on the result before the speech section stored in the condition storage unit 210, when the power is below a certain value, it can be used as a threshold value so as not to be regarded as a speech section. In addition, it is possible to estimate whether or not an attention word for which a plurality of recognition result selection processes are performed is easily selected from power, an S / N ratio, and various scores such as a language score and an acoustic score. Then, it is conceivable that the recognition result selection / integration unit 214 selects the recognition result in consideration of the information as a weight.

　また別の一例として、条件記憶部２１０に、当該発話区間（もしくは認識対象区間、単語、文節等）や認識結果が選択されたか棄却されたかを識別する識別モデルを記憶しておくことも考えられる。すなわち、予め入力音声とは異なる音声データを用いて（教師として与えて）ベースとなる識別モデルを学習し、条件記憶部２１０に記憶しておく。そして、音声が入力されたときに、音声区間調整部２１２が、条件記憶部２１０に記憶された識別モデルを用いて、入力された音声から得られる各種特徴量に基づいて、当該発話区間（もしくは認識対象区間、単語、文節等）を選択するか棄却するかの判定結果（もしくは識別モデルから得られるスコア）を取得する。そして、音声区間調整部２１２が、その結果を受けて音声区間の調整を行う。 As another example, the condition storage unit 210 may store an identification model for identifying whether the utterance section (or recognition target section, word, phrase, etc.) or the recognition result is selected or rejected. . In other words, a base identification model is learned in advance using voice data different from the input voice (given as a teacher) and stored in the condition storage unit 210. Then, when speech is input, the speech interval adjustment unit 212 uses the identification model stored in the condition storage unit 210 and based on various feature amounts obtained from the input speech, the speech interval (or A determination result (or a score obtained from the identification model) of whether to select or reject a recognition target section, a word, a phrase, or the like is acquired. And the audio | voice area adjustment part 212 adjusts an audio | voice area based on the result.

　さらに、認識結果選択統合部２１４が、条件記憶部２１０に記憶された識別モデルを用いて、得られる各種特徴量やスコアに基づいて、認識結果を選択するか棄却するかの判定結果（もしくは識別モデルから得られるスコア）を取得する。そして、認識結果選択統合部２１４は、その結果を用いて認識結果の選択および棄却を行う。なお、最終的な音声区間の調整結果や認識結果を追加することにより識別モデルを逐次更新することも考えられる。 Furthermore, the recognition result selection / integration unit 214 uses the identification model stored in the condition storage unit 210 to determine whether to select or reject the recognition result based on various feature quantities and scores obtained (or identification). Get the score obtained from the model. Then, the recognition result selection integration unit 214 selects and rejects the recognition result using the result. It is also conceivable to update the identification model sequentially by adding the final adjustment result and recognition result of the speech section.

　ここでは、音声区間調整部２１２および認識結果選択統合部２１４が条件記憶部２１０を参照する構成としたが、これに限定されず、他の判別部（不図示）が、条件記憶部２１０を参照し、音声区間調整部２１２または認識結果選択統合部２１４が条件記憶部２１０に記録されている条件を考慮する必要があるか否かを判別する構成としてもよい。そして、必要がある場合に、音声区間調整部２１２または認識結果選択統合部２１４に必要な条件を通知する構成としてもよい。 Here, the voice section adjustment unit 212 and the recognition result selection integration unit 214 are configured to refer to the condition storage unit 210. However, the present invention is not limited to this, and other determination units (not shown) refer to the condition storage unit 210. In addition, the voice section adjustment unit 212 or the recognition result selection integration unit 214 may determine whether or not the conditions recorded in the condition storage unit 210 need to be considered. And when it is necessary, it is good also as a structure which notifies a required condition to the audio | voice area adjustment part 212 or the recognition result selection integration part 214. FIG.

　上述したように、本実施形態の音声認識装置２００は、コンピュータにより実現することができる。
　本実施形態のコンピュータプログラムは、音声認識装置２００を実現させるためのコンピュータに、上記実施形態のコンピュータプログラムの手順に加え、さらに、音声認識結果が選択されたとき、または選択されなかったときの、音声データの入力条件を発話区間（もしくは認識対象区間）毎にそれぞれ条件記憶部２１０に記録する手順、条件記憶部２１０を参照し、入力した複数の音声データの入力条件を考慮して、発話区間を調整する手順を実行させるように記述されている。 As described above, the speech recognition apparatus 200 of the present embodiment can be realized by a computer.
The computer program of the present embodiment is a computer for realizing the speech recognition apparatus 200, in addition to the procedure of the computer program of the above embodiment, and further, when a speech recognition result is selected or not selected. A procedure for recording voice data input conditions for each utterance section (or recognition target section) in the condition storage unit 210, referring to the condition storage unit 210, and considering the input conditions of a plurality of input voice data, the utterance section It is described to execute the procedure to adjust.

　また、本実施形態のコンピュータプログラムは、音声認識装置２００を実現させるためのコンピュータに、さらに、音声認識結果が選択されたとき、または選択されなかったときの、音声認識部１０２における音声認識処理条件を認識対象区間毎にそれぞれ条件記憶部２１０に記録する手順、条件記憶部２１０を参照し、音声認識処理条件を考慮して、認識結果を認識対象区間毎に選択する手順を実行させるように記述されている。 Further, the computer program of the present embodiment is further provided with a computer for realizing the speech recognition apparatus 200. The speech recognition processing conditions in the speech recognition unit 102 when the speech recognition result is selected or not selected. Is recorded in the condition storage unit 210 for each recognition target section, and the condition storage unit 210 is referred to, and the procedure for selecting the recognition result for each recognition target section in consideration of the speech recognition processing conditions is described. Has been.

　このように構成された本実施形態の音声認識システムの動作について、以下に説明する。
　図６は、本実施形態の音声認識システムの動作の一例を示すフローチャートである。 The operation of the speech recognition system of the present embodiment configured as described above will be described below.
FIG. 6 is a flowchart showing an example of the operation of the speech recognition system of this embodiment.

　本実施形態の音声認識システムにおいて、音声認識装置２００は、図３の上記実施形態のフローチャートと同様なステップＳ１０１、ステップＳ１０５、およびステップＳ１０９に加え、さらに、ステップＳ２０３～ステップＳ２０８を含む。 In the speech recognition system of this embodiment, the speech recognition apparatus 200 includes steps S203 to S208 in addition to steps S101, S105, and S109 similar to those in the flowchart of the above-described embodiment of FIG.

　まず、音声認識装置２００の音声区間調整部２１２が、複数の音声入力部１０から、それぞれ異なる収録条件で集音された音声データｄ１、ｄ２、・・・、ｄｎをそれぞれ入力する（ステップＳ１０１）。そして、音声区間調整部２１２が、各音声データの発話区間を検出し、それぞれ同じ発話が含まれるように、発話区間を互いに調整する（ステップＳ２０３）。このとき、音声区間調整部２１２は、条件記憶部２１０を参照し、入力条件を考慮して、発話区間を検出および調整する。 First, the speech section adjustment unit 212 of the speech recognition apparatus 200 inputs speech data d1, d2,..., Dn collected from the plurality of speech input units 10 under different recording conditions, respectively (step S101). . Then, the speech segment adjustment unit 212 detects the speech segment of each voice data, and adjusts the speech segments to each other so that the same speech is included (step S203). At this time, the voice segment adjustment unit 212 refers to the condition storage unit 210 and detects and adjusts the speech segment in consideration of the input condition.

　そして、音声区間調整部２１２が、音声データ毎かつ発話区間（もしくは認識処理区間）毎に、入力条件を条件記憶部２１０に記録する（ステップＳ２０４）。そして、音声認識部１０２が、音声区間調整部２１２から出力された複数の音声データを、認識処理区間毎に認識処理する（ステップＳ１０５）。その結果、音声認識部１０２から複数の音声データの各認識処理区間に対応する認識結果がそれぞれ認識結果選択統合部２１４に出力される。そして、認識結果選択統合部２１４が、認識処理区間毎に、複数の音声認識結果を比較し、その中から最適なものを選択する（ステップＳ２０７）。このとき、認識結果選択統合部２１４は、条件記憶部２１０を参照し、入力条件または音声認識処理条件を考慮して、認識結果を選択する。 Then, the voice segment adjustment unit 212 records the input condition in the condition storage unit 210 for each voice data and for each utterance segment (or recognition processing segment) (step S204). Then, the speech recognition unit 102 recognizes the plurality of speech data output from the speech interval adjustment unit 212 for each recognition processing interval (step S105). As a result, the recognition result corresponding to each recognition processing section of the plurality of speech data is output from the speech recognition unit 102 to the recognition result selection integration unit 214. Then, the recognition result selection / integration unit 214 compares a plurality of speech recognition results for each recognition processing section, and selects an optimum one from them (step S207). At this time, the recognition result selection integration unit 214 refers to the condition storage unit 210 and selects a recognition result in consideration of the input condition or the voice recognition processing condition.

　そして、認識結果選択統合部２１４が、各音声データの各発話区間の音声認識処理条件と、その区間の音声データが採用されたか否かを示す選択フラグとを条件記憶部２１０に追記する（ステップＳ２０８）。そして、認識結果選択統合部２１４が、選択された認識処理区間毎の認識結果を統合し、一連の音声データの認識結果Ｔとして出力する（ステップＳ１０９）。 Then, the recognition result selection / integration unit 214 adds a speech recognition processing condition for each utterance section of each speech data and a selection flag indicating whether the speech data for that section has been adopted or not to the condition storage unit 210 (step S110). S208). Then, the recognition result selection / integration unit 214 integrates the recognition results for each selected recognition processing section, and outputs them as a series of speech data recognition results T (step S109).

　以上、説明したように、本実施形態の音声認識システムによれば、上記実施形態と同様な効果を奏するとともに、音声認識結果を選択する際に、過去に選択されたまたは選択されなかった音声データの音声認識処理条件などを考慮するので、そのとき、その会場の状況に応じて異なる収録条件の傾向を考慮して処理を行うことができ、認識精度を向上させることが可能になる。 As described above, according to the voice recognition system of the present embodiment, the voice data that has the same effects as those of the above-described embodiment and has been selected or not selected in the past when selecting the voice recognition result. Therefore, the processing can be performed in consideration of the tendency of different recording conditions depending on the situation of the venue, and the recognition accuracy can be improved.

　以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As described above, the embodiments of the present invention have been described with reference to the drawings. However, these are exemplifications of the present invention, and various configurations other than the above can be adopted.

　以上、実施形態および実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

　この出願は、２０１０年３月２９日に出願された日本出願特願２０１０－０７６１９５号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2010-076195 filed on Mar. 29, 2010, the entire disclosure of which is incorporated herein.

Claims

　話者の発話音声を異なる収録条件で入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
　前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える音声認識装置。 A voice recognition means for recognizing each of a plurality of voice data obtained by inputting a speaker's speech voice under different recording conditions;
A speech recognition apparatus comprising: a recognition result selection unit that compares a plurality of speech recognition results obtained by performing speech recognition by the speech recognition unit and selects an optimum one.
　請求項１に記載の音声認識装置において、
　複数の一連の前記音声データの入力を受け付け、複数の一連の前記音声データについて、それぞれ各発話区間を検出し、複数の一連の前記音声データ間で、同じ発話を含むように発話区間を調整する音声区間調整手段をさらに備え、
　前記音声認識手段は、前記音声区間調整手段により調整された複数の一連の前記音声データの前記同じ発話について、それぞれ音声認識処理を行い、前記同じ発話に対応する複数の音声認識結果をそれぞれ出力し、
　前記認識結果選択手段は、前記音声認識手段から出力された複数の一連の前記音声データの前記同じ発話に対応する複数の前記音声認識結果毎に比較、選択を行い、統合して１つの最適な音声認識結果として出力する音声認識装置。 The speech recognition apparatus according to claim 1,
Accepts input of a plurality of series of voice data, detects each utterance section for each of the plurality of series of voice data, and adjusts the utterance section to include the same utterance among the plurality of series of voice data A voice section adjusting means;
The speech recognition means performs speech recognition processing on each of the same utterances of the plurality of series of speech data adjusted by the speech interval adjustment means, and outputs a plurality of speech recognition results corresponding to the same utterances, respectively. ,
The recognition result selection unit compares, selects, and integrates a plurality of the speech recognition results corresponding to the same utterance of the series of the speech data output from the speech recognition unit. A speech recognition device that outputs a speech recognition result.
　請求項１または２に記載の音声認識装置において、
　前記認識結果選択手段は、複数の前記音声認識結果を比較して、同様な結果がより多く得られているものを選択する音声認識装置。 The speech recognition apparatus according to claim 1 or 2,
The recognition result selection unit is a speech recognition apparatus that compares a plurality of the speech recognition results and selects one that has obtained more similar results.
　請求項１乃至３いずれかに記載の音声認識装置において、
　前記認識結果選択手段は、前記音声認識手段にて前記音声データが音声認識処理された時に得られる認識結果情報に基づいて、最適なものを選択する音声認識装置。 The speech recognition apparatus according to any one of claims 1 to 3,
The recognition result selection unit is a voice recognition device that selects an optimum one based on recognition result information obtained when the voice data is subjected to voice recognition processing by the voice recognition unit.
　請求項４に記載の音声認識装置において、
　前記認識結果情報は、音響スコア、言語スコア、または信頼度である音声認識装置。 The speech recognition apparatus according to claim 4,
The speech recognition apparatus, wherein the recognition result information is an acoustic score, a language score, or a reliability.
　請求項５に記載の音声認識装置において、
　前記認識結果選択手段が同様な結果がより多く得られているものを選択する多数決を行うとき、前記音声認識結果に対する重み付けとして前記認識結果情報を用いる音声認識装置。 The speech recognition apparatus according to claim 5.
A speech recognition apparatus that uses the recognition result information as a weight for the speech recognition result when the recognition result selection means performs a majority vote to select a result that has obtained more similar results.
　請求項５または６に記載の音声認識装置において、
　前記認識結果選択手段が同様な結果がより多く得られているものを選択する多数決を行うとき、前記認識結果情報の閾値により、前記音声認識結果を採用するか否かを決定する音声認識装置。 The speech recognition apparatus according to claim 5 or 6,
A speech recognition apparatus that determines whether or not to adopt the speech recognition result according to a threshold value of the recognition result information when the recognition result selection unit performs a majority vote to select a result that has obtained more similar results.
　請求項２乃至７いずれかに記載の音声認識装置において、
　複数の前記音声認識結果が得られた時の前記音声認識手段の音声認識処理条件を、前記音声認識手段が処理した音声認識処理単位毎にそれぞれ記憶する処理条件記憶部と、
　前記認識結果選択手段で前記音声認識結果が選択されたとき、または選択されなかったときの、前記音声認識手段における音声認識処理条件を前記音声認識処理単位毎にそれぞれ前記処理条件記憶部に記録する処理条件記録手段と、をさらに備え、
　前記認識結果選択手段は、前記処理条件記憶部を参照し、前記音声認識処理条件を考慮して、前記音声認識結果を前記音声認識処理単位毎に選択する音声認識装置。 The speech recognition apparatus according to any one of claims 2 to 7,
A processing condition storage unit that stores voice recognition processing conditions of the voice recognition unit when a plurality of the voice recognition results are obtained for each voice recognition processing unit processed by the voice recognition unit;
When the speech recognition result is selected or not selected by the recognition result selection means, the speech recognition processing conditions in the speech recognition means are recorded in the processing condition storage unit for each speech recognition processing unit. Processing condition recording means,
The recognition result selection unit refers to the processing condition storage unit, and considers the voice recognition processing condition, and selects the voice recognition result for each voice recognition processing unit.
　請求項１乃至８いずれかに記載の音声認識装置において、
　前記音声認識手段は、複数の前記音声データに対して、同じ音声認識処理条件で音声認識処理を行う音声認識装置。 The speech recognition apparatus according to any one of claims 1 to 8,
The voice recognition device, wherein the voice recognition means performs voice recognition processing on a plurality of the voice data under the same voice recognition processing conditions.
　請求項１乃至９いずれかに記載の音声認識装置において、
　複数の前記音声データは、複数の音声入力装置でそれぞれ集音され、入力される音声認識装置。 The speech recognition apparatus according to any one of claims 1 to 9,
A plurality of the voice data are collected by a plurality of voice input devices and input.
　異なる収録条件でそれぞれ音声を入力する複数の音声入力手段と、
　前記音声入力手段から入力した複数の音声データをそれぞれ音声認識する音声認識手段と、
　前記音声認識手段で音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する認識結果選択手段と、を備える音声認識システム。 A plurality of voice input means for inputting voice respectively under different recording conditions;
Voice recognition means for recognizing each of a plurality of voice data input from the voice input means;
A speech recognition system comprising: a recognition result selection unit that compares a plurality of speech recognition results obtained by performing speech recognition with the speech recognition unit and selects an optimum one.
　音声データを音声認識する音声認識装置のデータ処理方法であって、
　前記音声認識装置が、
　異なる収録条件で入力した複数の音声データをそれぞれ音声認識し、
　音声認識で得られた複数の音声認識結果を比較して、最適なものを選択する音声認識装置のデータ処理方法。 A data processing method of a voice recognition device for voice recognition of voice data,
The speech recognition device
Recognize multiple voice data input under different recording conditions,
A data processing method of a speech recognition apparatus that compares a plurality of speech recognition results obtained by speech recognition and selects an optimum one.
　請求項１２に記載の音声認識装置のデータ処理方法において、
　前記音声認識装置が、
　複数の一連の前記音声データの入力を受け付け、複数の一連の前記音声データについて、それぞれ各発話区間を検出し、複数の一連の前記音声データ間で、同じ発話を含むように発話区間を調整し、
　調整された複数の一連の前記音声データの前記同じ発話について、それぞれ音声認識処理を行い、前記同じ発話に対応する複数の音声認識結果をそれぞれ出力し、
　複数の一連の前記音声データの前記同じ発話に対応する複数の前記音声認識結果毎に比較、選択を行い、統合して１つの最適な音声認識結果として出力する音声認識装置のデータ処理方法。 The data processing method of the voice recognition device according to claim 12,
The speech recognition device
Accepts input of a plurality of series of the voice data, detects each utterance section for each of the plurality of series of voice data, and adjusts the utterance section to include the same utterance among the plurality of series of the voice data. ,
For each of the same utterances of the adjusted series of the voice data, perform voice recognition processing, respectively, and output a plurality of voice recognition results corresponding to the same utterance,
A data processing method for a speech recognition apparatus that compares and selects a plurality of speech recognition results corresponding to the same utterance of a series of speech data, and outputs the results as a single optimal speech recognition result.
　請求項１２または１３に記載の音声認識装置のデータ処理方法において、
　前記音声認識装置が、
　前記音声データが音声認識処理された時に得られる認識結果情報に基づいて、最適なものを選択する音声認識装置のデータ処理方法。 The data processing method of the speech recognition device according to claim 12 or 13,
The speech recognition device
A data processing method of a speech recognition apparatus that selects an optimum one based on recognition result information obtained when the speech data is subjected to speech recognition processing.
　請求項１２乃至１４いずれかに記載の音声認識装置のデータ処理方法において、
　前記音声認識装置が、
　複数の前記音声認識結果を比較して、同様な結果がより多く得られているものを選択する音声認識装置のデータ処理方法。 The data processing method of the voice recognition device according to any one of claims 12 to 14,
The speech recognition device
A data processing method of a speech recognition apparatus that compares a plurality of the speech recognition results and selects one that has obtained more similar results.
　請求項１３乃至１５いずれかに記載の音声認識装置のデータ処理方法において、
　前記音声認識装置が、
　複数の前記音声認識結果が得られた時の前記音声認識手段の音声認識処理条件を、前記音声認識手段が処理した音声認識処理単位毎にそれぞれ記憶する処理条件記憶部を備え、
　前記音声認識結果が選択されたとき、または選択されなかったときの、前記音声認識時の音声認識処理条件を前記音声認識処理単位毎にそれぞれ前記処理条件記憶部に記憶し、
　前記処理条件記憶部を参照し、前記音声認識処理条件を考慮して、前記音声認識結果を前記音声認識処理単位毎に選択する音声認識装置のデータ処理方法。 The data processing method of the voice recognition device according to any one of claims 13 to 15,
The speech recognition device
A processing condition storage unit for storing the voice recognition processing conditions of the voice recognition unit when a plurality of the voice recognition results are obtained for each voice recognition processing unit processed by the voice recognition unit;
When the voice recognition result is selected or not selected, the voice recognition processing condition at the time of voice recognition is stored in the processing condition storage unit for each voice recognition processing unit,
A data processing method of a speech recognition apparatus that refers to the processing condition storage unit and selects the speech recognition result for each speech recognition processing unit in consideration of the speech recognition processing conditions.
　請求項１２乃至１６いずれかに記載の音声認識装置のデータ処理方法において、
　前記音声認識装置が、複数の前記音声データに対して、同じ音声認識処理条件で音声認識処理を行う音声認識装置のデータ処理方法。 The data processing method of the voice recognition device according to any one of claims 12 to 16,
A data processing method for a voice recognition apparatus, wherein the voice recognition apparatus performs voice recognition processing on the plurality of voice data under the same voice recognition processing conditions.
　請求項１２乃至１７いずれかに記載の音声認識装置のデータ処理方法において、
　複数の前記音声データは、複数の音声入力装置でそれぞれ集音され、入力される音声認識装置のデータ処理方法。 The data processing method of the speech recognition device according to any one of claims 12 to 17,
A data processing method of a voice recognition device in which a plurality of the voice data are collected and inputted by a plurality of voice input devices, respectively.
　音声データを音声認識する音声認識装置を実現するコンピュータプログラムであって、
　異なる収録条件で入力した複数の音声データをそれぞれ音声認識する手順と、
　音声認識して得られた複数の音声認識結果を比較して、最適なものを選択する手順と、をコンピュータに実行させるためのコンピュータプログラム。 A computer program for realizing a voice recognition device for voice recognition of voice data,
A procedure for recognizing a plurality of audio data input under different recording conditions,
A computer program for causing a computer to execute a procedure for comparing a plurality of speech recognition results obtained by speech recognition and selecting an optimum one.