JP2007086554A

JP2007086554A - Voice recognition device and program for voice recognition processing

Info

Publication number: JP2007086554A
Application number: JP2005276996A
Authority: JP
Inventors: Keisuke Yoshizaki; 圭祐吉崎; Tomonori Ikumi; 智則伊久美; Tomonari Kakino; 友成柿野; Naoki Sekine; 直樹関根
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2005-09-26
Filing date: 2005-09-26
Publication date: 2007-04-05

Abstract

<P>PROBLEM TO BE SOLVED: To raise a recognition rate when performing voice recognition processing. <P>SOLUTION: Sound signals of two or more systems output from individual microphone elements by inputting a voice to a microphone array 119 constituted of the two or more microphone elements 120 are subjected to noise signal reducing operation (a microphone array processing part 131) using the fact that the voice and noise signals included in the sound signals differ from each other in the respective systems, and after producing a processed signal improved in SN ratio, detection accuracy is raised (an utterance section detecting part 141) using the processed signal improved in SN ratio to detect the utterance section, and in the case of processing to take out an utterance signal from the inside of the utterance section raised thus in the accuracy, not a processed signal which has changed in the signal, but a sound signal which has not been changed is used(an utterance section extracting part 151). <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、マイクロフォンアレー技術を利用する音声認識装置及び音声認識処理用プログラムに関する。 The present invention relates to a speech recognition apparatus and a speech recognition processing program that use a microphone array technology.

音声認識処理は、マイクロフォンから取り込んだ音声を登録されている認識対象語句と比較することで音声認識結果を得る技術である。このような音声認識処理は、雑音環境下において認識性能が著しく低下してしまうため、雑音対策が重要な課題となっている。 The voice recognition process is a technique for obtaining a voice recognition result by comparing a voice taken in from a microphone with a registered recognition target word / phrase. In such a speech recognition process, since the recognition performance is significantly deteriorated in a noisy environment, noise countermeasures are an important issue.

このような雑音対策としては、従来から、複数個のマイクロフォン素子から構成されるマイクロフォンアレーを用いた雑音信号の低減処理が知られている（非特許文献１参照）。この処理は、マイクロフォンアレーに音声が入力されることにより個々のマイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する、というものである。 As a countermeasure against such noise, conventionally, a noise signal reduction process using a microphone array including a plurality of microphone elements is known (see Non-Patent Document 1). In this process, for a plurality of sound signals output from individual microphone elements when sound is input to the microphone array, the sound signals and noise signals included in the sound signals are different for each system. Is used to reduce the noise signal and generate a processed signal with an improved S / N ratio.

このようなマイクロフォンアレーを用いた雑音信号の低減処理は、遅延和アレー処理と適用型マイクロフォンアレー処理とに大別することができる。 Such noise signal reduction processing using a microphone array can be broadly divided into delay-and-sum array processing and adaptive microphone array processing.

遅延和アレー処理は、個々のマイクロフォン素子から出力される複数系統の音信号について、目的方向から到来する音信号をそれぞれ同相化した上で、同相化した複数系統の音信号を加算して処理信号とする処理である。目的方向から到来する音信号は音声信号と考えられるので、これを同相化して加算すれば、当該音声信号は強調された信号となる。これに対して、目的方向以外から到来する信号は雑音信号と考えられる。そこで、雑音信号については同相化せず、これによって時間的にずれた波形となるため、加算しても強調効果が弱い。その結果、音声信号の方が雑音信号よりも強調され、相対的に雑音信号を低減することができる。 Delay-and-sum array processing is a processing signal obtained by making the sound signals arriving from the target direction in-phase with the sound signals coming from the target direction and then adding the in-phase sound signals for the multiple sound signals output from the individual microphone elements. It is processing to. Since the sound signal coming from the target direction is considered to be an audio signal, the audio signal becomes an enhanced signal if they are in-phased and added. On the other hand, a signal coming from a direction other than the target direction is considered as a noise signal. Therefore, the noise signal is not in-phased, resulting in a time-shifted waveform, so that the enhancement effect is weak even when added. As a result, the voice signal is emphasized more than the noise signal, and the noise signal can be relatively reduced.

適用型マイクロフォンアレー処理では、雑音信号を同相化させる。そして、個々のマイクロフォン素子から出力される複数系統の音信号から同相化させた雑音信号を減算することで、雑音信号を消去する、という処理である。 In adaptive microphone array processing, the noise signal is in phase. Then, the noise signal is eliminated by subtracting the in-phase noise signal from a plurality of systems of sound signals output from the individual microphone elements.

「電子情報通信学会編音響システムとデジタル処理」大賀寿郎、山崎芳男、金田豊共著社団法人電子情報通信学会発行"Sound system and digital processing edited by IEICE" Toshio Oga, Yoshio Yamazaki, Yutaka Kaneda Published by The Institute of Electronics, Information and Communication Engineers

マイクロフォンアレーを用いた雑音信号の低減処理を施すことで、確実な雑音信号の低減を図ることができる。その反面、前述したような雑音信号の低減処理は、音声信号の変容を招来する。音声信号の変容は、遅延和アレー処理と適用型マイクロフォンアレー処理とのいずれの処理を採用した場合にも発生する。このため、変容した音声信号に基いて音声認識処理を実行することになるため、認識率が低下を招いてしまうという問題が生ずる。 By performing noise signal reduction processing using a microphone array, it is possible to reliably reduce the noise signal. On the other hand, the noise signal reduction processing as described above leads to the transformation of the audio signal. The transformation of the audio signal occurs when any of the delay sum array processing and the applied microphone array processing is adopted. For this reason, since the voice recognition process is executed based on the transformed voice signal, there arises a problem that the recognition rate is lowered.

本発明の目的は、音声認識処理に際して、認識率を向上させることである。 An object of the present invention is to improve the recognition rate in speech recognition processing.

本発明の音声認識装置は、複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、前記発話区間情報によって特定される発話区間内の前記音信号から発話信号を抽出する手段と、抽出された前記発話信号について音声認識処理を施し、認識結果を得る手段と、を備える。 The speech recognition apparatus according to the present invention includes a plurality of sound signals output from each of the microphone elements by inputting the sound into a microphone array including a plurality of microphone elements. A means for generating a processed signal having an improved S / N ratio by performing a noise signal reduction process using the difference between the signal and the noise signal for each system, and detecting a speech section based on the processed signal , Means for outputting as speech section information, means for extracting a speech signal from the sound signal in the speech section specified by the speech section information, and performing speech recognition processing on the extracted speech signal, Means for obtaining.

請求項１記載の発明は、一の系統の前記音信号から発話信号を抽出し、この発話信号について音声認識処理を施すようにした。 According to the first aspect of the present invention, an utterance signal is extracted from the sound signal of one system, and voice recognition processing is performed on the utterance signal.

請求項２記載の発明は、複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出し、抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得、前記発話信号についての認識の成否と採用する前記発話信号との関係を定義する採用定義に従い、いずれか一つの前記認識結果を選択して出力するようにした。 The invention according to claim 2 extracts speech signals in the speech section specified by the speech section information from the sound signals of a plurality of systems, and performs speech recognition processing on the extracted speech signals of the plurality of systems. The recognition result is obtained, and one of the recognition results is selected and output according to the adoption definition that defines the relationship between the success or failure of the recognition of the speech signal and the speech signal to be adopted.

請求項３記載の発明は、複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出し、抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得、前記認識結果についてその確度を表現する認識スコアを算出し、最も高い前記認識スコアに対応する前記認識結果を選択して出力するようにした。 According to a third aspect of the present invention, an utterance signal in an utterance section specified by the utterance section information is extracted from the sound signals of a plurality of systems, and speech recognition processing is performed on the extracted utterance signals of the plurality of systems. A recognition score expressing the accuracy of the recognition result is calculated, and the recognition result corresponding to the highest recognition score is selected and output.

請求項４記載の発明は、複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出し、抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得、発話信号を抽出する前記音信号の音量を算出し、最小音量の前記音信号に対応する前記認識結果を選択して出力するようにした。 The invention according to claim 4 extracts a speech signal within a speech section specified by the speech section information from the sound signals of a plurality of systems, performs speech recognition processing on the extracted speech signals of the plurality of systems, The recognition result is obtained, the volume of the sound signal from which the speech signal is extracted is calculated, and the recognition result corresponding to the sound signal having the minimum volume is selected and output.

更に、本発明は、コンピュータにインストールされ、当該コンピュータに上記処理各手段を実行させる音声認識処理用プログラムをも規定する。 Furthermore, the present invention also defines a voice recognition processing program that is installed in a computer and causes the computer to execute each of the processing means.

本発明によれば、ＳＮ比を改善した処理信号に基づく発話区間の検出を実行することでその検出精度を高め、こうして精度が高められた発話区間内の変容していない音信号から発話信号を抽出して音声認識処理を施すようにしたので、その認識率を向上させることができる。 According to the present invention, the detection accuracy is improved by executing the detection of the utterance interval based on the processed signal with improved S / N ratio, and the utterance signal is obtained from the untransformed sound signal in the utterance interval thus improved in accuracy. Since the voice recognition process is performed after extraction, the recognition rate can be improved.

本発明の第１の実施の形態を図１及び図２に基づいて説明する。 A first embodiment of the present invention will be described with reference to FIGS.

図１は、本実施の形態の音声認識装置のハードウェア構成を示すブロック図である。本実施の形態の音声認識装置１０１は、マイクロコンピュータによって実現されている。つまり、マイクロコンピュータは、各種演算処理を実行して各部を集中的に制御するＣＰＵ１０２を備え、このＣＰＵ１０２には、固定データを固定的に記憶するＲＯＭ１０３と、可変データを書き換え自在に記憶するＲＡＭ１０４と、ＨＤＤ１０５とがバスライン１０６を介して接続されている。 FIG. 1 is a block diagram showing a hardware configuration of the speech recognition apparatus according to the present embodiment. The speech recognition apparatus 101 according to the present embodiment is realized by a microcomputer. That is, the microcomputer includes a CPU 102 that executes various arithmetic processes and centrally controls each unit. The CPU 102 includes a ROM 103 that stores fixed data in a fixed manner, and a RAM 104 that stores variable data in a rewritable manner. The HDD 105 is connected to the HDD 105 via the bus line 106.

また、ＣＰＵ１０２には、磁気ディスク１０７に対する情報の書き込みと読み取りとを実行する磁気ディスクドライブ１０８、ＣＤ系やＤＶＤ系等の各種の光ディスク１０９に対する情報の読み取り、情報書き込み可能な光ディスク１０９に対しては書き込みを実行する光ディスクドライブ１１０、各種のＩ／Ｏ１１１、及び通信インターフェース１１２がバスライン１０６を介して接続されている。 In addition, the CPU 102 reads information from and writes information to various optical disks 109 such as a magnetic disk drive 108 and a CD system and a DVD system that execute information writing and reading on the magnetic disk 107. An optical disk drive 110 that executes writing, various I / Os 111, and a communication interface 112 are connected via a bus line 106.

また、音声認識装置１０１を構成するマイクロコンピュータは、ディスプレイ１１３に情報を出力し、キーボード１１４及びポインティングデバイス１１５から情報を入力することができる。そのために、ディスプレイ１１３は表示制御回路１１６を介して、キーボード１１４及びポインティングデバイス１１５は入力制御回路１１７を介して、それぞれＣＰＵ１０２に接続されている。表示制御回路１１６及び入力制御回路１１７は、バスライン１０６に接続されてＣＰＵ１０２との間で通信自在である。 Further, the microcomputer constituting the speech recognition apparatus 101 can output information to the display 113 and input information from the keyboard 114 and the pointing device 115. For this purpose, the display 113 is connected to the CPU 102 via the display control circuit 116, and the keyboard 114 and the pointing device 115 are connected to the CPU 102 via the input control circuit 117. The display control circuit 116 and the input control circuit 117 are connected to the bus line 106 and can communicate with the CPU 102.

更に、音声認識装置１０１を構成するマイクロコンピュータは、音声入力回路１１８を備えている。音声入力回路１１８は、一例として、図示しない増設基板上に集積回路として形成され、音声認識装置１０１を構成するマイクロコンピュータの図示しない増設基板追加用スロットに差し込まれている。そして、音声入力回路１１８にはマイクロフォンアレー１１９が接続されている。マイクロフォンアレー１１９は、複数個のマイクロフォン素子１２０（図２参照）から構成されており、入力された音声等の音をそれらのマイクロフォン素子１２０から取り込み、個々のマイクロフォン素子１２０から出力する構造のものである。したがって、マイクロフォンアレー１１９は、マイクロフォン素子１２０の数だけの系統の音信号を出力することになる。音声入力回路１１８は、マイクロフォンアレー１１９が有する個々のマイクロフォン素子１２０に対応させて、増幅器１２１とアナログデジタルコンバータ１２２とを備えている（図２参照）。したがって、音声入力回路１１８は、マイクロフォンアレー１１９に入力された音信号をデジタル信号に変換し、マイクロフォン素子１２０の数だけの系統のデジタル化された音信号としてバスライン１０６上に出力可能である。 Further, the microcomputer constituting the voice recognition apparatus 101 includes a voice input circuit 118. As an example, the voice input circuit 118 is formed as an integrated circuit on an extension board (not shown), and is inserted into an extension board addition slot (not shown) of a microcomputer constituting the voice recognition apparatus 101. A microphone array 119 is connected to the audio input circuit 118. The microphone array 119 is composed of a plurality of microphone elements 120 (see FIG. 2), and has a structure in which sounds such as input voices are taken in from the microphone elements 120 and output from the individual microphone elements 120. is there. Therefore, the microphone array 119 outputs sound signals of the number of systems corresponding to the number of microphone elements 120. The audio input circuit 118 includes an amplifier 121 and an analog-digital converter 122 corresponding to each microphone element 120 included in the microphone array 119 (see FIG. 2). Therefore, the audio input circuit 118 can convert the sound signal input to the microphone array 119 into a digital signal and output it on the bus line 106 as a digitized sound signal of the number of the microphone elements 120.

別の実施の形態として、音声入力回路１１８をソフトウェアによって生成することも可能である。もっても、処理速度の上からは、集積回路によって音声入力回路１１８を構成することが好ましい。 In another embodiment, the audio input circuit 118 can be generated by software. However, from the viewpoint of processing speed, it is preferable to configure the audio input circuit 118 by an integrated circuit.

更に別の実施の形態としては、マイクロフォンアレー１１９それ自体が増幅器１２１及びアナログデジタルコンバータ１２２を内蔵していても良い。つまり、マイクロフォンアレー１１９は、音声認識装置１０１を構成するマイクロコンピュータから見ると、別付け部品ということになるが、このようなマイクロフォンアレー１１９を構成する図示しないハウジングが増幅器１２１及びアナログデジタルコンバータ１２２を内蔵していても良い。この場合、音声入力回路１１８は、アナログデジタルコンバータ１２２から出力されたデジタル化された音信号を例えばＲＡＭ１０４に向けて出力する構成を主要構成として備えているだけで良い。 In still another embodiment, the microphone array 119 itself may include an amplifier 121 and an analog / digital converter 122. In other words, the microphone array 119 is a separate component when viewed from the microcomputer constituting the speech recognition apparatus 101. The housing (not shown) constituting the microphone array 119 includes the amplifier 121 and the analog-digital converter 122. It may be built in. In this case, the audio input circuit 118 need only include a configuration that outputs the digitized sound signal output from the analog-digital converter 122 toward the RAM 104, for example.

ここで、音声認識装置１０１を構成するマイクロコンピュータは、ＨＤＤ１０５に各種の処理プログラムをインストールすることが可能である。代表的には、ＯＳ（オペレーティングシステム）がインストールされている他、ＨＤＤ１０５には、音声認識処理用プログラムもインストールされている。このような音声認識処理用プログラムは、一例として、磁気ディスク１０７に記憶保存され、磁気ディスクドライブ１０８を介して読み取られてＨＤＤ１０５にインストールされる。音声認識処理用プログラムは、別の一例として、光ディスク１０９に記憶保存され、光ディスクドライブ１１０を介して読み取られてＨＤＤ１０５にインストールされる。更に別の一例として、音声認識処理用プログラムは、通信インターフェース１１２を介して接続された上位機（例えばイントラネットの場合）やウェブページ（例えばインターネットの場合）からダウンロードし、ＨＤＤ１０５にインストールしたものであっても良い。これらの各種例において、ＨＤＤ１０５、磁気ディスク１０７、光ディスク１０９は、音声認識処理用プログラムを記憶する記憶媒体となる。 Here, the microcomputer constituting the speech recognition apparatus 101 can install various processing programs in the HDD 105. Typically, an OS (Operating System) is installed, and a voice recognition processing program is also installed in the HDD 105. As an example, such a speech recognition processing program is stored and stored in the magnetic disk 107, read through the magnetic disk drive 108, and installed in the HDD 105. As another example, the speech recognition processing program is stored and saved in the optical disc 109, read through the optical disc drive 110, and installed in the HDD 105. As yet another example, the speech recognition processing program is downloaded from a higher-level device (for example, an intranet) or a web page (for example, the Internet) connected via the communication interface 112 and installed in the HDD 105. May be. In these various examples, the HDD 105, the magnetic disk 107, and the optical disk 109 are storage media for storing a voice recognition processing program.

音声認識装置１０１を構成するマイクロコンピュータの起動時、処理速度の高速度化を図るために、ＨＤＤ１０５にインストールされたＯＳの全部又は一部がＲＡＭ１０４にコピーされる。同様の目的で、ＨＤＤ１０５にインストールされた音声認識用処理プログラムも、一例としてその起動時等のタイミングで、その全部又は一部がＲＡＭ１０４にコピーされる。これにより、音声認識用処理プログラムは、単独で、あるいはＯＳと協働して、ＣＰＵ１０２に各種機能を実行させる。これらの機能は、音声認識用処理プログラムが意図する目的達成手段としても認識し得る。 When the microcomputer constituting the speech recognition apparatus 101 is activated, all or part of the OS installed in the HDD 105 is copied to the RAM 104 in order to increase the processing speed. For the same purpose, the speech recognition processing program installed in the HDD 105 is also copied in its entirety or in part to the RAM 104, for example, at the time of activation or the like. Thereby, the speech recognition processing program causes the CPU 102 to execute various functions independently or in cooperation with the OS. These functions can also be recognized as an objective achieving means intended by the speech recognition processing program.

図２は、音声認識装置１０１の機能ブロック図である。この機能ブロック図は、音声認識装置１０１を構成するマイクロコンピュータにおいて、起動した音声認識用処理プログラムに従いＣＰＵ１０２が実行される各種機能をブロック化して示すものである。これらの各種機能ブロックとして、音声認識装置１０１は、マイクロフォンアレー処理部１３１、発話区間検出部１４１、発話区間抽出部１５１、及び音声認識部１６１を有する。 FIG. 2 is a functional block diagram of the speech recognition apparatus 101. This functional block diagram is a block diagram showing various functions executed by the CPU 102 in accordance with the activated speech recognition processing program in the microcomputer constituting the speech recognition apparatus 101. As these various functional blocks, the speech recognition apparatus 101 includes a microphone array processing unit 131, a speech segment detection unit 141, a speech segment extraction unit 151, and a speech recognition unit 161.

マイクロフォンアレー処理部１３１は、マイクロフォンアレー１１９が備えるマイクロフォン素子１２０から出力されて増幅器１２１で増幅されアナログデジタルコンバータ１２２でデジタル変換された複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する。マイクロフォンアレー処理部１３１による雑音信号の低減処理は、一例として遅延和アレー処理により実行され、別の一例として適用型マイクロフォンアレー処理により実行される。 The microphone array processing unit 131 outputs audio signals and noise included in a plurality of sound signals output from the microphone elements 120 included in the microphone array 119, amplified by the amplifier 121, and digitally converted by the analog-digital converter 122. The noise signal is reduced using the difference between the signal and each system, and a processed signal with an improved S / N ratio is generated. The noise signal reduction processing by the microphone array processing unit 131 is performed by delay-and-sum array processing as an example, and is performed by adaptive microphone array processing as another example.

遅延和アレー処理は、前述したように、個々のマイクロフォン素子１２０から出力される複数系統の音信号について、目的方向から到来する音信号をそれぞれ同相化した上で、同相化した複数系統の音信号を加算して処理信号とする処理である。このため、遅延和アレー処理では、音声信号が到来する目的方向が既知である必要がある。こうすることで、目的方向から到来する音信号は音声信号となり、これを同相化して加算すれば、当該音声信号は強調された信号となる。これに対して、目的方向以外から到来する信号は雑音信号となるので、雑音信号については同相化しない。これによって、雑音信号は時間的にずれた波形となるため、加算しても強調効果が弱い。その結果、音声信号の方が雑音信号よりも強調され、相対的に雑音信号を低減することができるわけである。 As described above, the delay-and-sum array processing is performed by making the sound signals coming from the target direction in-phase with respect to the plurality of sound signals output from the individual microphone elements 120 and then making the in-phase sound signals in the plurality of lines. Is a processing signal. For this reason, in the delay-and-sum array process, the target direction from which the audio signal arrives needs to be known. By doing so, the sound signal coming from the target direction becomes an audio signal, and if the signals are in-phased and added, the audio signal becomes an enhanced signal. On the other hand, since a signal arriving from a direction other than the target direction becomes a noise signal, the noise signal is not in phase. As a result, the noise signal has a waveform shifted in time, so that the enhancement effect is weak even when added. As a result, the voice signal is emphasized more than the noise signal, and the noise signal can be relatively reduced.

適用型マイクロフォンアレー処理は、前述したように、雑音信号を同相化させ、個々のマイクロフォン素子１２０から出力される複数系統の音信号から同相化させた雑音信号を減算することで、雑音信号を消去する処理である。適用型マイクロフォンアレー処理においては、遅延和アレー処理と異なり、遅延量、換言すると雑音信号の到来方向を知っている必要がない。あるマイクロフォン素子１２０から出力される一の系統の音信号の位相を基準とし、減算出力のパワーを監視しながら、他のマイクロフォン素子１２０から出力される別の系統の音信号の位相を遅延させ、減算出力のパワーの値が最小となるように各遅延量を設定すれば良い。減算出力のパワーの値が最小となれば、雑音は消去されたことになる。 As described above, the adaptive microphone array process eliminates the noise signal by making the noise signal in-phase and subtracting the in-phase noise signal from multiple sound signals output from the individual microphone elements 120. It is processing to do. Unlike the delay-and-sum array process, the adaptive microphone array process does not need to know the delay amount, in other words, the arrival direction of the noise signal. Using the phase of the sound signal of one system output from a certain microphone element 120 as a reference, while monitoring the power of the subtraction output, the phase of the sound signal of another system output from another microphone element 120 is delayed, Each delay amount may be set so that the power value of the subtraction output is minimized. If the power value of the subtraction output is minimized, the noise is eliminated.

以上、マイクロフォンアレー処理部１３１による雑音信号の低減処理として、遅延和アレー処理と適用型マイクロフォンアレー処理とを紹介した。もっとも、本実施の形態のマイクロフォンアレー処理部１３１は、マイクロフォンアレー１１９が備えるマイクロフォン素子１２０から出力された複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用した雑音信号の低減処理を実行するのであれば、その処理形式を問わない。また、マイクロフォンアレー処理部１３１が実行すべき処理については、前述した非特許文献１を参照することで、各種処理を容易に実施可能である。 As described above, the delay sum array processing and the applied microphone array processing have been introduced as noise signal reduction processing by the microphone array processing unit 131. However, the microphone array processing unit 131 according to the present embodiment is configured so that the sound signal and the noise signal included in the sound signal of each of the plurality of sound signals output from the microphone element 120 included in the microphone array 119 are separated for each system. If the noise signal reduction process using the difference is executed, the processing format is not limited. In addition, regarding the process to be executed by the microphone array processing unit 131, various processes can be easily performed by referring to Non-Patent Document 1 described above.

発話区間検出部１４１は、マイクロフォンアレー処理部１３１が出力する処理信号に基づいて発話区間を検出し、発話区間情報として出力する処理を実行する。このような発話区間検出部１４１での処理としては、音声パワー包絡の立ち上がり立ち下りにより検出する手法、基本周波数を抽出して検出する手法等、従来から知られている様々な手法を採用して実施することができる。 The utterance section detection unit 141 detects a utterance section based on the processing signal output from the microphone array processing unit 131, and executes a process of outputting the utterance section information. As processing in such an utterance section detection unit 141, various conventionally known methods such as a method of detecting by rising and falling of a voice power envelope and a method of extracting and detecting a fundamental frequency are adopted. Can be implemented.

発話区間抽出部１５１は、一の系統の音信号から発話区間情報によって特定される発話区間内の発話信号を抽出する処理を実行する。どの音信号を採用するかは、予め固定的に定めておけば良い。発話区間抽出部１５１が実施する発話信号の抽出処理は、従来から採用されている各種の処理によって容易に実施可能である。 The utterance section extraction unit 151 executes a process of extracting an utterance signal in the utterance section specified by the utterance section information from the sound signal of one system. Which sound signal is to be used may be fixedly determined in advance. The speech signal extraction process performed by the speech segment extraction unit 151 can be easily performed by various processes conventionally employed.

本実施の形態において重要なことは、第一に、発話信号を抽出すべき発話区間として、発話区間情報によって特定される発話区間を採用している点である。つまり、発話区間情報は、前述したマイクロフォンアレー処理部１３１の処理によって生成された処理信号に基いて発話区間検出部１４１によって検出された発話区間である。マイクロフォンアレー処理部１３１は、アナログデジタルコンバータ１２２によってデジタル化された音信号から雑音信号を低減させた音信号、つまり、ＳＮ比が改善された処理信号を生成する。発話区間検出部１４１は、そのようなＳＮ比が改善された処理信号に基づいて発話区間を検出するので、高い精度で発話区間を検出し得る。 What is important in the present embodiment is that, firstly, the utterance section specified by the utterance section information is adopted as the utterance section from which the utterance signal is to be extracted. That is, the utterance section information is an utterance section detected by the utterance section detection unit 141 based on the processing signal generated by the processing of the microphone array processing unit 131 described above. The microphone array processing unit 131 generates a sound signal obtained by reducing a noise signal from the sound signal digitized by the analog-digital converter 122, that is, a processing signal having an improved SN ratio. Since the utterance section detection unit 141 detects the utterance section based on the processing signal with such improved S / N ratio, it can detect the utterance section with high accuracy.

本実施の形態において重要なことの第二は、発話区間抽出部１５１において、マイクロフォンアレー処理部１３１によって生成された処理信号を利用して発話信号を抽出するのではなく、アナログデジタルコンバータ１２２によってデジタル信号に変換された音信号中、一の系統の音信号から発話信号を抽出する処理が実行される点である。前述したように、マイクロフォンアレー処理部１３１によって生成された処理信号は、雑音信号の低減という側面については優れた特性を有する反面、音声信号の変容を招来してしまう。このため、そのような変容した音声信号である処理信号から発話信号を抽出すると、その後に続く音声認識部１６１での音声認識処理での認識率を低下させてしまう。そこで、本実施の形態では、発話区間抽出部１５１ではアナログデジタルコンバータ１２２によってデジタル信号に変換された一の系統の音信号を用い、この音信号から発話信号を抽出する処理を実行する。これは、マイクロフォンアレー処理部１３１によって生成される前の音信号には雑音信号が含まれているとしても、マイクロフォンアレー処理部１３１による生成後の音信号が変容してしまっている処理信号を用いた声認識処理の処理結果よりは、雑音信号交じりの音信号に基づく音声認識処理の処理結果の方が、認識率の低下が少ないという知見に基づくものである。 The second important thing in the present embodiment is that the utterance section extraction unit 151 does not extract the utterance signal using the processing signal generated by the microphone array processing unit 131 but uses the analog / digital converter 122 to perform digital processing. The point is that processing for extracting the speech signal from the sound signal of one system is executed from the sound signal converted into the signal. As described above, the processing signal generated by the microphone array processing unit 131 has excellent characteristics in terms of noise signal reduction, but leads to transformation of the audio signal. For this reason, if an utterance signal is extracted from the processing signal which is such a transformed voice signal, the recognition rate in the subsequent voice recognition processing in the voice recognition unit 161 is lowered. Therefore, in the present embodiment, the utterance section extraction unit 151 uses the sound signal of one system converted into a digital signal by the analog-digital converter 122, and executes a process of extracting the utterance signal from this sound signal. This is because, even if the sound signal before being generated by the microphone array processing unit 131 includes a noise signal, the processing signal in which the sound signal generated by the microphone array processing unit 131 has been transformed is used. The processing result of the speech recognition processing based on the sound signal mixed with the noise signal is based on the knowledge that the recognition rate decreases less than the processing result of the voice recognition processing.

音声認識処理は、抽出された発話信号を辞書に登録された認識対象語句と比較し、近似する認識対象語句を抽出する、という処理である。このような音声認識処理は、従来から知られている様々な手法によって実行可能である。 The speech recognition process is a process of comparing an extracted speech signal with a recognition target word / phrase registered in a dictionary and extracting an approximate recognition target word / phrase. Such speech recognition processing can be executed by various methods known in the art.

この際、音声認識部１６１は、抽出された発話信号、つまり、アナログデジタルコンバータ１２２によってデジタル信号に変換された一の系統の音信号から抽出された発話信号について音声認識処理を施し、認識結果を得る。その結果、本実施の形態では、音声認識部１６１による音声認識処理の認識率を向上させることが可能である。 At this time, the speech recognition unit 161 performs speech recognition processing on the extracted speech signal, that is, the speech signal extracted from the sound signal of one system converted into a digital signal by the analog-digital converter 122, and the recognition result is obtained. obtain. As a result, in this embodiment, it is possible to improve the recognition rate of the speech recognition processing by the speech recognition unit 161.

以上、本実施の形態では、マイクロコンピュータによって音声認識装置１０１を実現させた一例を示した。これに対して、別の実施の一形態としては、図２に示すマイクロフォンアレー処理部１３１、発話区間検出部１４１、発話区間抽出部１５１及び音声認識部１６１の全部又は一部を、集積回路によって実現させても良い。 As described above, in the present embodiment, an example in which the speech recognition apparatus 101 is realized by a microcomputer has been described. On the other hand, as another embodiment, all or part of the microphone array processing unit 131, the speech segment detection unit 141, the speech segment extraction unit 151, and the speech recognition unit 161 shown in FIG. It may be realized.

本発明の第２の実施の形態を図３ないし図５に基づいて説明する。第１の実施の形態と同一部分は同一符号で示し説明も省略する。 A second embodiment of the present invention will be described with reference to FIGS. The same parts as those of the first embodiment are denoted by the same reference numerals, and description thereof is also omitted.

図３は、本実施の形態の音声認識装置１０１の機能ブロック図である。本実施の形態が第１の実施の形態と相違する点は、発話区間抽出部１５１が取り込むデジタル化された音信号の数である。本実施の形態では、一例として、マイクロフォンアレー１１９は、図３中で「Ａ」と「Ｂ」と表記される二つのマイクロフォン素子１２０を備えている。そして、発話区間抽出部１５１は、それらの二つのマイクロフォン素子１２０が出力する両系統の音信号のいずれをも取り込み、両系統の信号から発話区間検出部１４１によって検出された発話区間情報によって特定される発話区間内の発話信号を抽出する処理を実行する。そして、音声認識部１６１は、それらの二系統の抽出された発話信号に対して音声認識処理を施し、二種類の認識結果を得る。 FIG. 3 is a functional block diagram of the speech recognition apparatus 101 according to the present embodiment. This embodiment is different from the first embodiment in the number of digitized sound signals captured by the utterance section extraction unit 151. In the present embodiment, as an example, the microphone array 119 includes two microphone elements 120 denoted as “A” and “B” in FIG. Then, the utterance section extraction unit 151 takes in both sound signals output from the two microphone elements 120 and is specified by the utterance section information detected by the utterance section detection unit 141 from the signals of both systems. A process for extracting an utterance signal within the utterance section is executed. Then, the speech recognition unit 161 performs speech recognition processing on these two extracted speech signals to obtain two types of recognition results.

別の実施の形態として、マイクロフォンアレー１１９は、三つ以上のマイクロフォン素子１２０を備えて三系統以上のデジタル化された音信号を出力し、これらの各系統の音信号がマイクロフォンアレー処理部１３１に送信される構成であっても良い。この場合、発話区間抽出部１５１は、マイクロフォンアレー１１９が出力する全ての系統のデジタル化された音信号から発話信号を抽出する構成であっても、マイクロフォンアレー１１９が出力する一部の系統のデジタル化された音信号のみから発話信号を抽出する構成であっても、いずれでも良い。 As another embodiment, the microphone array 119 includes three or more microphone elements 120 and outputs three or more digitized sound signals, and the sound signals of these systems are supplied to the microphone array processing unit 131. It may be configured to be transmitted. In this case, even if the utterance section extraction unit 151 is configured to extract the utterance signals from the digitized sound signals of all the systems output from the microphone array 119, the digital data of some systems output from the microphone array 119 is used. Even if it is the structure which extracts a speech signal only from the digitized sound signal, any may be sufficient.

図４は、採用定義を例示する模式図である。本実施の形態では、音声認識処理用プログラムは、採用定義２０１を有している。採用定義２０１は、発話信号についての認識の成否と採用する発話信号との関係を定義する。図４中、認識結果Ａは、図３中で「Ａ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく認識結果を、認識結果Ｂは、図３中で「Ｂ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく認識結果を、それぞれ示している。図４中の選択規則は、選択される方の認識結果である。図４に示すように、認識結果Ａと認識結果Ｂとが共に認識成功の場合、認識結果Ａが選択される。認識結果Ａが認識成功で認識結果Ｂが認識失敗の場合も同様である。これに対して、認識結果Ａが認識失敗で認識結果Ｂが認識成功の場合には、認識結果Ｂが選択される。認識結果Ａと認識結果Ｂとが共に認識失敗の場合は、エラーとなり、いずれの認識結果も採用されない。 FIG. 4 is a schematic diagram illustrating the employment definition. In the present embodiment, the speech recognition processing program has an adoption definition 201. The adoption definition 201 defines the relationship between the success or failure of recognition of the speech signal and the speech signal to be employed. In FIG. 4, the recognition result A is the recognition result based on the sound signal output from the microphone element 120 labeled “A” in FIG. 3 and digitized by the analog-digital converter 122, and the recognition result B is The recognition results based on the sound signal output from the microphone element 120 denoted as “B” in FIG. 3 and digitized by the analog-digital converter 122 are respectively shown. The selection rule in FIG. 4 is a recognition result of the selected one. As shown in FIG. 4, when the recognition result A and the recognition result B are both recognized successfully, the recognition result A is selected. The same applies when the recognition result A is a recognition success and the recognition result B is a recognition failure. On the other hand, when the recognition result A is a recognition failure and the recognition result B is a recognition success, the recognition result B is selected. If both the recognition result A and the recognition result B fail to be recognized, an error occurs, and neither recognition result is adopted.

図３に戻る。図３に示すように、本実施の形態の音声認識装置１０１は、第１の実施の形態の音声認識装置１０１が備えていない音声認識結果選択部１７１を備えている。この音声認識結果選択部１７１は、音声認識処理用プログラムに従いＣＰＵ１０２が実行する機能の一つであり、図４に例示する採用定義２０１に従い、いずれか一つの認識結果を選択して出力する。 Returning to FIG. As illustrated in FIG. 3, the speech recognition apparatus 101 according to the present embodiment includes a speech recognition result selection unit 171 that is not included in the speech recognition apparatus 101 according to the first embodiment. The voice recognition result selection unit 171 is one of the functions executed by the CPU 102 according to the voice recognition processing program, and selects and outputs any one of the recognition results according to the adoption definition 201 illustrated in FIG.

したがって、本実施の形態によれば、発話区間抽出部１５１は、二つのマイクロフォン素子１２０が出力する両系統の音信号のいずれをも取り込んで発話信号を抽出し、これらの二系統の発話信号に対して音声認識部１６１が音声認識処理を施して二種類の認識結果を得る。そして、音声認識結果選択部１７１が採用定義２０１に従いいずれか一つの認識結果を選択して出力する。 Therefore, according to the present embodiment, the utterance section extraction unit 151 extracts both utterance signals output from the two microphone elements 120 and extracts the utterance signals, and converts them into these two utterance signals. On the other hand, the speech recognition unit 161 performs speech recognition processing to obtain two types of recognition results. Then, the speech recognition result selection unit 171 selects and outputs one of the recognition results according to the employment definition 201.

図５は、採用定義２０１に従った認識結果の選択態様を例示する模式図である。音声認識部１６１での音声認識処理の結果、認識結果Ａは認識失敗であり、認識結果Ｂは認識成功で「ラーメン」という認識結果が得られた場合、音声認識結果選択部１７１は、採用定義２０１に従い認識結果Ｂ、つまり「ラーメン」という認識結果を選択して出力する。 FIG. 5 is a schematic diagram illustrating a selection mode of recognition results according to the employment definition 201. If the recognition result A is a recognition failure and the recognition result B is a recognition success and a recognition result of “ramen” is obtained as a result of the speech recognition processing in the speech recognition unit 161, the speech recognition result selection unit 171 adopts the adoption definition. According to 201, a recognition result B, that is, a recognition result of “ramen” is selected and output.

このように、本実施の形態によれば、二種類の認識結果から採用定義２０１に従い選択された認識結果が選択されて出力されるので、音声認識処理での認識率がより向上する。 As described above, according to the present embodiment, the recognition result selected from the two types of recognition results according to the adoption definition 201 is selected and output, so that the recognition rate in the speech recognition processing is further improved.

本発明の第３の実施の形態を図６に基づいて説明する。第２の実施の形態と同一部分は同一符号で示し説明も省略する。 A third embodiment of the present invention will be described with reference to FIG. The same parts as those of the second embodiment are denoted by the same reference numerals, and description thereof is also omitted.

図６は、認識結果に基づく認識スコアを例示する模式図である。本実施の形態は、認識結果Ａと認識結果Ｂとのいずれかを採用して出力する点については、第２の実施の形態と共通性を有している。これに対して、本実施の形態では、図４に例示するような採用定義２０１を用いず、認識結果に伴われる認識スコアに基づいていずれの認識結果を採用するのかを選択する。この点が、第２の実施の形態との相違である。 FIG. 6 is a schematic diagram illustrating a recognition score based on the recognition result. This embodiment has commonality with the second embodiment in that either the recognition result A or the recognition result B is adopted and output. On the other hand, in the present embodiment, which of the recognition results is to be adopted is selected based on the recognition score accompanying the recognition result without using the adoption definition 201 illustrated in FIG. This is the difference from the second embodiment.

つまり、音声認識結果選択部１７１は、認識結果についてその確度を表現する認識スコアを算出する。認識スコアの算出手法については、従来の様々な手法を採用することができ、その説明も省略する。そして、音声認識結果選択部１７１は、最も高い認識スコアに対応する認識結果を選択して出力する。 That is, the speech recognition result selection unit 171 calculates a recognition score that expresses the accuracy of the recognition result. As a method for calculating the recognition score, various conventional methods can be adopted, and the description thereof is also omitted. Then, the speech recognition result selection unit 171 selects and outputs a recognition result corresponding to the highest recognition score.

例えば、図６を参照すると、図６（ａ）は、図３中で「Ａ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく認識結果Ａを、図６（ｂ）は、図３中で「Ｂ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく認識結果Ｂを、それぞれ示している。認識結果Ａは、発話区間抽出部１５１で音信号から抽出された発話信号に基づいて、「ラーメン」、「ラー油」、「メンマ」という三種類の認識結果を順位１、２、３の順番で得ている。音声認識結果選択部１７１が算出したそれぞれの認識スコアは、「ラーメン」が７０、「ラー油」が５０、「メンマ」が２０である。なお、図６（ａ）中、順位４は認識失敗であるが、認識スコアは１０として算出されている。また、認識結果Ｂは、発話区間抽出部１５１で音信号から抽出された発話信号に基づいて、「ラー油」、「ラーメン」、「メンマ」という三種類の認識結果を順位１、２、４の順番で得ている。音声認識結果選択部１７１が算出したそれぞれの認識スコアは、「ラー油」が６０、「ラーメン」が３０、「メンマ」が５である。なお、図６（ｂ）中、順位３は認識失敗であるが、認識スコアは１０として算出されている。したがって、図６の示す一例では、音声認識結果選択部１７１は、最も高い認識スコアである７０に対応する認識結果、つまりラーメンを選択して出力することになる。 For example, referring to FIG. 6, FIG. 6A shows a recognition result A based on a sound signal output from the microphone element 120 labeled “A” in FIG. 3 and digitized by the analog-digital converter 122. FIG. 6B shows the recognition results B based on the sound signal output from the microphone element 120 labeled “B” in FIG. 3 and digitized by the analog-to-digital converter 122, respectively. . Based on the speech signal extracted from the sound signal by the speech segment extraction unit 151, the recognition result A is obtained by classifying the three types of recognition results of “ramen”, “ra oil”, and “menma” in the order of ranks 1, 2, and 3. It has gained. The respective recognition scores calculated by the speech recognition result selection unit 171 are “ramen” is 70, “ramie oil” is 50, and “menma” is 20. In FIG. 6A, rank 4 is a recognition failure, but the recognition score is calculated as 10. In addition, the recognition result B has three recognition results of “Ra oil”, “Ramen”, and “Memma” based on the speech signal extracted from the sound signal by the speech segment extraction unit 151. Getting in order. The respective recognition scores calculated by the speech recognition result selection unit 171 are 60 for “ramen oil”, 30 for “ramen”, and 5 for “menma”. In FIG. 6B, rank 3 is a recognition failure, but the recognition score is calculated as 10. Therefore, in the example illustrated in FIG. 6, the speech recognition result selection unit 171 selects and outputs a recognition result corresponding to 70, which is the highest recognition score, that is, ramen.

このように、本実施の形態によれば、二種類の認識結果に伴われる認識スコアが最も高い認識結果が選択されて出力されるので、音声認識処理での認識率がより向上する。 Thus, according to the present embodiment, the recognition result having the highest recognition score associated with the two types of recognition results is selected and output, so that the recognition rate in the speech recognition processing is further improved.

本発明の第４の実施の形態を図７及び図８に基づいて説明する。第２の実施の形態と同一部分は同一符号で示し説明も省略する。本実施の形態は、認識結果Ａと認識結果Ｂとのいずれかを採用して出力する点については、第２の実施の形態と共通性を有している。これに対して、本実施の形態では、図４に例示するような採用定義２０１を用いず、発話区間抽出部１５１が抽出する二系統の発話信号の音量に大小に応じていずれの認識結果を採用するのかを選択する。この点が、第２の実施の形態との相違である。 A fourth embodiment of the present invention will be described with reference to FIGS. The same parts as those of the second embodiment are denoted by the same reference numerals, and description thereof is also omitted. This embodiment has commonality with the second embodiment in that either the recognition result A or the recognition result B is adopted and output. On the other hand, in the present embodiment, without using the adoption definition 201 as illustrated in FIG. 4, any recognition result is determined according to the volume of the two systems of utterance signals extracted by the utterance section extraction unit 151. Select whether to adopt. This is the difference from the second embodiment.

図７は、本実施の形態の音声認識装置の機能ブロック図である。本実施の形態の音声認識装置１０１は、音声認識処理用プログラムに従いＣＰＵ１０２が実行する機能の一つとして、音量算出部１８１を備えている。音量算出部１８１は、二つのマイクロフォン素子１２０が出力する両系統の音信号から発話区間抽出部１５１が抽出する発話信号を取り込み、その音量を算出する。音量は、二系統の発話信号の振幅を参照することで、容易に算出される。 FIG. 7 is a functional block diagram of the speech recognition apparatus according to the present embodiment. The speech recognition apparatus 101 according to the present embodiment includes a volume calculation unit 181 as one of the functions executed by the CPU 102 in accordance with the speech recognition processing program. The volume calculation unit 181 takes in the utterance signal extracted by the utterance section extraction unit 151 from the sound signals of both systems output from the two microphone elements 120, and calculates the volume. The volume can be easily calculated by referring to the amplitudes of the two utterance signals.

音声認識結果選択部１７１は、音量算出部１８１が算出した音量のうち、最小音量の音信号に対応する認識結果を選択して出力する。これは、二系統の発話信号にはいずれにも話者の音声信号が含まれているのに対して、音量が大きい方の発話信号はそのような音声信号以外の雑音信号が多く含まれていると予想されるからである。 The voice recognition result selection unit 171 selects and outputs the recognition result corresponding to the sound signal with the minimum volume among the volumes calculated by the volume calculation unit 181. This is because the speech signal of the speaker is included in both of the two utterance signals, whereas the utterance signal with the higher volume includes a lot of noise signals other than such a speech signal. Because it is expected.

図８は、二系統の音信号のそれぞれについて算出された音量を例示する模式図である。図８中、発話信号Ａは、図３中で「Ａ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく発話信号を、発話信号Ｂは、図３中で「Ｂ」と表記されているマイクロフォン素子１２０から出力されてアナログデジタルコンバータ１２２でデジタル化された音信号に基づく発話信号を、それぞれ示している。図８に示す一例では、発話信号Ａは−２０ｄＢ、発話信号Ｂは−２５ｄＢである。したがって、音声認識結果選択部１７１は、より音量が小さい方の発話信号Ｂに対応する認識結果を選択して出力することになる。つまり、発話信号Ａに対応する認識結果である「ラーメン」は選択されず、発話信号Ｂに対応する認識結果である「ラー油」が選択されることになる。 FIG. 8 is a schematic diagram illustrating the volume calculated for each of the two systems of sound signals. In FIG. 8, an utterance signal A is an utterance signal based on a sound signal output from the microphone element 120 labeled “A” in FIG. 3 and digitized by the analog-digital converter 122, and the utterance signal B is FIG. 3 shows speech signals based on sound signals output from the microphone element 120 labeled “B” in FIG. 3 and digitized by the analog-digital converter 122. In the example shown in FIG. 8, the speech signal A is −20 dB, and the speech signal B is −25 dB. Therefore, the speech recognition result selection unit 171 selects and outputs a recognition result corresponding to the utterance signal B having a lower volume. That is, “ramen” that is the recognition result corresponding to the utterance signal A is not selected, and “ra oil” that is the recognition result corresponding to the utterance signal B is selected.

このように、本実施の形態によれば、二種類の認識結果から音量が小さい方の発話信号に対応する認識結果が選択されて出力されるので、音声認識処理での認識率がより向上する。 As described above, according to the present embodiment, the recognition result corresponding to the speech signal with the lower volume is selected and output from the two types of recognition results, so that the recognition rate in the speech recognition processing is further improved. .

本発明の第１の実施の形態として、音声認識装置のハードウェア構成を示すブロック図である。1 is a block diagram showing a hardware configuration of a speech recognition apparatus as a first embodiment of the present invention. FIG. 音声認識装置の機能ブロック図である。It is a functional block diagram of a voice recognition device. 本発明の第２の実施の形態を示す音声認識装置の機能ブロック図である。It is a functional block diagram of the speech recognition apparatus which shows the 2nd Embodiment of this invention. 採用定義を例示する模式図である。It is a schematic diagram which illustrates the adoption definition. 採用定義に従った認識結果の選択態様を例示する模式図である。It is a schematic diagram which illustrates the selection aspect of the recognition result according to the adoption definition. 本発明の第２の実施の形態を示す認識結果に基づく認識スコアを例示する模式図である。It is a schematic diagram which illustrates the recognition score based on the recognition result which shows the 2nd Embodiment of this invention. 本発明の第２の実施の形態を示す音声認識装置の機能ブロック図である。It is a functional block diagram of the speech recognition apparatus which shows the 2nd Embodiment of this invention. 二系統の音信号のそれぞれについて算出された音量を例示する模式図である。It is a schematic diagram which illustrates the sound volume calculated about each of two systems of sound signals.

符号の説明Explanation of symbols

１１９マイクロフォンアレー
１２０マイクロフォン素子
２０１採用定義
119 Microphone array 120 Microphone element 201 Adoption definition

Claims

複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
一の系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された前記発話信号について音声認識処理を施し、認識結果を得る手段と、
を備える音声認識装置。 When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal having an improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signal of one system;
Means for performing speech recognition processing on the extracted speech signal and obtaining a recognition result;
A speech recognition apparatus comprising:

複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
前記発話信号についての認識の成否と採用する前記発話信号との関係を定義する採用定義に従い、いずれか一つの前記認識結果を選択して出力する手段と、
を備える音声認識装置。 When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal having an improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for selecting and outputting any one of the recognition results according to the adoption definition that defines the relationship between the success or failure of recognition of the speech signal and the speech signal to be adopted;
A speech recognition apparatus comprising:

複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
前記認識結果についてその確度を表現する認識スコアを算出する手段と、
最も高い前記認識スコアに対応する前記認識結果を選択して出力する手段と、
を備える音声認識装置。 When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal having an improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for calculating a recognition score expressing the accuracy of the recognition result;
Means for selecting and outputting the recognition result corresponding to the highest recognition score;
A speech recognition apparatus comprising:

複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
発話信号を抽出する前記音信号の音量を算出する手段と、
最小音量の前記音信号に対応する前記認識結果を選択して出力する手段と、
を備える音声認識装置。 When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal with improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for calculating a volume of the sound signal for extracting a speech signal;
Means for selecting and outputting the recognition result corresponding to the sound signal of the minimum volume;
A speech recognition apparatus comprising:

コンピュータにインストールされ、当該コンピュータに、
複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
一の系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された前記発話信号について音声認識処理を施し、認識結果を得る手段と、
を実行させる音声認識処理用プログラム。 Installed on a computer,
When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal having an improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signal of one system;
Means for performing speech recognition processing on the extracted speech signal and obtaining a recognition result;
A speech recognition processing program that executes

コンピュータにインストールされ、当該コンピュータに、
複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
前記発話信号についての認識の成否と採用する前記発話信号との関係を定義する採用定義に従い、いずれか一つの前記認識結果を選択して出力する手段と、
を実行させる音声認識処理用プログラム。 Installed on a computer,
When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal with improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for selecting and outputting any one of the recognition results according to the adoption definition that defines the relationship between the success or failure of recognition of the speech signal and the speech signal to be adopted;
A speech recognition processing program that executes

コンピュータにインストールされ、当該コンピュータに、
複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
前記認識結果についてその確度を表現する認識スコアを算出する手段と、
最も高い前記認識スコアに対応する前記認識結果を選択して出力する手段と、
を実行させる音声認識処理用プログラム。 Installed on a computer,
When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal having an improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for calculating a recognition score expressing the accuracy of the recognition result;
Means for selecting and outputting the recognition result corresponding to the highest recognition score;
A speech recognition processing program that executes

コンピュータにインストールされ、当該コンピュータに、
複数個のマイクロフォン素子から構成されるマイクロフォンアレーに音声が入力されることにより個々の前記マイクロフォン素子から出力される複数系統の音信号について、当該音信号に含まれる音声信号と雑音信号とが個々の系統毎に相違することを利用して雑音信号の低減処理を施し、ＳＮ比を改善した処理信号を生成する手段と、
前記処理信号に基づいて発話区間を検出し、発話区間情報として出力する手段と、
複数系統の前記音信号から前記発話区間情報によって特定される発話区間内の発話信号を抽出する手段と、
抽出された複数系統の前記発話信号について音声認識処理を施し、複数の認識結果を得る手段と、
発話信号を抽出する前記音信号の音量を算出する手段と、
最小音量の前記音信号に対応する前記認識結果を選択して出力する手段と、
を実行させる音声認識処理用プログラム。
Installed on a computer,
When sound is input to a microphone array composed of a plurality of microphone elements, a plurality of sound signals output from each of the microphone elements, the sound signals and noise signals included in the sound signals are individually Means for reducing the noise signal by utilizing the difference for each system, and generating a processed signal with improved S / N ratio;
Means for detecting an utterance interval based on the processed signal and outputting the utterance interval information;
Means for extracting an utterance signal in an utterance section specified by the utterance section information from the sound signals of a plurality of systems;
Means for performing speech recognition processing on the extracted speech signals of a plurality of systems and obtaining a plurality of recognition results;
Means for calculating a volume of the sound signal for extracting a speech signal;
Means for selecting and outputting the recognition result corresponding to the sound signal of the minimum volume;
A speech recognition processing program that executes