JP2001337694A

JP2001337694A - Method for presuming speech source position, method for recognizing speech, and method for emphasizing speech

Info

Publication number: JP2001337694A
Application number: JP2000115323A
Authority: JP
Inventors: Akira Kurematsu; 明榑松; Takayuki Nagai; 隆行長井
Original assignee: Individual
Current assignee: Individual
Priority date: 2000-03-24
Filing date: 2000-04-17
Publication date: 2001-12-07

Abstract

PROBLEM TO BE SOLVED: To provide a method for presuming a speech source position, a method for recognizing speech, and a method for emphasizing the speech to increase a recognition rate of a speech signal of a target under a noisy situation such as an inside of a car. SOLUTION: These methods have a delay sum array part 4, a speech recognition part 5, and a reception part 2. A part 3 presuming the speech source position presumes a position of a speech source 1. Based on this presumption, the delay sum array part 4 forms a directional characteristic in the position of the speech source 1 and emphasizes the speech of the target. The speech recognition part 5 recognizes an incoming signal input from the reception part 2 as a speech.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音源位置推定方
法、音声認識方法および音声強調方法に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sound source position estimating method, a voice recognition method, and a voice emphasis method.

【０００２】[0002]

【従来の技術】例えば自動車内では、走行音やラジオ音
など様々な雑音が存在する。このため、自動車内での音
声認識は認識率が低いという問題がある。また、このよ
うな、雑音の大きい環境では、一般に、目的音声（認識
の対象とする音声信号）のＳＮ比が小さく、雑音除去や
音声強調を行うことは容易でない。2. Description of the Related Art In a car, for example, various noises such as running sounds and radio sounds are present. For this reason, there is a problem that voice recognition in a car has a low recognition rate. In such a noisy environment, the S / N ratio of the target speech (the speech signal to be recognized) is generally small, and it is not easy to perform noise removal and speech enhancement.

【０００３】[0003]

【発明が解決しようとする課題】本発明は、前記の事情
に鑑みてなされたもので、雑音の大きい環境下、例えば
自動車内において、目的とする音声信号の認識率を向上
させることのできる音源位置推定方法、音声認識方法お
よび音声強調方法を提供することを目的としている。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and provides a sound source capable of improving the recognition rate of a target voice signal in a noisy environment, for example, in an automobile. It is an object of the present invention to provide a position estimation method, a speech recognition method, and a speech enhancement method.

【０００４】[0004]

【課題を解決するための手段】請求項１記載の音源位置
推定方法は、音源から発信された音声信号を受信する複
数の受信部を備え、かつ、下記式（ａ）で示されるＰ
（ｘ，ｙ）の最大値に基づいて、Ｘ−Ｙ方向での２次元
における前記音源の位置を推定する構成となっている。According to a first aspect of the present invention, there is provided a method for estimating a sound source position comprising a plurality of receiving units for receiving a voice signal transmitted from a sound source, and further comprising:
The position of the sound source in two dimensions in the XY direction is estimated based on the maximum value of (x, y).

【数５】ここで、(Equation 5) here,

【数６】 ω：角周波数Ｔ：転置ｘ_ｉ，ｙ_ｉ：各受信部の座標である。＊は複素共役転置を表す。式（ｂ）の左辺は、
Ｍ個の受信部で受信された音波の振幅をＳ_ｑ（ｔ）とお
いて次式（ｅ）で表されるベクトル量である。(Equation 6) ω: angular frequency T: transposition x _i , y _i : coordinates of each receiving unit. * Represents a complex conjugate transpose. The left side of equation (b) is
This is a vector amount represented by the following equation (e), where _Sq (t) is the amplitude of the sound wave received by the M receiving units.

【数７】また、式（ｃ）においてｖ_ｑは、Ｋ個の音波がＭ個の各
受信部に到来している場合の受信信号を下記式（ｆ）の
ｓ（ｔ）としたとき、このｓ（ｔ）の相関行列の固有ベ
クトルＶ＝［ｖ_１，…，ｖ_Ｍ］における成分を示す。(Equation 7) In equation (c), v _q is s (t), where s (t) in the following equation (f) is a received signal when K sound waves arrive at each of the M receiving units. ) Indicate components in the eigenvector V = [v ₁ ,..., V _M ] of the correlation matrix.

【数８】ここでｎ_ｍ（ｔ）は、各受信部における雑音成分を示
す。(Equation 8) Here, n _m (t) indicates a noise component in each receiving unit.

【０００５】請求項２記載の音声認識方法は、遅延和ア
レー部と音声認識部と受信部とを備えており、まず音源
の位置を推定し、この推定に基づいて、前記遅延和アレ
ー部により、前記音源の位置に指向特性を形成して目的
音声の強調を行い、ついで、前記音声認識部により、前
記受信部から入力された入力信号を音声として認識する
構成となっている。According to a second aspect of the present invention, there is provided a speech recognition method including a delay-and-sum array unit, a speech recognition unit, and a receiving unit. First, the position of a sound source is estimated. Then, a directional characteristic is formed at the position of the sound source to emphasize the target sound, and then the input signal input from the receiving unit is recognized as a voice by the voice recognition unit.

【０００６】請求項３記載の音声強調方法は、遅延和ア
レー部と音声認識部とピッチ抽出部と音声合成部と受信
部とを備えており、下記のステップを含む構成となって
いる。（１）音源の位置を推定し、この推定に基づいて、前記
遅延和アレー部により、前記音源の位置に指向特性を形
成して目的音声の強調を行うステップ（２）ついで、前記音声認識部により、前記受信部に入
力された入力信号を音声として認識するステップ（３）前記遅延和アレー部からの出力に基づいて、前記
ピッチ抽出部によりピッチ周波数の抽出を行うステップ（４）前記ピッチ抽出部の出力と前記音声認識部の出力
とを前記音声合成部により合成して音声出力を得るステ
ップ。A voice emphasizing method according to a third aspect includes a delay-and-sum array unit, a voice recognizing unit, a pitch extracting unit, a voice synthesizing unit, and a receiving unit, and includes the following steps. (1) Estimating the position of the sound source and, based on the estimation, forming a directional characteristic at the position of the sound source by the delay-and-sum array unit to emphasize the target sound. (2) Next, the speech recognition unit. And (3) extracting a pitch frequency by the pitch extraction unit based on an output from the delay-and-sum array unit. (4) extracting the pitch Obtaining an audio output by synthesizing an output of the voice recognition unit and an output of the voice recognition unit by the voice synthesis unit.

【０００７】請求項４記載の音声認識方法は、請求項２
における音源の位置を推定する方法を、請求項１記載の
音源位置推定方法としたものである。According to a fourth aspect of the present invention, there is provided a voice recognition method.
The method for estimating the position of the sound source in the above is a sound source position estimating method according to claim 1.

【０００８】請求項５記載の音声強調方法は、請求項３
における音源の位置を推定する方法を、請求項１記載の
音源位置推定方法としたものである。According to a fifth aspect of the present invention, there is provided a voice emphasizing method.
The method for estimating the position of the sound source in the above is a sound source position estimating method according to claim 1.

【０００９】請求項６は、前記受信部をマイクロホンと
した、請求項１記載の音源位置推定方法または請求項
２、４のいずれか１項記載の音声認識方法または請求項
３、５のいずれか１項記載の音声強調方法とされてい
る。In a sixth aspect, the sound source position estimating method according to the first aspect, the voice recognition method according to any one of the second and fourth aspects, or the third aspect or the fifth aspect, wherein the receiving unit is a microphone. The voice emphasizing method described in item 1.

【００１０】[0010]

【発明の実施の形態】本発明の一実施形態に係る、音源
位置推定方法およびこれを用いた音声認識および強調方
法を、添付の図面に基づいて説明する。まず、図１に基
づいて、本実施形態に用いる音声認識装置の概要を説明
する。この実施形態では、音源１として、話者（人間）
を想定しているが、スピーカからの音声でもよく、音源
の種類は限定されない。音源１としての話者は、例え
ば、自動車内の運転席、助手席、後部座席などに座って
いるものとする。この装置は、複数の受信部２と、音源
位置推定部３と、遅延和アレー部４と、音声認識部５
と、ピッチ抽出部６と、音声合成部７とを主要な構成と
して備えている。受信部２は、この実施形態では、マイ
クロホンが用いられており、受信部２の全体として、マ
イクロフォンアレーを構成している。まず、音源１から
発信された音声信号（音声帯域の信号を称し、発信源や
伝送媒質は限定されない）は、受信部２により受信さ
れ、入力信号（一般には電気信号）に変換される。この
入力信号は、音源位置推定部３に入力される。音源位置
推定部３では、次のような、２次元平面（ｘ−ｙ平面）
での位置推定を行う。まず、位置推定の前提から説明す
る。座標（ｘ，ｙ）から到来し、受信部２で受信される
音波の振幅をＳ_ｑ（ｔ）とする。ベクトルを用いて、各
受信部２（Ｍ個とする）での振幅を表すと、次式（１）
となる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A sound source position estimating method and a speech recognition and emphasizing method using the same according to an embodiment of the present invention will be described with reference to the accompanying drawings. First, an outline of a speech recognition device used in the present embodiment will be described with reference to FIG. In this embodiment, a speaker (human) is used as the sound source 1.
However, the sound from the speaker may be used, and the type of the sound source is not limited. It is assumed that the speaker as the sound source 1 is sitting in, for example, a driver's seat, a passenger seat, a rear seat, or the like in an automobile. This device includes a plurality of receiving units 2, a sound source position estimating unit 3, a delay-and-sum array unit 4, and a speech recognizing unit 5.
, A pitch extraction unit 6 and a speech synthesis unit 7 as main components. In this embodiment, the receiving unit 2 uses a microphone, and the receiving unit 2 as a whole constitutes a microphone array. First, an audio signal transmitted from a sound source 1 (referred to as a signal in an audio band, and a transmission source and a transmission medium are not limited) is received by a receiving unit 2 and converted into an input signal (generally, an electric signal). This input signal is input to the sound source position estimating unit 3. In the sound source position estimating unit 3, the following two-dimensional plane (xy plane) is used.
Estimate the position with. First, the premise of position estimation will be described. The amplitude of the sound wave coming from the coordinates (x, y) and received by the receiving unit 2 is defined as S _q (t). When the amplitude at each receiving unit 2 (M is assumed to be M) is represented using a vector, the following equation (1)
Becomes

【数９】ただし、(Equation 9) However,

【数１０】 ω：角周波数、Ｔ：転置、ｘ_ｉ，ｙ_ｉ：各マイクロホン
（ｉ番目）の座標である。τ_ｉは、音源位置の違いによ
る、音波到来時間のずれを示している。これより、Ｋ個
の音波が到来しているときの受信信号の振幅スペクトル
ｓ（ｔ）は、各音波の和として以下のように表すことが
できる。(Equation 10) ω: angular frequency, T: transposition, x _i , y _i : coordinates of each microphone (i-th). τ _i indicates a shift of the sound wave arrival time due to a difference in the sound source position. Thus, the amplitude spectrum s (t) of the received signal when K sound waves have arrived can be expressed as the sum of the sound waves as follows.

【数１１】ただし、ここでｎ_ｍ（ｔ）は、各マイクロホンにおける
雑音成分である。次に、ｓ（ｔ）の相関行列Ｒの固有ベ
クトルＶ＝［ｖ_１，…，ｖ_Ｍ］を用いて、下式（３）に
より行列Ｒ_ｎを求める。ただし、＊は複素共役転置を表
す。[Equation 11] Here, _nm (t) is a noise component in each microphone. Next, using the eigenvector V = [v ₁ ,..., V _M ] of the correlation matrix R of s (t), a matrix R _n is obtained by the following equation (3). Here, * represents a complex conjugate transpose.

【数１２】ここで、信号の到来方向ベクトルと雑音部分空間とは直
交性があると見なせるので、次式（４）のＰ（ｘ，ｙ）
の最大値を見つけることで、２次元平面における音源の
位置を推定できる。(Equation 12) Here, since the arrival direction vector of the signal and the noise subspace can be considered to be orthogonal, P (x, y) of the following equation (4)
By finding the maximum value of, the position of the sound source on the two-dimensional plane can be estimated.

【数１３】すなわち、音源位置推定部３では、前記したＰ（ｘ，
ｙ）の最大値を見つけることで、２次元平面における音
源の位置を推定する処理を行う。推定した位置は、遅延
和アレー部４に送られる。音源位置推定部３では、音源
位置の探索範囲を限定すれば、演算量や誤推定を低減す
ることができる。例えば、自動車内であれば、車中の座
席近辺のみを探索範囲とすることで、こうした利点を得
ることができる。さらに、音源位置推定部３では、受信
部２からの入力信号をそのまま用いるのではなく、音声
が主に存在する１００［Ｈｚ］〜４［ｋＨｚ］の帯域を
対象とすることで、推定の精度を向上させることもでき
る。(Equation 13) That is, in the sound source position estimating unit 3, the aforementioned P (x,
The process of estimating the position of the sound source on the two-dimensional plane is performed by finding the maximum value of y). The estimated position is sent to the delay-and-sum array unit 4. In the sound source position estimating unit 3, if the search range of the sound source position is limited, the amount of calculation and erroneous estimation can be reduced. For example, in a car, such an advantage can be obtained by setting only the vicinity of a seat in the car as a search range. Further, the sound source position estimating unit 3 does not use the input signal from the receiving unit 2 as it is, but targets a band of 100 [Hz] to 4 [kHz] where voice mainly exists, thereby achieving estimation accuracy. Can also be improved.

【００１１】遅延和アレー（Delay and Sum: DS）部４
は、各受信部２で受信した信号に遅延を付加した後、総
和をとることにより、目的とする位置での音を強調する
ものである（参考文献：大橋寿郎、山崎芳男、金田豊著
“音響システムとディジタル処理”電子情報通信学会、
平成７年３月発行）。この実施形態では、音源位置推定
部２により推定された位置を目的位置とし、その座標に
対して指向特性を形成し、目的音声の強調を行う。ただ
し、低域側では遅延和アレーが十分に働かないため、音
声信号がほとんど存在しない成分、例えば０［Ｈｚ］〜
１００［Ｈｚ］の成分は、フィルタ（図示せず）によっ
て除去する。遅延和アレー自体の構成は従来から知られ
ているので、これ以上の詳細の説明は省略する。Delay and Sum (DS) section 4
Is to add a delay to the signal received by each receiving unit 2 and then add up the sum to emphasize the sound at the target position (references: Toshiro Ohashi, Yoshio Yamazaki, Yutaka Kaneda " Acoustic system and digital processing "IEICE,
Published in March 1995). In this embodiment, the position estimated by the sound source position estimating unit 2 is set as a target position, a directional characteristic is formed with respect to the coordinates, and the target sound is emphasized. However, since the delay-and-sum array does not work sufficiently on the low frequency side, a component in which an audio signal hardly exists, for example, 0 [Hz] to
The component of 100 [Hz] is removed by a filter (not shown). Since the configuration of the delay-and-sum array itself is conventionally known, further detailed description will be omitted.

【００１２】音声認識部５では、遅延和アレー部４によ
って強調された目的音声を含む入力信号に対して音声認
識を行う。このため、目的音声を精度良く認識すること
が可能になるという利点がある。ピッチ抽出部６では、
入力信号におけるピッチ周波数を抽出する。音声合成部
７では、ピッチ抽出部６の出力と音声認識部５の出力と
から音声信号を合成する。これにより、音声認識部５で
得られた音声信号の品質を聴覚的な意味で向上させるこ
とができる。これにより、例えば、自動車電話などのよ
うに、音声信号自体を取得する必要がある場合に、品質
の良い音声信号を得ることができるという利点がある。
音声認識部５、ピッチ抽出部６、音声合成部７自体の構
成は公知なので、これらについて、これ以上の説明は省
略する。The speech recognition section 5 performs speech recognition on the input signal including the target speech emphasized by the delay-and-sum array section 4. For this reason, there is an advantage that it is possible to accurately recognize the target voice. In the pitch extraction unit 6,
Extract the pitch frequency in the input signal. The voice synthesizer 7 synthesizes a voice signal from the output of the pitch extractor 6 and the output of the voice recognizer 5. Thereby, the quality of the voice signal obtained by the voice recognition unit 5 can be improved in an auditory sense. Thereby, for example, when it is necessary to obtain the audio signal itself, such as in a car phone, there is an advantage that a high-quality audio signal can be obtained.
Since the configurations of the voice recognition unit 5, the pitch extraction unit 6, and the voice synthesis unit 7 are known, further description thereof will be omitted.

【００１３】[0013]

【実験例】本実施形態の方法に基づき、シミュレーショ
ンによって話者位置推定および音声認識を行った例を示
す。ここでは、自動車内において話者の発声を認識する
状況を想定している。シミュレーション条件は下記の通
りであり、結果を図２に示す。（シミュレーション条件）マイクロホン（受信部）数：１６本雑音源：６カ所（走行雑音等の自動車内雑音であり、Ｓ
Ｎ比は０ｄＢ）話者位置の変更：ｘ−ｙ座標（0.2, 0.2）（左後部座
席）から４３フレーム目で座標（0.4, 1）へ切り替え図２（ａ）（ｂ）は、各フレームでの推定された話者位
置である。有音声区間（音声を発している時間）におい
ては、４３フレームの前後で、座標（0.2, 0.2）から座
標（0.4, 1）に切り替わっていることが判る。これか
ら、ほぼ正しく位置推定を行えることが判る。[Experimental Example] An example of speaker position estimation and speech recognition performed by simulation based on the method of the present embodiment will be described. Here, a situation is assumed in which the utterance of a speaker is recognized in a car. The simulation conditions are as follows, and the results are shown in FIG. (Simulation conditions) Number of microphones (receivers): 16 Noise sources: 6 places (In-car noise such as running noise, S
(N ratio is 0 dB) Change of speaker position: switching from xy coordinates (0.2, 0.2) (left rear seat) to coordinates (0.4, 1) in frame 43 Figure 2 (a) and (b) show each frame Is the estimated speaker position at. It can be seen that in the voiced section (time during which a voice is being emitted), the coordinates (0.2, 0.2) are switched to the coordinates (0.4, 1) before and after 43 frames. From this, it can be seen that position estimation can be performed almost correctly.

【００１４】さらに、図２（ｃ）〜（ｄ）からは、本実
施例の方法により、原音声に近い音声の認識が可能であ
ることが判る。Further, it can be seen from FIGS. 2C to 2D that the method of this embodiment can recognize a voice close to the original voice.

【００１５】なお、本実施形態においてｘおよびｙは任
意に選ばれた２次元を示しており、水平面に限定されな
い。また、本実施形態を実現するための各部の具体的手
段は、ハードウエア、ソフトウエア、ネットワーク、こ
れらの組み合わせ、その他の任意の手段を用いることが
でき、このこと自体は当業者において自明である。さら
に、前記実施形態および実施例の記載は単なる一例に過
ぎず、本発明に必須の構成を示したものではない。各部
の構成は、本発明の趣旨を達成できるものであれば、上
記に限らない。In the present embodiment, x and y indicate arbitrarily selected two dimensions, and are not limited to a horizontal plane. In addition, as specific means of each unit for realizing the present embodiment, hardware, software, a network, a combination thereof, or any other means can be used, and this itself is obvious to those skilled in the art. . Furthermore, the description of the above embodiments and examples is merely an example, and does not show a configuration essential to the present invention. The configuration of each part is not limited to the above as long as the purpose of the present invention can be achieved.

【００１６】[0016]

【発明の効果】請求項１記載の推定方法によれば、２次
元方向における音源の位置を比較的に正確に推定するこ
とができる。According to the estimation method of the first aspect, the position of the sound source in the two-dimensional direction can be estimated relatively accurately.

【００１７】請求項２記載の音声認識方法によれば、雑
音環境下で比較的に精度の良い音声認識を行うことがで
きる。According to the speech recognition method of the present invention, relatively accurate speech recognition can be performed in a noisy environment.

【００１８】請求項３記載の音声強調方法によれば、雑
音環境下で取得した音声情報に基づいて、聴覚的に品質
の良い音声を得ることが可能となる。According to the voice emphasizing method according to the third aspect, it is possible to obtain a high-quality voice perceptually based on voice information acquired in a noisy environment.

【００１９】請求項４記載の音声認識方法によれば、比
較的に正確に推定された音源位置に基づいて音声認識を
行うことができるので、音声認識の精度を向上させるこ
とができる。According to the fourth aspect of the present invention, since the voice recognition can be performed based on the relatively accurate estimated sound source position, the accuracy of the voice recognition can be improved.

【００２０】請求項５記載の音声強調方法によれば、比
較的に正確に推定された音源位置に基づいて音声強調を
行うことができるので、得られる音声の聴覚的な品質を
さらに向上させることが可能となる。According to the voice emphasizing method according to the fifth aspect, voice emphasis can be performed based on the sound source position estimated relatively accurately, so that the auditory quality of the obtained voice is further improved. Becomes possible.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声認識方法を説明
するための機能ブロック図である。FIG. 1 is a functional block diagram illustrating a speech recognition method according to an embodiment of the present invention.

【図２】本発明の一実施例の結果を示すグラフであっ
て、図（ａ）は推定されたｘ軸での位置とフレーム番号
との関係を示し、図（ｂ）は推定されたｙ軸での位置と
フレーム番号との関係を示し、図（ｃ）は原音声信号の
時間波形を示し、図（ｄ）は原音声信号に雑音が付加さ
れた時間波形を示し、図（ｅ）は認識された音声信号の
時間波形を示すものである。FIGS. 2A and 2B are graphs showing the results of one embodiment of the present invention, wherein FIG. 2A shows the relationship between the estimated position on the x-axis and the frame number, and FIG. FIG. 9C shows a time waveform of an original audio signal, and FIG. 9D shows a time waveform in which noise is added to the original audio signal. Indicates a time waveform of the recognized voice signal.

【符号の説明】[Explanation of symbols]

１音源２受信部３音源位置推定部４遅延和アレー部５音声認識部６ピッチ推定部７音声合成部 REFERENCE SIGNS LIST 1 sound source 2 reception unit 3 sound source position estimation unit 4 delay sum array unit 5 speech recognition unit 6 pitch estimation unit 7 speech synthesis unit

Claims

【特許請求の範囲】[Claims]

【請求項１】音源から発信された音声信号を受信する
複数の受信部を備え、かつ、下記式（ａ）で示されるＰ
（ｘ，ｙ）の最大値に基づいて、Ｘ−Ｙ方向での２次元
における前記音源の位置を推定することを特徴とする音
源位置推定方法。【数１】ここで、【数２】 ω：角周波数Ｔ：転置ｘ_ｉ，ｙ_ｉ：各受信部の座標である。＊は複素共役転置を表す。式（ｂ）の左辺は、
Ｍ個の受信部で受信された音波の振幅をＳ_ｑ（ｔ）とお
いて次式（ｅ）で表されるベクトル量である。【数３】また、式（ｃ）においてｖ_ｑは、Ｋ個の音波がＭ個の各
受信部に到来している場合の受信信号を下記式（ｆ）の
ｓ（ｔ）としたとき、このｓ（ｔ）の相関行列の固有ベ
クトルＶ＝［ｖ_１，…，ｖ_Ｍ］における成分を示す。【数４】ここでｎ_ｍ（ｔ）は、各受信部における雑音成分を示
す。An electronic apparatus comprising: a plurality of receiving units configured to receive an audio signal transmitted from a sound source;
A sound source position estimating method, comprising estimating the position of the sound source in two dimensions in the XY directions based on the maximum value of (x, y). (Equation 1) Where: ω: angular frequency T: transposition x _i , y _i : coordinates of each receiving unit. * Represents a complex conjugate transpose. The left side of equation (b) is
This is a vector quantity represented by the following equation (e), where _Sq (t) is the amplitude of the sound wave received by the M receiving units. (Equation 3) In equation (c), v _q is s (t) where s (t) in equation (f) below represents a reception signal when K sound waves arrive at each of M reception units. ) Indicate components in the eigenvector V = [v ₁ ,..., V _M ] of the correlation matrix. (Equation 4) Here, n _m (t) indicates a noise component in each receiving unit.

【請求項２】遅延和アレー部と音声認識部と受信部と
を備えており、まず音源の位置を推定し、この推定に基
づいて、前記遅延和アレー部により、前記音源の位置に
指向特性を形成して目的音声の強調を行い、ついで、前
記音声認識部により、前記受信部から入力された入力信
号を音声として認識することを特徴とする音声認識方
法。2. A delay-and-sum array unit, a voice recognition unit, and a receiving unit. First, a position of a sound source is estimated. Based on the estimation, the directional characteristic is added to the position of the sound source by the delay-and-sum array unit. And emphasizing the target voice, and then recognizing the input signal input from the receiving unit as voice by the voice recognition unit.

【請求項３】遅延和アレー部と音声認識部とピッチ抽
出部と音声合成部と受信部とを備えており、下記のステ
ップを含む音声強調方法。（１）音源の位置を推定し、この推定に基づいて、前記
遅延和アレー部により、前記音源の位置に指向特性を形
成して目的音声の強調を行うステップ（２）ついで、前記音声認識部により、前記受信部に入
力された入力信号を音声として認識するステップ（３）前記遅延和アレー部からの出力に基づいて、前記
ピッチ抽出部によりピッチ周波数の抽出を行うステップ（４）前記ピッチ抽出部の出力と前記音声認識部の出力
とを前記音声合成部により合成して音声出力を得るステ
ップ。3. A voice emphasizing method comprising a delay-and-sum array unit, a voice recognizing unit, a pitch extracting unit, a voice synthesizing unit, and a receiving unit, and including the following steps. (1) Estimating the position of a sound source and, based on the estimation, forming a directional characteristic at the position of the sound source by the delay-and-sum array unit to emphasize a target sound. (2) Next, the speech recognition unit. And (3) extracting a pitch frequency by the pitch extraction unit based on an output from the delay-and-sum array unit. (4) extracting the pitch Obtaining an audio output by synthesizing an output of the voice recognition unit and an output of the voice recognition unit by the voice synthesis unit.

【請求項４】請求項２の音源の位置を推定する方法
は、請求項１記載の音源位置推定方法であることを特徴
とする請求項２の音声認識方法。4. The speech recognition method according to claim 2, wherein the method for estimating the position of the sound source according to claim 2 is the method for estimating a sound source position according to claim 1.

【請求項５】請求項３の音源の位置を推定する方法
は、請求項１記載の音源位置推定方法であることを特徴
とする請求項３の音声強調方法。5. The voice emphasizing method according to claim 3, wherein the method for estimating the position of the sound source according to claim 3 is the sound source position estimating method according to claim 1.

【請求項６】前記受信部はマイクロホンであることを
特徴とする、請求項１記載の音源位置推定方法または請
求項２、４のいずれか１項記載の音声認識方法または請
求項３、５のいずれか１項記載の音声強調方法。6. The sound source position estimating method according to claim 1, wherein the receiving unit is a microphone, or the voice recognition method according to any one of claims 2 and 4, or the speech recognition method according to claim 3 or 5. The speech enhancement method according to any one of the preceding claims.