JP7159767B2

JP7159767B2 - Audio signal processing program, audio signal processing method, and audio signal processing device

Info

Publication number: JP7159767B2
Application number: JP2018189754A
Authority: JP
Inventors: 潤高橋; 拓也上村; 健太郎村瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-10-05
Filing date: 2018-10-05
Publication date: 2022-10-25
Anticipated expiration: 2038-10-05
Also published as: JP2020060612A

Description

本発明は、音声信号処理プログラム、音声信号処理方法及び音声信号処理装置に関する。 The present invention relates to an audio signal processing program, an audio signal processing method, and an audio signal processing apparatus.

例えば、非一時的記録媒体に記録されている音声に対して、音声認識技術を適用して、発話内容及び発話時間を取得し、発話内容を表す字幕を作成したり、発話内容から特定の用語を検索することが可能なコーパスを作成したりする技術が存在する。しかしながら、非一時的記録媒体に記録されている音声に雑音が含まれている場合、音声認識精度が低下する。 For example, by applying speech recognition technology to the voice recorded in a non-temporary recording medium, acquiring the utterance content and the utterance time, creating subtitles that express the utterance content, or identifying specific terms from the utterance content There are technologies for creating a corpus that can be searched for. However, if the voice recorded on the non-temporary recording medium contains noise, the voice recognition accuracy is lowered.

例えば、音声を収音する際に、複数のマイクを使用して、音声の到来方向を取得することで、雑音を除去する技術が存在する。しかしながら、一般的に、非一時的記録媒体に記録されている音声を収音した際のマイクに関する情報は不明であるため、非一時的記録媒体に記録されている音声に対して、当該技術を使用することは困難である。 For example, there is a technology that removes noise by using multiple microphones to acquire the direction of arrival of the sound when collecting the sound. However, in general, the information about the microphone used to pick up the sound recorded on the non-temporary recording medium is unknown, so this technology can be applied to the sound recorded on the non-temporary recording medium. Difficult to use.

特開２００４－０２０６７９号公報JP-A-2004-020679 特開２００８－０７６９７５号公報JP 2008-076975 A

Vincentら、"Extracting and Composing Robust Features with Denoising Autoencoders"、Proc. of ICML 2008、２００８年７月、pp.1096 - 1103Vincent et al., "Extracting and Composing Robust Features with Denoising Autoencoders", Proc. of ICML 2008, July 2008, pp.1096-1103 ”５．ＲＷＣＰ音声データベース”、ｏｎｌｉｎｅ、[平成３０年９月１９日検索］、インターネット（ｈｔｔｐ：／／ｒｅｓｅａｒｃｈ．ｎｉｉ．ａｃ．ｊｐ／ｓｒｃ／ＲＷＣＰ－ＳＰ９６．ｈｔｍｌ）"5. RWCP speech database", online, [searched on September 19, 2018], Internet (http://research.nii.ac.jp/src/RWCP-SP96.html)

音声を収音する際に使用されたマイクの配置に関する情報を使用することなく雑音を除去する技術として、例えば、音声信号を使用して雑音除去フィルタを生成し、音声信号の振幅スペクトルに生成した雑音除去フィルタを適用する技術が存在する。 Techniques for removing noise without using information about the placement of the microphones used to pick up the sound, such as using the speech signal to generate a noise removal filter and generating it on the amplitude spectrum of the speech signal. Techniques exist to apply denoising filters.

しかしながら、信号対雑音比が小さく、かつ、雑音が抽出対象の音声に類似している場合、音声の振幅スペクトルに雑音除去フィルタを適用しても、音声から雑音を適切に除去することは困難である。雑音が抽出対象の音声に類似している場合とは、例えば、雑音が抽出対象の話者以外の話者の発話などである場合である。位相スペクトルにも雑音除去フィルタを適用することで、音声から雑音を適切に除去することは可能となるが、処理負荷が増大する。 However, when the signal-to-noise ratio is small and the noise is similar to the speech to be extracted, it is difficult to properly remove the noise from the speech by applying a noise removal filter to the amplitude spectrum of the speech. be. A case where the noise is similar to the extraction target speech is, for example, a case where the noise is an utterance of a speaker other than the extraction target speaker. By applying the noise removal filter to the phase spectrum as well, it is possible to remove noise from speech appropriately, but the processing load increases.

本発明は、１つの側面として、処理負荷を抑制しつつ、音声から雑音を適切に除去することを可能とすることを目的とする。 An object of the present invention, as one aspect, is to make it possible to appropriately remove noise from speech while suppressing the processing load.

１つの実施形態では、音声信号に対して時間周波数変換を行い、音声信号に対応する周波数スペクトルを取得し、取得した周波数スペクトルに基づいて、音声信号に含まれる雑音成分を除去する複素数フィルタを生成する。複素数フィルタの実部の値と第１の値との比較、及び、複素数フィルタの虚部の値と第２の値との比較、の少なくとも一方を行う。第１の値は雑音成分が存在しない場合に生成される複素数フィルタの実部の値であり、第２の値は雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である。実部の値と第１の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用し、実部の値と第１の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。または、虚部の値と前記第２の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用し、虚部の値と第２の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。または、実部の値と第１の値との相違が小さく、虚部の値と第２の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用することを決定する。この場合、実部の値と第１の値との相違が小さくないか、または、虚部の値と第２の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。 In one embodiment, time-frequency transform is performed on an audio signal to obtain a frequency spectrum corresponding to the audio signal, and based on the obtained frequency spectrum, a complex filter for removing noise components included in the audio signal is generated. do. At least one of comparing the value of the real part of the complex filter with the first value and comparing the value of the imaginary part of the complex filter with the second value. The first value is the value of the real part of the complex filter generated in the absence of the noise component and the second value is the value of the imaginary part of the complex filter generated in the absence of the noise component. applying an amplitude component of a complex filter to the frequency spectrum if the difference between the value of the real part and the first value is small, and applying a complex filter to the frequency spectrum if the difference between the value of the real part and the first value is not small; We decide to apply the amplitude and phase components of . or applying the amplitude component of a complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small, and if the difference between the value of the imaginary part and the second value is not small, the frequency spectrum We decide to apply the amplitude and phase components of the complex filter to . Alternatively, if the difference between the value of the real part and the first value is small and the difference between the value of the imaginary part and the second value is small, it is decided to apply the amplitude component of the complex filter to the frequency spectrum. In this case, if the difference between the value of the real part and the first value is not small, or the difference between the value of the imaginary part and the second value is not small, the frequency spectrum will include the amplitude and phase components of the complex filter. decides to apply

本発明は、１つの側面として、処理負荷を抑制しつつ、音声から雑音を適切に除去することを可能とする。 As one aspect of the present invention, it is possible to appropriately remove noise from speech while suppressing the processing load.

実施形態に係る音声信号処理装置の一例を示すブロック図である。1 is a block diagram showing an example of an audio signal processing device according to an embodiment; FIG. 複素数フィルタ生成部の一例を示すブロック図である。4 is a block diagram showing an example of a complex filter generator; FIG. 実施形態に係る音声信号処理装置のハードウェアの一例を示すブロック図である。1 is a block diagram showing an example of hardware of an audio signal processing device according to an embodiment; FIG. 実施形態に係る音声信号処理の流れの一例を示すフローチャートである。4 is a flowchart showing an example of the flow of audio signal processing according to the embodiment; フィルタ判定部の閾値について説明するための表である。4 is a table for explaining threshold values of a filter determination unit; 処理パターン毎に音声信号処理に要する時間の一例を示す表である。4 is a table showing an example of the time required for audio signal processing for each processing pattern; 重複発話の割合の一例を示す表である。4 is a table showing an example of a ratio of duplicate utterances; 音声信号処理装置の一例を示すブロック図である。1 is a block diagram showing an example of an audio signal processing device; FIG. サーバのハードウェアの一例を示すブロック図である。It is a block diagram which shows an example of the hardware of a server. 音声信号処理装置の一例を示すブロック図である。1 is a block diagram showing an example of an audio signal processing device; FIG. 音声信号処理装置の一例を示すブロック図である。1 is a block diagram showing an example of an audio signal processing device; FIG. クライアントのハードウェアの一例を示すブロック図である。4 is a block diagram showing an example of hardware of a client; FIG.

以下、図面を参照して実施形態の一例を詳細に説明する。 An example of an embodiment will be described in detail below with reference to the drawings.

図１に示す音声信号処理装置１０は、音声入力部１２、時間周波数変換部１４、複素数フィルタ生成部１６、フィルタ判定部１８、フィルタ適用成分決定部２０、フィルタ適用部２２、時間周波数逆変換部２４、及び、音声出力部２６を含む。音声入力部１２は、入力される音声を音声信号に変換する。 The audio signal processing apparatus 10 shown in FIG. 1 includes an audio input unit 12, a time-frequency conversion unit 14, a complex number filter generation unit 16, a filter determination unit 18, a filter application component determination unit 20, a filter application unit 22, and a time-frequency inverse conversion unit. 24 and an audio output unit 26 . The voice input unit 12 converts input voice into a voice signal.

時間周波数変換部１４は、１フレーム分の音声信号に対して時間周波数変換を行い、周波数スペクトルに変換する。時間周波数変換は、例えば、Fast Fourier Transformation（以下、ＦＦＴという。）であってよく、１フレームは、例えば、１０ｍ秒であってよい。 The time-frequency transform unit 14 performs time-frequency transform on the audio signal for one frame to convert it into a frequency spectrum. The time-frequency transform may be, for example, Fast Fourier Transformation (hereinafter referred to as FFT), and one frame may be 10 ms long, for example.

複素数フィルタ生成部１６は、例えば、Ｎフレーム分の周波数スペクトルを使用して、当該周波数スペクトルに対応する音声に含まれる雑音を除去する複素数フィルタを生成する。Ｎは、例えば、１００であってよい。複素数フィルタＭは、例えば、（１）式で表される。
Ｍ＝Ｆ（Ｙ） …（１） The complex filter generation unit 16 uses, for example, frequency spectra for N frames to generate a complex filter that removes noise contained in speech corresponding to the frequency spectrum. N may be 100, for example. Complex number filter M is represented by (1) Formula, for example.
M=F(Y) (1)

Ｙは周波数スペクトルであり、Ｆは複素数フィルタの生成モデルである。生成モデルは、例えば、図２に例示するように、Denoising Autoencoder（以下、ＤＡＥという。）４４などであってよい。ＤＡＥ４４は、入力される情報を圧縮するエンコーダと、情報を展開して出力するデコーダとを含み、情報を一旦圧縮することで、不要な情報である雑音を除去する。 Y is the frequency spectrum and F is the generative model of the complex filter. The generative model may be, for example, a Denoising Autoencoder (hereinafter referred to as DAE) 44, as illustrated in FIG. The DAE 44 includes an encoder that compresses input information and a decoder that expands and outputs the information, and once compresses the information to remove noise, which is unnecessary information.

ＤＡＥ４４の入力は、雑音信号４２及び音声信号４１を含む信号をＦＦＴ４３で時間周波数変換することで取得されたＮフレーム分の周波数スペクトルであり、１フレーム分の周波数スペクトルは周波数サンプル数のデータを含む。周波数サンプル数は、例えば、２５６であってよい。ＤＡＥ４４の出力は、データ数分の複素数フィルタである。複素数フィルタをＦＦＴ４３で取得された周波数スペクトルに適用し、逆ＦＦＴ部４５で時間周波数逆変換することで、音声信号４６を取得する。 The input of the DAE 44 is the frequency spectrum for N frames obtained by time-frequency transforming the signal containing the noise signal 42 and the voice signal 41 by the FFT 43, and the frequency spectrum for one frame contains data of the number of frequency samples. . The number of frequency samples may be 256, for example. The output of DAE 44 is a complex number filter for the number of data. A voice signal 46 is obtained by applying a complex filter to the frequency spectrum obtained by the FFT 43 and subjecting it to time-frequency inverse transform by an inverse FFT unit 45 .

音声信号４１と音声信号４６とが等しくなるように、ＤＡＥ４４を学習させる。音声信号４１は、例えば、抽出対象の話者の発話に対応する音声信号であり、雑音信号４２は、例えば、抽出対象の話者以外の話者の発話に対応する音声信号などである。なお、生成モデルは、ＤＡＥに限定されない。既存の、雑音成分を含む音声信号に基づいて、複素数フィルタを生成するモデルであってよい。 The DAE 44 is trained so that the audio signal 41 and the audio signal 46 are equal. The speech signal 41 is, for example, a speech signal corresponding to the speech of the speaker to be extracted, and the noise signal 42 is, for example, a speech signal corresponding to the speech of a speaker other than the speaker to be extracted. Note that the generative model is not limited to DAE. It may be a model that generates a complex filter based on an existing speech signal containing noise components.

フィルタ判定部１８は、複素数フィルタの実部に基づいて、複素数フィルタの生成に使用された音声信号に含まれる雑音成分の大きさを判定する。複素数フィルタＭは、例えば、（２）式で表される。
Ｍ＝Ｆ（Ｙ）＝ａ＋ｂｉ …（２） Based on the real part of the complex filter, the filter determination unit 18 determines the magnitude of the noise component contained in the speech signal used to generate the complex filter. Complex number filter M is represented by (2) Formula, for example.
M=F(Y)=a+bi (2)

周波数スペクトルＹに対応する音声信号が雑音成分を含まない場合、実部ａ＝１．０、虚部ｂ＝０．０となる。周波数スペクトルＹに対応する音声信号に含まれる雑音成分が少ないほど、実部ａは第１の値の一例である１．０に近付き、虚部ｂは第２の値の一例である０．０に近付く。 When the speech signal corresponding to the frequency spectrum Y does not contain noise components, the real part a=1.0 and the imaginary part b=0.0. As the noise component included in the audio signal corresponding to the frequency spectrum Y decreases, the real part a approaches 1.0, which is an example of the first value, and the imaginary part b approaches 0.0, which is an example of the second value. approach.

したがって、生成される複素数フィルタＭの実部ａが１．０に近いほど、音声信号に含まれる雑音成分が少なく、実部が１．０から離れるほど、音声信号に含まれる雑音成分が多い、と判定することができる。詳細には、例えば、（３）式で、雑音判定値を算出する。
雑音判定値＝１．０－（複素数フィルタの実部の平均値）…（３） Therefore, the closer the real part a of the generated complex filter M is to 1.0, the smaller the noise component included in the speech signal, and the further away the real part is from 1.0, the more the noise component included in the speech signal. can be determined. Specifically, for example, the noise determination value is calculated by the equation (3).
Noise judgment value = 1.0 - (average value of real part of complex number filter) (3)

複素数フィルタの実部の平均値は、生成される複素数フィルタの実部ａを加算し、複素数フィルタの数で除算することで算出することができる。 The average value of the real parts of the complex filters can be calculated by adding the real parts a of the generated complex filters and dividing by the number of complex filters.

フィルタ適用成分決定部２０は、雑音判定値に基づいて、周波数スペクトルに、複素数フィルタの振幅成分を適用するか、振幅成分及び位相成分を適用するか、を決定する。例えば、雑音判定値が、第１所定値以下である場合、複素数フィルタの振幅成分を適用し、雑音判定値が、第１所定値より大きい場合、複素数フィルタの振幅成分及び位相成分を適用する、と決定する。第１所定値は、例えば、０．３０であってよい。 Based on the noise determination value, the filter-applied component determination unit 20 determines whether to apply the amplitude component of the complex number filter or to apply both the amplitude component and the phase component to the frequency spectrum. For example, if the noise judgment value is less than or equal to the first predetermined value, apply the amplitude component of the complex filter, and if the noise judgment value is greater than the first predetermined value, apply the amplitude and phase components of the complex filter. and decide. The first predetermined value may be 0.30, for example.

即ち、フィルタ適用成分決定部２０は、音声信号の雑音成分が少ないと判定された場合、周波数スペクトルに、複素数フィルタの振幅成分を適用することを決定する。また、フィルタ適用成分決定部２０は、音声信号の雑音成分が多いと判定された場合、周波数スペクトルに、複素数フィルタの振幅成分だけでなく、位相成分も併せて適用することを決定する。 That is, when it is determined that the noise component of the speech signal is small, the filter applied component determination unit 20 determines to apply the amplitude component of the complex number filter to the frequency spectrum. Further, when it is determined that the speech signal has a large amount of noise components, the filter application component determining unit 20 determines to apply not only the amplitude component of the complex filter but also the phase component to the frequency spectrum.

フィルタ生成モデルによって生成される複素数フィルタは一般的に誤差を含み、誤差を含む複素数フィルタの振幅成分だけを適用する場合、処理負荷を軽減することができるが、雑音除去性能は低減する。一方、複素数フィルタの振幅成分及び位相成分を適用する場合、雑音除去性能は増大するが、処理負荷も増大する。一般的な音声認識エンジンは、音声信号に許容量以下の雑音成分が存在しても、音声を適切に認識することができる。したがって、複素数フィルタの振幅成分だけを適用することで残存する雑音成分が許容量以下であれば、処理負荷を軽減するために、振幅成分だけを適用することは有用である。 The complex filter generated by the filter generation model generally contains errors, and if only the amplitude component of the complex filter containing errors is applied, the processing load can be reduced, but the noise reduction performance is reduced. On the other hand, when applying the amplitude component and phase component of the complex number filter, the noise removal performance increases, but the processing load also increases. A typical speech recognition engine can properly recognize speech even if the speech signal contains noise components below the allowable amount. Therefore, if the noise component remaining by applying only the amplitude component of the complex number filter is below an acceptable amount, applying only the amplitude component is useful in order to reduce the processing load.

フィルタ適用部２２は、周波数スペクトルに適用することが決定された複素数フィルタの成分を、周波数スペクトルに適用する。（３）式は、複素数フィルタＭの振幅成分を周波数スペクトルＹに適用することで取得される周波数スペクトルＳを例示する。
Ｓ＝｜Ｍ｜・｜Ｙ｜ …（３） The filter application unit 22 applies the component of the complex filter determined to be applied to the frequency spectrum to the frequency spectrum. Equation (3) illustrates the frequency spectrum S obtained by applying the amplitude component of the complex filter M to the frequency spectrum Y.
S=|M|·|Y| (3)

（４）式は、複素数フィルタＭの振幅成分及び位相成分を周波数スペクトルＹに適用することで取得される周波数スペクトルＳを例示する。
Ｓ＝Ｍ＊Ｙ＝｜Ｍ｜・｜Ｙ｜・（ｃｏｓ（θ_Ｍ＋θ_Ｙ）＋ｉ・ｓｉｎ（θ_Ｍ＋θ_Ｙ））
…（４）
θ_Ｍは、複素数フィルタＭの位相成分を表し、θ_Ｙは、周波数スペクトルＹの位相成分を表す。 Equation (4) illustrates the frequency spectrum S obtained by applying the amplitude and phase components of the complex filter M to the frequency spectrum Y.
S=M*Y=|M|·| _Y |·(cos( _θM +θY)+i·sin( _θM + _θY ))
…(4)
θ _M represents the phase component of the complex filter M, and θ _Y represents the phase component of the frequency spectrum Y.

（３）式によれば、周波数スペクトルに複素数フィルタの振幅成分を適用する場合、１回の乗算が行われる。また、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用する場合、（４）式によれば、５回の乗算及び３回の加減算が行われる。即ち、複素数フィルタの振幅成分及び位相成分を適用する場合、複素数フィルタの振幅成分だけを適用する場合よりも、処理負荷は増大する。 According to equation (3), one multiplication is performed when applying the amplitude component of the complex filter to the frequency spectrum. Also, when applying the amplitude component and the phase component of the complex number filter to the frequency spectrum, according to the equation (4), 5 multiplications and 3 additions/subtractions are performed. That is, when applying the amplitude component and phase component of the complex number filter, the processing load increases compared to when only the amplitude component of the complex number filter is applied.

時間周波数逆変換部２４は、複素数フィルタが適用された周波数スペクトルに対して、時間周波数逆変換を行い、周波数スペクトルに対応する音声信号を取得する。時間周波数逆変換は、例えば、Inverse Fast Fourier Transformation（以下、ＩＦＦＴという。）であってよい。音声出力部２６は、時間周波数逆変換部２４で取得された音声信号に対応する音声を出力する。 The time-frequency inverse transform unit 24 performs time-frequency inverse transform on the frequency spectrum to which the complex filter is applied, and obtains an audio signal corresponding to the frequency spectrum. The time-frequency inverse transform may be, for example, Inverse Fast Fourier Transformation (hereinafter referred to as IFFT). The audio output unit 26 outputs audio corresponding to the audio signal acquired by the time-frequency inverse transform unit 24 .

音声信号処理装置１０は、一例として、図３に示すように、ＣＰＵ（Central Processing Unit）５１、一次記憶部５２、二次記憶部５３、外部インタフェース５４、マイク３１Ａ及びスピーカ３１Ｂを含む。ＣＰＵ５１は、ハードウェアであるプロセッサの一例である。ＣＰＵ５１、一次記憶部５２、二次記憶部５３、外部インタフェース５４、マイク３１Ａ及びスピーカ３１Ｂは、バス５９を介して相互に接続されている。 As an example, the audio signal processing device 10 includes a CPU (Central Processing Unit) 51, a primary storage unit 52, a secondary storage unit 53, an external interface 54, a microphone 31A, and a speaker 31B, as shown in FIG. The CPU 51 is an example of a processor that is hardware. The CPU 51 , primary storage unit 52 , secondary storage unit 53 , external interface 54 , microphone 31 A and speaker 31 B are interconnected via bus 59 .

一次記憶部５２は、例えば、ＲＡＭ（Random Access Memory）などの揮発性のメモリである。二次記憶部５３は、例えば、ＨＤＤ（Hard Disk Drive）、又はＳＳＤ（Solid State Drive）などの不揮発性のメモリである。 The primary storage unit 52 is, for example, a volatile memory such as a RAM (Random Access Memory). The secondary storage unit 53 is, for example, a non-volatile memory such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive).

二次記憶部５３は、プログラム格納領域５３Ａ及びデータ格納領域５３Ｂを含む。プログラム格納領域５３Ａは、一例として、雑音除去を行う音声信号処理プログラムなどのプログラムを記憶している。データ格納領域５３Ｂは、一例として、音声信号及び音声信号処理プログラムを実行している間に生成される中間データなどを記憶する。 The secondary storage unit 53 includes a program storage area 53A and a data storage area 53B. The program storage area 53A stores, for example, a program such as an audio signal processing program for removing noise. The data storage area 53B stores, for example, an audio signal and intermediate data generated while the audio signal processing program is being executed.

ＣＰＵ５１は、プログラム格納領域５３Ａから音声信号処理プログラムを読み出して一次記憶部５２に展開する。ＣＰＵ５１は、音声信号処理プログラムをロードして実行することで、図１の時間周波数変換部１４、複素数フィルタ生成部１６、フィルタ判定部１８、フィルタ適用成分決定部２０、フィルタ適用部２２及び時間周波数逆変換部２４として動作する。 The CPU 51 reads the audio signal processing program from the program storage area 53A and develops it in the primary storage unit 52 . By loading and executing the audio signal processing program, the CPU 51 converts the time-frequency conversion unit 14, the complex number filter generation unit 16, the filter determination unit 18, the filter application component determination unit 20, the filter application unit 22, and the time-frequency conversion unit 14 shown in FIG. It operates as the inverse transforming unit 24 .

なお、音声信号処理プログラムなどのプログラムは、外部サーバに記憶され、ネットワークを介して、一次記憶部５２に展開されてもよい。また、音声信号処理プログラムなどのプログラムは、ＤＶＤ（Digital Versatile Disc）などの非一時的記録媒体に記憶され、記録媒体読込装置を介して、一次記憶部５２に展開されてもよい。 A program such as an audio signal processing program may be stored in an external server and expanded to the primary storage unit 52 via a network. A program such as an audio signal processing program may be stored in a non-temporary recording medium such as a DVD (Digital Versatile Disc), and expanded in the primary storage unit 52 via a recording medium reading device.

外部インタフェース５４には外部装置が接続され、外部インタフェース５４は、外部装置とＣＰＵ５１との間の各種情報の送受信を司る。マイク３１Ａは、音声入力部１２の一例であり、入力される音声を音声信号に変換する。スピーカ３１Ｂは、音声出力部２６の一例であり、例えば、雑音成分が除去された音声信号に対応する音声を出力する。なお、マイク３１Ａ及びスピーカ３１Ｂは、音声信号処理装置１０に内蔵されず、外部インタフェース５４を介して、外部装置として、音声信号処理装置１０と接続されていてもよい。 An external device is connected to the external interface 54 , and the external interface 54 controls transmission and reception of various information between the external device and the CPU 51 . The microphone 31A is an example of the voice input unit 12 and converts input voice into a voice signal. The speaker 31B is an example of the audio output unit 26, and outputs audio corresponding to an audio signal from which noise components have been removed, for example. Note that the microphone 31A and the speaker 31B may not be built in the audio signal processing device 10 but may be connected to the audio signal processing device 10 as external devices via the external interface 54 .

音声信号処理装置１０は、例えば、パーソナルコンピュータであってもよいし、スマートフォンであってもよいし、専用のデバイスであってもよい。 The audio signal processing device 10 may be, for example, a personal computer, a smart phone, or a dedicated device.

次に、雑音を除去する音声信号処理の作用の概要について説明する。図４は、音声信号処理の流れを例示する。ＣＰＵ５１は、ステップ１０１で、マイク３１Ａから入力される音声に対応する音声信号を１フレーム分読み込む。 Next, an outline of the operation of speech signal processing for removing noise will be described. FIG. 4 illustrates the flow of audio signal processing. At step 101, the CPU 51 reads one frame of the audio signal corresponding to the audio input from the microphone 31A.

ＣＰＵ５１は、ステップ１０２で、読み込まれた音声信号に対してＦＦＴを行い、周波数スペクトルを取得する。ＣＰＵ５１は、ステップ１０３で、所定数Ｎのフレームが読み込まれたか否か判定する。ステップ１０３の判定が否定された場合、ＣＰＵ５１は、ステップ１０１に戻り、ステップ１０３の判定が肯定された場合、ＣＰＵ５１は、ステップ１０４で、Ｎフレーム分の周波数スペクトルを使用して、複素数フィルタを生成する。 At step 102, the CPU 51 performs FFT on the read audio signal to obtain a frequency spectrum. At step 103, the CPU 51 determines whether or not a predetermined number N of frames have been read. If the determination in step 103 is negative, the CPU 51 returns to step 101, and if the determination in step 103 is positive, the CPU 51 generates a complex filter using the frequency spectrum for N frames in step 104. do.

ＣＰＵ５１は、ステップ１０５で、生成された複素数フィルタの実部に基づいて、ステップ１０１で読み込まれた音声信号に含まれている雑音成分が多いか否か判定する。ステップ１０５で、雑音成分が多いと判定された場合、ＣＰＵ５１は、ステップ１０７で、雑音を除去するために、複素数フィルタの振幅成分及び位相成分の両方を適用することを決定する。雑音成分が少ないと判定された場合、ＣＰＵ５１は、ステップ１０６で、複素数フィルタの振幅成分を適用することを決定する。 At step 105, the CPU 51 determines whether the speech signal read at step 101 contains many noise components based on the real part of the generated complex filter. If it is determined at step 105 that the noise component is high, the CPU 51 at step 107 decides to apply both the amplitude and phase components of the complex filter to remove the noise. If the noise component is determined to be small, the CPU 51 at step 106 decides to apply the amplitude component of the complex number filter.

ＣＰＵ５１は、ステップ１０８で、ステップ１０６またはステップ１０７で決定された複素数フィルタの成分をステップ１０２で取得された周波数スペクトルに適用する。ＣＰＵ５１は、ステップ１０９で、複素数フィルタが適用された周波数スペクトルに対してＩＦＦＴを行い、音声信号を取得する。 At step 108 , the CPU 51 applies the components of the complex filter determined at step 106 or step 107 to the frequency spectrum obtained at step 102 . In step 109, the CPU 51 performs IFFT on the frequency spectrum to which the complex number filter is applied, and acquires the audio signal.

ＣＰＵ５１は、ステップ１１０で、スピーカ３１Ｂを介して、音声信号に対応する音声を出力して、Ｎフレーム分の音声信号の処理を終了する。 At step 110, the CPU 51 outputs the sound corresponding to the sound signal through the speaker 31B, and ends the processing of the sound signal for N frames.

次に、ステップ１０５で、音声信号に含まれる雑音成分が多いか否かを判定する閾値について説明する。図５に、音声信号の信号対雑音比（Signal to Noise Ratio（以下、ＳＮＲという。））、当該音声信号を使用して生成される複素数フィルタの実部の平均値、１．０－（複素数フィルタの実部の平均値）、即ち、雑音判定値を例示する。１．０は、雑音成分を含まない音声信号を使用して生成される複素数フィルタの実部の値である。ＳＮＲは、数値が大きいほど、雑音成分が少ないことを意味する。 Next, in step 105, the threshold for determining whether or not the audio signal contains a large amount of noise components will be described. FIG. 5 shows the signal to noise ratio (hereinafter referred to as SNR) of the speech signal, the average value of the real part of the complex filter generated using the speech signal, 1.0-(complex number average value of the real part of the filter), that is, the noise judgment value. 1.0 is the value of the real part of a complex filter generated using a speech signal that does not contain noise components. SNR means that the larger the numerical value, the smaller the noise component.

図５は、複素数フィルタの振幅成分を音声信号に対応する周波数スペクトルに適用した場合のSignal to Distortion Ratio（以下、ＳＤＲという。）及び複素数フィルタの振幅成分及び位相成分を周波数スペクトルに適用した場合のＳＤＲも例示する。ＳＤＲは、「信号成分」と「雑音成分＋復元による音声歪み成分」との対数比率を表す値であり、数値が大きいほど、雑音成分が適切に除去されている、ことを意味する。 FIG. 5 shows the Signal to Distortion Ratio (hereinafter referred to as SDR) when the amplitude component of the complex number filter is applied to the frequency spectrum corresponding to the audio signal, and the signal to distortion ratio when the amplitude and phase components of the complex number filter are applied to the frequency spectrum. SDR is also exemplified. The SDR is a value representing the logarithmic ratio of the "signal component" and the "noise component + voice distortion component due to reconstruction", and the larger the value, the more appropriately the noise component is removed.

一般的な音声認識エンジンでは、音声信号のＳＤＲが１５．０［ｄＢ］より大きい場合に、適切な認識精度を発揮することができる。したがって、振幅成分を周波数スペクトルに適用した際のＳＤＲが、例えば、２０．０［ｄＢ］より大きい場合に、振幅成分を周波数スペクトルに適用する。図５において、振幅成分を周波数スペクトルに適用した際のＳＤＲが２２．５［ｄＢ］の場合、即ち、２０．０［ｄＢ］より大きい場合、対応する雑音判定値は０．３０であるため、閾値、即ち、第１所定値として、例えば、０．３０を使用することができる。 A general speech recognition engine can exhibit appropriate recognition accuracy when the SDR of the speech signal is greater than 15.0 [dB]. Therefore, when the SDR when applying the amplitude component to the frequency spectrum is greater than, for example, 20.0 [dB], the amplitude component is applied to the frequency spectrum. In FIG. 5, when the SDR when the amplitude component is applied to the frequency spectrum is 22.5 [dB], that is, when it is greater than 20.0 [dB], the corresponding noise judgment value is 0.30. For example, 0.30 can be used as the threshold, ie the first predetermined value.

一方、雑音判定値が閾値である０．３０を超える場合、即ち、０．４６である場合、振幅成分を周波数スペクトルに適用すると、ＳＤＲが１８．６［ｄＢ］となり、２０．０［ｄＢ］よりも小さくなる。したがって、音声認識エンジンで適切な認識精度が発揮されない虞があるため、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用する。これにより、ＳＤＲを２０．０［ｄＢ］を超える２２．８［ｄＢ］に引き上げることができる。例えば、ＳＮＲが－５「ｄＢ」の場合であっても、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することで、１５．０［ｄＢ］を超える１５．４［ｄＢ］のＳＤＲを取得することができ、音声認識エンジンで適切な認識精度を得ることができる。 On the other hand, when the noise judgment value exceeds the threshold value of 0.30, that is, when it is 0.46, when the amplitude component is applied to the frequency spectrum, the SDR becomes 18.6 [dB] and 20.0 [dB]. be smaller than Therefore, since there is a risk that the speech recognition engine may not exhibit appropriate recognition accuracy, the amplitude and phase components of a complex filter are applied to the frequency spectrum. As a result, the SDR can be raised to 22.8 [dB] exceeding 20.0 [dB]. For example, even if the SNR is -5 "dB", by applying the amplitude component and phase component of the complex filter to the frequency spectrum, the SDR of 15.4 [dB] exceeding 15.0 [dB] can be obtained, and the speech recognition engine can obtain adequate recognition accuracy.

しかしながら、本実施形態は、雑音判定値の閾値を０．３０とする例に限定されない。使用する音声認識エンジンの性能、または所望される処理負荷に応じて、適切な閾値を設定することができる。また、雑音判定値と閾値とを比較する代わりに、複素数フィルタの実部の平均値と閾値とを比較してもよい。この場合の閾値は、例えば、１．０－第１所定値とすればよい。即ち、例えば、複素数フィルタの実部の平均値が０．７を超える場合に、音声信号の雑音成分が少ないと判定してもよい。 However, this embodiment is not limited to an example in which the threshold for the noise determination value is set to 0.30. An appropriate threshold can be set depending on the performance of the speech recognition engine used or the desired processing load. Also, instead of comparing the noise judgment value and the threshold, the average value of the real part of the complex filter may be compared with the threshold. The threshold in this case may be, for example, 1.0-first predetermined value. That is, for example, when the average value of the real part of the complex number filter exceeds 0.7, it may be determined that the noise component of the speech signal is small.

図６は、音声信号に要する処理時間を例示する。ここで使用される音声信号処理プログラムは、Ｐｙｔｈｏｎ３で作成されている。処理パターン１～処理パターン５の処理について、各々、音声信号のフレーム数は２５６、周波数サンプル数は２５６で、１０００回の処理を行った。 FIG. 6 illustrates the processing time required for an audio signal. The audio signal processing program used here is written in Python3. Processing patterns 1 to 5 were processed 1000 times with 256 audio signal frames and 256 frequency samples.

処理パターン１及び処理パターン２は、複素数フィルタの実部の判定を行わない場合である。処理パターン１では、複素数フィルタの振幅成分を全ての周波数スペクトルに適用し、処理パターン２では、複素数フィルタの振幅成分及び位相成分を全ての周波数スペクトルに適用する。処理パターン１の処理時間は、１．９５［秒］であり、処理パターン２の処理時間は、３．７３［秒］である。 Processing pattern 1 and processing pattern 2 are cases in which the real part of the complex number filter is not determined. In processing pattern 1, the amplitude component of the complex number filter is applied to all frequency spectra, and in processing pattern 2, the amplitude component and phase component of the complex number filter are applied to all frequency spectra. The processing time for processing pattern 1 is 1.95 [seconds], and the processing time for processing pattern 2 is 3.73 [seconds].

処理パターン３～処理パターン５は、複素数フィルタの実部の判定を行う場合であり、処理パターン３では、５０％の周波数スペクトルに、複素数フィルタの振幅成分を適用し、５０％の周波数スペクトルに、複素数フィルタの振幅成分及び位相成分を適用する。処理パターン４では、２０％の周波数スペクトルに、複素数フィルタの振幅成分を適用し、８０％の周波数スペクトルに、複素数フィルタの振幅成分及び位相成分を適用する。処理パターン５では、１００％の周波数スペクトルに、複素数フィルタの振幅成分及び位相成分を適用する。処理パターン３の処理時間は、２．４３［秒］であり、処理パターン４の処理時間は３．０３［秒］であり、処理パターン５の処理時間は、３．９５［秒］である。 Processing patterns 3 to 5 are for determining the real part of the complex filter. In processing pattern 3, the amplitude component of the complex filter is applied to 50% of the frequency spectrum, Apply the amplitude and phase components of the complex filter. In processing pattern 4, the amplitude component of the complex filter is applied to 20% of the frequency spectrum, and the amplitude and phase components of the complex filter are applied to 80% of the frequency spectrum. Processing pattern 5 applies the amplitude and phase components of a complex filter to 100% of the frequency spectrum. The processing time for processing pattern 3 is 2.43 [seconds], the processing time for processing pattern 4 is 3.03 [seconds], and the processing time for processing pattern 5 is 3.95 [seconds].

処理パターン２では、処理時間が３．７３［秒］であり、処理パターン５では、処理時間が３．９５［秒］である。即ち、処理パターン５では、複素数フィルタの実部の判定に０．２２秒要し、複素数フィルタの実部を判定する分、０．２２［秒］多く時間を要する。複素数フィルタの実部の判定は、複素数フィルタの実部の平均値を算出する際に、１フレームの音声信号毎に周波数サンプル数の回数の加算及び除算１回を行う。 In the processing pattern 2, the processing time is 3.73 [seconds], and in the processing pattern 5, the processing time is 3.95 [seconds]. That is, in processing pattern 5, it takes 0.22 seconds to determine the real part of the complex number filter, and it takes 0.22 [seconds] longer to determine the real part of the complex number filter. In determining the real part of the complex number filter, when calculating the average value of the real part of the complex number filter, the number of frequency samples is added and divided once for each audio signal of one frame.

しかしながら、複数の話者が発話する状況において、複数の話者の発話が重畳する、即ち、抽出対象の話者以外の話者の発話である雑音成分が多く含まれる音声の割合は、図７に例示するように多くはない。図７は、音声対話データベース（ＲＷＣＰ－ＳＰ９６）における、単独発話、重複発話の発話時間及び割合を例示する。 However, in a situation in which a plurality of speakers are speaking, the ratio of speech in which the speeches of a plurality of speakers are superimposed, that is, speech that includes a large amount of noise components that are speeches of speakers other than the speaker to be extracted is shown in FIG. There are not many as exemplified in . FIG. 7 exemplifies the utterance time and ratio of single utterances and multiple utterances in the voice dialogue database (RWCP-SP96).

音声対話データベースでは、話者が顧客及び店員の２人である４８対話の発話区間のうち、顧客の単独発話が２２．２％、店員の単独発話が６１．４％、顧客及び店員の重複発話が１６．３％である。即ち、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用する程度に雑音が多いと判定される可能性が高い音声は、全発話区間のうち、１６．３％であり、２０％に満たない。 In the voice dialogue database, out of the 48 dialogue segments in which the two speakers are a customer and a clerk, 22.2% are single utterances from the customer, 61.4% are single utterances from the clerk, and 61.4% are multiple utterances from the customer and the clerk. is 16.3%. That is, 16.3%, less than 20%, of all speech segments are likely to be determined to be noisy enough to apply the amplitude and phase components of a complex filter to the frequency spectrum. .

一方、図６において、５０％の周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用する処理パターン３の処理時間は２．４３［秒］である。また、８０％の周波数スペクトルに振幅成分及び位相成分を適用する処理パターン４の処理時間は３．０３［秒］である。したがって、８０％の周波数スペクトルに振幅成分及び位相成分を適用する場合であっても、複素数フィルタの実部の判定を行わず、全ての周波数スペクトルに振幅成分及び位相成分を適用する処理パターン２よりも、０．７０［秒］処理時間が短縮される。即ち、処理負荷が軽減される。 On the other hand, in FIG. 6, the processing time for processing pattern 3, which applies the amplitude and phase components of the complex filter to 50% of the frequency spectrum, is 2.43 [seconds]. The processing time for processing pattern 4, which applies the amplitude component and the phase component to 80% of the frequency spectrum, is 3.03 [seconds]. Therefore, even if the amplitude component and the phase component are applied to 80% of the frequency spectrum, the real part of the complex filter is not determined, and the amplitude component and the phase component are applied to all the frequency spectrum From the processing pattern 2 Also, the processing time is shortened by 0.70 [seconds]. That is, the processing load is reduced.

なお、ステップ１０５で、生成された複素数フィルタの実部に基づいて、音声信号に含まれる雑音成分を判定する例について説明したが、本実施例はこれに限定されない。例えば、生成される複素数フィルタの虚部ｂを加算し、複素数フィルタの数で除算することで算出される複素数フィルタの虚部の平均値が０．０に近い場合に、雑音成分が少ないと判定してもよい。また、複素数フィルタの実部の平均値が１．０に近く、かつ、虚部の平均値が０．０に近い場合に、雑音成分が少ないと判定してもよい。なお、虚部の平均値が０．０に近いか否か判定する閾値は、例えば、０．３０であってよい。当該閾値は、第２所定値の一例である。 Although the example in which the noise component included in the speech signal is determined based on the real part of the generated complex filter in step 105 has been described, the present embodiment is not limited to this. For example, when the average value of the imaginary part of the complex number filter calculated by adding the imaginary part b of the generated complex number filter and dividing by the number of complex number filters is close to 0.0, it is determined that the noise component is small. You may Also, when the average value of the real part of the complex number filter is close to 1.0 and the average value of the imaginary part is close to 0.0, it may be determined that the noise component is small. Note that the threshold for determining whether the average value of the imaginary part is close to 0.0 may be 0.30, for example. The threshold is an example of a second predetermined value.

音声信号処理装置１０は、図８に例示するように、有線または無線ネットワークで接続されたクライアント８１Ａ及びサーバ８２Ａを含んでいてもよい。この場合、クライアント８１Ａは、例えば、図１の音声入力部１２及び音声出力部２６を含む。サーバ８２Ａは、時間周波数変換部１４、複素数フィルタ生成部１６、フィルタ判定部１８、フィルタ適用成分決定部２０、フィルタ適用部２２、及び、時間周波数逆変換部２４を含む。 The audio signal processing device 10 may include a client 81A and a server 82A connected via a wired or wireless network, as illustrated in FIG. In this case, the client 81A includes, for example, the audio input section 12 and the audio output section 26 of FIG. The server 82A includes a time-frequency transformation unit 14, a complex number filter generation unit 16, a filter determination unit 18, a filter application component determination unit 20, a filter application unit 22, and a time-frequency inverse transformation unit 24.

クライアント８１Ａのハードウェア構成は、図２の音声信号処理装置１０のハードウェア構成と同様であってよい。また、サーバ８２Ａのハードウェア構成は、図９に例示するように、マイク３１Ａ及びスピーカ３１Ｂを含まない点で、図２の音声信号処理装置１０と相違する。しかしながら、図９のＣＰＵ５１Ｄ、一次記憶部５２Ｄ、二次記憶部５３Ｄ、及び外部インタフェース５４Ｄは、図２のＣＰＵ５１、一次記憶部５２、二次記憶部５３、及び外部インタフェース５４と同様であってよいため、詳細な説明を省略する。二次記憶部５３Ｄは、二次記憶部５３と同様に、プログラム格納領域５３ＡＤ及びデータ格納領域５３ＢＤを含む。 The hardware configuration of the client 81A may be the same as the hardware configuration of the audio signal processing device 10 of FIG. Moreover, as shown in FIG. 9, the hardware configuration of the server 82A differs from that of the audio signal processing apparatus 10 in FIG. 2 in that it does not include a microphone 31A and a speaker 31B. However, the CPU 51D, primary storage unit 52D, secondary storage unit 53D, and external interface 54D in FIG. 9 may be similar to the CPU 51, primary storage unit 52, secondary storage unit 53, and external interface 54 in FIG. Therefore, detailed description is omitted. The secondary storage section 53D, like the secondary storage section 53, includes a program storage area 53AD and a data storage area 53BD.

音声信号処理装置１０の機能をクライアント８１Ａとサーバ８２Ａとに分離することで、クライアント８１Ａの処理負荷をさらに軽減し、クライアント８１Ａを小型・軽量化することで、クライアント８１Ａの携帯性を向上させることが可能となる。 By separating the functions of the audio signal processing device 10 into the client 81A and the server 82A, the processing load on the client 81A is further reduced, and the portability of the client 81A is improved by reducing the size and weight of the client 81A. becomes possible.

音声信号処理装置１０は、図１０に例示するように、有線または無線ネットワークで接続されたクライアント８１Ｂ、第１サーバ８２Ｂ及び第２サーバ８２Ｃを含んでいてもよい。クライアント８１Ｂは、図１１に例示するように、音声入力部１２及び、テキスト出力部２７を含む。第１サーバ８２Ｂは、時間周波数変換部１４、複素数フィルタ生成部１６、フィルタ判定部１８、フィルタ適用成分決定部２０、フィルタ適用部２２及び時間周波数逆変換部２４を含む。第２サーバは、音声認識部２５を含む。 The audio signal processing device 10 may include a client 81B, a first server 82B, and a second server 82C connected via a wired or wireless network, as illustrated in FIG. The client 81B includes a voice input section 12 and a text output section 27, as illustrated in FIG. The first server 82B includes a time-frequency transformation unit 14, a complex number filter generation unit 16, a filter determination unit 18, a filter application component determination unit 20, a filter application unit 22, and a time-frequency inverse transformation unit . The second server includes a speech recognizer 25 .

第１サーバ８２Ａ及び第２サーバ８２Ｂのハードウェア構成は、サーバ８２Ａのハードウェア構成と同様であってよい。クライアント８１Ｂのハードウェア構成は、図１２に例示するように、スピーカ３１Ｂに代えて、テキスト出力部の一例であるディスプレイ３１Ｃを有している点で、図２の音声信号処理装置１０と相違する。 The hardware configuration of the first server 82A and the second server 82B may be the same as the hardware configuration of the server 82A. As illustrated in FIG. 12, the hardware configuration of the client 81B differs from the audio signal processing apparatus 10 of FIG. 2 in that it has a display 31C, which is an example of a text output unit, instead of the speaker 31B. .

第２サーバ８２Ｃは、第１サーバ８２Ｂから雑音成分が除去された音声信号を受信し、音声認識を行うことで、音声信号をテキストに変換し、クライアント８１Ｂに当該テキストを送信する。クライアント８１Ｂは、テキストを受信し、ディスプレイ３１Ｃに表示する。音声認識には、既存の技術が適用されてよい。雑音を除去した音声信号の音声認識を行いテキスト化することで、音声信号に含まれる情報のテキスト検索を可能とし、情報の利用価値を向上させることができる。音声信号処理装置１０の機能をクライアント８１Ｂ、第１サーバ８２Ｂ及び第２サーバ８２Ｃに分離することで、クライアント８１Ｂの処理負荷をさらに軽減することができる。これにより、クライアント８１Ｂを小型・軽量化することで、クライアント８１Ｂの携帯性を向上させることが可能となる。 The second server 82C receives the voice signal from which the noise component has been removed from the first server 82B, performs voice recognition, converts the voice signal into text, and transmits the text to the client 81B. Client 81B receives the text and displays it on display 31C. Existing technology may be applied to speech recognition. By performing speech recognition on the speech signal from which noise has been removed and converting it into text, it is possible to perform a text search of the information contained in the speech signal, thereby improving the utility value of the information. By separating the functions of the audio signal processing device 10 into the client 81B, the first server 82B, and the second server 82C, the processing load on the client 81B can be further reduced. This makes it possible to improve the portability of the client 81B by reducing the size and weight of the client 81B.

なお、図８及び図１０に例示した音声信号処理装置は、一例であり、本実施形態はこれらに限定されない。例えば、図１０の第２サーバ８２Ｃが音声認識部２５を含む代わりに、第１サーバ８２Ｂが音声認識部２５を含み、第２サーバ８２Ｃが存在しない構成としてもよい。また、図１の音声信号処理装置１０が、音声出力部２６に代えて、または、音声出力部２６に加えて、音声認識部２５及びテキスト出力部２７を有してもよい。 Note that the audio signal processing apparatuses illustrated in FIGS. 8 and 10 are examples, and the present embodiment is not limited to these. For example, instead of the second server 82C of FIG. 10 including the speech recognition unit 25, the first server 82B may include the speech recognition unit 25 and the second server 82C may not exist. Further, the audio signal processing apparatus 10 of FIG. 1 may have the audio recognition section 25 and the text output section 27 instead of or in addition to the audio output section 26 .

なお、音声を音声入力部１２から入力し、音声出力部２６から音声を出力するか、テキスト出力部２７から音声に対応するテキストを出力する例について説明したが、本実施形態はこれらに限定されない。例えば、ファイルに予め保存されている音声信号のデータを読み込み、雑音成分が除去された音声信号のデータをファイルに保存するようにしてもよい。ファイルは、例えば、二次記憶部５３のデータ格納領域５３Ｂまたは二次記憶部５３Ｄのデータ格納領域５３ＢＤなどに記憶されてもよい。 Although an example of inputting speech from the speech input unit 12 and outputting the speech from the speech output unit 26 or outputting text corresponding to the speech from the text output unit 27 has been described, the present embodiment is not limited to these. . For example, it is possible to read audio signal data pre-stored in a file, and store the audio signal data from which noise components have been removed in the file. The file may be stored, for example, in the data storage area 53B of the secondary storage section 53 or the data storage area 53BD of the secondary storage section 53D.

本実施形態は、雑音が存在する環境で収音される音声の字幕作成、会議議事録作成などに適用することができる。雑音は、抽出対象の話者以外の話者の発話またはエアコンの稼働音などの環境雑音であってよい。 The present embodiment can be applied to creating captions for voices picked up in an environment where noise exists, creating meeting minutes, and the like. The noise may be utterances of speakers other than the speaker to be extracted, or environmental noise such as the operating sound of an air conditioner.

なお、複素数フィルタの生成には、全ての周波数サンプルの周波数スペクトルを使用せず、所定の周波数帯域の周波数スペクトルを使用してもよい。また、フィルタ判定には、全ての周波数サンプルに対応する複素数フィルタを使用せず、所定の周波数帯域に対応する複素数フィルタを使用してもよい。なお、図４に例示するフローチャートは一例であり、ステップの順序は変更されてもよい。 Note that the frequency spectrum of a predetermined frequency band may be used instead of using the frequency spectrum of all frequency samples to generate the complex number filter. Also, for filter determination, a complex number filter corresponding to a predetermined frequency band may be used instead of using a complex number filter corresponding to all frequency samples. Note that the flowchart illustrated in FIG. 4 is an example, and the order of steps may be changed.

本実施形態では、音声信号に対して時間周波数変換を行い、音声信号に対応する周波数スペクトルを取得し、取得した周波数スペクトルに基づいて、音声信号に含まれる雑音成分を除去する複素数フィルタを生成する。複素数フィルタの実部の値と第１の値との比較及び、複素数フィルタの虚部の値と第２の値との比較の少なくとも一方を行う。第１の値は雑音成分が存在しない場合に生成される複素数フィルタの実部の値であり、第２の値は雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である。実部の値と第１の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用し、実部の値と第１の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。または、虚部の値と前記第２の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用し、虚部の値と第２の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。または、実部の値と第１の値との相違が小さく、虚部の値と第２の値との相違が小さい場合、周波数スペクトルに複素数フィルタの振幅成分を適用することを決定する。この場合、実部の値と第１の値との相違が小さくないか、または、虚部の値と第２の値との相違が小さくない場合、周波数スペクトルに複素数フィルタの振幅成分及び位相成分を適用することを決定する。 In this embodiment, a time-frequency transform is performed on an audio signal, a frequency spectrum corresponding to the audio signal is obtained, and a complex filter that removes noise components contained in the audio signal is generated based on the obtained frequency spectrum. . At least one of comparing the value of the real part of the complex filter with the first value and comparing the value of the imaginary part of the complex filter with the second value. The first value is the value of the real part of the complex filter generated in the absence of the noise component and the second value is the value of the imaginary part of the complex filter generated in the absence of the noise component. applying an amplitude component of a complex filter to the frequency spectrum if the difference between the value of the real part and the first value is small, and applying a complex filter to the frequency spectrum if the difference between the value of the real part and the first value is not small; We decide to apply the amplitude and phase components of . or applying the amplitude component of a complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small, and if the difference between the value of the imaginary part and the second value is not small, the frequency spectrum We decide to apply the amplitude and phase components of the complex filter to . Alternatively, if the difference between the value of the real part and the first value is small and the difference between the value of the imaginary part and the second value is small, it is decided to apply the amplitude component of the complex filter to the frequency spectrum. In this case, if the difference between the value of the real part and the first value is not small, or the difference between the value of the imaginary part and the second value is not small, the frequency spectrum will include the amplitude and phase components of the complex filter. decides to apply

本実施形態では、音声信号を使用して、当該音声信号から雑音成分を除去する複素数フィルタを生成する。生成される複素数フィルタに基づいて音声信号の雑音成分が少ないと判定される場合には、音声信号に複素数フィルタの振幅成分を適用し、雑音成分が多いと判定される場合には、音声信号に複素数フィルタの振幅成分及び位相成分を適用することを決定する。これにより、本実施形態では、処理負荷を抑制しつつ、音声から雑音を適切に除去することを可能とする。 In this embodiment, an audio signal is used to generate a complex filter that removes noise components from the audio signal. If it is judged that the noise component of the speech signal is small based on the generated complex number filter, the amplitude component of the complex number filter is applied to the speech signal, and if it is judged that the noise component is large, We decide to apply the amplitude and phase components of a complex filter. As a result, in this embodiment, it is possible to appropriately remove noise from speech while suppressing the processing load.

以上の各実施形態に関し、更に以下の付記を開示する。 The following supplementary remarks are further disclosed regarding each of the above embodiments.

（付記１）
音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得し、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成し、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行い、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定する、
雑音除去処理をコンピュータに実行させるための音声信号処理プログラム。
（付記２）
音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号に対応する音声を音声出力部から出力する、
付記１の音声信号処理プログラム。
（付記３）
音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号を音声認識することでテキストに変換し、
変換した前記テキストをテキスト出力部から出力する、
付記１の音声信号処理プログラム。
（付記４）
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、
付記１～付記３の何れかの音声信号処理プログラム。
（付記５）
前記複素数フィルタは、周波数スペクトルを入力すると、前記複素数フィルタを出力するように機械学習を用いて学習された複素数フィルタ生成モデルで生成される、
付記１～付記４の何れかの音声信号処理プログラム。
（付記６）
コンピュータが、
音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得し、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成し、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行い、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定する、
音声信号処理方法。
（付記７）
音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号に対応する音声を音声出力部から出力する、
付記６の音声信号処理方法。
（付記８）
音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号を音声認識することでテキストに変換し、
変換した前記テキストをテキスト出力部から出力する、
付記６の音声信号処理方法。
（付記９）
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、
付記６～付記８の何れかの音声信号処理方法。
（付記１０）
前記複素数フィルタは、周波数スペクトルを入力すると、前記複素数フィルタを出力するように機械学習を用いて学習された複素数フィルタ生成モデルで生成される、
付記６～付記９の何れかの音声信号処理方法。
（付記１１）
音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得する時間周波数変換部と、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成する複素数フィルタ生成部と、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行うフィルタ判定部と、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定するフィルタ適用成分決定部と、
を含む、音声信号処理装置。
（付記１２）
入力される音声を前記音声信号に変換して取得する音声入力部と、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用するフィルタ適用部と、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換する時間周波数逆変換部と、
時間周波数逆変換した前記音声信号に対応する音声を出力する音声出力部と、
をさらに含む、付記１１の音声信号処理装置。
（付記１３）
入力される音声を前記音声信号に変換して取得する音声入力部と、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用するフィルタ適用部と、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換する時間周波数逆変換部と、
時間周波数逆変換した前記音声信号を音声認識することでテキストに変換する音声認識部と、
変換した前記テキストを出力するテキスト出力部と、
をさらに含む、付記１１の音声信号処理装置。
（付記１４）
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記フィルタ判定部は、前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、と判定する、
付記１１～付記１３の何れかの音声信号処理装置。
（付記１５）
前記複素数フィルタ生成部は、周波数スペクトルを入力すると、前記複素数フィルタを出力するように機械学習を用いて学習された複素数フィルタ生成モデルを使用して前記複素数フィルタを生成する、
付記１１～付記１４の何れかの音声信号処理装置。
（付記１６）
前記音声信号処理装置は、
前記時間周波数変換部と、前記複素数フィルタ生成部と、前記フィルタ判定部と、前記フィルタ適用成分決定部と、前記フィルタ適用部と、前記時間周波数逆変換部と、を含むサーバと、
前記音声入力部と、前記音声出力部と、を含むクライアントと、
を含む、付記１２の音声信号処理装置。
（付記１７）
前記音声信号処理装置は、
前記時間周波数変換部と、前記複素数フィルタ生成部と、前記フィルタ判定部と、前記フィルタ適用成分決定部と、前記フィルタ適用部と、前記時間周波数逆変換部と、を含む第１サーバと、
前記音声認識部を含む第２サーバと、
前記音声入力部と、前記テキスト出力部と、を含むクライアントと、
を含む、付記１３の音声信号処理装置。 (Appendix 1)
performing a time-frequency transform on an audio signal to obtain a frequency spectrum corresponding to the audio signal;
generating a complex filter that removes noise components contained in the audio signal based on the acquired frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; performing at least one of comparison with a second value that is the value of the imaginary part of the complex filter generated when the noise component is not present;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. to decide that
An audio signal processing program that causes a computer to perform noise reduction processing.
(Appendix 2)
Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
outputting audio corresponding to the audio signal subjected to the inverse time-frequency transform from an audio output unit;
The audio signal processing program of appendix 1.
(Appendix 3)
Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
converting the time-frequency inverse-transformed speech signal into text by speech recognition;
outputting the converted text from a text output unit;
The audio signal processing program of appendix 1.
(Appendix 4)
the first value is 1.0;
the second value is 0.0;
when the difference between the average value of the real part of each of the complex filters and the first value is less than or equal to a first predetermined value, the difference between the real part and the first value is small, and the complex filter If the difference between the average value of each imaginary part and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small,
The audio signal processing program according to any one of Appendices 1 to 3.
(Appendix 5)
The complex filter is generated by a complex filter generation model trained using machine learning to output the complex filter when the frequency spectrum is input.
The audio signal processing program according to any one of Appendices 1 to 4.
(Appendix 6)
the computer
performing a time-frequency transform on an audio signal to obtain a frequency spectrum corresponding to the audio signal;
generating a complex filter that removes noise components contained in the audio signal based on the acquired frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; performing at least one of comparison with a second value that is the value of the imaginary part of the complex filter generated when the noise component is not present;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. to decide that
Audio signal processing method.
(Appendix 7)
Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
outputting audio corresponding to the audio signal subjected to the inverse time-frequency transform from an audio output unit;
The audio signal processing method of appendix 6.
(Appendix 8)
Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
converting the time-frequency inverse-transformed speech signal into text by speech recognition;
outputting the converted text from a text output unit;
The audio signal processing method of appendix 6.
(Appendix 9)
the first value is 1.0;
the second value is 0.0;
when the difference between the average value of the real part of each of the complex filters and the first value is less than or equal to a first predetermined value, the difference between the real part and the first value is small, and the complex filter If the difference between the average value of each imaginary part and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small,
The audio signal processing method according to any one of appendices 6 to 8.
(Appendix 10)
The complex filter is generated by a complex filter generation model trained using machine learning to output the complex filter when the frequency spectrum is input.
The audio signal processing method according to any one of appendices 6 to 9.
(Appendix 11)
a time-frequency transform unit that performs time-frequency transform on an audio signal to acquire a frequency spectrum corresponding to the audio signal;
a complex filter generation unit that generates a complex filter that removes noise components contained in the audio signal based on the obtained frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; A filter determination unit that performs at least one of comparison with a second value that is an imaginary part value of a complex filter that is generated when the noise component does not exist;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. A filter application component determination unit that determines that
An audio signal processor, comprising:
(Appendix 12)
an audio input unit that converts input audio into the audio signal and obtains the audio signal;
a filter application unit that applies the component of the complex filter determined to be applied to the frequency spectrum to the frequency spectrum;
a time-frequency inverse transform unit for time-frequency inverse transforming the frequency spectrum to which the complex number filter is applied, into an audio signal;
an audio output unit that outputs audio corresponding to the audio signal subjected to time-frequency inverse transform;
12. The audio signal processing apparatus of claim 11, further comprising:
(Appendix 13)
an audio input unit that converts input audio into the audio signal and obtains the audio signal;
a filter application unit that applies the component of the complex filter determined to be applied to the frequency spectrum to the frequency spectrum;
a time-frequency inverse transform unit for time-frequency inverse transforming the frequency spectrum to which the complex number filter is applied, into an audio signal;
a voice recognition unit that converts the time-frequency inverse-transformed voice signal into text by voice recognition;
a text output unit that outputs the converted text;
12. The audio signal processing apparatus of claim 11, further comprising:
(Appendix 14)
the first value is 1.0;
the second value is 0.0;
When the difference between the average value of the real part of each of the complex filters and the first value is equal to or less than a first predetermined value, the filter determination unit determines the difference between the value of the real part and the first value. is small and the difference between the average value of the imaginary part of each of the complex filters and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small; determine that
The audio signal processing device according to any one of appendices 11 to 13.
(Appendix 15)
When the frequency spectrum is input, the complex filter generation unit generates the complex filter using a complex filter generation model trained using machine learning to output the complex filter.
The audio signal processing device according to any one of appendices 11 to 14.
(Appendix 16)
The audio signal processing device is
a server including the time-frequency transformation unit, the complex filter generation unit, the filter determination unit, the filter application component determination unit, the filter application unit, and the time-frequency inverse transformation unit;
a client including the audio input unit and the audio output unit;
13. The audio signal processing device of appendix 12, comprising:
(Appendix 17)
The audio signal processing device is
a first server including the time-frequency transformation unit, the complex filter generation unit, the filter determination unit, the filter application component determination unit, the filter application unit, and the time-frequency inverse transformation unit;
a second server including the speech recognition unit;
a client including the voice input unit and the text output unit;
14. The audio signal processing device of appendix 13, comprising:

１０音声信号処理装置
１６複素数フィルタ生成部
１８フィルタ判定部
２０フィルタ適用成分決定部
５１ＣＰＵ
５２一次記憶部
５３二次記憶部
３１Ａマイク
３１Ｂスピーカ 10 audio signal processing device 16 complex number filter generation unit 18 filter determination unit 20 filter application component determination unit 51 CPU
52 primary storage unit 53 secondary storage unit 31A microphone 31B speaker

Claims

音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得し、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成し、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行い、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定し、
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、
雑音除去処理をコンピュータに実行させるための音声信号処理プログラム。 performing a time-frequency transform on an audio signal to obtain a frequency spectrum corresponding to the audio signal;
generating a complex filter that removes noise components contained in the audio signal based on the acquired frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; performing at least one of comparison with a second value that is the value of the imaginary part of the complex filter generated when the noise component is not present;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. decided that
the first value is 1.0;
the second value is 0.0;
when the difference between the average value of the real part of each of the complex filters and the first value is less than or equal to a first predetermined value, the difference between the real part and the first value is small, and the complex filter If the difference between the average value of each imaginary part and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small,
An audio signal processing program that causes a computer to perform noise reduction processing.

音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号に対応する音声を音声出力部から出力する、
請求項１に記載の音声信号処理プログラム。 Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
outputting audio corresponding to the audio signal subjected to the inverse time-frequency transform from an audio output unit;
The audio signal processing program according to claim 1.

音声入力部で音声から変換される前記音声信号を取得し、
前記周波数スペクトルに適用することが決定された前記複素数フィルタの成分を、前記周波数スペクトルに適用し、
前記複素数フィルタを適用した前記周波数スペクトルを音声信号に時間周波数逆変換し、
時間周波数逆変換した前記音声信号を音声認識することでテキストに変換し、
変換した前記テキストをテキスト出力部から出力する、
請求項１に記載の音声信号処理プログラム。 Acquiring the audio signal converted from the audio by the audio input unit,
applying to the frequency spectrum the components of the complex filter determined to be applied to the frequency spectrum;
inverse time-frequency transforming the frequency spectrum to which the complex filter is applied to an audio signal;
converting the time-frequency inverse-transformed speech signal into text by speech recognition;
outputting the converted text from a text output unit;
The audio signal processing program according to claim 1.

前記複素数フィルタは、周波数スペクトルを入力すると、前記複素数フィルタを出力するように機械学習を用いて学習された複素数フィルタ生成モデルで生成される、
請求項１～３の何れか１項に記載の音声信号処理プログラム。 The complex filter is generated by a complex filter generation model trained using machine learning to output the complex filter when the frequency spectrum is input.
The audio signal processing program according to any one of claims 1 to 3 .

コンピュータが、
音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得し、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成し、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行い、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定し、
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、
音声信号処理方法。 the computer
performing a time-frequency transform on an audio signal to obtain a frequency spectrum corresponding to the audio signal;
generating a complex filter that removes noise components contained in the audio signal based on the acquired frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; performing at least one of comparison with a second value that is the value of the imaginary part of the complex filter generated when the noise component is not present;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. decided that
the first value is 1.0;
the second value is 0.0;
when the difference between the average value of the real part of each of the complex filters and the first value is less than or equal to a first predetermined value, the difference between the real part and the first value is small, and the complex filter If the difference between the average value of each imaginary part and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small,
Audio signal processing method.

音声信号に対して時間周波数変換を行い、前記音声信号に対応する周波数スペクトルを取得する時間周波数変換部と、
取得した前記周波数スペクトルに基づいて、前記音声信号に含まれる雑音成分を除去する複素数フィルタを生成する複素数フィルタ生成部と、
前記複素数フィルタの実部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの実部の値である第１の値との比較、及び、前記複素数フィルタの虚部の値と、前記雑音成分が存在しない場合に生成される複素数フィルタの虚部の値である第２の値と、の比較の少なくとも一方を行うフィルタ判定部と、
前記実部の値と前記第１の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用し、または、前記実部の値と前記第１の値との相違が小さく、前記虚部の値と前記第２の値との相違が小さい場合、前記周波数スペクトルに前記複素数フィルタの振幅成分を適用し、前記実部の値と前記第１の値との相違が小さくないか、または、前記虚部の値と前記第２の値との相違が小さくない場合、前記周波数スペクトルに前記複素数フィルタの振幅成分及び位相成分を適用することで雑音成分を除去することを決定するフィルタ適用成分決定部と、
を含み、
前記第１の値は１．０であり、
前記第２の値は０．０であり、
前記複素数フィルタの各々の実部の平均値と前記第１の値との差が第１所定値以下である場合、前記実部の値と前記第１の値との相違が小さく、前記複素数フィルタの各々の虚部の平均値と前記第２の値との差が第２所定値以下である場合、前記虚部の値と前記第２の値との相違が小さい、
音声信号処理装置。 a time-frequency transform unit that performs time-frequency transform on an audio signal to acquire a frequency spectrum corresponding to the audio signal;
a complex filter generation unit that generates a complex filter that removes noise components contained in the audio signal based on the obtained frequency spectrum;
comparing the real part value of the complex filter with a first value, which is the real part value of the complex filter generated when the noise component is not present, and the imaginary part value of the complex filter; A filter determination unit that performs at least one of comparison with a second value that is an imaginary part value of a complex filter that is generated when the noise component does not exist;
if the difference between the value of the real part and the first value is small, applying the amplitude component of the complex filter to the frequency spectrum, and if the difference between the value of the real part and the first value is not small; applying the amplitude and phase components of the complex filter to the frequency spectrum, and applying the amplitude component of the complex filter to the frequency spectrum if the difference between the value of the imaginary part and the second value is small; if the difference between the imaginary part value and the second value is not small, applying the amplitude and phase components of the complex filter to the frequency spectrum; or applying the real part value and the first value. is small and the difference between the imaginary part value and the second value is small, applying the amplitude component of the complex filter to the frequency spectrum, and the difference between the real part value and the first value If the difference is not small or if the difference between the value of the imaginary part and the second value is not small, removing noise components by applying amplitude and phase components of the complex filter to the frequency spectrum. A filter application component determination unit that determines that
including
the first value is 1.0;
the second value is 0.0;
when the difference between the average value of the real part of each of the complex filters and the first value is less than or equal to a first predetermined value, the difference between the real part and the first value is small, and the complex filter If the difference between the average value of each imaginary part and the second value is less than or equal to a second predetermined value, the difference between the value of the imaginary part and the second value is small,
Audio signal processor.