WO2017188141A1

WO2017188141A1 - Audio signal processing device, audio signal processing method, and audio signal processing program

Info

Publication number: WO2017188141A1
Application number: PCT/JP2017/016019
Authority: WO
Inventors: 安藤　彰男
Original assignee: 国立大学法人富山大学
Priority date: 2016-04-27
Filing date: 2017-04-21
Publication date: 2017-11-02
Also published as: JPWO2017188141A1; JP6846822B2

Abstract

An audio signal processing device according to one embodiment is provided with an acceptance unit for accepting audio signals of a plurality of channels, a division unit for dividing the audio signal of each channel into a coherent component and a field component, and an output unit for outputting the coherent component and field component of each channel. In the dividing process, an estimated signal having highest correlation with the audio signal of a channel to be processed among estimated signals calculated using at least the audio signal of other than the channel to be processed, is extracted as the coherent component of the channel to be processed. Then, a difference of the audio signal of the channel to be processed and the coherent component is extracted as a field component.

Description

オーディオ信号処理装置、オーディオ信号処理方法、およびオーディオ信号処理プログラムAudio signal processing apparatus, audio signal processing method, and audio signal processing program

　本発明の一側面は、オーディオ信号処理装置、オーディオ信号処理方法、およびオーディオ信号処理プログラムに関する。 One aspect of the present invention relates to an audio signal processing device, an audio signal processing method, and an audio signal processing program.

　オーディオ信号のチャネル数を変更する手法が従来から知られている。具体的には、Ｍチャネルのオーディオ信号をＮチャネル（ただし、Ｎ＞Ｍ）のオーディオ信号に変換するアップミックスという手法と、Ｎチャネルのオーディオ信号をＭチャネルのオーディオ信号に変換するダウンミックスという手法が存在する。例えば、２チャネル（左チャネルおよび右チャネル）のオーディオ信号から５．１チャネルのオーディオ信号への変換はアップミックスの一例である。また、５．１チャネルのオーディオ信号から２チャネルのオーディオ信号への変換はダウンミックスの一例である。 A method for changing the number of audio signal channels has been known. Specifically, a method called upmix that converts an M-channel audio signal into an N-channel (where N> M) audio signal, and a method called downmix that converts an N-channel audio signal into an M-channel audio signal Exists. For example, conversion from a 2-channel (left channel and right-channel) audio signal to a 5.1-channel audio signal is an example of upmixing. The conversion from a 5.1 channel audio signal to a 2 channel audio signal is an example of downmixing.

　例えば下記特許文献１には、テレビ・ラジオのスポーツ実況番組のステレオ放送を、迫力ある臨場感と聴き取りやすいアナウンスとするサラウンド再生装置が記載されている。この装置はフロント左／右チャンネル信号創成手段、フロントセンタチャンネル信号創成手段、およびリア左／右サラウンドチャンネル信号創成手段を有する。フロント左／右チャンネル信号創成手段は、２チャンネル音声信号入力に対して、マトリックス処理を行って得たフロント左／右チャンネル用各音声信号に、残響音を選択的に付加すると共にフロント用音量調整を行い、フロント左／右チャンネル用各音声信号として出力する。フロントセンタチャンネル信号創成手段は、２チャンネル音声信号入力から、同相成分を抽出して得た音声信号に、残響音を付加せずにフロントセンタチャンネル用音声信号としてセンタ用音量調整を行って出力する。リア左／右サラウンドチャンネル信号創成手段は、マトリックス処理を行って得たフロント左／右チャンネル用各音声信号に、残響音を付加すると共にリア用音量調整を行い、リア左／右チャンネル用各音声信号として出力する。 For example, the following Patent Document 1 describes a surround playback device that makes a stereo broadcast of a live sports television / radio program a powerful presence and an easy-to-listen announcement. The apparatus has front left / right channel signal creation means, front center channel signal creation means, and rear left / right surround channel signal creation means. Front left / right channel signal creation means selectively adds reverberant sound to front left / right channel audio signals obtained by performing matrix processing on 2-channel audio signal input, and adjusts front volume And output as audio signals for the front left / right channels. The front center channel signal generation means adjusts and outputs the center volume as a front center channel audio signal without adding a reverberation sound to the audio signal obtained by extracting the in-phase component from the 2-channel audio signal input. . The rear left / right surround channel signal creation means adds reverberant sound to the front left / right channel audio signals obtained by performing matrix processing, adjusts the rear volume, and outputs the rear left / right channel audio signals. Output as a signal.

　下記非特許文献１，２はいずれも、アップミックスの手法を記載する文献である。非特許文献１には、ステレオ信号を帯域分割し、帯域ごとにステレオ信号を主信号とアンビエンス信号とに分割し、アンビエンス信号を５．１チャネルの後方チャネルから再生する手法が記載されている。非特許文献２には、ステレオ信号を帯域分割した後に、そのステレオ信号を直接音成分と残響音成分とに分割し、残響音成分を側方から再生する方法が記載されている。 The following

non-patent documents

1 and 2 are documents that describe the upmix technique. Non-Patent Document 1 describes a method of dividing a stereo signal into bands, dividing the stereo signal into a main signal and an ambience signal for each band, and reproducing the ambience signal from the rear channel of 5.1 channels. Non-Patent Document 2 describes a method of dividing a stereo signal into bands and then dividing the stereo signal into a direct sound component and a reverberation sound component and reproducing the reverberation sound component from the side.

　下記非特許文献３，４はいずれも、多チャネルのオーディオ信号を２チャネルのオーディオ信号のペアに分割することで、３チャネル以上のオーディオ信号を生成する手法を開示する。 The following Non-Patent Documents 3 and 4 each disclose a method of generating an audio signal of three or more channels by dividing a multi-channel audio signal into a pair of two-channel audio signals.

特開２００７－２８０６５号公報JP 2007-28065 A

　特許文献１に記載のサラウンド再生装置は原音に残響音を付加するため、再生音の雰囲気（例えば音色）が原音から変わったり損なわれたりしてしまう。これに対して非特許文献１，２に記載の手法は残響音を付加するものではないが、原理上、２チャネルのオーディオ信号（すなわち、ステレオ信号）にしか適用できない。 Since the surround playback apparatus described in Patent Document 1 adds a reverberation sound to the original sound, the atmosphere (for example, timbre) of the playback sound is changed or damaged from the original sound. On the other hand, the methods described in

Non-Patent Documents

1 and 2 do not add reverberation sound, but can be applied only to two-channel audio signals (that is, stereo signals) in principle.

　非特許文献３，４に記載の手法では、２チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分として抽出するので、二つのスピーカの中間付近に位置する音の情報を取得することになる。したがって、３チャネル以上のオーディオ・システムでは、任意の二つのスピーカの中間付近の音の情報だけしかコヒーレント成分として抽出することができず、全スピーカで囲まれた領域の中央部分に位置する音の情報を抽出することができない。 In the methods described in Non-Patent Documents 3 and 4, since a component having a high correlation between two-channel audio signals is extracted as a coherent component, information on a sound located near the middle of the two speakers is acquired. . Therefore, in an audio system with three or more channels, only the sound information near the middle of any two speakers can be extracted as a coherent component, and the sound located in the central part of the area surrounded by all the speakers can be extracted. Information cannot be extracted.

　そこで、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持する手法が望まれている。 Therefore, there is a demand for a technique for maintaining the atmosphere of the original sound as much as possible when changing the number of channels of the audio signal regardless of the number of channels of the original sound.

　本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 An audio signal processing device according to an aspect of the present invention is a receiving unit that receives audio signals of a plurality of channels, and a dividing unit that executes a dividing process for dividing an audio signal into a coherent component and a field component for each channel. In the case where the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel. Divider and each channel extracted by the divider And an output unit for outputting the coherent component and field component.

　本発明の一側面に係るオーディオ信号処理方法は、オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号処理装置が、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、オーディオ信号処理装置が、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとを含む。 An audio signal processing method according to an aspect of the present invention includes an accepting step in which an audio signal processing device receives audio signals of a plurality of channels, and a division in which the audio signal processing device divides the audio signal into a coherent component and a field component. A division step for performing processing for each channel, and the division processing is calculated using at least an audio signal of a channel other than the target channel when one channel to be divided is set as a target channel. Extracting an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals as a coherent component of the target channel; and calculating a difference between the audio signal of the target channel and the coherent component of the target channel. Extract as field component of And a step comprises the said dividing step, the audio signal processing device, and an output step of outputting coherent component and field component of each channel extracted in dividing step.

　本発明の一側面に係るオーディオ信号処理プログラムは、複数のチャネルのオーディオ信号を受け付ける受付ステップと、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割ステップと、分割ステップにおいて抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力ステップとをコンピュータに実行させる。 An audio signal processing program according to one aspect of the present invention is a reception step for receiving audio signals of a plurality of channels, and a division step for executing division processing for dividing the audio signal into coherent components and field components for each channel. In the case where the division processing is performed using one channel that is the target of the division processing as the target channel, the estimation signal calculated by using at least the audio signal of the channel other than the target channel and the audio signal of the target channel Extracting an estimated signal having the highest correlation as a coherent component of the target channel; and extracting a difference between an audio signal of the target channel and a coherent component of the target channel as a field component of the target channel. Split steps and split steps To execute an output step of outputting coherent component and field component of each channel extracted in up to the computer.

　このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気を可能な限り維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。 In such an aspect, a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent component and field component are obtained for each channel. In this way, by obtaining the coherent component and field component of each channel using only the original audio signal without adding sound, the atmosphere of the original sound can be maintained as much as possible. In addition, since the coherent component and the field component can be obtained by the number of original channels, this method can be applied regardless of the number of channels of the original sound.

　本発明の一側面によれば、原音のチャネル数にかかわらず、オーディオ信号のチャネル数を変更する際に原音の雰囲気を可能な限り維持することができる。 According to one aspect of the present invention, the atmosphere of the original sound can be maintained as much as possible when the number of channels of the audio signal is changed regardless of the number of channels of the original sound.

実施形態に係るオーディオ信号処理の例を示す図である。It is a figure which shows the example of the audio signal process which concerns on embodiment. 実施形態に係るオーディオ信号処理装置として機能するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer which functions as the audio signal processing apparatus which concerns on embodiment. 実施形態に係るオーディオ信号処理装置の機能構成を示す図である。It is a figure which shows the function structure of the audio signal processing apparatus which concerns on embodiment. オーディオ信号を処理する単位であるブロックを示す図である。It is a figure which shows the block which is a unit which processes an audio signal. ある一つのチャネルにおける処理を示す図である。It is a figure which shows the process in a certain channel. 実施形態に係るオーディオ信号処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio signal processing apparatus which concerns on embodiment. 図６に示すコヒーレント成分の抽出の詳細を示すフローチャートである。It is a flowchart which shows the detail of extraction of the coherent component shown in FIG. 実施形態に係るオーディオ信号処理プログラムの構成を示す図である。It is a figure which shows the structure of the audio signal processing program which concerns on embodiment. 従来の手法におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of extraction of the coherent component in the conventional method. 実施形態におけるコヒーレント成分の抽出の例を示す図である。It is a figure which shows the example of extraction of the coherent component in embodiment.

　以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一または同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.

　図１～図５を参照しながら、実施形態に係るオーディオ信号処理装置１０の機能および構成を説明する。オーディオ信号処理装置１０は、複数のチャネルのオーディオ信号のそれぞれをコヒーレント成分とフィールド成分とに分割するコンピュータである。オーディオ信号は、ヒトが聴くことができる周波数帯域（一般に約２０Ｈｚ～２００００Ｈｚ）の音を含むデジタル信号であり、必要に応じてアナログ信号に変換される。オーディオ信号で示される音の例として声、音楽、映像の音、自然音、あるいはこれらの任意の組合せが挙げられるが、これらに限定されるものではない。 The function and configuration of the audio signal processing apparatus 10 according to the embodiment will be described with reference to FIGS. The audio signal processing apparatus 10 is a computer that divides each audio signal of a plurality of channels into a coherent component and a field component. The audio signal is a digital signal including sound in a frequency band (generally about 20 Hz to 20000 Hz) that can be heard by humans, and is converted into an analog signal as necessary. Examples of sound represented by the audio signal include, but are not limited to, voice, music, video sound, natural sound, or any combination thereof.

　図１は、オーディオ信号処理装置１０によるオーディオ信号の処理の一例を示し、より具体的には、２チャネル（ＬチャネルおよびＲチャネル）、すなわちステレオのオーディオ信号の処理を示す。オーディオ信号処理装置１０は各チャネルの信号をコヒーレント成分とフィールド成分とに分割する。 FIG. 1 shows an example of audio signal processing by the audio signal processing apparatus 10, and more specifically shows processing of two channels (L channel and R channel), that is, stereo audio signals. The audio signal processing apparatus 10 divides each channel signal into a coherent component and a field component.

　ある一つのチャネルのコヒーレント成分とは、他のチャネルのオーディオ信号との相関が高い成分である。ある一つのチャネルのフィールド成分とは、該チャネルのオーディオ信号（すなわち、元の信号）と該チャネルのコヒーレント成分との差分である。より具体的には、フィールド成分はオーディオ信号からコヒーレント成分を差し引くことで得られる成分である。コヒーレント成分は明瞭な方向性を有する音であるのに対して、フィールド成分は、拡散性を持つ、周囲を取り巻くような音（ａｍｂｉｅｎｔ　ｓｏｕｎｄ）である。以下では、フィールド成分に対応する音を「フィールド音」ともいう。 A coherent component of one channel is a component having a high correlation with an audio signal of another channel. The field component of one channel is the difference between the audio signal of the channel (ie, the original signal) and the coherent component of the channel. More specifically, the field component is a component obtained by subtracting the coherent component from the audio signal. The coherent component is a sound having a clear direction, whereas the field component is an ambient sound having a diffusive nature. Hereinafter, the sound corresponding to the field component is also referred to as “field sound”.

　図１は、オーディオ信号処理装置１０がＬチャネルのオーディオ信号をＬチャネルのコヒーレント成分Ｌγおよびフィールド成分Ｌφに分割し、Ｒチャネルのオーディオ信号をＲチャネルのコヒーレント成分Ｒγおよびフィールド成分Ｒφに分割することを示す。コヒーレント成分ＬγはＲチャネルのオーディオ信号との相関が高い成分であり、コヒーレント成分ＲγはＬチャネルのオーディオ信号との相関が高い成分である。 FIG. 1 shows that an audio signal processing apparatus 10 divides an L channel audio signal into an L channel coherent component Lγ and a field component Lφ, and an R channel audio signal into an R channel coherent component Rγ and a field component Rφ. Indicates. The coherent component Lγ is a component having a high correlation with the R channel audio signal, and the coherent component Rγ is a component having a high correlation with the L channel audio signal.

　図１は２チャネルのオーディオ信号の処理を示すが、オーディオ信号処理装置１０は任意の個数のオーディオ信号を処理してよい。オーディオ信号処理装置１０は３以上のチャネルのオーディオ信号を処理してもよく、例えば、８Ｋスーパーハイビジョン用の２２．２チャネルのオーディオ信号を処理してもよい。 FIG. 1 shows the processing of a two-channel audio signal, but the audio signal processing apparatus 10 may process an arbitrary number of audio signals. The audio signal processing apparatus 10 may process audio signals of three or more channels. For example, the audio signal processing apparatus 10 may process 22.2 channel audio signals for 8K Super Hi-Vision.

　三次元空間での音の方向、距離、広がりを再現可能な立体音響効果を実現するために、複数チャネルのオーディオ信号は、三次元空間内に分散して配置された複数のマイクにより記録される。複数チャネルのオーディオ信号は、複数の目的音（ｏｂｊｅｃｔ　ｓｏｕｎｄ）が互いに混ざったり目的音がフィールド音と混ざったりしたかたちで記録される。一般に音源からの距離は個々のマイクで異なるため、ある特定の音が到着する時間はマイク毎に異なり、その結果、記録されたオーディオ信号のコヒーレントが低くなる。コヒーレント成分を各チャネルのオーディオ信号から取り出すことができれば、音の明瞭性および見かけの音源の幅（ＡＳＷ：Ａｐｐａｒｅｎｔ　Ｓｏｕｒｃｅ　Ｗｉｄｔｈ）を改善することができる。また、フィールド成分を抽出してこれをアップミックスに用いることで、良好なアンビエンス効果（聴取者の周囲を音が取り巻くような感じ）を生み出すことが可能になる。一般に、コヒーレント成分は主たる音源から発せられる目的音（例えば、歌声、楽器の音、スピーカから発せられる音など）に相当し、フィールド成分は、音の方向性が明瞭でない音（例えば、エコー、うなりなど）に相当する。 In order to achieve a three-dimensional sound effect that can reproduce the direction, distance, and spread of sound in a three-dimensional space, multi-channel audio signals are recorded by a plurality of microphones arranged in a three-dimensional space. . The audio signals of a plurality of channels are recorded in such a manner that a plurality of target sounds (object sound) are mixed with each other or the target sound is mixed with a field sound. In general, since the distance from a sound source differs among individual microphones, the time at which a specific sound arrives differs from microphone to microphone, and as a result, the coherence of the recorded audio signal becomes low. If the coherent component can be extracted from the audio signal of each channel, the clarity of the sound and the apparent sound source width (ASW: Appearance Source Width) can be improved. Further, by extracting the field component and using it for the upmix, it is possible to produce a good ambience effect (feeling that the sound surrounds the listener). In general, the coherent component corresponds to a target sound (for example, singing voice, instrument sound, or sound emitted from a speaker) emitted from the main sound source, and the field component is a sound whose directionality is not clear (for example, echo, beat). Etc.).

　Ｎ個のチャネルのうちｌ番目のチャネルのオーディオ信号をｘ_ｌ（ｎ）とすると、このオーディオ信号ｘ_ｌ（ｎ）はＭ個の目的音ｑ_ｌｍ（ｎ）（ｍ＝１，…，Ｍ）とフィールド音ｖ_ｌ（ｎ）とから成る。すなわち、オーディオ信号ｘ_ｌ（ｎ）は式（１）で示される。

Assuming that the audio signal of the l-th channel among the N channels is x _l (n), the audio signal x _l (n) is M target sounds q _lm (n) (m = 1,..., M). And field sound v _l (n). That is, the audio signal x _l (n) is expressed by the equation (1).

　この式（１）で示されるように、目的音とフィールド音とは互いに統計的に独立と見なすことができる。オーディオ信号ｘ_ｌ（ｎ）のコヒーレント成分γ_ｌ（ｎ）は式（２）で示される。

As shown in the equation (1), the target sound and the field sound can be regarded as being statistically independent from each other. The coherent component γ _l (n) of the audio signal x _l (n) is expressed by Equation (2).

　オーディオ信号ｘ_ｌ（ｎ）のフィールド成分φ_ｌ（ｎ）は式（３）で示される。

The field component φ _l (n) of the audio signal x _l (n) is expressed by Expression (3).

　オーディオ信号処理装置１０の具体的な実現方法は限定されない。例えば、オーディオ信号処理装置１０はパーソナル・コンピュータ、サーバ、携帯端末などのコンピュータに所定のプログラム（例えば、後述するオーディオ信号処理プログラムＰ１）をインストールすることで実現されてもよい。あるいは、アンプなどの音響機器がオーディオ信号処理装置１０として機能してもよい。 The specific method for realizing the audio signal processing apparatus 10 is not limited. For example, the audio signal processing apparatus 10 may be realized by installing a predetermined program (for example, an audio signal processing program P1 described later) in a computer such as a personal computer, a server, or a portable terminal. Alternatively, an audio device such as an amplifier may function as the audio signal processing device 10.

　図２は、オーディオ信号処理装置１０として機能するコンピュータ１００の一般的なハードウェア構成を示す。コンピュータ１００は、オペレーティングシステムやアプリケーション・プログラムなどを実行するプロセッサ（例えばＣＰＵ）１０１と、ＲＯＭおよびＲＡＭで構成される主記憶部１０２と、ハードディスクやフラッシュメモリなどで構成される補助記憶部１０３と、ネットワークカードまたは無線通信モジュールで構成される通信制御部１０４と、キーボードやマウスなどの入力装置１０５と、モニタなどの出力装置１０６とを備える。 FIG. 2 shows a general hardware configuration of the computer 100 functioning as the audio signal processing apparatus 10. The computer 100 includes a processor (for example, CPU) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk, a flash memory, and the like. A communication control unit 104 configured by a network card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a monitor are provided.

　オーディオ信号処理装置１０の各機能要素は、プロセッサ１０１または主記憶部１０２の上に所定のソフトウェア（例えば、後述するオーディオ信号処理プログラムＰ１）を読み込ませてそのソフトウェアを実行させることで実現される。プロセッサ１０１はそのソフトウェアに従って、通信制御部１０４、入力装置１０５、または出力装置１０６を動作させ、主記憶部１０２または補助記憶部１０３におけるデータの読み出し及び書き込みを行う。処理に必要なデータまたはデータベースは主記憶部１０２または補助記憶部１０３内に格納される。 Each functional element of the audio signal processing apparatus 10 is realized by reading predetermined software (for example, an audio signal processing program P1 described later) on the processor 101 or the main storage unit 102 and executing the software. The processor 101 operates the communication control unit 104, the input device 105, or the output device 106 in accordance with the software, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. Data or a database necessary for processing is stored in the main storage unit 102 or the auxiliary storage unit 103.

　なお、オーディオ信号処理装置１０は１台のコンピュータで構成されてもよいし、複数台のコンピュータで構成されてもよい。複数台のコンピュータを用いる場合には、これらのコンピュータがインターネットやイントラネットなどの通信ネットワークを介して接続されることで、論理的に一つのオーディオ信号処理装置１０が構築される。 Note that the audio signal processing apparatus 10 may be composed of one computer or a plurality of computers. When a plurality of computers are used, one audio signal processing apparatus 10 is logically constructed by connecting these computers via a communication network such as the Internet or an intranet.

　図３は、オーディオ信号処理装置１０の機能構成を示す。図３に示すように、オーディオ信号処理装置１０は機能的構成要素として受付部１１、分割部１２、および出力部１３を備える。 FIG. 3 shows a functional configuration of the audio signal processing apparatus 10. As shown in FIG. 3, the audio signal processing apparatus 10 includes a receiving unit 11, a dividing unit 12, and an output unit 13 as functional components.

　受付部１１は、複数のチャネルのオーディオ信号を受け付ける機能要素である。「オーディオ信号を受け付ける」とは、オーディオ信号処理装置１０がオーディオ信号を任意の手法で取得することである。言い換えると、「オーディオ信号を受け付ける」とは、オーディオ信号がオーディオ信号処理装置１０に入力されることを意味する。各チャネルのオーディオ信号を受け付ける具体的な手法は限定されない。例えば、受付部１１はデータベースまたは他の装置にアクセスしてオーディオ信号のデータファイルを読み出すことでそのオーディオ信号を受け付けてもよい。あるいは、受付部１１は他の装置から通信ネットワーク経由で送られてきたオーディオ信号を受信してもよい。あるいは、受付部１１はオーディオ信号処理装置１０で入力されたオーディオ信号を取得してもよい。いずれにしても、受付部１１は受け付けた各チャネルのオーディオ信号を分割部１２に出力する。 The reception unit 11 is a functional element that receives audio signals of a plurality of channels. “Accepting an audio signal” means that the audio signal processing apparatus 10 acquires an audio signal by an arbitrary method. In other words, “accepting an audio signal” means that the audio signal is input to the audio signal processing apparatus 10. A specific method for receiving the audio signal of each channel is not limited. For example, the reception unit 11 may receive an audio signal by accessing a database or another device and reading out a data file of the audio signal. Or the reception part 11 may receive the audio signal sent via the communication network from the other apparatus. Alternatively, the reception unit 11 may acquire an audio signal input from the audio signal processing device 10. In any case, the receiving unit 11 outputs the received audio signal of each channel to the dividing unit 12.

　分割部１２は、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する機能要素である。以下の説明は、分割部１２が式（４）で示されるＮチャネルのオーディオ信号｛ｘ_ｌ（ｎ）｜ｌ＝１，…，Ｎ｝を処理することを前提とする。

The dividing unit 12 is a functional element that divides the audio signal of each channel into a coherent component and a field component. The following description is based on the premise that the dividing unit 12 processes the N-channel audio signal {x _l (n) | l = 1,..., N} expressed by Expression (4).

　まず、分割部１２は各チャネルのオーディオ信号を複数の時間区間の信号に分割する。具体的には、分割部１２は窓関数（例えば、カイザー・ベッセル窓）を用いてオーディオ信号を短い時間間隔（これを「フレーム」という）の信号に区切る。例えば、後述する変形離散コサイン変換（ＭＤＣＴ）において１０２４個の周波数点を用いるのであれば、分割部１２は２０４８点分の長さに相当するカイザー・ベッセル窓を用いてオーディオ信号を複数のフレームに分割する。通常、１フレーム内のサンプル数は適切な周波数分解能が得られるように決められるが、そのサンプル数はコヒーレント成分を推定するには十分ではない。そこで、分割部１２は連続する複数のフレーム（例えば２４個のフレーム）を一つの時間区間（これを「ブロック」という）の信号として設定する。図４はこのようなブロックの生成の概念を示し、より具体的には、２チャネル（ＬチャネルおよびＲチャネル）のオーディオ信号のそれぞれを複数のブロックに分割する処理を示す。 First, the dividing unit 12 divides the audio signal of each channel into a plurality of time interval signals. Specifically, the dividing unit 12 divides the audio signal into signals having a short time interval (referred to as “frame”) using a window function (for example, Kaiser-Bessel window). For example, if 1024 frequency points are used in the modified discrete cosine transform (MDCT) described later, the dividing unit 12 uses a Kaiser-Bessel window corresponding to the length of 2048 points to divide the audio signal into a plurality of frames. To divide. Usually, the number of samples in one frame is determined so as to obtain an appropriate frequency resolution, but the number of samples is not sufficient for estimating the coherent component. Therefore, the dividing unit 12 sets a plurality of continuous frames (for example, 24 frames) as a signal of one time section (referred to as “block”). FIG. 4 shows the concept of such block generation. More specifically, FIG. 4 shows a process of dividing each of two-channel (L channel and R channel) audio signals into a plurality of blocks.

　各チャネルのオーディオ信号を複数のブロックに分割すると、分割部１２は各チャネルの各ブロックに対して以下の処理を実行する。本明細書では、オーディオ信号をコヒーレント成分とフィールド成分とに分ける対象（すなわち、分割処理の対象）となるチャネルを「対象チャネル」という。ここでは、ある一つの対象チャネルにおける処理を説明する。 When the audio signal of each channel is divided into a plurality of blocks, the dividing unit 12 executes the following processing for each block of each channel. In this specification, a channel that is a target for dividing an audio signal into a coherent component and a field component (that is, a target of division processing) is referred to as a “target channel”. Here, processing in a certain target channel will be described.

　分割部１２は、対象チャネルのコヒーレント成分を抽出し、その後に該対象チャネルのフィールド成分を抽出する。図５は、その一連の処理の前半に相当する、コヒーレント成分の抽出の概念を示す。分割部１２は、フィルタバンクを用いて、対象チャネルであるｌ番目のチャネルのオーディオ信号ｘ_ｌ（ｎ）をＫ個の周波数帯域（サブバンド）の信号（これを「サブバンド信号」という。）に分割する。そして、分割部１２は各サブバンドにおいて、対象チャネル以外の他のチャネルのオーディオ信号を用いてコヒーレント成分γ_ｌ ^（ｋ）（ｎ）（ｋ＝１，…，Ｋ）を抽出する。分割部１２はこの抽出の際に最小二乗法を用いる。そして、分割部１２は全サブバンドのコヒーレント成分を加算することで、対象チャネルのコヒーレント成分γ_ｌ（ｎ）を抽出する。その後、分割部１２は、元のオーディオ信号ｘ_ｌ（ｎ）からコヒーレント成分γ_ｌ（ｎ）を差し引くことでフィールド成分φ_ｌ（ｎ）を抽出する。 The dividing unit 12 extracts a coherent component of the target channel, and then extracts a field component of the target channel. FIG. 5 shows the concept of extraction of coherent components corresponding to the first half of the series of processes. Using the filter bank, the dividing unit 12 converts the audio signal x _l (n) of the l-th channel, which is the target channel, into K frequency band (subband) signals (referred to as “subband signals”). Divide into Then, the dividing unit 12 extracts coherent components γ _l ^(k) (n) (k = 1,..., K) in each subband using audio signals of channels other than the target channel. The dividing unit 12 uses a least square method for this extraction. Then, the dividing unit 12 extracts the coherent components γ _l (n) of the target channel by adding the coherent components of all the subbands. Thereafter, the dividing unit 12 extracts the field component φ _l (n) by subtracting the coherent component γ _l (n) from the original audio signal x _l (n).

　分割部１２は対象チャネルのオーディオ信号の各ブロックについて以下の処理を実行する。 The dividing unit 12 executes the following processing for each block of the audio signal of the target channel.

　分割部１２はフィルタバンクを用いて各チャネルのオーディオ信号ｘ_ｌ（ｎ）をＫ個のサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）に分割する。この分割は式（５）で示される。

The dividing unit 12 divides the audio signal x _l (n) of each channel into K subband signals x _l ^(k) (n) using a filter bank. This division is expressed by equation (5).

　なお、式（５）で示されるサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）は時間領域での信号であり、したがって、時間領域サブバンド信号である。周波数領域での信号を用いる上記の非特許文献１～４の手法と異なり、オーディオ信号処理装置１０は時間領域サブバンド信号を用いるので、連続する任意のフレーム数の信号を一つのブロック信号として処理することで推定区間長を伸ばすことができる。この結果、得られたコヒーレント成分の音質を損なうことなく各チャネルのオーディオ信号を処理することができる。 Incidentally, wherein the subband signals represented by _{^{(5) x l (k)}} (n) is the signal in the time domain, therefore, it is a time-domain subband signals. Unlike the above-described methods of Non-Patent Documents 1 to 4 that use signals in the frequency domain, the audio signal processing apparatus 10 uses time-domain subband signals, and therefore processes a signal with an arbitrary number of consecutive frames as one block signal. By doing so, the estimated section length can be extended. As a result, the audio signal of each channel can be processed without impairing the sound quality of the obtained coherent component.

　続いて、分割部１２はこのサブバンド信号ｘ_ｌ ^（ｋ）（ｎ）を、対象チャネル以外のＮ－１個のチャネルの同帯域（同じサブバンド）のサブバンド信号｛ｘ_ｍ ^（ｋ）（ｎ）｜ｍ＝１，…，ｌ－１，ｌ＋１，…，Ｎ｝の線形結合から推定する。ある１ブロックに対応するこの線形結合は式（６）で示される。

Subsequently, the dividing unit 12 _{converts the} subband signal x _l ^(k) (n) into subband signals {x _m ^(k) (n) in the same band (same subband) of N−1 channels other than the target channel. n) Estimate from a linear combination of | m = 1,..., l−1, l + 1,. This linear combination corresponding to a certain block is expressed by Equation (6).

　推定信号

は、他チャネル（対象チャネル以外のＮ－１個のチャネル）の同帯域の信号との相関が高い成分と考えることができる。対象チャネルのサブバンド信号とこの推定信号との推定誤差ｅ_ｌ ^（ｋ）（ｎ）は式（７）で示される。

Estimated signal

Can be considered as a component having a high correlation with signals in the same band of other channels (N−1 channels other than the target channel). An estimation error e _l ^(k) (n) between the subband signal of the target channel and the estimated signal is expressed by Expression (7).

　分割部１２は、この推定誤差を最小にする係数｛ａ_ｍ ^（ｋ）｜ｍ＝１，…，ｌ－１，ｌ＋１，…，Ｎ｝を最小二乗法で求める。最小化すべき誤差関数は式（８）で示される。

The dividing unit 12 obtains coefficients {a _m ^(k) | m = 1,..., L−1, l + 1,..., N} that minimize the estimation error by the least square method. The error function to be minimized is given by equation (8).

　ここで、

とすると、最適な係数群

は式（９）を満たす。

here,

Then, the optimal coefficient group

Satisfies equation (9).

　この式（９）をｍ＝１，…，ｌ－１，ｌ＋１，…，Ｎで連立させると式（１０）が得られる。

ここで、

である。 When this equation (9) is made simultaneous with m = 1,..., L−1, l + 1,.

here,

It is.

　ｋ番目のサブバンドにおける対象チャネルの係数ベクトルａ＾_ｌ ^（ｋ）は式（１１）により得られる。

The coefficient vector a ^ _l ^(k) of the target channel in the k-th subband is obtained by Expression (11).

　ｋ番目のサブバンドにおける対象チャネルのコヒーレント成分γ_ｌ ^（ｋ）（ｎ）は式（１２）により得られる。このコヒーレント成分γ_ｌ ^（ｋ）（ｎ）は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号に相当する。

The coherent component γ _l ^(k) (n) of the target channel in the kth subband is obtained by Equation (12). This coherent component γ _l ^(k) (n) corresponds to an estimated signal having the highest correlation with the audio signal of the target channel among the estimated signals calculated using the audio signals of channels other than the target channel.

　分割部１２はすべてのサブバンドについてコヒーレント成分を求める。そして、分割部１２は全サブバンドのコヒーレント成分を加算することで対象チャネルのコヒーレント成分を求める。この処理は式（１３）で示される。

The dividing unit 12 obtains coherent components for all subbands. Then, the dividing unit 12 obtains the coherent component of the target channel by adding the coherent components of all the subbands. This process is expressed by equation (13).

　さらに、分割部１２は対象チャネルの元のオーディオ信号からそのコヒーレント成分を差し引くことで、対象チャネルのフィールド成分を求める。この処理は上記式（３）で示される。 Further, the dividing unit 12 obtains the field component of the target channel by subtracting the coherent component from the original audio signal of the target channel. This processing is expressed by the above formula (3).

　なお、分割部１２は、各サブバンドにおいてオーディオ信号からコヒーレント成分を差し引くことでフィールド成分を求め、全サブバンドのフィールド成分を加算することで対象チャネルのフィールド成分を求めてもよい。具体的には、ｋ番目のサブバンドにおける対象チャネルのフィールド成分φ_ｌ ^（ｋ）（ｎ）は式（１４）により得られる。

対象チャネルのフィールド成分φ_ｌ（ｎ）は式（１５）により得られる。

The dividing unit 12 may obtain a field component by subtracting a coherent component from the audio signal in each subband, and may obtain a field component of the target channel by adding the field components of all subbands. Specifically, the field component φ _l ^(k) (n) of the target channel in the k-th subband is obtained by Expression (14).

The field component φ _l (n) of the target channel is obtained by Equation (15).

　分割部１２は上記の処理を対象チャネルのオーディオ信号の各ブロックに対して実行する。そして、分割部１２は全ブロックのコヒーレント成分を連結することで対象チャネルのコヒーレント成分を抽出する。また、分割部１２は全ブロックのフィールド成分を連結することで対象チャネルのフィールド成分を生成する。 The dividing unit 12 performs the above processing on each block of the audio signal of the target channel. Then, the dividing unit 12 extracts the coherent component of the target channel by connecting the coherent components of all blocks. Further, the dividing unit 12 generates the field component of the target channel by concatenating the field components of all blocks.

　分割部１２は複数のチャネルのそれぞれを対象チャネルとして設定して上記の処理を実行することで、全チャネルについてコヒーレント成分およびフィールド成分を生成する。そして、分割部１２は全チャネルのコヒーレント成分およびフィールド成分を出力部１３に出力する。 The dividing unit 12 generates a coherent component and a field component for all channels by setting each of a plurality of channels as a target channel and executing the above processing. Then, the division unit 12 outputs the coherent components and field components of all channels to the output unit 13.

　このように、分割部１２は各チャネルのオーディオ信号に別の信号を追加することなく（すなわち、原音に別の音を追加することなく）、各チャネルのオーディオ信号をコヒーレント成分とフィールド成分とに分割する。 As described above, the dividing unit 12 does not add another signal to the audio signal of each channel (that is, without adding another sound to the original sound), and converts the audio signal of each channel into a coherent component and a field component. To divide.

　出力部１３は、分割部１２により生成された各チャネルのコヒーレント成分およびフィールド成分を処理結果として出力する機能要素である。この処理結果は、Ｎチャネルから２Ｎチャネルへのアップミックスを実現したものであるということができる。処理結果の出力方法は何ら限定されない。例えば、出力部１３は処理結果をメモリやデータベースなどの記憶装置に格納してもよいし、通信ネットワークを介して他の装置に送信してもよい。あるいは、出力部１３は各チャネルのコヒーレント成分およびフィールド成分を対応するスピーカに出力してもよい。いずれにしても、オーディオ信号処理装置１０による処理結果を用いて、既存の音声素材を、より多くのチャネル数を持つコンテンツの制作に利用したり、より多くのチャネルを有するオーディオ・システムで再生したりすることが可能になる。 The output unit 13 is a functional element that outputs the coherent component and field component of each channel generated by the dividing unit 12 as a processing result. This processing result can be said to be an upmix from N channel to 2N channel. The output method of the processing result is not limited at all. For example, the output unit 13 may store the processing result in a storage device such as a memory or a database, or may transmit the processing result to another device via a communication network. Alternatively, the output unit 13 may output the coherent component and field component of each channel to a corresponding speaker. In any case, using the processing result of the audio signal processing apparatus 10, existing audio material can be used for production of contents having a larger number of channels, or reproduced by an audio system having a larger number of channels. It becomes possible to do.

　オーディオ信号処理装置１０は、Ｎチャネルのオーディオ信号を２Ｎより大きい数のチャネルにアップミックスしてもよい。具体的には、オーディオ信号処理装置１０は、抽出した複数のフィールド成分を下記参考文献に記載の手法で無相関化することで、チャネル間の相関が互いに異なる信号を生成する。これにより、Ｎより多い個数のフィールド成分が得られる。例えば、ステレオの音声素材を５．１チャネルの音声素材に変換したり、５．１チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。あるいは、５．１チャネルの音声素材を２２．２チャネルの音声素材に変換したり、２２．２チャネルのオーディオ・システムを用いてより高い臨場感で再生したりすることができる。
　（参考文献）J. Breebaart and C. Fallar, “Spatial Audio Processing - MPEG Surround and Other Applications,” Wiley, 2007. The audio signal processing apparatus 10 may upmix an N-channel audio signal into a number of channels larger than 2N. Specifically, the audio signal processing apparatus 10 generates signals having different correlations between channels by decorrelating the extracted plurality of field components using a technique described in the following reference. Thereby, more than N field components are obtained. For example, stereo audio material can be converted into 5.1 channel audio material, and can be reproduced with higher presence using a 5.1 channel audio system. Alternatively, 5.1 channel audio material can be converted to 22.2 channel audio material, or reproduced with higher presence using a 22.2 channel audio system.
(Reference) J. Breebaart and C. Fallar, “Spatial Audio Processing-MPEG Surround and Other Applications,” Wiley, 2007.

　オーディオ信号処理装置１０は、Ｎチャネルのオーディオ信号を、２Ｎより小さいＪ個のオーディオ信号（ただし、Ｊ＞Ｎ）のオーディオ信号にアップミックスしてもよい。具体的には、オーディオ信号処理装置１０はＮ個のフィールド成分をミキシングすることで、ＮチャネルからＪチャネルへのアップミックスを実現する。 The audio signal processing apparatus 10 may upmix the N-channel audio signal into audio signals of J audio signals smaller than 2N (where J> N). Specifically, the audio signal processing apparatus 10 realizes an upmix from the N channel to the J channel by mixing N field components.

　オーディオ信号処理装置１０による処理結果はアップミックスだけでなくダウンミックスにも利用可能である。 The processing result by the audio signal processing apparatus 10 can be used not only for upmixing but also for downmixing.

　次に、図６および図７を参照しながら、オーディオ信号処理装置１０の動作を説明するとともに本実施形態に係るオーディオ信号処理方法について説明する。オーディオ信号処理装置１０では、まず、受付部１１が複数のチャネルのオーディオ信号を受け付ける（受付ステップ）。続いて、分割部１２がオーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する（分割ステップ）。そして、出力部１３が各チャネルのコヒーレント成分およびフィールド成分を出力する（出力ステップ）。以下では、特に重要な分割部１２の処理（分割ステップ）について詳しく説明する。 Next, the operation of the audio signal processing apparatus 10 will be described with reference to FIGS. 6 and 7, and the audio signal processing method according to the present embodiment will be described. In the audio signal processing apparatus 10, first, the reception unit 11 receives audio signals of a plurality of channels (reception step). Subsequently, the dividing unit 12 executes a dividing process for dividing each audio signal into a coherent component and a field component for each channel (dividing step). Then, the output unit 13 outputs the coherent component and field component of each channel (output step). Hereinafter, a particularly important process (dividing step) of the dividing unit 12 will be described in detail.

　図６は、一つの対象チャネルのコヒーレント成分およびフィールド成分を生成する処理を示す。 FIG. 6 shows a process of generating a coherent component and a field component of one target channel.

　まず、分割部１２は各チャネルのオーディオ信号を複数のブロックに分割する（ステップＳ１１）。なお、ステップＳ１１において分割した各チャネルおよび各ブロックのオーディオ信号を保存することで、２番目以降の対象チャネルを処理する際にはステップＳ１１を省略することができる。 First, the dividing unit 12 divides the audio signal of each channel into a plurality of blocks (step S11). Note that by storing the audio signal of each channel and each block divided in step S11, step S11 can be omitted when processing the second and subsequent target channels.

　続いて、分割部１２は対象チャネルの複数のブロックのうちの一つを処理対象として設定する（ステップＳ１２）。続いて、分割部１２は、対象チャネル以外のチャネルのオーディオ信号を用いて算出される推定信号のうち、対象チャネルのオーディオ信号との相関が最も高い推定信号を、対象チャネルのコヒーレント成分として抽出する（ステップＳ１３）。続いて、分割部１２は、対象チャネルのオーディオ信号とそのコヒーレント成分との差分を、対象チャネルのフィールド成分として抽出する（ステップＳ１４）。このような処理により、分割部１２は対象チャネルの１ブロックのコヒーレント成分およびフィールド成分を得る。 Subsequently, the dividing unit 12 sets one of a plurality of blocks of the target channel as a processing target (step S12). Subsequently, the dividing unit 12 extracts an estimated signal having the highest correlation with the audio signal of the target channel from among the estimated signals calculated using the audio signals of channels other than the target channel as a coherent component of the target channel. (Step S13). Subsequently, the dividing unit 12 extracts a difference between the audio signal of the target channel and the coherent component thereof as a field component of the target channel (step S14). By such processing, the dividing unit 12 obtains a coherent component and a field component of one block of the target channel.

　分割部１２は一つのブロックを処理すると次のブロックの処理に移る（ステップＳ１５参照）。すなわち、分割部１２は次のブロックを処理対象として設定し（ステップＳ１２）、そのブロックのコヒーレント成分およびフィールド成分を生成する（ステップＳ１３およびＳ１４）。分割部１２はすべてのブロックについてステップＳ１２～Ｓ１４の処理を実行し、全ブロックのコヒーレント成分およびフィールド成分を生成する（ステップＳ１５においてＹＥＳ）。そして、分割部１２は全ブロックのコヒーレント成分を連結することで対象チャネルの最終的なコヒーレント成分を得ると共に、全ブロックのフィールド成分を連結することで対象チャネルの最終的なフィールド成分を得る。 When the dividing unit 12 processes one block, the process proceeds to the next block (see step S15). That is, the dividing unit 12 sets the next block as a processing target (step S12), and generates a coherent component and a field component of the block (steps S13 and S14). The dividing unit 12 executes the processing of steps S12 to S14 for all blocks, and generates coherent components and field components of all blocks (YES in step S15). Then, the dividing unit 12 obtains the final coherent component of the target channel by concatenating the coherent components of all blocks, and obtains the final field component of the target channel by concatenating the field components of all blocks.

　図７は、図６におけるステップＳ１３の処理の詳細、すなわち、対象チャネルのコヒーレント成分を生成する処理の詳細を示す。図７に示す処理は対象チャネルのオーディオ信号の各ブロックについて実行される。 FIG. 7 shows details of the processing in step S13 in FIG. 6, that is, details of processing for generating a coherent component of the target channel. The process shown in FIG. 7 is executed for each block of the audio signal of the target channel.

　まず、分割部１２は各チャネル（対象チャネルおよびすべての他チャネル）について、ブロック信号を複数のサブバンドに分割することで複数のサブバンド信号を生成する（ステップＳ１３１）。続いて、分割部１２は複数のサブバンドのうちの一つを処理対象として設定する（ステップＳ１３２）。続いて、分割部１２は、対象チャネル以外のチャネルのサブバンド信号を用いて算出される推定信号のうち、対象チャネルのサブバンド信号との相関が最も高い推定信号を、処理対象であるサブバンドにおける対象チャネルのコヒーレント成分として抽出する（ステップＳ１３３）。分割部１２はすべてのサブバンドについてステップＳ１３２およびＳ１３３の処理を実行する（ステップＳ１３４参照）。対象チャネルについて全サブバンドのコヒーレント成分を生成すると（ステップＳ１３４においてＹＥＳ）、分割部１２はそれらのコヒーレント成分を加算することで対象チャネルのコヒーレント成分（より具体的には、１ブロック分のコヒーレント成分）を生成する（ステップＳ１３５）。 First, the dividing unit 12 generates a plurality of subband signals by dividing the block signal into a plurality of subbands for each channel (target channel and all other channels) (step S131). Subsequently, the dividing unit 12 sets one of a plurality of subbands as a processing target (step S132). Subsequently, the dividing unit 12 selects an estimated signal having the highest correlation with the subband signal of the target channel from among the estimated signals calculated using the subband signals of channels other than the target channel. As a coherent component of the target channel at (step S133). The dividing unit 12 executes the processes of steps S132 and S133 for all subbands (see step S134). When the coherent components of all subbands are generated for the target channel (YES in step S134), the dividing unit 12 adds the coherent components to add the coherent components of the target channel (more specifically, the coherent components for one block). ) Is generated (step S135).

　次に、図８を参照しながら、コンピュータをオーディオ信号処理装置１０として機能させるためのオーディオ信号処理プログラムＰ１を説明する。 Next, an audio signal processing program P1 for causing a computer to function as the audio signal processing apparatus 10 will be described with reference to FIG.

　オーディオ信号処理プログラムＰ１はメインモジュールＰ１０、受付モジュールＰ１１、分割モジュールＰ１２、および出力モジュールＰ１３を含む。メインモジュールＰ１０は、オーディオ信号の処理を統括的に実行する部分である。受付モジュールＰ１１、分割モジュールＰ１２、および出力モジュールＰ１３を実行することにより実現される機能はそれぞれ、上記の受付部１１、分割部１２、および出力部１３の機能と同様である。 The audio signal processing program P1 includes a main module P10, a reception module P11, a division module P12, and an output module P13. The main module P10 is a part that performs overall processing of audio signals. The functions realized by executing the reception module P11, the division module P12, and the output module P13 are the same as the functions of the reception unit 11, the division unit 12, and the output unit 13, respectively.

　オーディオ信号処理プログラムＰ１は、例えば、ＣＤ－ＲＯＭやＤＶＤ－ＲＯＭ、半導体メモリなどの有形の記録媒体に固定的に記録された上で提供されてもよい。あるいは、オーディオ信号処理プログラムＰ１は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The audio signal processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the audio signal processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.

　以上説明したように、本発明の一側面に係るオーディオ信号処理装置は、複数のチャネルのオーディオ信号を受け付ける受付部と、オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、分割処理が、分割処理の対象となる一つのチャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルのオーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルのオーディオ信号との相関が最も高い推定信号を該対象チャネルのコヒーレント成分として抽出するステップと、対象チャネルのオーディオ信号と該対象チャネルのコヒーレント成分との差分を該対象チャネルのフィールド成分として抽出するステップとを含む、該分割部と、分割部により抽出された各チャネルのコヒーレント成分およびフィールド成分を出力する出力部とを備える。 As described above, the audio signal processing device according to one aspect of the present invention executes a reception unit that receives audio signals of a plurality of channels and a division process that divides the audio signal into coherent components and field components for each channel. A dividing unit that performs division processing when the target channel is one channel that is the target of the division processing, the target of the estimated signals calculated using at least audio signals of channels other than the target channel An estimation signal having the highest correlation with the audio signal of the channel is extracted as a coherent component of the target channel, and a difference between the audio signal of the target channel and the coherent component of the target channel is extracted as a field component of the target channel. Including a step, and the dividing unit and the dividing unit And an output unit for outputting the coherent component and field component of each issued channels.

　このような側面においては、対象チャネル以外のチャネルのオーディオ信号を用いて推定され、且つ該対象チャネルの実際のオーディオ信号ごとの相関が最も高い信号が該対象チャネルのコヒーレント成分として抽出される。また、対象チャネルの実際のオーディオ信号とそのコヒーレント成分との差分が該対象チャネルのフィールド成分として抽出される。このコヒーレント成分およびフィールド成分は各チャネルについて得られる。このように、音を追加することなく元のオーディオ信号のみを用いて各チャネルのコヒーレント成分およびフィールド成分を求めることで、原音の雰囲気（例えば本来の音色）を可能な限りまたは完全に維持することができる。加えて、コヒーレント成分およびフィールド成分は元のチャネル数の分だけ求めることができるので、この手法は原音のチャネル数にかかわらず適用できる。例えば、本発明の一側面は２チャネル、３チャネル、５．１チャネル、２２．２チャネルなどの任意のチャネル数のオーディオ信号に対して適用できる。 In such an aspect, a signal that is estimated using an audio signal of a channel other than the target channel and has the highest correlation for each actual audio signal of the target channel is extracted as a coherent component of the target channel. Further, the difference between the actual audio signal of the target channel and its coherent component is extracted as the field component of the target channel. This coherent component and field component are obtained for each channel. In this way, the coherent and field components of each channel are determined using only the original audio signal without adding sound, so that the atmosphere of the original sound (for example, the original tone) is maintained as completely or completely as possible. Can do. In addition, since the coherent component and the field component can be obtained by the number of original channels, this method can be applied regardless of the number of channels of the original sound. For example, one aspect of the present invention can be applied to audio signals having an arbitrary number of channels such as 2 channels, 3 channels, 5.1 channels, and 22.2 channels.

　図９および図１０を用いて上記側面の優位性を説明する。図９は従来の手法におけるコヒーレント成分の抽出の例を示す図であり、図１０は上記側面におけるコヒーレント成分の抽出の例を示す図である。図９，１０共に、三角形状に配置された三つのスピーカ９０からオーディオ信号が出力される例を示し、したがって、この例は３チャネルのオーディオ・システムを示す。 The superiority of the above aspect will be described with reference to FIGS. 9 and 10. FIG. 9 is a diagram illustrating an example of extraction of a coherent component in a conventional method, and FIG. 10 is a diagram illustrating an example of extraction of a coherent component in the above-described aspect. 9 and 10 both show an example in which audio signals are output from three speakers 90 arranged in a triangular shape, and thus this example shows a three-channel audio system.

　図９に示すように、上記の非特許文献３，４に記載の手法では、２チャネルのオーディオ信号の間で相関が高い成分をコヒーレント成分９１として抽出する（なお、破線９２はフィールド成分を示す）。したがって、このような従来の手法では、二つのスピーカ（チャネル）９０の中間部分９３に位置する音の情報しか取得することができず、三つのスピーカ（チャネル）９０で囲まれた領域の中央部分９４に位置する音の情報を抽出することができない。 As shown in FIG. 9, in the methods described in Non-Patent Documents 3 and 4 described above, a component having a high correlation between two-channel audio signals is extracted as a coherent component 91 (note that a broken line 92 indicates a field component) ). Therefore, in such a conventional method, only the information of the sound located in the middle portion 93 of the two speakers (channels) 90 can be acquired, and the central portion of the region surrounded by the three speakers (channels) 90 Information on the sound located at 94 cannot be extracted.

　これに対して上記側面では、あるスピーカ（チャネル）９０のコヒーレント成分が他のスピーカ（チャネル）９０の信号から推定される。そのため、図１０に示すように、三つのスピーカ（チャネル）９０で囲まれた領域の中央部分９５に位置する音の情報を抽出することができる。この中央部分９５は、図９における部分９３，９４の和に相当し得る。 On the other hand, in the above aspect, the coherent component of one speaker (channel) 90 is estimated from the signal of another speaker (channel) 90. Therefore, as shown in FIG. 10, it is possible to extract information on the sound located in the central portion 95 of the area surrounded by the three speakers (channels) 90. This central portion 95 may correspond to the sum of the

portions

93 and 94 in FIG.

　他の側面に係るオーディオ信号処理装置では、分割処理が、窓関数を用いてオーディオ信号を複数のフレームに区切る処理を各チャネルについて実行するステップと、連続する少なくとも二つのフレームを一つのブロックにまとめる処理を複数のフレームの全体に対して実行することで複数のブロックを生成する処理を各チャネルについて実行するステップと、ブロックのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing device according to another aspect, the dividing process performs a process of dividing the audio signal into a plurality of frames using a window function for each channel, and combines at least two consecutive frames into one block. The process of generating a plurality of blocks by executing the process on the whole of the plurality of frames may be executed for each channel, and the step of extracting the coherent component of the target channel in each of the blocks may be included.

　複数のフレームで構成されるブロックを採用することで、コヒーレント成分の推定のためのサンプル数が多くなるので、コヒーレント成分をより精度良く抽出することが可能になる。 By adopting a block composed of a plurality of frames, the number of samples for estimating the coherent component increases, so that the coherent component can be extracted with higher accuracy.

　他の側面に係るオーディオ信号処理装置では、分割部が、各チャネルのオーディオ信号を複数のサブバンドに分割することで、各チャネルについて複数のサブバンド信号を生成するステップと、複数のサブバンドのそれぞれにおいて対象チャネルのコヒーレント成分を抽出するステップと、複数のサブバンドにおけるコヒーレント成分を加算することで対象チャネルのコヒーレント成分を抽出するステップとを含んでもよい。 In the audio signal processing device according to another aspect, the dividing unit divides the audio signal of each channel into a plurality of subbands, thereby generating a plurality of subband signals for each channel; Extracting the coherent component of the target channel in each, and extracting the coherent component of the target channel by adding the coherent components in a plurality of subbands may be included.

　一般に、音声処理では一部の周波数が他の周波数よりも重要であることが多い。サブバンド毎に処理することで、それぞれの周波数帯で要求される精度に応じてコヒーレント成分を抽出することができ、ひいてはコヒーレント成分およびフィールド成分を精度良く抽出することができる。 In general, in audio processing, some frequencies are often more important than other frequencies. By performing processing for each subband, a coherent component can be extracted according to the accuracy required in each frequency band, and thus a coherent component and a field component can be extracted with high accuracy.

　以下、実施例に基づいて本発明を具体的に説明するが、本発明はそれらに何ら限定されるものではない。 Hereinafter, the present invention will be specifically described based on examples, but the present invention is not limited thereto.

　表１に示される７個のステレオ音声素材（すなわち、２チャネルのオーディオ信号）を用意した。いずれの音声素材も市販のＣＤから入手したものであり、サンプリング周波数は４４．１ｋＨｚであった。表１の名前欄は曲名または楽曲の種類を示し、説明欄は演奏の形態を示す。ミキシング欄における「Ａｒｔｉｆｉｃａｌ」はミキシング処理が施された素材であることを示し、「Ｎａｔｕｒａｌ」はミキシング処理が施されていない素材であることを示す。長さ欄は再生時間を示す。

Seven stereo sound materials (that is, 2-channel audio signals) shown in Table 1 were prepared. All audio materials were obtained from commercially available CDs, and the sampling frequency was 44.1 kHz. The name column in Table 1 shows the song name or the type of song, and the explanation column shows the form of performance. “Artifical” in the mixing column indicates that the material has been subjected to mixing processing, and “Natural” indicates that the material has not been subjected to mixing processing. The length column shows the playback time.

　オーディオ信号を完全に再構築できるフィルタバンクを構築するために、変形離散コサイン変換（ＭＤＣＴ）を用いた重畳加算法を採用した。オーディオ信号を複数のフレームに分割するための窓関数としてカイザー・ベッセル窓を用いた。フレーム長は２０４８点とし、これは、ＭＤＣＴにおいて１０２４個の周波数点が得られることを意味する。その周波数点を表２に示すように２３個のサブバンドにまとめた。これらのサブバンドは、ＭＰＥＧ－２　ＡＡＣ標準を参考に、４８ｋＨｚ　ｌｏｎｇ　ＦＦＴ（高速フーリエ変換）における６９個のサブバンドを三つの連続するサブバンド毎に一つにまとめたものである。２４個のフレームを１ブロックとした。サンプリング周波数が４４．１ｋＨｚであれば、ブロック長は０．５８秒に相当するものであった。

In order to construct a filter bank that can completely reconstruct the audio signal, a superposition addition method using a modified discrete cosine transform (MDCT) was employed. The Kaiser-Bessel window was used as a window function for dividing the audio signal into a plurality of frames. The frame length is 2048 points, which means that 1024 frequency points are obtained in MDCT. The frequency points were grouped into 23 subbands as shown in Table 2. These subbands are a collection of 69 subbands in a 48 kHz long FFT (Fast Fourier Transform), one for every three consecutive subbands, with reference to the MPEG-2 AAC standard. 24 frames were taken as one block. If the sampling frequency was 44.1 kHz, the block length was equivalent to 0.58 seconds.

　実験結果をチャネル間の相互相関係数で評価した。原音、コヒーレント成分、およびフィールド成分の相互相関係数を表３に示す。コヒーレント成分は原音よりも高い相互相関を示した。このようなコヒーレント成分は原音よりも狭い音場の雰囲気をもたらす。一方、フィールド成分は、一個の素材（“Ｑｕｉｅｔ　Ｎｉｇｈｔ”）を除いて負の相互相関を示した。負の相互相関を示すフィールド成分を側方もしくは後方に設置したスピーカで再生すれば、良好なアンビエンス効果が得られる。その結果として、臨場感の高い音を再生することができる。

The experimental results were evaluated with the cross-correlation coefficient between channels. Table 3 shows cross-correlation coefficients of the original sound, the coherent component, and the field component. The coherent component showed higher cross-correlation than the original sound. Such a coherent component provides a sound field atmosphere narrower than the original sound. On the other hand, the field component showed a negative cross-correlation except for one material (“Quiet Night”). If a field component showing a negative cross-correlation is reproduced by a speaker installed on the side or rear, a good ambience effect can be obtained. As a result, it is possible to reproduce a sound with a high presence.

　以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail above based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.

　上記実施形態では、分割部１２が、ある一つの対象チャネルのコヒーレント成分を、該対象チャネル以外のチャネルのオーディオ信号を用いて推定した。この変形例として、分割部は、当該他チャネルのオーディオ信号と、対象チャネルの過去のオーディオ信号および当該他チャネルの過去のオーディオ信号の少なくとも一方とを用いて、該対象チャネルのコヒーレント成分を推定してもよい。ここで、「過去のオーディオ信号」とは、処理対象のブロックより時間的に前のブロックのオーディオ信号である。対象チャネルおよび他チャネルのうちの一方または双方の過去のオーディオ信号も用いて、処理対象のブロックにおける対象チャネルのオーディオ信号を推定することで、コヒーレント成分をより精度良く抽出することが期待できる。 In the above embodiment, the dividing unit 12 estimates a coherent component of a certain target channel using an audio signal of a channel other than the target channel. As a modification, the dividing unit estimates the coherent component of the target channel using the audio signal of the other channel and at least one of the past audio signal of the target channel and the past audio signal of the other channel. May be. Here, the “past audio signal” is an audio signal of a block temporally preceding the block to be processed. By estimating the audio signal of the target channel in the block to be processed using past audio signals of one or both of the target channel and the other channel, it can be expected to extract the coherent component with higher accuracy.

　少なくとも一つのプロセッサにより実行されるオーディオ信号処理方法の手順は上記実施形態での例に限定されない。例えば、オーディオ信号処理装置は上述したステップ（処理）の一部を省略してもよいし、別の順序で各ステップを実行してもよい。また、上述したステップのうちの任意の２以上のステップが組み合わされてもよいし、ステップの一部が修正又は削除されてもよい。あるいは、オーディオ信号処理装置は上記の各ステップに加えて他のステップを実行してもよい。 The procedure of the audio signal processing method executed by at least one processor is not limited to the example in the above embodiment. For example, the audio signal processing apparatus may omit some of the steps (processes) described above, or may execute the steps in a different order. Also, any two or more of the steps described above may be combined, or a part of the steps may be corrected or deleted. Alternatively, the audio signal processing apparatus may execute other steps in addition to the above steps.

　オーディオ信号処理装置は、二つの数値の大小関係を比較する際に、「以上」および「よりも大きい」という二つの基準のどちらを用いてもよく、「以下」および「未満」の二つの基準のうちのどちらを用いてもよい。このような基準の選択は、二つの数値の大小関係を比較する処理についての技術的意義を変更するものではない。 The audio signal processing apparatus may use either of the two criteria “greater than” and “greater than” when comparing the magnitude relationship between the two values, and the two criteria “less than” and “less than”. Either of these may be used. The selection of such a standard does not change the technical significance of the process of comparing the magnitude relationship between two numerical values.

　１０…オーディオ信号処理装置、１１…受付部、１２…分割部、１３…出力部、ｅｌ…推定誤差、Ｐ１…オーディオ信号処理プログラム、Ｐ１０…メインモジュール、Ｐ１１…受付モジュール、Ｐ１２…分割モジュール、Ｐ１３…出力モジュール。 DESCRIPTION OF SYMBOLS 10 ... Audio signal processing apparatus, 11 ... Acceptance part, 12 ... Dividing part, 13 ... Output part, el ... Estimation error, P1 ... Audio signal processing program, P10 ... Main module, P11 ... Accepting module, P12 ... Dividing module, P13 ... output module.

Claims

　複数のチャネルのオーディオ信号を受け付ける受付部と、
　前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割部であって、前記分割処理が、
　　前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
　　前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
を含む、該分割部と、
　前記分割部により抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力部と
を備えるオーディオ信号処理装置。 A reception unit for receiving audio signals of a plurality of channels;
A division unit that performs a division process for dividing the audio signal into a coherent component and a field component for each channel, wherein the division process includes:
Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
An audio signal processing apparatus comprising: an output unit that outputs the coherent component and the field component of each channel extracted by the dividing unit.
　前記分割処理が、
　　窓関数を用いてオーディオ信号を複数のフレームに区切る処理を各チャネルについて実行するステップと、
　　連続する少なくとも二つの前記フレームを一つのブロックにまとめる処理を前記複数のフレームの全体に対して実行することで複数の前記ブロックを生成する処理を各チャネルについて実行するステップと、
　　前記ブロックのそれぞれにおいて前記対象チャネルの前記コヒーレント成分を抽出するステップと
を含む、
請求項１に記載のオーディオ信号処理装置。 The dividing process is
Performing for each channel a process of dividing the audio signal into a plurality of frames using a window function;
Executing a process of generating a plurality of the blocks by executing a process of grouping at least two consecutive frames into one block on the whole of the plurality of frames, and
Extracting the coherent component of the channel of interest in each of the blocks.
The audio signal processing apparatus according to claim 1.
　前記分割部が、
　　各チャネルのオーディオ信号を複数のサブバンドに分割することで、各チャネルについて複数のサブバンド信号を生成するステップと、
　　前記複数のサブバンドのそれぞれにおいて前記対象チャネルのコヒーレント成分を抽出するステップと、
　　前記複数のサブバンドにおけるコヒーレント成分を加算することで前記対象チャネルのコヒーレント成分を抽出するステップと
を含む、
請求項１または２に記載のオーディオ信号処理装置。 The dividing unit is
Generating a plurality of subband signals for each channel by dividing the audio signal of each channel into a plurality of subbands;
Extracting a coherent component of the target channel in each of the plurality of subbands;
Extracting the coherent component of the target channel by adding coherent components in the plurality of subbands.
The audio signal processing apparatus according to claim 1 or 2.
　オーディオ信号処理装置が、複数のチャネルのオーディオ信号を受け付ける受付ステップと、
　前記オーディオ信号処理装置が、前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、前記分割処理が、
　　前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
　　前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
を含む、該分割ステップと、
　前記オーディオ信号処理装置が、前記分割ステップにおいて抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力ステップと
を含むオーディオ信号処理方法。 An accepting step in which the audio signal processing device accepts audio signals of a plurality of channels;
The audio signal processing apparatus is a division step for performing division processing for dividing each of the audio signals into a coherent component and a field component for each channel, and the division processing includes:
Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
An audio signal processing method, wherein the audio signal processing device includes an output step of outputting the coherent component and the field component of each channel extracted in the dividing step.
　複数のチャネルのオーディオ信号を受け付ける受付ステップと、
　前記オーディオ信号をコヒーレント成分とフィールド成分とに分割する分割処理を各チャネルについて実行する分割ステップであって、前記分割処理が、
　　前記分割処理の対象となる一つの前記チャネルを対象チャネルとした場合に、該対象チャネル以外のチャネルの前記オーディオ信号を少なくとも用いて算出される推定信号のうち該対象チャネルの前記オーディオ信号との相関が最も高い推定信号を該対象チャネルの前記コヒーレント成分として抽出するステップと、
　　前記対象チャネルの前記オーディオ信号と該対象チャネルの前記コヒーレント成分との差分を該対象チャネルの前記フィールド成分として抽出するステップと
を含む、該分割ステップと、
　前記分割ステップにおいて抽出された各チャネルの前記コヒーレント成分および前記フィールド成分を出力する出力ステップと
をコンピュータに実行させるオーディオ信号処理プログラム。 A reception step for receiving audio signals of a plurality of channels;
A division step of performing a division process for dividing the audio signal into a coherent component and a field component for each channel, wherein the division process includes:
Correlation with the audio signal of the target channel among estimated signals calculated using at least the audio signal of a channel other than the target channel when one channel to be divided is the target channel Extracting the highest estimated signal as the coherent component of the target channel;
Extracting the difference between the audio signal of the target channel and the coherent component of the target channel as the field component of the target channel; and
An audio signal processing program causing a computer to execute the output step of outputting the coherent component and the field component of each channel extracted in the dividing step.