JP2012109643A

JP2012109643A - Sound reproduction system, sound reproduction device and sound reproduction method

Info

Publication number: JP2012109643A
Application number: JP2010254608A
Authority: JP
Inventors: Yusuke Ikeda; 雄介池田; Seigo Enomoto; 成悟榎本; Satoru Nakamura; 哲中村; Shiro Ise; 史郎伊勢
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2010-11-15
Filing date: 2010-11-15
Publication date: 2012-06-07
Anticipated expiration: 2030-11-15
Also published as: JP5697079B2

Abstract

PROBLEM TO BE SOLVED: To smoothly have a conversation by recognizing the direction of a person generating the sound by reproduced sound.SOLUTION: A voice reproduction system 10a (10b, 10c) includes a computer 18 (26, 34), and when voice data corresponding to the voice of another user and angle data corresponding to the direction of the face of the other user with the position of a user of itself as a reference are received from other computers 26, 34, the computer 18 convolutes received sound data using a voice filter corresponding to the direction (angle) of the face indicated by the angle data. Then, convoluted sound data is outputted using a speaker array 20 having a plurality of speakers connected to the computer 18. Thus, the voice corresponding to the direction of the face of the user of the other computers 26, 34 is reproduced.

Description

この発明は音再現システム、音再現装置および音再現方法に関し、特にたとえば、複数のマイクロホンを有するマイクロホンユニットと複数のラウドスピーカを有するスピーカユニットを用いた、音再現システム、音再現装置および音再現方法に関する。 The present invention relates to a sound reproduction system, a sound reproduction device, and a sound reproduction method, and in particular, for example, a sound reproduction system, a sound reproduction device, and a sound reproduction method using a microphone unit having a plurality of microphones and a speaker unit having a plurality of loudspeakers. About.

この種の従来の音再現システムの一例が非特許文献１に開示されている。この非特許文献１に開示される３次元音場通信システムでは、７０ｃｈ（チャネル）のマイクロホンアレイで収録した音響データを６２ｃｈのラウドスピーカで再現する音場制御（ＢｏｕｎｄａｒｙＳｕｒｆａｃｅＣｏｎｔｒｏｌ：ＢｏＳＣ）再生システムを用いて、遠隔地に存在する利用者が音響空間を共有しながら会話を行うことが可能である。具体的には、予め収録し逆フィルタが畳み込まれた６２ｃｈの音場データがサーバに記憶される。このサーバには、インターネットおよびＬＡＮのようなネットワークを介して、異なる場所に配置された２台のクライアントマシン（ＰＣ）が接続される。各クライアントマシンには、３次元の音場再現システムが接続されている。サーバは、利用者が選択した再現音場を双方の音場再現システム（スピーカアレイシステム）に同時に伝送する。各音場再現システムの利用者の音声に対応する音声データは、ネットワークを介して、それぞれ他方のクライアントマシンに伝送される。各クライアントマシンでは、他方の利用者の音声に対応する音声データ（１ｃｈ）が、実時間で畳み込まれた後に、音場データ（６２ｃｈ）に重ね合わせて出力される。したがって、異なる場所に存在する利用者は、サーバから出力される音場データを共有するとともに、会話することができる。 An example of this type of conventional sound reproduction system is disclosed in Non-Patent Document 1. In the three-dimensional sound field communication system disclosed in Non-Patent Document 1, a sound field control (Bond Surface Control: BoSC) reproduction system that reproduces sound data recorded by a microphone array of 70 ch (channel) with a 62 ch loudspeaker is provided. It is possible for users in remote locations to have a conversation while sharing an acoustic space. Specifically, 62ch sound field data recorded in advance and convoluted with an inverse filter is stored in the server. Two client machines (PCs) arranged at different locations are connected to this server via a network such as the Internet and a LAN. A three-dimensional sound field reproduction system is connected to each client machine. The server simultaneously transmits the reproduced sound field selected by the user to both sound field reproducing systems (speaker array systems). Voice data corresponding to the voice of the user of each sound field reproduction system is transmitted to the other client machine via the network. In each client machine, voice data (1ch) corresponding to the voice of the other user is convoluted in real time and then superimposed on the sound field data (62ch) and output. Therefore, users existing in different places can share the sound field data output from the server and have a conversation.

「１．数値解析技術と可視化・可聴化１．７三次元音場通信システム」榎本成悟音響技術 No.148/Dec.2009 pp37-42"1. Numerical analysis technology and visualization / audibility 1.7 Three-dimensional sound field communication system" Seigo Enomoto Acoustic Technology No.148 / Dec.2009 pp37-42

しかし、非特許文献１の３次元音場通信システムでは、各クライアントマシンでは、他方の利用者の音声に対応する音声データ（１ｃｈ）は、予め用意された音声フィルタを用いて畳み込まれた後に、音場データ（６２ｃｈ）に重ね合わせて出力されるだけであるため、当該他方の利用者がどこを向いて話しているのかを再現された音声から認識することができない。したがって、背景技術の３次元音場通信システムにさらにクライアントマシンおよび音場再現システムなどを接続して、三者以上の利用者が会話する場合には、誰が誰に話し掛けているのかを、認識するのが困難である。このため、円滑に会話することができない。 However, in the three-dimensional sound field communication system of Non-Patent Document 1, in each client machine, the audio data (1ch) corresponding to the other user's voice is convoluted using a voice filter prepared in advance. Since it is only superimposed on the sound field data (62ch) and output, it cannot be recognized from the reproduced voice where the other user is talking. Therefore, when a client machine and a sound field reproduction system are further connected to the three-dimensional sound field communication system of the background art and three or more users have a conversation, it recognizes who is talking to whom. Is difficult. For this reason, it is not possible to talk smoothly.

それゆえに、この発明の主たる目的は、新規な、音再現システム、音再現装置および音再現方法を提供することである。 Therefore, the main object of the present invention is to provide a novel sound reproduction system, sound reproduction device and sound reproduction method.

また、この発明の他の目的は、再現された音でその音の発生者の向きを認識できる、音再現システム、音再現装置および音再現方法を提供することである。 Another object of the present invention is to provide a sound reproduction system, a sound reproduction device, and a sound reproduction method capable of recognizing the direction of the sound generator from the reproduced sound.

本発明は、上記の課題を解決するために、以下の構成を採用した。なお、括弧内の参照符号および補足説明等は、本発明の理解を助けるために後述する実施の形態との対応関係を示したものであって、本発明を何ら限定するものではない。 The present invention employs the following configuration in order to solve the above problems. The reference numerals in parentheses, supplementary explanations, and the like indicate correspondence relationships with embodiments described later to help understanding of the present invention, and do not limit the present invention in any way.

第１の発明は、少なくとも、複数の第１ラウドスピーカを有するスピーカアレイを備える音再現装置を複数備える、音再現システムであって、各音再現装置は、角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段、使用者の発生する音に対応する音データを検出する音検出手段、他の使用者の方向を基準として、使用者が音を発生した方向に対応する角度データを検出する角度検出手段、音検出手段によって検出された音データと角度検出手段によって検出された角度データとを他の音再現装置に送信するデータ送信手段、他の音再現装置からの音データと角度データとを受信する第１データ受信手段、第１データ受信手段によって受信された角度データが示す角度に応じた音声フィルタデータをフィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、データ受信手段によって受信された音データに畳み込み処理を施す音処理手段、および音処理手段によって畳み込み処理が施された音データをスピーカアレイに出力する音出力手段を備える、音再現システムである。 A first invention is a sound reproduction system including at least a plurality of sound reproduction devices including a speaker array having a plurality of first loudspeakers, and each sound reproduction device corresponds to an audio filter provided for each angle. Filter storage means for storing sound filter data to be performed, sound detection means for detecting sound data corresponding to sound generated by the user, and corresponding to the direction in which the user has generated sound with reference to the direction of other users Angle detection means for detecting angle data, data transmission means for transmitting sound data detected by the sound detection means and angle data detected by the angle detection means to another sound reproduction device, sound from other sound reproduction devices First data receiving means for receiving the data and angle data, and voice filter data corresponding to the angle indicated by the angle data received by the first data receiving means. Sound processing means for performing convolution processing on the sound data received by the data receiving means using the sound filter corresponding to the read sound filter data read from the filter storage means, and the sound subjected to the convolution processing by the sound processing means A sound reproduction system including sound output means for outputting data to a speaker array.

第１の発明では、音再現システム（１０）では、少なくとも、複数の第１ラウドスピーカ（２３０）を有するスピーカアレイ（２０、２８、３６）を備える音再現装置（１８、２０、２６、２８、３４、３６）を複数備える。各音再現装置は、フィルタ記憶手段、音検出手段、角度検出手段、データ送信手段、第１データ受信手段、音処理手段、および音出力手段を備える。フィルタ記憶手段は、角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶する。音検出手段は、使用者の発生する音、たとえば、当該使用者の音声や当該使用者が演奏する楽器の音に対応する音データを検出する。角度検出手段は、他の使用者の方向を基準として、使用者が音を発生した方向に対応する角度データを検出する。データ送信手段は、音検出手段によって検出された音データと角度検出手段によって検出された角度データとを他の音再現装置に送信する。第１データ受信手段は、他の音再現装置からの音データと角度データとを受信する。音処理手段は、第１データ受信手段によって受信された角度データが示す角度に応じた音声フィルタデータをフィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、第１データ受信手段によって受信された音データに畳み込み処理を施す。音出力手段は、音処理手段によって畳み込み処理が施された音データをスピーカアレイに出力する。 In the first invention, in the sound reproduction system (10), at least a sound reproduction device (18, 20, 26, 28) including a speaker array (20, 28, 36) having a plurality of first loudspeakers (230). 34, 36). Each sound reproduction device includes filter storage means, sound detection means, angle detection means, data transmission means, first data reception means, sound processing means, and sound output means. The filter storage means stores audio filter data corresponding to the audio filter provided for each angle. The sound detection means detects sound data corresponding to a sound generated by the user, for example, a sound of the user or a sound of an instrument played by the user. The angle detection means detects angle data corresponding to the direction in which the user has generated a sound with reference to the direction of another user. The data transmission means transmits the sound data detected by the sound detection means and the angle data detected by the angle detection means to another sound reproduction device. The first data receiving means receives sound data and angle data from another sound reproduction device. The sound processing means reads the sound filter data corresponding to the angle indicated by the angle data received by the first data receiving means from the filter storage means, and uses the sound filter corresponding to the read sound filter data to receive the first data The sound data received by the means is subjected to a convolution process. The sound output means outputs the sound data subjected to the convolution processing by the sound processing means to the speaker array.

第１の発明によれば、角度毎に対応する音声フィルタを記憶しておき、他の音再現装置からの音データを、同じく他の音声再現装置からの角度データが示す角度に対応する音声フィルタを用いて畳み込むので、スピーカアレイによってその角度が示す方向の音を再現することができる。このため、再現される音によってその音の発生者の向きを知ることができる。したがって、スピーカアレイのユーザは、たとえば、誰が誰に話し掛けているのかを再現された音から認識することができ、円滑に会話することができる。 According to the first invention, the sound filter corresponding to each angle is stored, and the sound data from the other sound reproduction device is stored in the sound filter corresponding to the angle indicated by the angle data from the other sound reproduction device. The sound in the direction indicated by the angle can be reproduced by the speaker array. For this reason, the direction of the sound generator can be known from the reproduced sound. Therefore, the user of the speaker array can recognize, for example, who is talking to whom from the reproduced sound, and can talk smoothly.

第２の発明は、第１の発明に従属し、音声フィルタは、或る場所において、複数のマイクロホンを有するマイクロホンアレイを所定の向きで配置し、当該マイクロホンアレイに対向するように第２ラウドスピーカを配置し、当該第２ラウドスピーカから刺激音を発生させるとともに所定角度ずつ回転させたときに、当該マイクロホンアレイによって測定されるインパルス応答に基づいて生成される。 A second invention is dependent on the first invention, and the audio filter is arranged such that a microphone array having a plurality of microphones is arranged in a predetermined direction at a certain location, and the second loudspeaker is arranged so as to face the microphone array. Is generated based on an impulse response measured by the microphone array when a stimulus sound is generated from the second loudspeaker and rotated by a predetermined angle.

第２の発明では、或る場所において、複数のマイクロホンを有するマイクロホンアレイを所定の向きで配置し、当該マイクロホンアレイに対向するように第２ラウドスピーカを配置する。つまり、マイクロホンアレイが聴者として配置され、第２ラウドスピーカが話者として配置される。そして、当該第２ラウドスピーカから刺激音を発生させるとともに所定角度ずつ回転させたときに、マイクロホンアレイによって測定されるインパルス応答が測定される。各マイクロホンで測定されたインパルス応答から伝達特性が測定され、第２ラウドスピーカの回転角度毎の音声フィルタが生成されるのである。 In the second invention, in a certain place, a microphone array having a plurality of microphones is arranged in a predetermined direction, and the second loudspeaker is arranged so as to face the microphone array. That is, the microphone array is arranged as a listener and the second loudspeaker is arranged as a speaker. Then, an impulse response measured by the microphone array is measured when a stimulation sound is generated from the second loudspeaker and rotated by a predetermined angle. The transfer characteristic is measured from the impulse response measured by each microphone, and an audio filter for each rotation angle of the second loudspeaker is generated.

第２の発明によれば、或る場所においてラウドスピーカおよびマイクロホンアレイを用いて予め測定したインパルス応答に基づいて音声フィルタを生成するので、音再現装置を使用して会話するユーザは、或る場所で会話しているような臨場感を得ることができる。 According to the second invention, the voice filter is generated based on the impulse response measured in advance using a loudspeaker and a microphone array in a certain place. You can get a sense of realism as if you were talking.

第３の発明は、第２の発明に従属し、第２ラウドスピーカは、マイクロホンアレイの正面方向から所定角度の方向に、所定距離を隔てて配置される。 A third invention is dependent on the second invention, and the second loudspeaker is arranged at a predetermined distance in a direction at a predetermined angle from the front direction of the microphone array.

第３の発明では、第２ラウドスピーカは、マイクロホンアレイの正面方向から所定の角度の方向に、所定距離を隔てて配置される。この音再現装置を用いて、たとえば、遠隔地に存在する三者間で会話する場合には、仮想の位置関係として、所定長さの辺を有する正三角形の頂点の位置に各ユーザの位置が想定される。したがって、そのような位置関係を再現するように、第２ラウドスピーカおよびマイクロホンアレイが配置されるのである。 In the third invention, the second loudspeaker is arranged at a predetermined distance in a direction at a predetermined angle from the front direction of the microphone array. For example, in the case of a conversation between three parties in a remote place using this sound reproduction device, the position of each user is at the position of the vertex of an equilateral triangle having a side of a predetermined length as a virtual positional relationship. is assumed. Therefore, the second loudspeaker and the microphone array are arranged so as to reproduce such a positional relationship.

第３の発明によれば、仮想の位置関係を再現するように、ラウドスピーカおよびマイクロホンアレイを配置するので、この位置関係で測定されたインパルス応答に基づいて生成された音声フィルタを用いた場合には、或る場所にその位置関係で会話しているような臨場感を得ることができる。 According to the third aspect of the invention, the loudspeaker and the microphone array are arranged so as to reproduce the virtual positional relationship. Therefore, when the sound filter generated based on the impulse response measured in this positional relationship is used. Can get a sense of realism as if they are talking to a certain place in that positional relationship.

第４の発明は、第１ないし第３の発明のいずれかに従属し、マイクロホンアレイは、或る音場に配置され、マイクロホンアレイによって検出された音場データを収録し、当該音場データに畳み込みの処理を施して各音再現装置に伝送するサーバをさらに備え、各音再現装置は、サーバから伝送された音場データを受信する第２データ受信手段をさらに備え、音出力手段は、第２データ受信手段によって受信された音場データを、音処理手段によって畳み込み処理が施された音データに重畳してスピーカアレイに出力する。 A fourth invention is dependent on any one of the first to third inventions, and the microphone array is arranged in a certain sound field and records sound field data detected by the microphone array, and the sound field data is recorded in the sound field data. Each of the sound reproduction devices further includes a second data receiving unit that receives sound field data transmitted from the server, and the sound output unit includes: 2 The sound field data received by the data receiving means is superimposed on the sound data subjected to the convolution processing by the sound processing means and output to the speaker array.

第４の発明では、マイクロホンアレイは、或る音場に配置される。音再現システムは、さらに、サーバ（１２）を備える。このサーバは、マイクロホンアレイによって検出された音場データを収録し、当該音場データに畳み込みの処理を施して各音再現装置に伝送する。各音再現装置は、第２データ受信手段をさらに備える。第２データ受信手段は、サーバから伝送された音場データを受信する。音出力手段は、第２データ受信手段によって受信された音場データを、音処理手段によって畳み込み処理が施された音データに重畳してスピーカアレイに出力する。したがって、或る音場が再現されるとともに、他の音再現装置からの音が再現される。 In the fourth invention, the microphone array is arranged in a certain sound field. The sound reproduction system further includes a server (12). This server records the sound field data detected by the microphone array, convolves the sound field data, and transmits it to each sound reproduction device. Each sound reproduction device further includes second data receiving means. The second data receiving means receives the sound field data transmitted from the server. The sound output means superimposes the sound field data received by the second data receiving means on the sound data subjected to the convolution processing by the sound processing means and outputs the sound data to the speaker array. Accordingly, a certain sound field is reproduced, and sounds from other sound reproduction devices are reproduced.

第４の発明によれば、たとえば、音再現装置を用いて会話しているユーザは、音場を共有しながら、会話することができる。 According to the fourth aspect of the invention, for example, a user who is having a conversation using a sound reproduction device can have a conversation while sharing a sound field.

第５の発明は、第４の発明に従属し、スピーカアレイは、第１所定数の第１ラウドスピーカを有し、マイクロホンアレイは、第２所定数のマイクロホンを有し、線形独立性の高い、第１所定数よりも少ない第３所定数の第１ラウドスピーカを選択するスピーカ選択手段、および線形独立性の高い、第２所定数よりも少ない第４所定数のマイクロホンを選択するマイクロホン選択手段をさらに備え、サーバは、第４所定数のマイクロホンを用いて音場データを収録して、畳み込み処理を施し、音出力手段は、第２データ受信手段によって受信された音場データを第３所定数の第１ラウドスピーカを使用して出力する。 A fifth invention is according to the fourth invention, wherein the speaker array has a first predetermined number of first loudspeakers, the microphone array has a second predetermined number of microphones, and is highly linearly independent. Speaker selection means for selecting a third predetermined number of first loudspeakers less than the first predetermined number, and microphone selection means for selecting a fourth predetermined number of microphones less than the second predetermined number that are highly linearly independent. And the server records the sound field data using a fourth predetermined number of microphones, performs convolution processing, and the sound output means outputs the sound field data received by the second data receiving means to a third predetermined value. The number of first loudspeakers is used to output.

第５の発明では、スピーカアレイは、第１所定数の第１ラウドスピーカを有し、マイクロホンアレイは、第２所定数のマイクロホンを有している。スピーカ選択手段は、線形独立性の高い、第１所定数よりも少ない第３所定数の第１ラウドスピーカを選択する。同様に、マイクロホン選択手段は、線形独立性の高い、第２所定数よりも少ない第４所定数のマイクロホンを選択する。したがって、サーバは、第４所定数のマイクロホンを用いて音場データを収録して、畳み込み処理を施す。また、音出力手段は、第２データ受信手段によって受信された音場データを第３所定数の第１ラウドスピーカを使用して出力する。 In the fifth invention, the speaker array has a first predetermined number of first loudspeakers, and the microphone array has a second predetermined number of microphones. The speaker selecting means selects a third predetermined number of first loudspeakers that are highly linearly independent and less than the first predetermined number. Similarly, the microphone selection means selects a fourth predetermined number of microphones that are highly linearly independent and less than the second predetermined number. Therefore, the server records the sound field data using the fourth predetermined number of microphones and performs a convolution process. The sound output means outputs the sound field data received by the second data receiving means using the third predetermined number of first loudspeakers.

第５の発明によれば、使用するラウドスピーカおよびマイクロホンの数を低減するので、畳み込みの処理負荷を軽減するとともに、データの伝送量を低減することができる。したがって、リアルタイムに音場を共有したり、会話したりすることができる。また、線形独立性の高い、ラウドスピーカおよびマイクロホンをそれぞれ選択するので、それらの数を低減したとしても、臨場感を損なうことがない。 According to the fifth aspect, since the number of loudspeakers and microphones to be used is reduced, the processing load of convolution can be reduced and the amount of data transmission can be reduced. Therefore, it is possible to share the sound field or have a conversation in real time. In addition, since loudspeakers and microphones having high linear independence are respectively selected, even if the number thereof is reduced, the sense of reality is not impaired.

第６の発明は、複数のラウドスピーカを有するスピーカアレイ、角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段、使用者の発生する音に対応する音データを検出する音検出手段、他の使用者の方向を基準として、使用者が音を発生した方向に対応する角度データを検出する角度検出手段、音検出手段によって検出された音データと角度検出手段によって検出された角度データとを他の音再現装置に送信するデータ送信手段、他の音再現装置からの音データと角度データとを受信するデータ受信手段、データ受信手段によって受信された角度データが示す角度に応じた音声フィルタデータをフィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、データ受信手段によって受信された音データに畳み込み処理を施す音処理手段、および音処理手段によって畳み込み処理が施された音データをスピーカアレイに出力する音出力手段を備える、音再現装置である。 6th invention detects the sound data corresponding to the sound which the speaker array which has a several loudspeaker, the audio | voice filter data corresponding to the audio | voice filter provided for every angle, and the sound which a user produces | generates Sound detection means, angle detection means for detecting angle data corresponding to the direction in which the user has generated sound with reference to the direction of the other user, sound data detected by the sound detection means and detected by the angle detection means The angle data received by the data receiving means, the data receiving means for receiving the sound data and the angle data from the other sound reproducing apparatus, and the angle indicated by the angle data received by the data receiving means. The corresponding voice filter data is read from the filter storage means, and the voice filter data corresponding to the read voice filter data is used to Comprising a sound output means for outputting sound processing unit performs a convolution process on the sound data received by the signal unit, and sound data convolution processing is performed by the sound processing means in the loudspeaker array, a sound reproduction device.

第７の発明は、複数のラウドスピーカを有するスピーカアレイおよび角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段を備える音再現装置を複数備える、音再現システムの音再現方法であって、各音再現装置は、（ａ）使用者の発生する音に対応する音データを検出し、（ｂ）他の使用者の方向を基準として、使用者が音を発生した方向に対応する角度データを検出し、（ｃ）ステップ（ａ）によって検出された音データとステップ（ｂ）によって検出された角度データとを他の音再現装置に送信し、（ｄ）他の音再現装置からの音データと角度データとを受信し、（ｅ）ステップ（ｄ）によって受信された角度データが示す角度に応じた音声フィルタデータをフィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、ステップ（ｄ）によって受信された音データに畳み込み処理を施し、そして（ｆ）ステップ（ｅ）によって畳み込み処理が施された音データをスピーカアレイに出力する、音再現方法である。 The seventh invention provides a sound reproduction of a sound reproduction system comprising a plurality of sound reproduction devices comprising a speaker array having a plurality of loudspeakers and filter storage means for storing sound filter data corresponding to the sound filters provided for each angle. Each of the sound reproduction devices is a method in which (a) sound data corresponding to a sound generated by a user is detected, and (b) a direction in which the user generates a sound with reference to the direction of another user. (C) transmitting the sound data detected in step (a) and the angle data detected in step (b) to another sound reproduction device, and (d) other sound data. Sound data and angle data are received from the reproduction device, and (e) audio filter data corresponding to the angle indicated by the angle data received in step (d) is read from the filter storage means and read. Using the audio filter corresponding to the output audio filter data, the sound data received in step (d) is subjected to convolution processing, and (f) the sound data subjected to the convolution processing in step (e) is applied to the speaker array. This is a sound reproduction method that is output to.

第６および第７の発明においても、再現される音によってその音の発生者の向きを知ることができる。 In the sixth and seventh inventions, the direction of the sound generator can be known from the reproduced sound.

この発明によれば、音の発生者の角度に応じた音声フィルタを用いるので、再現される音によってその音の発生者の向きを知ることができる。したがって、たとえば、異なる場所に存在する複数の人間が音再現装置を用いて会話するような場合には、誰が誰に話し掛けているのかを再現された音によって知ることができ、円滑に会話することができる。 According to the present invention, since the sound filter corresponding to the angle of the sound generator is used, the direction of the sound generator can be known from the reproduced sound. Therefore, for example, when multiple people in different places have a conversation using a sound reproduction device, it is possible to know who is talking to whom by the reproduced sound, and to have a smooth conversation Can do.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１はこの発明の音場共有システムの一例を示す図解図である。FIG. 1 is an illustrative view showing one example of a sound field sharing system of the present invention. 図２は図１に示すマイクロホンアレイの例を示す図解図である。FIG. 2 is an illustrative view showing an example of the microphone array shown in FIG. 図３は図１に示す音場共有システムに用いるスピーカアレイシステムの例を示す図解図である。FIG. 3 is an illustrative view showing an example of a speaker array system used in the sound field sharing system shown in FIG. 図４は音場再現の原理を説明するための図解図である。FIG. 4 is an illustrative view for explaining the principle of sound field reproduction. 図５はグラムシュミットの直交化法を説明するための図解図である。FIG. 5 is an illustrative view for explaining the Gramschmitt orthogonalization method. 図６は各ラウドスピーカを最初に選択した場合に、６２個のマイクロホンに対して２４個のラウドスピーカを選択したときの評価指標の平均値および最小値の変化を示すグラフである。FIG. 6 is a graph showing changes in the average value and the minimum value of the evaluation index when 24 loudspeakers are selected for 62 microphones when each loudspeaker is first selected. 図７は６０番のラウドスピーカを最初に選択した場合に、選択された２４個のラウドスピーカの配置位置を示す図解図である。FIG. 7 is an illustrative view showing arrangement positions of 24 selected loudspeakers when the 60th loudspeaker is first selected. 図８は選択された２４個のラウドスピーカに対して選択された８個のマイクロホンの配置位置を示す図解図である。FIG. 8 is an illustrative view showing the arrangement positions of eight microphones selected for the selected 24 loudspeakers. 図９は図１の音場共有システムに用いるスピーカアレイシステムの使用状態を真上方向から見た模式図である。FIG. 9 is a schematic view of the usage state of the speaker array system used in the sound field sharing system of FIG. 1 as viewed from directly above. 図１０は図１に示す音場共有システムを用いて三者間で会話する場合の仮想の位置関係を示す図解図である。FIG. 10 is an illustrative view showing a virtual positional relationship in the case of conversation between three parties using the sound field sharing system shown in FIG. 図１１は図１０に示した仮想の位置関係で話者の顔の向きに応じた音声のインパルス応答を検出した実環境を真上から見た図を示す。FIG. 11 is a view of the real environment from which the voice impulse response corresponding to the direction of the speaker's face in the virtual positional relationship shown in FIG. 図１２はマイクロホンアレイのうちの或るマイクロホンで検出されたインパルス応答およびハニング窓を用いて減衰させたインパルス応答を示すグラフである。FIG. 12 is a graph showing an impulse response detected by a certain microphone in the microphone array and an impulse response attenuated by using a Hanning window. 図１３は話者と聴者との位置および向きを示す図解図およびそれらをラウドスピーカおよびマイクロホンアレイを用いて表した図解図である。FIG. 13 is an illustrative view showing positions and orientations of speakers and listeners, and an illustrative view showing them using a loudspeaker and a microphone array. 図１４は実環境および再現環境での実験における被験者の主観評価による角度誤りの平均を示すグラフである。FIG. 14 is a graph showing an average of angle errors due to subjective evaluation of subjects in experiments in a real environment and a reproduction environment. 図１５は実環境および再現環境での実験において、話者の向く角度毎の被験者の主観評価による角度誤り平均を示す棒グラフである。FIG. 15 is a bar graph showing an average of angle errors by subject's subjective evaluation for each angle facing the speaker in experiments in a real environment and a reproduction environment. 図１６は実環境および再現環境の間で、被験者の主観による角度誤り平均の相関関係を示す図である。FIG. 16 is a diagram showing the correlation of the average angle error depending on the subjectivity between the real environment and the reproduction environment.

図１を参照して、この実施例の音場共有システム１０は音再現システムとしても機能し、サーバ１２を含む。サーバ１２は、汎用のサーバであり、このサーバ１２には、マイクロホンアレイ１４が接続される。また、サーバ１２は、インターネットまたはＬＡＮ或いはその両方のようなネットワーク１６を介して、コンピュータ１８、コンピュータ２６およびコンピュータ３４に接続される。コンピュータ１８、２６、３４は、汎用のＰＣまたはワークステーションである。コンピュータ１８には、スピーカアレイシステム２０、マイクロホン２２およびカメラ２４が接続される。また、コンピュータ２６には、スピーカアレイシステム２８、マイクロホン３０およびカメラ３２が接続される。そして、コンピュータ３４にも、スピーカアレイシステム３６、マイクロホン３８およびカメラ４０が接続される。 Referring to FIG. 1, the sound field sharing system 10 of this embodiment also functions as a sound reproduction system and includes a server 12. The server 12 is a general-purpose server, and a microphone array 14 is connected to the server 12. The server 12 is connected to a computer 18, a computer 26, and a computer 34 via a network 16 such as the Internet and / or a LAN. The computers 18, 26, and 34 are general-purpose PCs or workstations. A speaker array system 20, a microphone 22 and a camera 24 are connected to the computer 18. In addition, a speaker array system 28, a microphone 30, and a camera 32 are connected to the computer 26. The speaker array system 36, the microphone 38, and the camera 40 are also connected to the computer 34.

この図１に示す音場共有システム１０は、３つのＢｏＳＣ再生システム１０ａ、１０ｂおよび１０ｃを含む。図１の点線枠で囲むように、ＢｏＳＣ再生システム１０ａは、サーバ１２、マイクロホンアレイ１４、ネットワーク１６、コンピュータ１８、スピーカアレイシステム２０、マイクロホン２２およびカメラ２４によって構成される。また、図１の一点鎖線枠で囲むように、ＢｏＳＣ再生システム１０ｂは、サーバ１２、マイクロホンアレイ１４、ネットワーク１６、コンピュータ２６、スピーカアレイシステム２８、マイクロホン３０およびカメラ３２によって構成される。さらに、図１の二点鎖線枠で囲むように、ＢｏＳＣ再生システム１０ｃは、サーバ１２、マイクロホンアレイ１４、ネットワーク１６、コンピュータ３４、スピーカアレイシステム３６、マイクロホン３８およびカメラ４０によって構成される。 The sound field sharing system 10 shown in FIG. 1 includes three BoSC playback systems 10a, 10b, and 10c. The BoSC playback system 10 a is configured by a server 12, a microphone array 14, a network 16, a computer 18, a speaker array system 20, a microphone 22, and a camera 24 so as to be surrounded by a dotted line frame in FIG. 1. In addition, the BoSC playback system 10b is configured by the server 12, the microphone array 14, the network 16, the computer 26, the speaker array system 28, the microphone 30 and the camera 32 so as to be surrounded by a one-dot chain line frame in FIG. Further, the BoSC playback system 10c is configured by the server 12, the microphone array 14, the network 16, the computer 34, the speaker array system 36, the microphone 38, and the camera 40 so as to be surrounded by a two-dot chain line frame in FIG.

ただし、コンピュータ１８およびスピーカアレイ２０、コンピュータ２６およびスピーカアレイ２８、コンピュータ３４およびスピーカアレイ３６のそれぞれの組は、マイクロホンアレイ１４で検出された音場データまたは他のＢｏＳＣシステム１０ａ、１０ｂ、１０ｃからの音声データ或いはそれらの両方を再現するための音再現装置として機能する。 However, each set of computer 18 and speaker array 20, computer 26 and speaker array 28, computer 34 and speaker array 36 includes sound field data detected by microphone array 14 or other BoSC systems 10a, 10b, 10c. It functions as a sound reproduction device for reproducing audio data or both.

図２に示すように、マイクロホンアレイ１４は、球形に近い形状の骨格１４ａおよびこの骨格１４ａを支持するスタンド１４ｂを含む。骨格１４ａは、Ｃ_８０フラーレン（Ｆｕｌｌｅｒｅｎｅ）の構造を基に、底部の１０個の頂点を切り取った７０個の頂点を有している。図示は省略するが、骨格１４ａの表面（外面）であり、７０個の頂点の各々には１個の無指向性のマイクロホンが取り付けられる。たとえば、マイクロホンとしては、ＤＰＡ４０６０−ＢＭを用いることができる。スタンド１４ｂは、支持軸１４０および三脚１４２によって構成され、支持軸１４０は、骨格１４ａの切り取った底部を通ってこの骨格１４ａの天井をその内側から支持している。 As shown in FIG. 2, the microphone array 14 includes a skeleton 14a having a nearly spherical shape and a stand 14b that supports the skeleton 14a. Skeleton _14a, based on the structure of the _{C 80} fullerene (Fullerene), has 70 vertices taken ten vertices of the bottom. Although not shown, it is the surface (outer surface) of the skeleton 14a, and one omnidirectional microphone is attached to each of the 70 apexes. For example, DPA 4060-BM can be used as the microphone. The stand 14b is constituted by a support shaft 140 and a tripod 142, and the support shaft 140 supports the ceiling of the skeleton 14a from the inside through the bottom portion cut out of the skeleton 14a.

なお、骨格１４ａは、前面側と重なる部分以外は、背面側であっても正面から見えるが、分かり易く示すために、図２では、背面側に相当する部分を点線で示してある。 The skeleton 14a can be seen from the front even if it is on the back side except for the part that overlaps the front side, but for the sake of easy understanding, the part corresponding to the back side is shown by dotted lines in FIG.

また、図３に示すように、スピーカアレイシステム２０、２８、３６は、楕円形のドーム部２２０およびこれを支える４本の柱部２２２を含む。この楕円形のドーム部２２０は、たとえば木製の４層の架台２２０ａ、２２０ｂ、２２０ｃ、２２０ｄによって構成される。ただし、図３では、ドーム部２２０の内部をその斜め下方から見た図であり、架台２２０ｄおよび柱部２２２についてはその一部を示してある。図示は省略するが、ドーム部２２０および柱部２２２の内部は空洞にされ、架台（２２０ａ−２２０ｄ）自体が密室型エンクロージャの役割を果たす。 As shown in FIG. 3, the speaker array system 20, 28, 36 includes an elliptical dome part 220 and four column parts 222 that support the dome part 220. The elliptical dome portion 220 is constituted by, for example, wooden four-layer mounts 220a, 220b, 220c, and 220d. However, in FIG. 3, the inside of the dome portion 220 is viewed from an obliquely lower side, and a part of the pedestal 220 d and the column portion 222 is shown. Although illustration is omitted, the inside of the dome part 220 and the pillar part 222 is made hollow, and the gantry (220a-220d) itself serves as a closed-chamber enclosure.

また、スピーカアレイシステム２０、２８、３６の各々には、７０個のラウドスピーカ２３０が設置される。具体的には、架台２２０ａには６個のフルレンジユニット（ＦｏｓｔｅｘＦＥ８３Ｅ）すなわちラウドスピーカ２３０が設置され、架台２２０ｂには１６個のラウドスピーカ２３０が設置され、架台２２０ｃには２４個のラウドスピーカ２３０が設置され、そして、架台２２０ｄには１６個のラウドスピーカ２３０が設置される。さらに、４本の柱部２２２の各々には、低域を補うため、２個のサブウーファーユニット（ＦｏｓｔｅｘＦＷ１０８Ｎ）すなわちラウドスピーカ２３０が設置される。 In addition, 70 loudspeakers 230 are installed in each of the speaker array systems 20, 28, and 36. Specifically, six full-range units (Fostex FE83E), that is, loudspeakers 230 are installed on the gantry 220a, 16 loudspeakers 230 are installed on the gantry 220b, and 24 loudspeakers 230 are installed on the gantry 220c. And 16 loudspeakers 230 are installed on the frame 220d. Further, in each of the four pillars 222, two subwoofer units (Fostex FW108N), that is, loudspeakers 230 are installed to compensate for the low frequency range.

このようなスピーカアレイシステム２０、２８、３６は、それぞれ、音場再現ルーム（図示せず）内に設置される。音場再現ルームは、１．５帖の防音室であり、ＹＡＭＡＨＡウッディボックス（遮音性能Ｄｒ−３０）が用いられる。また、音場再現ルーム内には、リフト付きの椅子（図示せず）が設けられる。これは、スピーカアレイシステム２０、２８、３６のドーム部２２０内であり、ラウドスピーカ２３０の数が最大となる架台２２０ｃの高さに、椅子に座ったユーザの耳の位置（高さ）を設定するためである。 Such speaker array systems 20, 28, and 36 are each installed in a sound field reproduction room (not shown). The sound field reproduction room is a 1.5-cm soundproof room, and a YAMAHA woody box (sound insulation performance Dr-30) is used. In addition, a chair with a lift (not shown) is provided in the sound field reproduction room. This is in the dome part 220 of the speaker array system 20, 28, 36, and the position (height) of the ear of the user sitting on the chair is set to the height of the mount 220c where the number of loudspeakers 230 is maximum. It is to do.

なお、マイクロホンアレイ１４、およびコンピュータ（１８、２６、３４）とスピーカアレイシステム（２０、２８、３６）とを含む音場再現ルーム（音場再現システム）については、「１．数値解析技術と可視化・可聴化１．７三次元音場通信システム」榎本成悟音響技術 No.148/Dec.2009 pp37-42に開示されているため、さらなる詳細な説明は省略することにする。 For the sound field reproduction room (sound field reproduction system) including the microphone array 14 and the computers (18, 26, 34) and the speaker array system (20, 28, 36), see “1. Numerical analysis technology and visualization”.・ Auralization 1.7 Three-dimensional sound field communication system ”Seigo Enomoto Acoustic Technology No.148 / Dec.2009 Since it is disclosed in pp37-42, further detailed explanation will be omitted.

たとえば、図１に示した音場共有システム１０では、マイクロホンアレイ１４は、オーケストラの演奏会場などの音場に配置される。サーバ１２は、マイクロホンアレイ１４からアンプ（図示せず）を介して入力される音場信号をディジタルの音場データに変換し、この音場データに対して逆システムの畳み込み処理を実行する。サーバ１２は、畳み込み処理を実行した音場データを、ネットワーク１６を介して、コンピュータ１８、２６および３４に送信する。 For example, in the sound field sharing system 10 shown in FIG. 1, the microphone array 14 is arranged in a sound field such as an orchestra performance hall. The server 12 converts the sound field signal input from the microphone array 14 via an amplifier (not shown) into digital sound field data, and executes convolution processing of the inverse system on this sound field data. The server 12 transmits the sound field data subjected to the convolution process to the computers 18, 26 and 34 via the network 16.

コンピュータ１８、２６、３４は、それぞれ、サーバ１２からの音場データをアナログの音場信号に変換し、スピーカアレイシステム２０、２８、３６に出力する。したがって、スピーカアレイシステム２０、２８、３６では、上述の音場が再現される。このため、スピーカアレイシステム２０、２８、３６を使用する各ユーザ（図示せず）は、遠隔地に存在している場合であっても、スピーカアレイシステム２０、２８、３６を介して、たとえば演奏会場で収録した生のオーケストラを楽しむことができる。 The computers 18, 26, and 34 convert the sound field data from the server 12 into analog sound field signals and output the analog sound field signals to the speaker array systems 20, 28, and 36, respectively. Therefore, in the speaker array systems 20, 28, and 36, the above-described sound field is reproduced. For this reason, each user (not shown) who uses the speaker array system 20, 28, 36, for example, is able to play through the speaker array system 20, 28, 36 even if the user is present at a remote place. You can enjoy the live orchestra recorded at the venue.

また、各ユーザは、マイクロホン２２、３０、３８を通して音声を入力することができる。マイクロホン２２で検出された音声信号はコンピュータ１８でディジタルの音声データに変換され、ネットワーク１６を介してコンピュータ２６、３４に送信される。コンピュータ２６は、受信した音声データと音声フィルタを畳み込み演算し、アナログの音声信号に変換して、スピーカアレイシステム２８に出力する。同様に、コンピュータ３４は、受信した音声データと音声フィルタを畳み込み演算し、アナログの音声信号に変換して、スピーカアレイシステム３６に出力する。ただし、コンピュータ２６、３４は、それぞれ、音場データと音声データとを重畳し、重畳したデータ（以下、「音データ」という）をアナログの信号（以下、「音信号」という）に変換する。以下、同様である。したがって、音場が再現されるとともに、他のユーザの音声が再現される。 Each user can input sound through the microphones 22, 30, and 38. The audio signal detected by the microphone 22 is converted into digital audio data by the computer 18 and transmitted to the computers 26 and 34 via the network 16. The computer 26 performs a convolution operation on the received audio data and the audio filter, converts the audio data into an analog audio signal, and outputs the analog audio signal to the speaker array system 28. Similarly, the computer 34 performs a convolution operation on the received audio data and the audio filter, converts the audio data into an analog audio signal, and outputs the analog audio signal to the speaker array system 36. However, the computers 26 and 34 superimpose sound field data and audio data, respectively, and convert the superimposed data (hereinafter referred to as “sound data”) into an analog signal (hereinafter referred to as “sound signal”). The same applies hereinafter. Therefore, the sound field is reproduced and the voices of other users are reproduced.

また、マイクロホン３０で検出された音声信号はコンピュータ２６でディジタルの音声データに変換され、ネットワーク１６を介してコンピュータ１８、３４に送信される。コンピュータ１８は、受信した音声データと音声フィルタを畳み込み演算し、アナログの音声信号に変換して、スピーカアレイシステム２０に出力する。同様に、コンピュータ３４は、受信した音声データと音声フィルタを畳み込み演算し、アナログの音声信号に変換して、スピーカアレイシステム３６に出力する。つまり、コンピュータ１８、３４は、それぞれ、音場データと音声データとを重畳した音データを音信号に変換する。 Also, the audio signal detected by the microphone 30 is converted into digital audio data by the computer 26 and transmitted to the computers 18 and 34 via the network 16. The computer 18 performs a convolution operation on the received audio data and the audio filter, converts the audio data into an analog audio signal, and outputs the analog audio signal to the speaker array system 20. Similarly, the computer 34 performs a convolution operation on the received audio data and the audio filter, converts the audio data into an analog audio signal, and outputs the analog audio signal to the speaker array system 36. That is, the computers 18 and 34 respectively convert sound data obtained by superimposing sound field data and sound data into sound signals.

さらに、マイクロホン３８で検出された音声信号はコンピュータ３４でディジタルの音声データに変換され、ネットワーク１６を介してコンピュータ１８、２６に送信される。コンピュータ１８、２６は、上述したように、それぞれ、受信した音声データと音声フィルタを畳み込み演算し、アナログの音声信号に変換して、スピーカアレイシステム２０、２８に出力する。 Further, the audio signal detected by the microphone 38 is converted into digital audio data by the computer 34 and transmitted to the computers 18 and 26 via the network 16. As described above, the computers 18 and 26 perform convolution operations on the received audio data and audio filters, convert them into analog audio signals, and output them to the speaker array systems 20 and 28, respectively.

したがって、スピーカアレイシステム２０のユーザ、スピーカアレイシステム２８のユーザ、およびスピーカアレイシステム３６のユーザは、音場を共有するとともに、三者間で会話することが可能である。 Therefore, the user of the speaker array system 20, the user of the speaker array system 28, and the user of the speaker array system 36 can share a sound field and have a conversation between the three parties.

なお、詳細な説明は省略するが、たとえば、マイクロホン２２、３０、３８としては、ヘッドセットのマイクロホンを用いることができる。 Although a detailed description is omitted, for example, as the microphones 22, 30, and 38, headset microphones can be used.

また、詳細な説明は省略するが、各コンピュータ１８、２６、３４は、他のコンピュータ１８、２６、３４からの音声データを個別の音声フィルタを用いて畳み込む。たとえば、各コンピュータ１８、２６、３４は、使用する通信ポートやＩＰアドレスによって、他のコンピュータ１８、２６、３４を識別することが可能である。 Although not described in detail, each computer 18, 26, 34 convolves audio data from the other computers 18, 26, 34 using individual audio filters. For example, each computer 18, 26, 34 can identify the other computer 18, 26, 34 by the communication port or IP address used.

ここで、ＢｏＳＣの原理およびＢｏＳＣを用いた音場再現システムについて簡単に説明する。境界音場制御では、キルヒホッフ・ヘルムホルツ積分方程式（ＫＨＩＥ）に基づき、図４の左側に示す原音場内の領域Ｖ内の音場が、図４の右側に示す際現音場内の領域Ｖ´において再現される。ただし、領域Ｖを囲む境界Ｓ上の収録点ｒと、領域Ｖ’を囲む境界Ｓ’上の制御点ｒ’との相対的な位置は等しいものとする。つまり、数１が成立すると仮定する。ただし、点ｓおよび点ｓ’は各領域内部の任意の点である。 Here, the principle of BoSC and the sound field reproduction system using BoSC will be briefly described. In the boundary sound field control, based on the Kirchhoff-Helmholtz integral equation (KHIE), the sound field in the region V in the original sound field shown on the left side of FIG. 4 is reproduced in the region V ′ in the current sound field shown on the right side of FIG. Is done. However, the relative positions of the recording point r on the boundary S surrounding the region V and the control point r ′ on the boundary S ′ surrounding the region V ′ are equal. That is, it is assumed that Equation 1 holds. However, the point s and the point s ′ are arbitrary points inside each region.

［数１］
｜ｒ−ｓ｜＝｜ｒ’−ｓ’｜，ｓ∈Ｖ，ｓ’∈Ｖ’
このとき、内部に音源を含まない領域内の音圧ｐ（ｓ），ｐ（ｓ’）はＫＨＩＥより、数２および数３のそれぞれで示される。 [Equation 1]
| R−s | = | r′−s ′ |, s∈V, s′∈V ′
At this time, the sound pressures p (s) and p (s ′) in the region that does not include the sound source are expressed by Equations 2 and 3 from KHIE.

ただし、ωは角周波数であり、ρ_０は媒質の密度であり、ｐ（ｒ），ｖ_ｎ（ｒ）はそれぞれ境界上の点ｒにおける音圧と法線ｎの方向の粒子速度であり、Ｇ（ｒ｜ｓ）は自由空間グリーン関数である。 Where ω is the angular frequency, ρ ₀ is the density of the medium, p (r) and v _n (r) are the sound pressure at the point r on the boundary and the particle velocity in the direction of the normal n, respectively. G (r | s) is a free space Green's function.

ここで、数１より、数４に示す関係が成立する。さらに、数４に従って、数５が成立する。 Here, from Equation 1, the relationship shown in Equation 4 is established. Further, according to Equation 4, Equation 5 is established.

この数５から、原音原で収音された境界面Ｓ上の音圧と粒子速度が再現音場において等しくなるように、２次音源から信号を出力すれば、領域Ｖ内の音場が領域Ｖ’において再現されることが分かる。 From Equation 5, if a signal is output from the secondary sound source so that the sound pressure on the boundary surface S collected by the original sound source and the particle velocity are equal in the reproduced sound field, the sound field in the region V will be the region. It can be seen that it is reproduced at V ′.

ただし、２次音源の出力は、すべての２次音源からすべての制御点までの伝達特性を打ち消す逆フィルタと収録点で観測された信号を畳み込むことにより決定される。したがって、図４に示すような、ＢｏＳＣ音場再現システムを実現するためには、安定であり、かつ頑健な逆フィルタ（ｐｉｎｖ（Ｈ））を設計することが重要になる。 However, the output of the secondary sound source is determined by convolving the inverse filter that cancels the transfer characteristics from all secondary sound sources to all control points and the signal observed at the recording point. Therefore, in order to realize a BoSC sound field reproduction system as shown in FIG. 4, it is important to design a stable and robust inverse filter (pinv (H)).

なお、逆フィルタの設計方法は、文献（S.Enomoto et al., "Three-dimensional sound field reproduction and recording systems based on boundary surface control principle", Proc. of 14th ICAD, Presentation o 16, 2008 Jun.）に詳細に開示されているため、ここでは、簡単に説明することにする。 The inverse filter design method is described in the literature (S. Enomoto et al., “Three-dimensional sound field reproduction and recording systems based on boundary surface control principle”, Proc. Of 14th ICAD, Presentation o 16, 2008 Jun.). Will be described briefly here.

図４に示すような、２次音源数Ｍ、制御点数Ｎの多チャネル−多点制御逆システム（以下、単に「逆システム」という）を周波数領域で設計する方法について簡単に説明する。ただし、逆システムとは、Ｍ×Ｎ個の逆フィルタ群の総称である。 A method for designing a multichannel-multipoint control inverse system (hereinafter simply referred to as “inverse system”) having the number M of secondary sound sources and the number N of control points as shown in FIG. 4 will be briefly described. However, the inverse system is a general term for M × N inverse filter groups.

２次音源ｉから制御点ｊまでの伝達関数をＨｊｉ（ω）とし、入力信号をＸｊ（ω）とし、そして、観測信号をＰｊ（ω）とすると、これらの関係は、数６で表すことができる。ただし、ｉは２次音源番号（１、２、…、Ｍ）であり、ｊは制御点番号（１、２、…、Ｎ）であり、そして、Ｗ（ω）は逆システムである。 When the transfer function from the secondary sound source i to the control point j is Hji (ω), the input signal is Xj (ω), and the observation signal is Pj (ω), these relations are expressed by Equation 6. Can do. Where i is the secondary sound source number (1, 2,..., M), j is the control point number (1, 2,..., N), and W (ω) is the inverse system.

このとき、Ｐ（ω）＝Ｘ（ω）とするためには、数７を満たす必要がある。ただし、＋は疑似逆行列を意味する。これによって、［Ｗ（ω）］は、［Ｈ（ω）］の逆システムとして定義される。 At this time, in order to satisfy P (ω) = X (ω), Equation 7 must be satisfied. However, + means a pseudo inverse matrix. Thus, [W (ω)] is defined as the inverse system of [H (ω)].

［数７］
[W(ω)] = [H(ω)]⁺
ここで、正則化法が逆問題を解決する合理的な方法であることは良く知られている。これは既に音再生システムに適用されている（TOKUNO et al., "Inverse Filter of Sound Reproduction Systems Using Regularization" EIEIC TRANS. FUNDAMENTALS, Vol.E80-A, NO.5 MAY 1997など）。正則化法を用いることにより、ランク（［Ｈ（ω）］）＝Ｎについての算出された逆行列［Ｗ＾（ω）］（表記の都合上、“＾”をＷの横に示してあるが、実際には数８に示すように、Ｗの上に記載される。以下、同じ。）は数８で与えられる。ただし、数８において、＃は共役転置を意味し、−１は逆行列を意味し、β（ω）は正則化パラメータであり、Ｉ_ＭはＭ×Ｍの単位行列である。以下、同様である。 [Equation 7]
[W (ω)] = [H (ω)] ⁺
Here, it is well known that the regularization method is a rational method for solving the inverse problem. This has already been applied to sound reproduction systems (TOKUNO et al., “Inverse Filter of Sound Reproduction Systems Using Regularization” EIEIC TRANS. FUNDAMENTALS, Vol.E80-A, NO.5 MAY 1997, etc.). By using the regularization method, the calculated inverse matrix [W ^ (ω)] for rank ([H (ω)]) = N (“^” is shown next to W for convenience of description. Is actually written on W as shown in Equation 8. The same applies hereinafter.) Is given by Equation 8. In Equation 8, # means conjugate transpose, -1 means an inverse matrix, β (ω) is a regularization parameter, and _IM is an M × M unit matrix. The same applies hereinafter.

一方、数７の右辺に示される、ランク（［Ｈ（ω）］）＝Ｍについての逆行列［Ｈ（ω）］^＋は、数９として導かれる。 On the other hand, the inverse matrix [H (ω)] ⁺ for rank ([H (ω)]) = M shown on the right side of Equation 7 is derived as Equation 9.

数８および数９は、それぞれ、最小二乗解および最小ノルム解（ノルム最小型一般逆行列）として解釈される。ただし、ランク（［Ｈ（ω）］）＝Ｎ＝Ｍであり、［Ｈ（ω）］は特異行列（非正則行列）では無く、そして［Ｗ（ω）］＝［Ｈ（ω）］^−１で与えられる。また、時間領域逆フィルタ係数は、［Ｗ＾（ω）］の逆離散フーリエ変換から得られる。 Equations 8 and 9 are interpreted as a least square solution and a minimum norm solution (norm minimum general inverse matrix), respectively. However, rank ([H (ω)]) = N = M, [H (ω)] is not a singular matrix (non-regular matrix), and [W (ω)] = [H (ω)] ⁻ Given by ¹ . The time domain inverse filter coefficient is obtained from the inverse discrete Fourier transform of [W ^ (ω)].

なお、ＢｏＳＣ再生システムにおいては、スピーカアレイシステム（２０、２８、３６）のラウドスピーカ２３０の配置およびマイクロホンアレイ１４のマイクロホンの配置は、空間サンプリングに影響を及ぼす。 In the BoSC playback system, the arrangement of the loudspeakers 230 of the speaker array system (20, 28, 36) and the arrangement of the microphones of the microphone array 14 affect spatial sampling.

数８および数９においては、適切な正則化パラメータβ（ω）が選択されることにより、逆システムの不安定性を緩和する（取り除く）ことができる。この実施例では、正則化パラメータβ（ω）は、各オブターブの周波数帯域で定義される。さらに、逆フィルタは、予め防音室でそれぞれのラウドスピーカ２３０とマイクロホンアレイ１４の各マイクロホンとの組の間で測定されたインパルス応答を使用することによって、計算された。測定されたインパルス応答を使用したため、環境の変化によって引き起こされた変動には追従しなかった。ただし、変動する実際の環境においては、ＭＩＭＯ(Multiple-Input Multiple-Output)の適応型の逆フィルタをＢｏＳＣ再生システムに適用することができる。 In the equations (8) and (9), the instability of the inverse system can be reduced (removed) by selecting an appropriate regularization parameter β (ω). In this embodiment, the regularization parameter β (ω) is defined in the frequency band of each object. Furthermore, the inverse filter was calculated by using the impulse response measured in advance between each loudspeaker 230 and each microphone of the microphone array 14 in a soundproof room. Because the measured impulse response was used, it did not follow the fluctuations caused by environmental changes. However, in an actual environment that fluctuates, an adaptive inverse filter of MIMO (Multiple-Input Multiple-Output) can be applied to the BoSC reproduction system.

ここで、図１−図３に示したマイクロホンアレイ１４およびスピーカアレイシステム２０、２８、３６をそのまま使用する場合には、サーバ１２における処理負荷がかなり大きい。具体的には、マイクロホンアレイ１４が７０ｃｈであり、スピーカアレイシステム２０、２８、３６が６２ｃｈであるため、サーバ１２は、マイクロホンアレイ１４の各マイクロホンの音場信号（音場データ）と、逆システムとの畳み込み処理を６２×７０回行う必要があり、また、各回の畳み込み処理は、逆システムのタップ数（この実施例では、２０４８ポイント×２タップ＝４０９６）分実行する必要がある。 Here, when the microphone array 14 and the speaker array systems 20, 28, and 36 shown in FIGS. 1 to 3 are used as they are, the processing load on the server 12 is considerably large. Specifically, since the microphone array 14 has 70 ch and the speaker array systems 20, 28, and 36 have 62 ch, the server 12 can detect the sound field signal (sound field data) of each microphone in the microphone array 14 and the inverse system. It is necessary to perform the convolution processing with the number of times of 62 × 70 times, and each time of the convolution processing needs to be executed for the number of taps of the inverse system (2048 points × 2 taps = 4096 in this embodiment).

また、伝送する音場データの量（データ量）が膨大であるため、各クライアント（コンピュータ１８、２６、３４）において、約４５Ｍｂｐｓの帯域を必要とする。 Also, since the amount of sound field data to be transmitted (data amount) is enormous, each client (computer 18, 26, 34) requires a bandwidth of about 45 Mbps.

さらに、コンピュータ１８、２６、３４によって、ユーザの音声に対応する音声データと音声フィルタを畳み込み演算する場合にも、７０ｃｈをフルに使用する場合には、処理負荷が比較的大きくなってしまう。 Furthermore, even when the computer 18, 26, and 34 perform convolution calculation of the audio data corresponding to the user's voice and the audio filter, the processing load becomes relatively large when 70ch is fully used.

したがって、サーバ１２からコンピュータ１８、２６、３４に音場データをリアルタイムに送信するのは困難であり、当然のことながら、スピーカアレイシステム２０、２８、３６を使用するユーザがリアルタイムにオーケストラ等を楽しむことも困難である。つまり、リアルタイムに音場を共有することができない。また、リアルタイムに会話することもできない。 Therefore, it is difficult to transmit the sound field data from the server 12 to the computers 18, 26, and 34 in real time. As a matter of course, a user who uses the speaker array systems 20, 28, and 36 enjoys an orchestra and the like in real time. It is also difficult. That is, the sound field cannot be shared in real time. Nor can they talk in real time.

これを回避するため、たとえば、マイクロホンアレイ１４で使用するマイクロホンの数やスピーカアレイシステム２０、２８、３６で使用するラウドスピーカ２３０の数を減らすことにより、畳み込み処理の処理負荷および伝送するデータ量を低減することが考えられる。しかし、使用するマイクロホンおよびラウドスピーカ２３０の数を単に減らせば良いということでは無く、再現される音場の臨場感を損なわない必要がある。 In order to avoid this, for example, by reducing the number of microphones used in the microphone array 14 and the number of loudspeakers 230 used in the speaker array systems 20, 28, 36, the processing load of the convolution process and the amount of data to be transmitted are reduced. It is possible to reduce. However, it is not just that the number of microphones and loudspeakers 230 to be used is reduced, and it is necessary not to impair the realistic feeling of the reproduced sound field.

そこで、この実施例では、臨場感を損なうことなく、使用するマイクロホンおよびラウドスピーカ２３０を低減するようにしてある。 Therefore, in this embodiment, the number of microphones and loudspeakers 230 to be used is reduced without impairing the sense of presence.

この実施例では、まず、グラムシュミットの直交化法を用いて、７０ｃｈのマイクロホンアレイ１４を用いた場合に、スピーカアレイシステム２２で使用するラウドスピーカ２３０が抽出（選出）される。そして、選出されたラウドスピーカ２３０を用いる場合に、グラムシュミットの直交化法を用いて、マイクロホンアレイ１４で使用するマイクロホンが抽出（選出）される。 In this embodiment, first, the loudspeaker 230 used in the speaker array system 22 is extracted (selected) when the 70ch microphone array 14 is used, using the Gramschmitt orthogonalization method. When the selected loudspeaker 230 is used, microphones used in the microphone array 14 are extracted (selected) using the Gramschmitt orthogonalization method.

詳細な説明は省略するが、使用するラウドスピーカ２３０およびマイクロホンの抽出（選出）は、サーバ１２、コンピュータ１８、２６、３４または図示しない別のコンピュータを用いて実行することができる。 Although detailed description is omitted, extraction (selection) of the loudspeaker 230 and the microphone to be used can be performed by using the server 12, the computers 18, 26, 34, or another computer (not shown).

ここでは、単一の周波数について、グラムシュミットの直交化法を使用することでラウドスピーカ２３０を選択する場合の基本的なアルゴリズムを説明する。Ｎ×Ｍに含まれるＮ次元の縦ベクトルからの線形独立性が低ければ、行列式は悪い状態であると言われる。［Ｈ（ω）］において線形独立性の劣化は、ＢｏＳＣ再生システム１０ａ、１０ｂ、１０ｃの不安定性を引き起こす。ここで、数６に示した［Ｈ（ω）］は、数１０のように書くことができる。 Here, a basic algorithm in the case of selecting the loudspeaker 230 by using the Gram Schmidt orthogonalization method for a single frequency will be described. If the linear independence from the N-dimensional vertical vector contained in N × M is low, the determinant is said to be in a bad state. The degradation of linear independence in [H (ω)] causes instability of the BoSC playback system 10a, 10b, 10c. Here, [H (ω)] shown in Equation 6 can be written as in Equation 10.

［数１０］
P(ω) = [H(ω)]Y(ω)
= {h₁(ω),…,h_M(ω)}Y(ω)
ただし、Ｙ（ω）＝［Ｗ（ω）］Ｘ（ω）およびｈ_ｉ（ω）は、［Ｈ（ω）］に含まれるＮ次元の縦ベクトルである。この縦ベクトルｈ（ω）は、周波数ωにおける、或るラウドスピーカ２３０とマイクロホンアレイ１４の各々のマイクロホンとの間の伝達関数である。それゆえに、グラムシュミットの直交化法を用いたラウドスピーカ２３０の選択は、［Ｈ（ω）］から高い線形独立を有する縦ベクトルｈ（ω）の組を選択することを意味する。以下、グラムシュミットの直交化法のアルゴリズムについて簡単に説明することにする。 [Equation 10]
P (ω) = [H (ω)] Y (ω)
= {h ₁ (ω),…, h _M (ω)} Y (ω)
However, Y (ω) = [W (ω)] X (ω) and h _i (ω) are N-dimensional vertical vectors included in [H (ω)]. This vertical vector h (ω) is a transfer function between a certain loudspeaker 230 and each microphone of the microphone array 14 at the frequency ω. Therefore, the selection of the loudspeaker 230 using the Gramschmitt orthogonalization means selecting a set of longitudinal vectors h (ω) having high linear independence from [H (ω)]. Hereinafter, the algorithm of the Gramschmitt orthogonalization method will be briefly described.

ラウドスピーカ２３０を選択するｎ番目のステップにおいては、既にｎ−１個のラウドスピーカ２３０が選択されている。［Ｈ］に含まれる縦ベクトルの集合は、τ＝｛ｈ_１，…，ｈ_Ｍ｝で示される。Ｓ_ｎ−１は、ｎ−１番目のステップまでに選択されたベクトルの部分集合を示し、τ_ｎ−１は、ｎ−１番目のステップまでに未使用のベクトルの部分集合を示す。ｖ_ｎ−１＝｛ｖ_１，…，ｖ_ｎ−１｝は、部分集合Ｓ_ｎ−１によって張られる平面の正規直交基底を示す。 In the n-th step of selecting the loudspeakers 230, n-1 loudspeakers 230 have already been selected. A set of vertical vectors included in [H] is represented by τ = {h ₁ ,..., H _M }. S _n−1 indicates a subset of vectors selected up to the (n−1) th step, and τ _n−1 indicates a subset of unused vectors until the (n−1) th step. v _n−1 = {v ₁ ,..., v _n−1 } represents an orthonormal basis of a plane stretched by the subset S _n−1 .

たとえば、最初のステップでは、すべてのラウドスピーカ２３０のうちの１つのラウドスピーカ２３０が基準ラウドスピーカ２３０として選択され、基準ラウドスピーカ２３０以外のすべてのラウドスピーカ２３０が評価対象のラウドスピーカ２３０（評価対象ラウドスピーカ２３０）として選択される。後述するように、グラムシュミットの直交化法により、基準ラウドスピーカ２３０との関係において、複数の評価対象ラウドスピーカ２３０から１の評価対象ラウドスピーカ２３０が選択される。次のステップでは、同じくグラムシュミットの直交化法により、最初に選択された基準ラウドスピーカ２３０および先のステップで選択された評価対象ラウドスピーカ２３０との関係において、残りの複数の評価対象ラウドスピーカ２３０から１の評価対象ラウドスピーカ２３０が選択される。つまり、このステップでは、先のステップで選択された評価対象ラウドスピーカ２３０は、基準ラウドスピーカ２３０と言える。これが繰り返されるのである。 For example, in the first step, one of the loudspeakers 230 is selected as the reference loudspeaker 230, and all the loudspeakers 230 other than the reference loudspeaker 230 are evaluated. Selected as loudspeaker 230). As will be described later, one evaluation target loudspeaker 230 is selected from the plurality of evaluation target loudspeakers 230 in relation to the reference loudspeaker 230 by the Gram Schmidt orthogonalization method. In the next step, the remaining plurality of evaluation target loudspeakers 230 in relation to the reference loudspeaker 230 initially selected and the evaluation target loudspeaker 230 selected in the previous step, also using the Gram Schmidt orthogonalization method. To 1 of the evaluation target loudspeakers 230 is selected. That is, in this step, the evaluation target loudspeaker 230 selected in the previous step can be said to be the reference loudspeaker 230. This is repeated.

ただし、低域を補う８個のラウドスピーカ２３０は、基準ラウドスピーカ２３０や評価対象ラウドスピーカ２３０の対象外である。 However, the eight loudspeakers 230 that compensate for the low frequency band are outside the scope of the reference loudspeaker 230 and the evaluation target loudspeaker 230.

図５は、部分集合Ｓ_ｎ−１によって張られた平面の一例である。ｎ番目のステップでは、部分集合Ｓ_ｎ−１によって張られた平面に対するｈ_ｎ＾（数１１に示すように、実際には“＾”はｈの上に表記される。以下、同じ。）の垂直成分が最大となるように、ｈ_ｎ＾が選択される。部分集合τ_ｎ−１に含まれる任意のベクトルｈ_ｉの垂直成分ｒ_ｉは数１１で表される。 FIG. 5 is an example of a plane spanned by the subset S _n−1 . In the n-th step, h _n ^ for the plane stretched by the subset S _n−1 (in fact, “^” is written on h as shown in Equation 11. The same applies hereinafter). H _n ^ is selected so that the vertical component is maximized. A vertical component r _i of an arbitrary vector h _i included in the subset τ _n−1 is expressed by Equation 11.

［数１１］
r_i = z_i- p
ただし、ｐは部分集合Ｓ_ｎ−１によって張られた平面上の投影（射影）を示す。ｎ番目のラウドスピーカ２３０は、たとえば数１２で示される、垂直成分ｒ_ｉのノルムが最大となるように決定される。 [Equation 11]
r _i = z _i -p
Here, p represents a projection (projection) on a plane stretched by the subset S _n−1 . The n-th loudspeaker 230 is determined so that the norm of the vertical component r _i shown in, for example, Equation 12 is maximized.

ただし、評価指標の値であるＪ（ｈ_ｉ）は数１３で定義される。 However, J (h _i ), which is the value of the evaluation index, is defined by Equation 13.

［数１３］
J(h_i) = ||r_i||
ｈ_ｉ＾の垂直成分がｒ_ｎ＾（実際には“＾”の記号はｒの上に表記される。以下、同じ。）として示される場合には、ｎ番目の正規直交ベクトルｖ_ｎは数１４に従って決定される。 [Equation 13]
J (h _i ) = || r _i ||
If the vertical component of h _i ^ is indicated as r _n ^ (actually, the symbol “^” is written on r. The same applies hereinafter), the nth orthonormal vector v _n is a number. 14 is determined.

ｎ番目のステップで最大化された評価指標の値Ｊ_ｎ＾（実際には“＾”の記号はＪの上に表記される。以下、同じ。）は数１５で示される。 The evaluation index value J _n ^ maximized in the n-th step (actually, the symbol “^” is written on J. The same applies hereinafter) is expressed by Equation 15.

このような数１１−数１５に従う処理は、評価指標の値Ｊ_ｎ＾が予め設定された閾値Ｊ_ｔｈｒ＾よりも小さくなるまで繰り返される。ただし、周波数帯域［ω_ｌ，ω_ｈ］について、２つの評価指標の値が数１６に従って求められる。 Such processing according to Equation 11 to Equation 15 is repeated until the evaluation index value J _n ^ becomes smaller than a preset threshold value J _thr ^. However, for the frequency band [ω _l , ω _h ], two evaluation index values are obtained according to Equation 16.

ただし、ｈ_ｉ￣＝｛ｈ_ｉ（ω_ｌ），…，ｈ_ｉ（ω_ｈ）｝であり（実際には、数１６に示すように、“￣”はｈの上に表記される。）、Ｋは離散周波数ω_ｋの数であり、ａ_ｋは離散周波数ω_ｋに対する任意の重み係数を示す。垂直成分ｒ_ｉ（ω_ｋ）と正規直交ベクトルｖ_ｉ（ω_ｋ）は、単一の周波数の場合と同様に、離散周波数毎に分離して求められる。最適化処理では、評価指標の値Ｊ_ａｖｇは最大化される。一方、評価指標の値Ｊ_ｍｉｎは最適化処理の終了判定に用いられる。つまり、Ｊ_ｍｉｎ＾＜Ｊ_ｔｈｒ＾となったときにラウドスピーカ２３０の選択を終了する。 However, h _i ￣ = {h _i (ω _l ),..., H _i (ω _h )} (in practice, “￣” is written on h as shown in Equation 16). , K is the number of discrete frequencies ω _k and a _k is an arbitrary weighting factor for the discrete frequency ω _k . The vertical component r _i (ω _k ) and the orthonormal vector v _i (ω _k ) are obtained separately for each discrete frequency as in the case of a single frequency. In the optimization process, the evaluation index value J _avg is maximized. On the other hand, the evaluation index value J _min is used to determine the end of the optimization process. That is, selection of the loudspeaker 230 ends when J _min ^ <J _thr ^.

ただし、最適化処理については、文献（Asano, Suzuki, and Swanson " Optimization of control source configuration in active control systems using Gram-Schmidt orthogonalization", Speech and Audio Processing, IEEE Transactions on, Mar. 1999）に開示されている。 However, optimization processing is disclosed in the literature (Asano, Suzuki, and Swanson "Optimization of control source configuration in active control systems using Gram-Schmidt orthogonalization", Speech and Audio Processing, IEEE Transactions on, Mar. 1999). Yes.

この文献においては、評価指標の値が閾値以上（Ｊ_ｍｉｎ＾≧Ｊ_ｔｈｒ＾）である場合には、ラウドスピーカ２３０の選択は継続される。しかし、適切な閾値を決定する方法は確認されていない。したがって、この実施例では、音場共有システム１０において、リアルタイムに音場を共有することができるスピーカアレイシステム（２０、２８、３６）のラウドスピーカ２３０の最大数とマイクロホンアレイ１４のマイクロホンの最大数とを検証した。そして、グラムシュミットの直交化法を使用することで、最大数までのラウドスピーカ２３０の番号（配置位置）を決定した。 In this document, when the value of the evaluation index is equal to or greater than the threshold (J _min ^ ≧ J _thr ^), the selection of the loudspeaker 230 is continued. However, a method for determining an appropriate threshold has not been confirmed. Therefore, in this embodiment, in the sound field sharing system 10, the maximum number of loudspeakers 230 of the speaker array system (20, 28, 36) and the maximum number of microphones of the microphone array 14 that can share the sound field in real time. And verified. And the number (arrangement position) of the loudspeakers 230 up to the maximum number was determined by using the Gramschmitt orthogonalization method.

ここで、上述したように、グラムシュミットの直交化法では、スピーカ位置は、それ以前に選択されたスピーカ位置に基づいて決定されるため、その選択結果は、１番目に選択されるスピーカ位置に強い影響を及ぼされる。 Here, as described above, in the Gram Schmidt orthogonalization method, the speaker position is determined based on the speaker position previously selected, and therefore, the selection result is the first selected speaker position. Has a strong influence.

たとえば、使用するラウドスピーカ２３０の個数を、半数程度（３２個）、３分の１程度（２４個）、４分の１程度（１６個）に削減する場合について検討した。図６は、２４個のラウドスピーカ２３０が選択された（２４ステップの選択処理を実行した）場合の評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎの変化である。図６において、横軸は最初に選択されたラウドスピーカ２３０（基準ラウドスピーカ２３０）のスピーカ位置（図１０参照）を示し、縦軸は評価値（ｄＢ）を示す。ただし、２本の実線のうち、細い実線が評価指標の値Ｊ_ａｖｇを示し、細い実線が評価指標の値Ｊ_ｍｉｎの変化を示す。 For example, the case where the number of the loudspeakers 230 to be used is reduced to about half (32), about one third (24), or about one fourth (16) was examined. FIG. 6 shows changes in the evaluation index values J _avg and J _min when 24 loudspeakers 230 are selected (a selection process of 24 steps is executed). In FIG. 6, the horizontal axis indicates the speaker position (see FIG. 10) of the first selected loudspeaker 230 (reference loudspeaker 230), and the vertical axis indicates the evaluation value (dB). However, of the two solid lines, the thin solid line indicates the value J _avg of the evaluation index, and the thin solid line indicates the change in the value J _min of the evaluation index.

詳細な説明は省略するが、たとえば、最初に選択される基準ラウドスピーカ２３０は「１」番（図７参照）から順次変化（２、３、…、６２）され、それぞれの場合について、選択された２４個のスピーカ位置（ラウドスピーカ２３０の番号）の組が選択されるとともに、各組について評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎが算出される。ただし、選択された２４個のスピーカ位置（ラウドスピーカ２３０の番号）の組と、各組について算出された評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎは、上述したコンピュータのメモリ（図示は省略するが、ハードディスクやＲＡＭ）に記憶される。そして、後述するように、複数の組のうち、評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎが所定の条件を満たす一組が選択される。したがって、選択された一組の２４個のラウドスピーカ２３０を用いて音場が再現されるのである。 Although detailed explanation is omitted, for example, the reference loudspeaker 230 selected first is sequentially changed (2, 3,..., 62) from “1” (see FIG. 7), and is selected for each case. In addition, a set of 24 speaker positions (numbers of loudspeakers 230) is selected, and evaluation index values J _avg and J _min are calculated for each set. However, the set of 24 selected speaker positions (numbers of the loudspeakers 230) and the evaluation index values J _avg and J _min calculated for each set are the memory of the computer (not shown). Stored in a hard disk or RAM). As will be described later, one set of evaluation index values J _avg and J _min satisfying a predetermined condition is selected from the plurality of sets. Therefore, the sound field is reproduced using the selected set of 24 loudspeakers 230.

また、自由空間グリーン関数は、スピーカアレイシステム（２０、２８、３６）の各ラウドスピーカ２３０とマイクロホンアレイ１４のマイクロホンとの間の伝達関数を得るのに使用された。後述する刺激のための上限周波数は、ここでは制限されなかった。しかし、ラウドスピーカ２３０の構成（設定）は、２０Hzから１kHzまでの範囲を、２０Hz毎の周波数で決定された。図示は省略するが、上限周波数が制限されない場合には、上側の層（架台２２０ａ、架台２２０ｂ）に配置されたラウドスピーカ２３０が、多く選択された。ラウドスピーカ２３０が全く無い方向から来る波面を統合するのは立体音の再生系においては困難である。したがって、ラウドスピーカ２３０は、マイクロホンアレイ１４に囲まれるあらゆる可能な方向に位置されるべきである。 The free space Green function was also used to obtain the transfer function between each loudspeaker 230 of the loudspeaker array system (20, 28, 36) and the microphone of the microphone array 14. The upper limit frequency for stimulation described below was not limited here. However, the configuration (setting) of the loudspeaker 230 was determined in the range from 20 Hz to 1 kHz at a frequency of 20 Hz. Although illustration is omitted, when the upper limit frequency is not limited, many loudspeakers 230 arranged on the upper layer (the gantry 220a and the gantry 220b) are selected. It is difficult to integrate a wavefront coming from a direction where there is no loudspeaker 230 in a three-dimensional sound reproduction system. Therefore, the loudspeaker 230 should be positioned in every possible direction surrounded by the microphone array 14.

上述したように、図６には、ラウドスピーカ２３０について、２４ステップ（回）の選択処理を実行した場合の評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎを折れ線で示したグラフである。この図６からも分かるように、スピーカ位置が「６０」（図７参照）であるラウドスピーカ２３０を最初に選択し、全部で２４個のラウドスピーカ２３０を選択した場合の評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎが最大である。 As described above, FIG. 6 is a graph showing the evaluation index values J _avg and J _min in a broken line when the selection process of 24 steps (times) is performed for the loudspeaker 230. As can be seen from FIG. 6, the evaluation index value J _avg when the loudspeaker 230 whose speaker position is “60” (see FIG. 7) is first selected and all 24 loudspeakers 230 are selected. , J _min is the maximum.

この実施例では、複数の組（この実施例では、６２個の組）のうち、評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎが所定の条件を満たす一組の２４個のラウドスピーカ２３０が選択される。具体的には、評価指標の値Ｊ_ａｖｇが最大である組が選択される。ただし、評価指標の値Ｊ_ａｖｇが最大である組についての評価指標の値Ｊ_ｍｉｎが極端に低い場合には、線形独立性の低い周波数が存在するため、評価指標の値Ｊ_ａｖｇが最大であっても、選択するのは適切ではない。正しく音場を再現できないと考えられるからである。かかる場合には、次に評価指標の値Ｊ_ａｖｇが大きい組が選択される。ただし、次に評価指標の値Ｊ_ａｖｇが大きい組についての評価指標の値Ｊ_ｍｉｎが極端に低い場合には、その次に評価指標の値Ｊ_ａｖｇが大きい組が選択される。それ以降も同様である。たとえば、評価指標の値Ｊ_ｍｉｎが極端に低いかどうかについては、予め設定された閾値によってコンピュータは判断する。この閾値は、音場共有システム１０の開発者ないし使用者が設定する値である。ただし、図示は省略するが、選択するラウドスピーカ２３０の個数が増えるに従って、評価指標の値Ｊ_ａｖｇ，Ｊ_ｍｉｎは次第に低下するため、選択するラウドスピーカ２３０の個数に応じて、閾値も可変的に設定する必要がある。 In this embodiment, among a plurality of sets (62 sets in this embodiment), a set of 24 loudspeakers 230 in which evaluation index values J _avg and J _min satisfy a predetermined condition are selected. . Specifically, the pair having the maximum evaluation index value J _avg is selected. However, when the evaluation index value J _min for the pair having the maximum evaluation index value J _avg is extremely low, a frequency with low linear independence exists, and therefore the evaluation index value J _avg is the maximum. However, it is not appropriate to choose. This is because it is considered that the sound field cannot be reproduced correctly. In such a case, a group having the next largest evaluation index value J _avg is selected. However, if the next value J _min of metrics for the value J _avg large set of metrics is extremely low, the set value J _avg metric the next larger is selected. The same applies thereafter. For example, the computer determines whether or not the value J _{min of the} evaluation index is extremely low based on a preset threshold value. This threshold is a value set by the developer or user of the sound field sharing system 10. However, although illustration is omitted, since the evaluation index values J _avg and J _min gradually decrease as the number of loudspeakers 230 to be selected increases, the threshold value can also be variably changed according to the number of loudspeakers 230 to be selected. Must be set.

予備試験の結果では、サーバ１２およびコンピュータ１８、２６、３４の性能およびネットワーク１６を含む通信速度の制約から、［Ｗ（ω）］における要素の数がＭ×Ｎ＝１９２以内で、スピーカアレイシステム（２０、２８、３６）のラウドスピーカ２３０の数（Ｍ）およびマイクロホンアレイ１４のマイクロホンの数（Ｎ）が決定されるべきであることが示された。したがって、上述したように、ラウドスピーカ２３０の数（Ｍ）を「２４」に決定したため、選択されるマイクロホンの数（Ｎ）は最大で「８」である。 As a result of the preliminary test, the number of elements in [W (ω)] is within M × N = 192 due to the performance of the server 12 and the computers 18, 26, 34 and the communication speed limitation including the network 16. It was shown that the number (M) of (20, 28, 36) loudspeakers 230 and the number of microphones (N) in the microphone array 14 should be determined. Therefore, as described above, since the number (M) of the loudspeakers 230 is determined to be “24”, the number (N) of the selected microphones is “8” at the maximum.

ただし、この実施例では、サーバ１２およびコンピュータ１８、２６、３４のＣＰＵ（図示せず）はＸｅｏｎ（登録商標）ＱｕａｄＣｏｒｅ×２であり、メモリ（図示せず）は４ＧＢである。また、サーバ１２には、オペレーティングシステムとして、Ｗｉｎｄｏｗｓ（登録商標）ＸＰ６４ｂｉｔが採用された。また、サーバ１２とコンピュータ１８、２６、３４とを結ぶネットワーク１６としては、超高速・高機能研究開発テストベッドネットワーク（ＪＧＮ２ｐｌｕｓ：１Ｇｂｐｓ）およびＬＡＮ（１００Ｍｂｐｓ）が用いられた。 However, in this embodiment, the CPU (not shown) of the server 12 and the computers 18, 26, 34 is Xeon (registered trademark) QuadCore × 2, and the memory (not shown) is 4 GB. The server 12 employs Windows (registered trademark) XP 64 bits as an operating system. As the network 16 connecting the server 12 and the computers 18, 26, 34, an ultra-high speed / high performance R & D test bed network (JGN2 plus: 1 Gbps) and a LAN (100 Mbps) were used.

なお、図示は省略するが、予備実験においては、サーバ１２とコンピュータ１８とは、上述のＬＡＮを用いて接続され、サーバ１２とコンピュータ２６、３４とは、上述のＪＧＮ２ｐｌｕｓおよびＬＡＮを用いて接続される。 Although not shown, in the preliminary experiment, the server 12 and the computer 18 are connected using the above-described LAN, and the server 12 and the computers 26 and 34 are connected using the above-described JGN2plus and LAN. The

図７（Ａ）および（Ｂ）には、上述したように、スピーカ位置が「６０」のラウドスピーカ２３０が最初に選択し、全部で２４個のラウドスピーカ２３０を選択した場合の２４個のラウドスピーカ２３０の位置の分布が示される。図７（Ａ）は、ラウドスピーカ２３０の配置を真上から見た場合の模式図であり、図７（Ｂ）は、ラウドスピーカ２３０の配置を真横から見た場合の模式図である。つまり、図７（Ａ）は、ラウドスピーカ２３０の水平方向の分布を示し、図７（Ｂ）は、ラウドスピーカ２３０の垂直方向の分布を示す。 In FIGS. 7A and 7B, as described above, the loudspeaker 230 whose speaker position is “60” is selected first, and 24 loudspeakers 230 are selected in total when 24 loudspeakers 230 are selected. The distribution of the position of the speaker 230 is shown. FIG. 7A is a schematic diagram when the arrangement of the loudspeaker 230 is viewed from directly above, and FIG. 7B is a schematic diagram when the arrangement of the loudspeaker 230 is viewed from the side. That is, FIG. 7A shows the horizontal distribution of the loudspeakers 230, and FIG. 7B shows the vertical distribution of the loudspeakers 230.

図７（Ｂ）からも分かるように、図７（Ａ）に示す分布においては、スピーカ位置が中央に向かうに従って高さ方向（Ｚ方向）の値は大きくなる。つまり、架台２２０ａに設けられたラウドスピーカ２３０のスピーカ位置は、「１」−「６」である。また、架台２２０ｂに設けられたラウドスピーカ２３０のスピーカ位置は、「７」−「２２」である。さらに、架台２２０ｃに設けられたラウドスピーカ２３０のスピーカ位置は、「２３」−「４６」である。そして、架台２２０ｄに設けられたラウドスピーカ２３０のスピーカ位置は、「４７」−「６２」である。 As can be seen from FIG. 7B, in the distribution shown in FIG. 7A, the value in the height direction (Z direction) increases as the speaker position moves toward the center. That is, the loudspeaker positions of the loudspeakers 230 provided on the mount 220a are “1”-“6”. The loudspeaker positions of the loudspeakers 230 provided on the gantry 220b are “7”-“22”. Furthermore, the loudspeaker position of the loudspeaker 230 provided on the gantry 220c is “23”-“46”. And the speaker position of the loudspeaker 230 provided in the mount 220d is “47”-“62”.

なお、低域を補うために、４本の柱部２２２に設けられた８個のラウドスピーカ２３０は選択の対象では無いため、図７（Ａ）および（Ｂ）には示されていない。 In order to compensate for the low frequency, the eight loudspeakers 230 provided on the four pillars 222 are not selected, and are not shown in FIGS. 7A and 7B.

また、図７（Ａ）および(Ｂ）では、Ｙ軸のマイナス方向がユーザの顔が向く前方であり、Ｙ軸のプラス方向がユーザの後頭部の向く後方である。さらに、図７（Ａ）に示すように、Ｘ軸のマイナス方向がユーザの右方であり、Ｘ軸のプラス方向がユーザの左方である。そして、図７（Ｂ）に示すように、Ｚ軸のマイナス方向がユーザの耳の位置からの下方であり、Ｚ軸のプラス方向がユーザの耳の位置からの上方である。 In FIGS. 7A and 7B, the negative direction of the Y axis is the front toward the user's face, and the positive direction of the Y axis is the rear toward the back of the user. Further, as shown in FIG. 7A, the minus direction of the X axis is the right side of the user, and the plus direction of the X axis is the left side of the user. As shown in FIG. 7B, the minus direction of the Z axis is below the position of the user's ear, and the plus direction of the Z axis is above the position of the user's ear.

図７（Ａ）においては、最初に選択されたラウドスピーカ２３０のスピーカ位置を示す丸印（「６０」が記載された丸印）に網掛模様が付される。また、これに続いて、グラムシュミットの直交化法に基づく繰り返しの結果として選ばれたラウドスピーカ２３０のスピーカ位置を示す丸印（ここでは、「１」−「６」、「７」、「９」、「１１」、「１３」、「１５」、「１７」、「１９」、「２１」、「２３」、「３１」、「３５」、「４８」、「５１」、「５４」、「５６」、「５８」、「６２」が記載された丸印）に斜線模様が付されている。さらに、模様が付されていない丸印は、選択されなかったラウドスピーカ２３０のスピーカ位置を示す。 In FIG. 7 (A), a circle pattern indicating a speaker position of the first selected loudspeaker 230 (a circle having “60” written thereon) is given a shaded pattern. Further, following this, a circle indicating the speaker position of the loudspeaker 230 selected as a result of the repetition based on the Gramschmitt orthogonalization method (here, “1”-“6”, “7”, “9”). ”,“ 11 ”,“ 13 ”,“ 15 ”,“ 17 ”,“ 19 ”,“ 21 ”,“ 23 ”,“ 31 ”,“ 35 ”,“ 48 ”,“ 51 ”,“ 54 ”, A hatched pattern is attached to the circles “56”, “58”, and “62”. Further, a circle without a pattern indicates the speaker position of the loudspeaker 230 that has not been selected.

また、図７（Ｂ）においては、配置されるラウドスピーカ２３０のＺ方向の位置に応じて、異なる図形（円、三角形、四角形、菱形）を示してある。また、図７（Ｂ）では、最初に選択されたラウドスピーカ２３０のスピーカ位置は、黒色を付した図形の位置で示される。そして、図７（Ｂ）では、２番目以降に選択されたラウドスピーカ２３０のスピーカ位置は、灰色を付した図形の位置で示される。 Further, in FIG. 7B, different figures (circle, triangle, quadrangle, rhombus) are shown depending on the position of the arranged loudspeaker 230 in the Z direction. In FIG. 7B, the loudspeaker position of the loudspeaker 230 selected first is indicated by the position of the graphic with black. In FIG. 7B, the loudspeaker positions of the loudspeakers 230 selected after the second are indicated by gray positions.

図７（Ａ）および（Ｂ）からは、各方向と高さに分布されたラウドスピーカ２３０が規則的に観測される。図７（Ａ）に示すように、ラウドスピーカ２３０の分布を真上から平面的に見た場合には、縦方向および横方向のそれぞれにおいて、選択されたラウドスピーカ２３０が略対称に分布していることが分かる。このことは、図７（Ｂ）に示すように、ラウドスピーカ２３０の分布を真横から平面的に見た場合も同様である。 7A and 7B, the loudspeakers 230 distributed in each direction and height are regularly observed. As shown in FIG. 7A, when the distribution of the loudspeakers 230 is viewed from above, the selected loudspeakers 230 are distributed substantially symmetrically in each of the vertical direction and the horizontal direction. I understand that. This is the same when the distribution of the loudspeakers 230 is viewed in plan from the side, as shown in FIG. 7B.

また、スピーカアレイシステム（２０、２８、３６）のラウドスピーカ２３０とマイクロホンアレイ１４のマイクロホンとの構成を入れ替えることによって、上述したグラムシュミットの直交化法を適用することにより、マイクロホンを選択した。ただし、グラムシュミットの直交化法を用いた選択方法については既に説明したため、重複した説明は省略することにする。 In addition, the microphones were selected by applying the Gramschmitt orthogonalization method described above by switching the configuration of the loudspeakers 230 of the speaker array system (20, 28, 36) and the microphones of the microphone array 14. However, since the selection method using the Gramschmitt orthogonalization method has already been described, redundant description will be omitted.

図８は、図７（Ａ）および（Ｂ）に示した２４個のラウドスピーカ２３０の配列に対して、選択された８個のマイクロホンの配列を示す。図示は省略するが、マイクロホンの位置は、ラウドスピーカ２３０のスピーカ位置と同様に、番号が割り当てられている。図８では少し分かり難いが、ＸＹ平面を真上から平面的に見た場合には、選択されたマイクロホンはすべての方向に均等に分布している。 FIG. 8 shows an array of eight selected microphones with respect to the array of 24 loudspeakers 230 shown in FIGS. 7 (A) and (B). Although illustration is omitted, numbers are assigned to the positions of the microphones in the same manner as the speaker positions of the loudspeaker 230. Although it is difficult to understand in FIG. 8, when the XY plane is viewed from above, the selected microphones are evenly distributed in all directions.

このように、グラムシュミットの直交化法を使用することによって、マイクロホンおよびラウドスピーカ２３０の数を低減するようにしたが、この低減による影響を評価するために、水平面の音源定位テストが行われた。この音源定位テストの方法および評価結果については、発明者等によって２０１０年８月に公開された「Optimization of loudspeaker and microphone configurations for sound reproduction system based on boundary surface control principle - An optimizing approach using Gram-Schmidt orthogonalization and its evaluation -」に開示されているため、その説明は省略することにする。上述したように、この音源定位テストの結果、ラウドスピーカ２３０の個数が２４個に決定され、サーバ１２等の性能および通信速度の制約によって、マイクロホンの個数が８個に決定される。 In this way, the number of microphones and loudspeakers 230 was reduced by using the Gramschmitt orthogonalization method, but in order to evaluate the effect of this reduction, a sound source localization test on a horizontal plane was performed. . The method and evaluation results of this sound source localization test are described in “Optimization of loudspeaker and microphone configurations for sound reproduction system based on boundary surface control principle-An optimizing approach using Gram-Schmidt orthogonalization” published in August 2010 by the inventors. and its evaluation-”, the description thereof will be omitted. As described above, as a result of the sound source localization test, the number of loudspeakers 230 is determined to be 24, and the number of microphones is determined to be 8 due to the limitations of the performance of the server 12 and the communication speed.

詳細な説明は省略するが、選択されたマイクロホンで検出された音場信号がマイクロホンアレイ１４からサーバ１２に与えられる。このとき、選択されていないマイクロホンは不能化される。つまり、サーバ１２は、選択されていないマイクロホンからの音場信号を検出しない。一方、コンピュータ１８、２６、３４は、選択されたラウドスピーカ２３０のみに、音場データや音声データを出力する。 Although a detailed description is omitted, the sound field signal detected by the selected microphone is supplied from the microphone array 14 to the server 12. At this time, unselected microphones are disabled. That is, the server 12 does not detect a sound field signal from a microphone that is not selected. On the other hand, the computers 18, 26, and 34 output sound field data and audio data only to the selected loudspeaker 230.

上述したように、この実施例では、各スピーカアレイシステム２０、２８、３６では、他のユーザが発生した音声に対応する音声データは音場データとともに出力（再現）される。したがって、話者の顔の向きを何ら考慮せずに、コンピュータ１８、２６、３４で、他のコンピュータ１８、２６、３４から受信した音声データと音声フィルタを畳み込んだだけでは、誰が誰に向かって話し掛けているのかを認識するのが困難である。たとえば、話者が自分の名前と聴者（相手）の名前とを毎回発話することも考えられるが、自然な会話とは言えない。 As described above, in this embodiment, in each speaker array system 20, 28, 36, sound data corresponding to sound generated by another user is output (reproduced) together with sound field data. Therefore, without considering the direction of the speaker's face, the computer 18, 26, 34 simply convolves the audio data received from the other computer 18, 26, 34 with the audio filter. It is difficult to recognize if you are talking. For example, the speaker may speak his name and the name of the listener (the other party) every time, but this is not a natural conversation.

したがって、この実施例では、話者の顔の向き（発話の方向）を考慮した音声フィルタを用いるようにしてある。簡単に言うと、音響信号（この実施例では、音声信号）の伝達特性を考慮した音声フィルタが用いられる。 Therefore, in this embodiment, an audio filter that takes into account the direction of the speaker's face (the direction of speech) is used. In short, an audio filter that takes into account the transfer characteristics of an acoustic signal (in this embodiment, an audio signal) is used.

図３では省略したが、図１に示したように、ＢｏＳＣ再生システム１０ａ、１０ｂ、１０ｃは、それぞれ、カメラ２４、３２、４０を有している。図９に示すように、カメラ２４は、スピーカアレイシステム２０を使用するユーザが正面を向いた状態で、そのレンズ（撮影方向）が対向するように、当該スピーカアレイシステム２０の架台２２０ｄに取り付けられる。 Although omitted in FIG. 3, the BoSC playback systems 10a, 10b, and 10c have cameras 24, 32, and 40, respectively, as shown in FIG. As shown in FIG. 9, the camera 24 is attached to the pedestal 220d of the speaker array system 20 so that the lens (photographing direction) faces the user using the speaker array system 20 facing the front. .

なお、図９では、上述のように選択した２４個のラウドスピーカ２３０がユーザの周囲を均等に囲むように模式的に示してある。 In FIG. 9, the 24 loudspeakers 230 selected as described above are schematically shown so as to uniformly surround the user.

また、カメラ２４と同様に、カメラ３２、４０は、それぞれ、スピーカアレイシステム２８、３６の架台２２０ｄに取り付けられる。 Similarly to the camera 24, the cameras 32 and 40 are attached to the pedestals 220d of the speaker array systems 28 and 36, respectively.

さらに、上述したように、ユーザは、ヘッドセットのマイクロホン２２、３０、３８を装着してある。これは、ラウドスピーカ２３０から出力される音がマイクロホン２２、３０、３８で検出されるのを出来る限り防止して、ユーザが発生する音声のみを検出するようにするためである。 Further, as described above, the user is wearing headset microphones 22, 30, and 38. This is to prevent the sound output from the loudspeaker 230 from being detected by the microphones 22, 30 and 38 as much as possible and detect only the sound generated by the user.

コンピュータ１８、２６、３４は、各々に接続されたカメラ２４、３２、４０で撮影された映像（顔画像）を解析することにより、ユーザの顔の向き、すなわち正面方向に対する顔の角度を求める。顔画像から顔の向き等を求める方法は、既に周知であるため、その説明は省略するが、たとえば、特開平１０−２７４５１６号に開示の技術を用いることができる。 The computers 18, 26, and 34 analyze the images (face images) taken by the cameras 24, 32, and 40 connected to the computers 18, 26, and 34, thereby obtaining the face direction of the user, that is, the face angle with respect to the front direction. Since the method for obtaining the face orientation and the like from the face image is already known, the description thereof will be omitted. For example, the technique disclosed in Japanese Patent Laid-Open No. 10-274516 can be used.

ただし、他のコンピュータ１８、２６、３４に送信される角度データは、他のユーザ（聴者）の位置を基準とした場合の自身のユーザ（話者）の顔の向きについての角度である。したがって、顔画像から顔の向きを求めた後に、他のユーザの位置（方向）を基準（０°）とした場合の角度に変換される。 However, the angle data transmitted to the other computers 18, 26, 34 is an angle regarding the orientation of the face of the user (speaker) of the user when the position of the other user (listener) is used as a reference. Therefore, after obtaining the face orientation from the face image, it is converted into an angle when the position (direction) of another user is set as a reference (0 °).

このように検出された角度を、再現する音声に反映させるために、音声の伝達特性が検出され、上述したように、この伝達特性を考慮した音声フィルタが用いられる。この実施例では、音声の伝達特性を検出するのであるが、簡単のため、音再現システム１０を利用する三者が、或る空間において、各辺が所定長さ（２ｍ）を有する正三角形の頂点の位置に存在すると仮定してある。 In order to reflect the detected angle in the reproduced sound, the transfer characteristic of the sound is detected, and as described above, the sound filter considering the transfer characteristic is used. In this embodiment, sound transfer characteristics are detected, but for simplicity, the three parties using the sound reproduction system 10 have a regular triangle shape with each side having a predetermined length (2 m) in a certain space. It is assumed that it exists at the position of the vertex.

つまり、図１０に示すように、ユーザＡ、Ｂ、Ｃは、辺の長さが２ｍの正三角形の頂点の位置に存在し、各ユーザＡ、Ｂ、Ｃの正面方向は、頂点から当該頂点に対向する辺に垂下する方向に設定される。したがって、この仮想の位置関係においては、ユーザＡがユーザＢに話し掛ける場合には、ユーザＡは正面方向から右に３０°の方向を向いて発話する。また、ユーザＡがユーザＣに話し掛ける場合には、ユーザＡは正面方向から左に３０°の方向を向いて発話する。説明は省略するが、ユーザＢおよびユーザＣについても同様である。 That is, as shown in FIG. 10, the users A, B, and C exist at the positions of the vertices of an equilateral triangle whose side length is 2 m, and the front direction of each user A, B, and C is from the vertex to the corresponding vertex. It is set in a direction depending on the side opposite to. Therefore, in this virtual positional relationship, when the user A speaks to the user B, the user A speaks in the direction of 30 ° to the right from the front direction. Further, when the user A talks to the user C, the user A speaks in the direction of 30 ° to the left from the front direction. Although the description is omitted, the same applies to user B and user C.

この仮想の位置関係を再現するべく、或る場所において、音声の伝達特性を検出した。図１１は、音声の伝達特性を検出した環境を真上から見た図である。図１１に示す或る場所は、小会議室であり、横が１０ｍで縦が３．９ｍの長方形状を有している。ただし、図１１からも分かるように、小会議室は、長方形の左上部において、内側に少し凹んでいる。 In order to reproduce this virtual positional relationship, sound transmission characteristics were detected at a certain place. FIG. 11 is a view of the environment in which the sound transfer characteristic is detected as seen from directly above. A certain place shown in FIG. 11 is a small meeting room having a rectangular shape with a width of 10 m and a length of 3.9 m. However, as can be seen from FIG. 11, the small meeting room is slightly recessed inward in the upper left corner of the rectangle.

また、小会議室には、音声の伝達特性を検出するためのラウドスピーカ５０およびマイクロホンアレイ５２が配置される。ラウドスピーカ５０としては、たとえば、人間が発生する音声に近似する音を再現可能なスピーカ（ＹＡＭＡＨＡＭＳＰ−３）が用いられる。また、マイクロホンアレイ５２としては、上述したマイクロホンアレイ１４と同じものが用いられる。ただし、音再現システム１０に用いられる場合と音声の伝達特性の検出に用いられる場合とを区別するために、異なる参照符号を付してある。 In the small conference room, a loudspeaker 50 and a microphone array 52 for detecting a sound transmission characteristic are arranged. As the loudspeaker 50, for example, a loudspeaker (YAMAHA MSP-3) capable of reproducing a sound that approximates a sound generated by a human being is used. As the microphone array 52, the same microphone array 14 as described above is used. However, in order to distinguish between the case where it is used for the sound reproduction system 10 and the case where it is used for detection of the transfer characteristic of sound, different reference numerals are given.

図１１からも分かるように、マイクロホンアレイ５２は、小会議室の下側の壁際の中央に配置される。ラウドスピーカ５０は、マイクロホンアレイ５２の正面方向を真上方向とした場合に、左に３０°回転した方向であり、ラウドスピーカ５０の正面がマイクロホンアレイ５２に向いたときに、その正面とマイクロホンアレイ５２の中心との距離が２ｍになる位置に配置される。そして、ラウドスピーカ５０は、その位置で１５°刻みで、一周（３６０°）回転される。１５°毎に、ラウドスピーカ５０から刺激としてスイープ音を出力し、そのときマイクロホンアレイ５２の各マイクロホンｍ（ｍ＝１，２，…，Ｍ）で検出されるインパルス応答を伝達特性Ｈ_ａｎｇ[ｍ]として検出する。ただし、この実施例では、上述したように、Ｍ＝７０である。また、ａｎｇは、音源の指向性を模擬する角度であり、上述した使用者Ａ、Ｂ、Ｃの正面方向に対する角度である。ただし、この実施例では、ラウドスピーカ５０は、左回り（反時計回り）に１５°刻みで回転される。さらに、スイープ音には、ＴｉｍｅＳｔｒｅｔｃｈｅｄＰｕｌｓｅ法を用いて作成した２４ｋＨｚまでの信号を用いた。また、この小会議室の残響時間は、約０．６秒である。 As can be seen from FIG. 11, the microphone array 52 is disposed at the center of the lower wall of the small meeting room. The loudspeaker 50 is a direction rotated 30 ° to the left when the front direction of the microphone array 52 is set to be directly above. When the front of the loudspeaker 50 faces the microphone array 52, the front of the loudspeaker 50 and the microphone array It is arranged at a position where the distance from the center of 52 is 2 m. The loudspeaker 50 is rotated once (360 °) in 15 ° increments at that position. Every 15 °, a sweep sound is output as a stimulus from the loudspeaker 50, and the impulse response detected by each microphone m (m = 1, 2,..., M) of the microphone array 52 at that time is transferred characteristics H _ang [m ] Detected. However, in this embodiment, as described above, M = 70. Further, ang is an angle that simulates the directivity of the sound source, and is an angle with respect to the above-described front direction of the users A, B, and C. However, in this embodiment, the loudspeaker 50 is rotated counterclockwise (counterclockwise) in 15 ° increments. Furthermore, a signal of up to 24 kHz created using the Time Stretched Pulse method was used as the sweep sound. The reverberation time of this small meeting room is about 0.6 seconds.

なお、１５°刻みでラウドスピーカ５０を回転させるのは、人間の聴覚によって識別可能な角度が２０°程度だからである。 The reason why the loudspeaker 50 is rotated in increments of 15 ° is that the angle that can be identified by human hearing is about 20 °.

つまり、図１１に示す場合には、ラウドスピーカ５０が話者であり、マイクロホンアレイ５２の内部の中心に聴者の頭部（耳の高さ）が来るように当該聴者が存在するものとして、伝達特性が測定されるのである。したがって、図１０に示したような仮想の位置関係において、すべての場合について、伝達特性Ｈ_ａｎｇ［ｍ］を検出するためには、ラウドスピーカ５０とマイクロホンアレイ５２の配置位置を逆にしたり、ラウドスピーカ５０を点線で示す位置（マイクロホンアレイ５２の正面方向から右に３０°回転した位置）に移動させたり、点線で示すラウドスピーカ５０とマイクロホンアレイ５２との配置位置を逆にしたりして、伝達特性Ｈ_ａｎｇ［ｍ］を測定する必要がある。ただし、この実施例では、簡単のため、図１１に実線で示したラウドスピーカ５０とマイクロホンアレイ５２との配置位置でのみ、伝達特性Ｈ_ａｎｇ［ｍ］を測定し、これを各コンピュータ１８、２６、３４で使用するようにしてある。 That is, in the case shown in FIG. 11, it is assumed that the loudspeaker 50 is a speaker, and that the listener is present so that the listener's head (ear height) is at the center of the microphone array 52. Characteristics are measured. Therefore, in all of the virtual positional relationships as shown in FIG. 10, in order to detect the transfer characteristic H _ang [m], the arrangement positions of the loudspeaker 50 and the microphone array 52 are reversed, The speaker 50 is moved to the position indicated by the dotted line (position rotated 30 ° to the right from the front direction of the microphone array 52), or the arrangement positions of the loudspeaker 50 and the microphone array 52 indicated by the dotted line are reversed to transmit. The characteristic H _ang [m] needs to be measured. However, in this embodiment, for the sake of simplicity, the transfer characteristic H _ang [m] is measured only at the arrangement position of the loudspeaker 50 and the microphone array 52 shown by solid lines in FIG. , 34.

ここで、図１２には、マイクロホンアレイ５２の或るマイクロホンで検出されたインパルス応答（後述する「減衰されたインパルス応答」と区別するために、ここでは「元のインパルス応答」という）の波形が点線で示される。この元のインパルス応答では、初期反射音と後期反射音とを含んでいる。上述したように、図１１で示したような小会議室では、残響時間があるため、減衰するのに時間がかかってしまい、これを正しく再現するためには、逆フィルタの長さが２０４８ポイントを超えてしまう。これでは、リアルタイムでの処理を実現できなくなってしまう。したがって、この実施例では、ハニング窓を用いることにより、逆フィルタの長さが２０４８ポイントを超えないようにしてある。ハニング窓を用いることによって減衰されたインパルス応答は、図１２において、実線で示される。ただし、ハニング窓は、各マイクロホンで記録されるインパルス応答の直接音をその中央に有している。また、図１２から分かるように、この減衰されたインパルス応答は、初期反射音を十分含んでいて、後期反射音を何ら含んでいない。しかし、減衰されたインパルス応答に基づく伝達特性Ｈ_ａｎｇ［ｍ］を用いた場合であっても、図１１で示した小会議室でユーザが会話しているように、話者と聴者との位置関係をほぼ正確に再現することができる。 Here, FIG. 12 shows a waveform of an impulse response detected by a certain microphone of the microphone array 52 (here, referred to as “original impulse response” in order to be distinguished from “attenuated impulse response” described later). Indicated by dotted lines. This original impulse response includes early reflections and late reflections. As described above, in the small conference room as shown in FIG. 11, since there is a reverberation time, it takes time to attenuate, and in order to reproduce this correctly, the length of the inverse filter is 2048 points. Will be exceeded. This makes it impossible to realize real-time processing. Therefore, in this embodiment, the Hanning window is used so that the length of the inverse filter does not exceed 2048 points. The impulse response attenuated by using the Hanning window is shown as a solid line in FIG. However, the Hanning window has a direct sound of an impulse response recorded by each microphone at its center. Further, as can be seen from FIG. 12, this attenuated impulse response sufficiently includes the early reflection sound and does not include any late reflection sound. However, even when the transfer characteristic H _ang [m] based on the attenuated impulse response is used, the positions of the speaker and the listener are as if the user is speaking in the small conference room shown in FIG. The relationship can be reproduced almost accurately.

図示は省略するが、各コンピュータ１８、２６、３４では、メモリ（ハードディスクやＲＡＭ）に伝達特性Ｈ_ａｎｇ［ｍ］に対応するデータ（伝達特性データ）が記憶される。したがって、コンピュータ１８、２６、３４は、他のコンピュータ１８、２６、３４から送信される角度データが示す角度ａｎｇに応じた伝達特性データを読み出し、読み出した伝達特性データに対応する伝達特性Ｈ_ａｎｇ［ｍ］を考慮した音声フィルタを用いて音声信号を再現する。したがって、指向性を有する音声が再現される。 Although illustration is omitted, in each of the computers 18, 26, and 34, data (transfer characteristic data) corresponding to the transfer characteristic H _ang [m] is stored in a memory (hard disk or RAM). Therefore, the computers 18, 26, 34 read the transfer characteristic data corresponding to the angle ang indicated by the angle data transmitted from the other computers 18, 26, 34, and transfer characteristics H _ang [corresponding to the read transfer characteristic data [ m] is used to reproduce the audio signal. Therefore, sound having directivity is reproduced.

ここで、具体的に説明する。単一のマイクロホン２２（３０、３８）で収録された音響信号（この実施例では、ユーザが発生した音声に対応する音声信号）をＳとする。また、ＢｏＳＣ再生システム内の２次音源スピーカｓ（ｓ＝１，２，…，Ｎ）と制御点ｉ（ｉ＝１，２，…，Ｍ）に対する逆フィルタをＧ_ｉｎｖ［ｓ，ｉ］とする。ただし、制御点ｉの配置は、マイクロホンアレイ５２と合同であり、ｍ＝ｉが成り立つ。また、２次音源スピーカｓは、ラウドスピーカ２３０であり、この実施例では、Ｎ＝２４である。 Here, it demonstrates concretely. Let S be an acoustic signal recorded in the single microphone 22 (30, 38) (in this embodiment, an audio signal corresponding to the voice generated by the user). In addition, an inverse filter for the secondary sound source speaker s (s = 1, 2,..., N) and the control point i (i = 1, 2,..., M) in the BoSC playback system is denoted by G _inv [s, i]. To do. However, the arrangement of the control point i is congruent with the microphone array 52, and m = i holds. The secondary sound source speaker s is a loudspeaker 230, and in this embodiment, N = 24.

図１３（Ａ）のように、話者から見た聴者の位置する方向をθとし、話者が向いている方向をαとすると、聴者に対する話者の向き（角度）はα−θで表される。ここで、図１３（Ａ）に示す話者と聴者とを上述したラウドスピーカ５０とマイクロホンアレイ５２とで表すと、図１３（Ｂ）のように示される。したがって、角度ａｎｇ＝α−θの伝達特性Ｈ_ａｎｇ［ｍ］を用いて、発話方向を含む音声を再現すると、ＢｏＳＣ再生システム内の２次音源ｓからの出力信号Ｒ（ｓ）は、数１７で示される。ただし、Ｖ[ｓ]は、伝達特性Ｈ_ａｎｇ［ｍ］を考慮した音声フィルタである。 As shown in FIG. 13A, when the direction of the listener as viewed from the speaker is θ and the direction of the speaker is α, the direction (angle) of the speaker with respect to the listener is expressed by α−θ. Is done. Here, the speaker and the listener shown in FIG. 13A are represented by the above-described loudspeaker 50 and microphone array 52 as shown in FIG. 13B. Therefore, when the speech including the speech direction is reproduced using the transfer characteristic H _ang [m] of the angle ang = α−θ, the output signal R (s) from the secondary sound source s in the BoSC reproduction system is Indicated by However, V [s] is an audio filter considering the transfer characteristic H _ang [m].

つまり、コンピュータ１８、２６、３４は、ＲＡＭやハードディスクのような内部メモリに、角度に応じた音声フィルタＶ［ｓ］または伝達特性Ｈ_ａｎｇ［ｍ］に対応するデータ（音声フィルタデータまたは伝達特性データ）を記憶しておき、他のコンピュータ１８，２６、３４から受信した角度データが示す角度に応じた角度に応じた音声フィルタＶ［ｓ］を用いて、受信した音声データを畳み込むのである。ただし、上述したように、１５°刻みで伝達特性Ｈ_ａｎｇ［ｍ］は測定されるため、音声フィルタＶ［ｓ］の１５°刻みである。したがって、角度データが示す角度に応じた音声フィルタＶ［ｓ］を選択する場合には、０°、１５°、…、３３０°、３４５°のうち、角度データが示す角度が最も近い角度に応じた音声フィルタＶ［ｓ］が選択される。ただし、７．５°、２２．５°などのように、角度データが示す角度が、隣接する２つの角度の中間値である場合には、この２つの角度のうちから所定のルールに従って選択した一つの角度に応じた音声フィルタＶ［ｓ］が選択される。たとえば、所定のルールとしては、前回の角度に近い方を選択したり、角度の小さい（または大きい）方を選択したり、ランダムに選択したりすることが考えられる。いずれのルールを採用したとしても、上述したように、人間の聴覚で識別可能な範囲内であるため、不都合が生じることはない。 That is, the computers 18, 26, and 34 store data (voice filter data or transfer characteristic data) corresponding to the sound filter V [s] or the transfer characteristic H _ang [m] corresponding to the angle in an internal memory such as a RAM or a hard disk. ) Is stored, and the received audio data is convoluted using the audio filter V [s] corresponding to the angle indicated by the angle data received from the other computers 18, 26, 34. However, as described above, since the transfer characteristic H _ang [m] is measured in increments of 15 °, the audio filter V [s] is in increments of 15 °. Therefore, when the voice filter V [s] corresponding to the angle indicated by the angle data is selected, the angle indicated by the angle data is the closest of 0 °, 15 °,..., 330 °, 345 °. The voice filter V [s] is selected. However, when the angle indicated by the angle data is an intermediate value between two adjacent angles, such as 7.5 ° and 22.5 °, the angle is selected from these two angles according to a predetermined rule. A voice filter V [s] corresponding to one angle is selected. For example, as the predetermined rule, it is conceivable to select a direction closer to the previous angle, select a smaller (or larger) angle, or select randomly. Regardless of which rule is employed, there is no inconvenience because it is within the range that can be identified by human hearing as described above.

このように、この実施例では、図１１に示したような小会議室で測定したインパルス応答に基づいて伝達特性Ｈ_ａｎｇ［ｍ］を有する音声フィルタＶ［ｓ］を生成するため、スピーカアレイ２０、２８、３６を使用するユーザは、この小会議室で、辺の長さが２ｍの正三角形の頂点の位置で会話しているような臨場感を得ることができる。 As described above, in this embodiment, since the sound filter V [s] having the transfer characteristic H _ang [m] is generated based on the impulse response measured in the small meeting room as shown in FIG. , 28, and 36, the user can obtain a sense of presence in the small conference room as if he / she is talking at the position of the apex of an equilateral triangle having a side length of 2 m.

したがって、他の場所でインパルス応答の検出を行えば、当該他の場所で会話しているような臨場感を得ることができる。たとえば、マイクロホンアレイ１４が配置されるオーケストラの会場の客席でインパルス応答を検出して音声フィルタを生成しておければ、当該オーケストラの会場で生のオーケストラを聴きながら、会話をしている臨場感を得ることができる。 Therefore, if the impulse response is detected at another place, it is possible to obtain a sense of presence as if talking at the other place. For example, if an impulse response is generated by detecting an impulse response at a seat in an orchestra venue where the microphone array 14 is arranged, the presence of a conversation while listening to the raw orchestra at the orchestra venue Can be obtained.

ここで、話者の顔の角度と音声再現の主観評価を行うために、以下のような実験を行った。実験では、ラウドスピーカ５０から出力する刺激（刺激音）として、一般的な挨拶（ここでは、「こんにちは」）を言う３０代の男性の音声が用いられた。実験における被験者は、２０代または３０代の１０人の日本人である。ただし、５人は女性であり、５人は男性である。 Here, in order to perform subjective evaluation of the speaker's face angle and speech reproduction, the following experiment was conducted. In the experiment, as a stimulus to be output from the loudspeaker 50 (stimulus sound), general greeting (in this case, "Hello") voice of men in their 30s say was used. The test subjects were 10 Japanese people in their 20s or 30s. However, five are women and five are men.

また、この実験においては、使用する角度は、後述する２つの環境、すなわち実際の環境（以下、「実環境」とう）および音場再現システム（スピーカアレイシステム２０（２８、３６でも可）で再現する環境（以下、「再現環境」という）の両方において、反時計回りに０°から９０°までであり、１５°刻みで変化される。ただし、０°の位置は、ラウドスピーカ５０の正面（話者の顔）がマイクロホンアレイ５２（聴者すなわち被験者）に対向している位置に合わせられる。この角度範囲を使用することによって、想定された三者間の関係（仮想の位置関係）において、話者が話し掛けている聴者を音響的に知覚できるかどうかを判断することができる。 In this experiment, the angles used are reproduced in the following two environments: an actual environment (hereinafter referred to as “real environment”) and a sound field reproduction system (speaker array system 20 (or 28 or 36 is acceptable)). In both environments (hereinafter referred to as “reproduction environment”), the angle is 0 ° to 90 ° counterclockwise and is changed in 15 ° increments, however, the position of 0 ° is the front of the loudspeaker 50 ( The speaker's face) is aligned with the position facing the microphone array 52 (listener or subject), and by using this angular range, in the assumed three-way relationship (virtual position relationship) It can be determined whether or not the listener who is speaking can be perceived acoustically.

上述したように、この実施例では、２つの環境で主観評価を行った。１つ目は、実環境で回転しているラウドスピーカ５０を用いて音声を再現した場合についての主観評価である。２つ目は、再現環境で上記の音声フィルタＶ[ｓ]を使用して上記の角度範囲内で角度を変化させて音声を再現した場合についての主観評価である。 As described above, in this example, subjective evaluation was performed in two environments. The first is a subjective evaluation in the case where sound is reproduced using the loudspeaker 50 rotating in a real environment. The second is a subjective evaluation in the case where the sound is reproduced by changing the angle within the above angle range using the sound filter V [s] in the reproduction environment.

まず、１つ目の主観評価についての実験では、インパルス応答が測定された場合と、同じ場所および同じ条件で行われ、ラウドスピーカ５０は実環境において無作為に回転させた。また、上述したように、音声フィルタ向けのインパルス応答を測定するのに使用されたラウドスピーカ５０が、実環境における音声の再現にも使用された。そして、被験者には、インパルス応答を測定した際に、マイクロホンアレイ５２が置かれた位置で評価を行ってもらった。また、実験中に、被験者が頭部を回転することを許可した。ただし、被験者は、マイクロホンアレイ５２の球状の骨格（図２の１４ａ）の中心の高さに自身の耳の位置が来るように、椅子に座るなどして高さを調整した。さらに、実験では、ラウドスピーカ５０が被験者に見えるのを防ぐために、その前（被験者とラウドスピーカ５０の間）に、カーテンを設けた。 First, the first subjective evaluation experiment was performed at the same location and under the same conditions as when the impulse response was measured, and the loudspeaker 50 was randomly rotated in the actual environment. Further, as described above, the loudspeaker 50 used to measure the impulse response for the audio filter was also used to reproduce audio in a real environment. Then, when the impulse response was measured, the test subject was evaluated at the position where the microphone array 52 was placed. Also, during the experiment, the subject was allowed to rotate his head. However, the subject adjusted the height by sitting on a chair or the like so that the position of his / her ears reached the height of the center of the spherical skeleton (14a in FIG. 2) of the microphone array 52. Furthermore, in the experiment, in order to prevent the loudspeaker 50 from being seen by the subject, a curtain was provided in front of it (between the subject and the loudspeaker 50).

なお、音圧レベル計から得られた結果では、音場へのカーテン設けたことの影響がわずかであることが示された。また、ラウドスピーカ５０のパワー出力は、被験者以外の者が調整したので、音量は顔の角度や上記の２つの環境（実環境および再現環境）でよって影響を受けていない。 In addition, the result obtained from the sound pressure level meter showed that the effect of the curtain on the sound field was slight. Further, since the power output of the loudspeaker 50 is adjusted by a person other than the subject, the sound volume is not affected by the face angle or the above two environments (real environment and reproduction environment).

２つ目の主観評価についての実験では、コンピュータ１８（２６、３４でも可）およびスピーカアレイシステム２０（２８、３６でも可）を用いて、上述したように、０から９０°までを１５°刻みで変化させるように、上述の音声フィルタＶ[ｓ]を用いて刺激音を出力した。 In the second subjective evaluation experiment, using the computer 18 (or 26 or 34) and the speaker array system 20 (28 or 36), as described above, from 0 to 90 ° in steps of 15 °. Stimulation sound was output using the above-mentioned voice filter V [s] so as to be changed.

音声の方向が質問される前に、ラウドスピーカ５０の位置が被験者に知らされた。また、実験では、ラウドスピーカ５０を、反時計回りに０°から９０°まで、１５°刻みで回転させ、そして、逆向きに（時計回りに）、９０°から０°まで、１５°刻みで回転させることによって音声の方向を変化させ、被験者に音声を聴かせた。質問に従って、被験者は、最初に０°の位置で音声を聞かされた後に、２度同じ角度の位置で音声を聞かされる。つまり、音声の方向は、０°から９０°までの間で、１５°刻みで変化するため、７つの方向から１つの方向（角度）を選択しなければならない。７つの音声の方向は、各被験者に無作為の順に、試験された。被験者は、実環境と再現環境との両方で、全部で１４個の質問に回答した。 Before the voice direction was questioned, the subject was informed of the position of the loudspeaker 50. In the experiment, the loudspeaker 50 is rotated counterclockwise from 0 ° to 90 ° in 15 ° increments, and in the reverse direction (clockwise) from 90 ° to 0 ° in 15 ° increments. The direction of the voice was changed by rotating, and the subject was allowed to listen to the voice. According to the question, the subject is first heard at 0 ° and then heard at the same angle twice. In other words, since the direction of the sound changes from 0 ° to 90 ° in steps of 15 °, one direction (angle) must be selected from seven directions. Seven voice directions were tested in random order for each subject. The subject answered a total of 14 questions in both the real environment and the reproduction environment.

各環境において、次のように、角度誤りを定義することができる。実環境においては、ラウドスピーカ５０が向いている角度と回答された角度の絶対誤差が定義される。また、再現環境においては、再生される音声の方向（角度）と回答された角度の絶対誤差が定義される。図１４は、各環境において、全被験者についての平均角度誤差の箱ひげ図を示す。図１４に示すように、実環境および再現環境におけるそれぞれの平均角度誤差は、１３．７°と２０．８°である。図１０に示した三者間の仮想の位置関係（正三角形の頂点の位置に各ユーザを配置）を考慮して、再現環境における平均角度誤差は、誰が誰に話し掛けているかを知覚可能な程度に小さいと言える。 In each environment, the angle error can be defined as follows: In the actual environment, the absolute error between the angle at which the loudspeaker 50 is facing and the angle that is answered is defined. In the reproduction environment, the absolute error between the direction (angle) of the reproduced sound and the answered angle is defined. FIG. 14 shows a box plot of the average angle error for all subjects in each environment. As shown in FIG. 14, the average angle errors in the real environment and the reproduction environment are 13.7 ° and 20.8 °, respectively. In consideration of the virtual positional relationship between the three parties shown in FIG. 10 (where each user is placed at the position of the apex of the equilateral triangle), the average angle error in the reproduction environment can be perceived as to who is talking to whom. It can be said that it is small.

しかしながら、平均角度誤差の間には、２つの環境間で７．１度の差がある。両側ｔ検定は、平均角度誤差の差が統計的有意差（ｐ＜０．０５）を有していることを示している。したがって、被験者には、再現環境において発話方向の角度を知覚することは、実環境よりも難しいことが分かる。また、ほとんどの被験者は、再現環境において、発話方向の角度を知覚することは、実環境よりも難しいと論評した。そして、被験者等は、それらの違いは残響の長さであると論評した。また、実験に使用した会議室などの音波反射を有する共用空間では、後期反射音が向かう角度を知覚するのに有意な効果を持っていると考えられる。 However, there is a 7.1 degree difference between the two environments between the average angular errors. A two-tailed t-test shows that the difference in mean angle error has a statistically significant difference (p <0.05). Therefore, it is understood that the subject is more difficult to perceive the angle of the utterance direction in the reproduction environment than in the real environment. Most subjects commented that it was more difficult to perceive the angle of the utterance direction in the reproduction environment than in the real environment. The subjects commented that the difference was the length of reverberation. In addition, it is considered that a common space having sound wave reflection such as a conference room used in the experiment has a significant effect in perceiving the angle to which the late reflected sound is directed.

図１５には、話者が向く角度（ここでは、ラウドスピーカ５０が向く角度またはスピーカアレイシステム２０（２８、３６）で再現された発話方向の角度）毎の平均角度誤差を示した棒グラフである。ただし、格子模様が付されている棒グラフは、実環境についての平均角度誤差であり、斜線が付されている棒グラフは、再現環境についての平均角度誤差である。 FIG. 15 is a bar graph showing an average angle error for each angle at which the speaker faces (here, the angle at which the loudspeaker 50 faces or the angle of the utterance direction reproduced by the speaker array system 20 (28, 36)). . However, the bar graph with the lattice pattern is the average angle error for the actual environment, and the bar graph with the diagonal line is the average angle error for the reproduction environment.

この図１５から分かるように、話者の向く角度が９０°であるときに、２つの環境の間には、著しい違いがある。これは、一部の被験者において、音声が９０度まで回転したことを知覚できなかったためと考えられる。 As can be seen from FIG. 15, there is a significant difference between the two environments when the angle the speaker faces is 90 °. This is probably because some subjects could not perceive that the sound was rotated up to 90 degrees.

また、図１６は、被験者毎に、平均角度誤差の散布図を示す。つまり、各被験者についての平均角度誤差の実環境と再現環境との間における相関関係が示される。ただし、円の中に記載した数字は、被験者を個別に識別するために付した番号である。また、実線の円は男性の被験者であり、点線の円は女性の被験者である。 FIG. 16 shows a scatter diagram of the average angle error for each subject. That is, the correlation between the actual environment and the reproduction environment of the average angle error for each subject is shown. However, the numbers described in the circles are numbers assigned to individual subjects. The solid circle is a male subject, and the dotted circle is a female subject.

この図１６では、被験者の半分が、２つの環境における発話方向の知覚の差が小さいことを示している。残りの半分の被験者については、実環境における発話方向の角度の知覚が、より高い精度が示されている。質問に対する回答結果が２つの環境においてほとんど差が無い被験者の一人（女性）は、再現環境で０°から９０°まで回転する発話方向の角度を明確に知覚していた。これらの結果は、被験者等の能力（聴力）によって、発話方向の角度を認知することには、個人差があることを示している。そして、図１６では、特に女性の被験者においては、２つの環境においてほとんど差が無いことが示される。 In FIG. 16, half of the subjects show that the difference in perception of the utterance direction in the two environments is small. For the other half of the subjects, the perception of the angle of the utterance direction in the real environment is shown with higher accuracy. One of the subjects (female) whose answer results to the question had almost no difference between the two environments clearly perceived the angle of the utterance direction rotating from 0 ° to 90 ° in the reproduction environment. These results indicate that there is an individual difference in recognizing the angle of the utterance direction by the ability (hearing ability) of the subject or the like. FIG. 16 shows that there is almost no difference between the two environments, particularly for female subjects.

なお、主観評価実験においては、各角度での音声の大きさ（強さ）を一定に保つために、ラウドスピーカ５０の出力パワーが制御された。しかしながら、音再現システム１０を用いて実際に三者間で会話を行う場合には、音声の大きさ（強さ）は、話者が向く方向（角度）に応じて自然に変化するため、より発話方向の知覚が行い易いことが考えられる。 In the subjective evaluation experiment, the output power of the loudspeaker 50 was controlled in order to keep the volume (intensity) of the sound at each angle constant. However, when a conversation is actually carried out between the three parties using the sound reproduction system 10, the loudness (strength) of the voice naturally changes depending on the direction (angle) that the speaker faces, It is conceivable that the utterance direction is easily perceived.

この実施例によれば、単に音声を再現するのみならず、話者の音声の向きを再現することができるので、遠隔に存在するユーザが音場再現システムをそれぞれ用いて会話する場合であっても、再現される音声によって、誰が誰に話しているのかを知覚することができる。したがって、円滑に会話することができる。 According to this embodiment, not only the voice can be reproduced, but also the direction of the speaker's voice can be reproduced. However, it is possible to perceive who is talking to whom by the reproduced voice. Therefore, it is possible to talk smoothly.

なお、この実施例では、ヘッドセットマイクロホンを装着したユーザの音声を再現するようにしたが、これに限定される必要はない。ユーザが演奏する楽器の音やユーザが行う手拍子の音を再現するようにしてもよい。ただし、ユーザが楽器を演奏する場合には、楽器の向きを検出する必要があるため、たとえば、楽器にジャイロセンサが設けられ、ジャイロセンサの出力に応じて楽器の方向が検出される。また、ユーザが行う手拍子の音を再現する場合には、当該ユーザの手首付近にマイクロホンが装着され、ユーザの手が有る方向ないし体の向きを検出するために、手首やお腹付近に、ジャイロセンサが設けられる。 In this embodiment, the voice of the user wearing the headset microphone is reproduced. However, the present invention is not limited to this. You may make it reproduce the sound of the musical instrument which a user performs, and the sound of the clapping which a user performs. However, when the user plays an instrument, it is necessary to detect the direction of the instrument. For example, the instrument is provided with a gyro sensor, and the direction of the instrument is detected according to the output of the gyro sensor. In addition, when reproducing the clapping sound performed by the user, a microphone is mounted near the wrist of the user, and a gyro sensor is installed near the wrist or stomach to detect the direction of the user's hand or the direction of the body. Is provided.

また、この実施例では、カメラで撮影された映像からユーザの顔の向きを検出するようにしたが、これに限定される必要はない。たとえば、ユーザの頭部（ヘッドセットマイクロホン）にジャイロセンサを装着して、ジャイロセンサの出力に基づいてユーザの顔の向きを検出するようにしてもよい。 In this embodiment, the orientation of the user's face is detected from the video captured by the camera. However, the present invention is not limited to this. For example, a gyro sensor may be attached to the user's head (headset microphone), and the orientation of the user's face may be detected based on the output of the gyro sensor.

また、この実施例では、或る場所に、ラウドスピーカおよびマイクロホンアレイを設置して、インパルス応答を測定することにより、音声の伝達特性を検出し、検出した伝達特性を音声フィルタに反映させるようにしたが、これに限定される必要はない。たとえば、鏡像法を用いたシミュレーションによって、各角度ａｎｇについての伝達特性を算出することもできる。かかる場合には、想定される環境における仮想の壁面に反射率が設定され、これによって反射音が生成される。 Further, in this embodiment, a loudspeaker and a microphone array are installed at a certain location, and the impulse response is measured to detect the transfer characteristic of the sound, and the detected transfer characteristic is reflected in the sound filter. However, it need not be limited to this. For example, the transfer characteristic for each angle ang can be calculated by simulation using a mirror image method. In such a case, the reflectance is set on the virtual wall surface in the assumed environment, thereby generating a reflected sound.

さらに、この実施例では、仮想の位置関係として、正三角形の頂点の位置にユーザが位置する場合についてのみ示したが、これに限定される必要はない。様々な距離と、マイクロホンアレイの正面方向に対するラウドスピーカの様々な角度について、インパルス応答を測定または計算することにより、多数の伝達特性を用意しておけば、ユーザ同士の様々な位置関係に対応して、音声を再現することができる。 Furthermore, in this embodiment, as the virtual positional relationship, only the case where the user is positioned at the vertex position of the equilateral triangle is shown, but it is not necessary to be limited to this. By measuring or calculating the impulse response for various distances and various angles of the loudspeaker with respect to the front direction of the microphone array, a large number of transfer characteristics can be prepared to accommodate various positional relationships between users. Voice can be reproduced.

さらにまた、この実施例では、マイクロホンアレイによって検出された音場データも再現するようにしたが、音場データは再現されなくてもよい。 Furthermore, in this embodiment, the sound field data detected by the microphone array is also reproduced, but the sound field data may not be reproduced.

また、この実施例では、三者間の会話を再現するようにしたが、二者間または四者間以上の会話も再現することができる。たとえば、四者間の会話では、仮想の位置関係として、所定長さの辺を有する正方形の頂点にユーザを配置することが考えられる。また、五者間の会話では、仮想の位置関係として、所定長さの辺を有する正五角形の頂点にユーザを配置することが考えられる。他の場合も同様である。ただし、実際の位置関係を多角形で表現して、その頂点に各ユーザを配置するようにしてもよい。いずれの場合にも、測定や計算により求められた伝達特性を考慮した音声フィルタが用意される。
この実施例では、現時点における、サーバおよびコンピュータの性能に加え、データの伝送速度を考慮して、マイクロホンアレイおよびスピーカアレイシステムで使用するマイクロホンおよびラウドスピーカの個数を低減してあるが、性能や伝送速度が向上された場合には、それらの個数を低減しなくても、リアルタイムに音場データや音声データを再現できると考えられる。 In this embodiment, a conversation between three parties is reproduced, but a conversation between two parties or more than four parties can also be reproduced. For example, in a conversation between four parties, as a virtual positional relationship, it is conceivable to place a user at the apex of a square having sides of a predetermined length. In a conversation between the five parties, as a virtual positional relationship, it is conceivable to place a user at the apex of a regular pentagon having a side with a predetermined length. The same applies to other cases. However, the actual positional relationship may be expressed as a polygon, and each user may be placed at the vertex. In any case, an audio filter is prepared that takes into account the transfer characteristics obtained by measurement and calculation.
In this embodiment, the number of microphones and loudspeakers used in the microphone array and speaker array system is reduced in consideration of the data transmission speed in addition to the server and computer performance at the present time. When the speed is improved, it is considered that the sound field data and the sound data can be reproduced in real time without reducing the number of them.

１０ …音場共有システム
１２ …サーバ
１４ …マイクロホンアレイ
１８，２６，３４ …コンピュータ
２０，２８，３６ …スピーカアレイシステム
２２，３０，３８ …マイクロホン
２４，３２，４０ …カメラ DESCRIPTION OF SYMBOLS 10 ... Sound field sharing system 12 ... Server 14 ... Microphone array 18, 26, 34 ... Computer 20, 28, 36 ... Speaker array system 22, 30, 38 ... Microphone 24, 32, 40 ... Camera

Claims

少なくとも、複数の第１ラウドスピーカを有するスピーカアレイを備える音再現装置を複数備える、音再現システムであって、
各音再現装置は、
角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段、
使用者の発生する音に対応する音データを検出する音検出手段、
他の使用者の方向を基準として、前記使用者が前記音を発生した方向に対応する角度データを検出する角度検出手段、
前記音検出手段によって検出された音データと前記角度検出手段によって検出された角度データとを他の音再現装置に送信するデータ送信手段、
他の音再現装置からの音データと角度データとを受信する第１データ受信手段、
前記第１データ受信手段によって受信された角度データが示す角度に応じた音声フィルタデータを前記フィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、前記第１データ受信手段によって受信された音データに畳み込み処理を施す音処理手段、および
前記音処理手段によって畳み込み処理が施された音データを前記スピーカアレイに出力する音出力手段を備える、音再現システム。 A sound reproduction system comprising at least a plurality of sound reproduction devices including a speaker array having a plurality of first loudspeakers,
Each sound reproduction device
Filter storage means for storing voice filter data corresponding to a voice filter provided for each angle;
Sound detection means for detecting sound data corresponding to the sound generated by the user;
Angle detection means for detecting angle data corresponding to the direction in which the user has generated the sound, with reference to the direction of the other user;
Data transmission means for transmitting the sound data detected by the sound detection means and the angle data detected by the angle detection means to another sound reproduction device;
First data receiving means for receiving sound data and angle data from another sound reproduction device;
Voice filter data corresponding to the angle indicated by the angle data received by the first data receiving means is read from the filter storage means, and using the voice filter corresponding to the read voice filter data, the first data receiving means A sound reproduction system comprising: sound processing means for performing convolution processing on received sound data; and sound output means for outputting sound data subjected to convolution processing by the sound processing means to the speaker array.

前記音声フィルタは、或る場所において、複数のマイクロホンを有するマイクロホンアレイを所定の向きで配置し、当該マイクロホンアレイに対向するように第２ラウドスピーカを配置し、当該第２ラウドスピーカから刺激音を発生させるとともに所定角度ずつ回転させたときに、当該マイクロホンアレイによって測定されるインパルス応答に基づいて生成される、請求項１記載の音再現システム。 In the sound filter, a microphone array having a plurality of microphones is arranged in a predetermined direction at a certain location, a second loudspeaker is arranged so as to face the microphone array, and a stimulation sound is emitted from the second loudspeaker. The sound reproduction system according to claim 1, wherein the sound reproduction system is generated based on an impulse response measured by the microphone array when generated and rotated by a predetermined angle.

前記第２ラウドスピーカは、前記マイクロホンアレイの正面方向から所定角度の方向に、所定距離を隔てて配置される、請求項２記載の音再現システム。 The sound reproduction system according to claim 2, wherein the second loudspeaker is arranged at a predetermined distance in a direction at a predetermined angle from a front direction of the microphone array.

前記マイクロホンアレイは、或る音場に配置され、
前記マイクロホンアレイによって検出された音場データを収録し、当該音場データに畳み込みの処理を施して前記各音再現装置に伝送するサーバをさらに備え、
前記各音再現装置は、前記サーバから伝送された音場データを受信する第２データ受信手段をさらに備え、
前記音出力手段は、前記第２データ受信手段によって受信された音場データを、前記音処理手段によって畳み込み処理が施された前記音データに重畳して前記スピーカアレイに出力する、請求項１ないし３のいずれかに記載の音再現システム。 The microphone array is arranged in a certain sound field,
It further includes a server that records sound field data detected by the microphone array, performs convolution processing on the sound field data, and transmits the sound field data to each sound reproduction device,
Each of the sound reproduction devices further includes second data receiving means for receiving sound field data transmitted from the server,
The sound output means superimposes the sound field data received by the second data receiving means on the sound data subjected to convolution processing by the sound processing means and outputs the sound data to the speaker array. 4. The sound reproduction system according to any one of 3.

前記スピーカアレイは、第１所定数の第１ラウドスピーカを有し、
前記マイクロホンアレイは、第２所定数のマイクロホンを有し、
線形独立性の高い、第１所定数よりも少ない第３所定数の第１ラウドスピーカを選択するスピーカ選択手段、および
線形独立性の高い、第２所定数よりも少ない第４所定数のマイクロホンを選択するマイクロホン選択手段をさらに備え、
前記サーバは、前記第４所定数のマイクロホンを用いて前記音場データを収録して、畳み込み処理を施し、
前記音出力手段は、前記第２データ受信手段によって受信された音場データを前記第３所定数の第１ラウドスピーカを使用して出力する、請求項４記載の音再現システム。 The speaker array has a first predetermined number of first loudspeakers,
The microphone array has a second predetermined number of microphones;
Speaker selection means for selecting a third predetermined number of first loudspeakers less than the first predetermined number with high linear independence; and a fourth predetermined number of microphones with less linear second and less than the second predetermined number A microphone selection means for selecting;
The server records the sound field data using the fourth predetermined number of microphones, performs a convolution process,
5. The sound reproduction system according to claim 4, wherein the sound output means outputs the sound field data received by the second data receiving means using the third predetermined number of first loudspeakers.

複数のラウドスピーカを有するスピーカアレイ、
角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段、
使用者の発生する音に対応する音データを検出する音検出手段、
他の使用者の方向を基準として、前記使用者が前記音を発生した方向に対応する角度データを検出する角度検出手段、
前記音検出手段によって検出された音データと前記角度検出手段によって検出された角度データとを他の音再現装置に送信するデータ送信手段、
他の音再現装置からの音データと角度データとを受信するデータ受信手段、
前記データ受信手段によって受信された角度データが示す角度に応じた音声フィルタデータを前記フィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、前記データ受信手段によって受信された音データに畳み込み処理を施す音処理手段、および
前記音処理手段によって畳み込み処理が施された音データを前記スピーカアレイに出力する音出力手段を備える、音再現装置。 A speaker array having a plurality of loudspeakers;
Filter storage means for storing voice filter data corresponding to a voice filter provided for each angle;
Sound detection means for detecting sound data corresponding to the sound generated by the user;
Angle detection means for detecting angle data corresponding to the direction in which the user has generated the sound, with reference to the direction of the other user;
Data transmission means for transmitting the sound data detected by the sound detection means and the angle data detected by the angle detection means to another sound reproduction device;
Data receiving means for receiving sound data and angle data from other sound reproduction devices;
The sound filter data corresponding to the angle indicated by the angle data received by the data receiving means is read from the filter storage means, and the sound received by the data receiving means using the sound filter corresponding to the read sound filter data. A sound reproduction apparatus comprising: sound processing means for performing convolution processing on data; and sound output means for outputting sound data subjected to convolution processing by the sound processing means to the speaker array.

複数のラウドスピーカを有するスピーカアレイおよび角度毎に設けられた音声フィルタに対応する音声フィルタデータを記憶するフィルタ記憶手段を備える音再現装置を複数備える、音再現システムの音再現方法であって、
各音再現装置は、
（ａ）使用者の発生する音に対応する音データを検出し、
（ｂ）他の使用者の方向を基準として、前記使用者が前記音を発生した方向に対応する角度データを検出し、
（ｃ）前記ステップ（ａ）によって検出された音データと前記ステップ（ｂ）によって検出された角度データとを他の音再現装置に送信し、
（ｄ）他の音再現装置からの音データと角度データとを受信し、
（ｅ）前記ステップ（ｄ）によって受信された角度データが示す角度に応じた音声フィルタデータを前記フィルタ記憶手段から読み出し、読み出した音声フィルタデータに対応する音声フィルタを用いて、前記ステップ（ｄ）によって受信された音データに畳み込み処理を施し、そして
（ｆ）前記ステップ（ｅ）によって畳み込み処理が施された音データを前記スピーカアレイに出力する、音再現方法。 A sound reproduction method for a sound reproduction system comprising a plurality of sound reproduction devices each including a speaker array having a plurality of loudspeakers and filter storage means for storing sound filter data corresponding to sound filters provided for each angle,
Each sound reproduction device
(A) Detect sound data corresponding to the sound generated by the user,
(B) Detecting angle data corresponding to the direction in which the user has generated the sound with reference to the direction of another user;
(C) transmitting the sound data detected in step (a) and the angle data detected in step (b) to another sound reproduction device;
(D) receiving sound data and angle data from another sound reproduction device;
(E) The voice filter data corresponding to the angle indicated by the angle data received in the step (d) is read from the filter storage means, and the voice filter data corresponding to the read voice filter data is used to perform the step (d). And (f) a sound reproduction method for outputting the sound data subjected to the convolution processing in the step (e) to the speaker array.