JP2017168977A

JP2017168977A - Information processing apparatus, conference system, and method for controlling information processing apparatus

Info

Publication number: JP2017168977A
Application number: JP2016051100A
Authority: JP
Inventors: 未来袴谷; Miku Hakamatani; 清人五十嵐; Kiyoto Igarashi
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-03-15
Filing date: 2016-03-15
Publication date: 2017-09-21
Anticipated expiration: 2036-03-15
Also published as: JP6590152B2

Abstract

PROBLEM TO BE SOLVED: To achieve speaker tracking even when a voice input unit is arranged at an arbitrary angle with respect to an apparatus body.SOLUTION: An information processing apparatus comprises: a microphone unit 70 having a microphone array 71 that includes a plurality of microphones 72; a speaker 115 that outputs voices; a camera 112 that photographs a predetermined range; a sound source direction detection part 130 that detects the directions of the voices input to the microphone array 71 on the basis of the voices input to the plurality of microphones 72; and a body part 50 having an imaging range control part 131 that controls a range of imaging performed by the camera 112 according to a result of detection performed by the sound source direction detection part 130. The sound source direction detection part 130 detects information on the relative position of the microphone unit 70 to the body part 50 on the basis of the voices output from the speaker 115 and input to the microphone array 71, and the imaging range control part 131 controls the imaging range on the basis of a result of detection and the position information.SELECTED DRAWING: Figure 2

Description

本発明は、情報処理装置、会議システムおよび情報処理装置の制御方法に関する。 The present invention relates to an information processing apparatus, a conference system, and a method for controlling the information processing apparatus.

近年、インターネット等のネットワークを介して遠隔地（拠点）に設置された端末装置（会議端末ともいう）を接続し、遠隔会議（テレビ会議、ビデオ会議ともいう）を行う会議システム（遠隔会議システム、テレビ会議システム、ビデオ会議システムともいう）が普及している。 In recent years, a conference system (remote conference system, which connects a terminal device (also called a conference terminal) installed in a remote place (base) via a network such as the Internet and performs a remote conference (also called a video conference or a video conference) Video conferencing systems and video conferencing systems) are widely used.

この会議システムでの会議端末は、各拠点の会議室等に設置され、相手先の会議端末との間で会議出席者の画像や音声をやり取りすることで遠隔会議を行う。具体的には、会議端末の各々は、遠隔会議に出席する会議出席者をカメラで撮像するとともに会議出席者の音声をマイクで集音し、相手先の会議端末に画像データや音声データを送信する一方で、相手先の会議端末から送信された画像データおよび音声データを受信し、受信した画像データを用いた会議画面を表示部（モニタ）に表示出力するとともに音声データをスピーカから音出力する。なお、本明細書では、「音声」は人の声に限るものではなく、会議端末から発生される呼び出し音などの機械音も含むものとする。 A conference terminal in this conference system is installed in a conference room or the like at each base, and performs a remote conference by exchanging images and sounds of conference attendees with the conference terminal of the other party. Specifically, each conference terminal captures a conference attendee attending a remote conference with a camera, collects the conference attendee's voice with a microphone, and transmits image data and audio data to the destination conference terminal. On the other hand, it receives image data and audio data transmitted from the conference terminal of the other party, displays a conference screen using the received image data on a display unit (monitor), and outputs audio data from a speaker. . In this specification, “voice” is not limited to a human voice, but includes a machine sound such as a ringing tone generated from a conference terminal.

会議端末のマイクとして、音声を入力するためのマイクが複数配列されてなるマイクアレイを用いることが知られている。そして、拠点間での円滑な会話を実現するために、マイクアレイを構成する各マイクに届いた音源の時間差に基づいて、音声が入力された方向を特定すること（音源方向検知機能、音源方向検知処理という）により、会議参加者のうち実際に発話している参加者（発話者という）を検知して、発話者をカメラで撮像する機能（話者追尾機能という）を備えるものが知られている。 As a conference terminal microphone, it is known to use a microphone array in which a plurality of microphones for inputting voice are arranged. Then, in order to realize a smooth conversation between the bases, the direction in which the sound is input is identified based on the time difference between the sound sources that have reached each microphone constituting the microphone array (sound source direction detection function, sound source direction). It is known that it has a function (referred to as a speaker tracking function) that detects a participant who actually speaks (referred to as a speaker) and captures the speaker with a camera (referred to as a speaker tracking function). ing.

例えば、特許文献１には、同一音源からの音が複数のマイクに到達するときの時間差を利用して、音源位置が求められるようするとともに、求めた音源位置に基づいて、音源の方向へカメラの向きとズーム量を制御する装置が開示されている。 For example, in Patent Document 1, a sound source position is obtained by using a time difference when sound from the same sound source reaches a plurality of microphones, and a camera is moved in the direction of the sound source based on the obtained sound source position. An apparatus for controlling the direction and zoom amount is disclosed.

ところで、会議端末のカメラ、マイク、スピーカが会議端末の本体（本体部、装置本体ともいう）に一体で設けられている場合、カメラにより会議参加者を広く撮像するために、会議端末は、会議参加者の視線が集まるモニタの付近に設置されることが多い。しかしながら、マイクが会議端末の本体と一体型である場合、マイクがモニタに近い位置にあることになり、モニタから遠い位置にいる会議参加者からの音声が拾いにくくなってしまう。 By the way, in the case where the camera, microphone, and speaker of the conference terminal are provided integrally with the main body (also referred to as a main body unit or device main body) of the conference terminal, the conference terminal It is often installed near the monitor where participants' eyes are gathered. However, when the microphone is integrated with the main body of the conference terminal, the microphone is located close to the monitor, and it becomes difficult to pick up audio from conference participants who are far from the monitor.

これに対し、マイクを会議端末の本体とは別体のマイクユニット（音声入力ユニットともいう）として構成することが提案されている。マイクを別体のマイクユニットとして構成し、会議端末の本体に対して、有線または無線によりマイクユニットを接続することで、マイクユニットは、モニタから離れた所望の位置（例えば、会議参加者の中央位置）に配置することができる。よって、どの会議参加者からの音声も拾えるようにすることが可能となる。 On the other hand, it has been proposed to configure the microphone as a microphone unit (also referred to as an audio input unit) separate from the main body of the conference terminal. By configuring the microphone as a separate microphone unit and connecting the microphone unit to the main body of the conference terminal by wire or wirelessly, the microphone unit can be located at a desired position away from the monitor (for example, the center of the conference participant). Position). Therefore, it is possible to pick up audio from any conference participant.

マイクアレイが本体と一体型である会議端末の場合には、話者追尾機能を音源方向検知機能により好適に実行することが可能であったが、マイクアレイを会議端末の本体とは別体のマイクユニットとして構成して、有線または無線による接続とする場合、マイクユニットの配置の自由度が高くなる一方で、話者追尾機能の実現のために、マイクユニットを本体に対して所定の位置関係に配置する必要があり、使用環境に応じて毎回異なる位置や角度でマイクユニットを配置すると、話者追尾を正しく実行できないことがあった。 In the case of a conference terminal in which the microphone array is integrated with the main body, the speaker tracking function can be suitably executed by the sound source direction detection function, but the microphone array is separated from the main body of the conference terminal. When configured as a microphone unit and connected by wire or wirelessly, the microphone unit can be arranged more freely, while the microphone unit has a predetermined positional relationship with the main body in order to realize the speaker tracking function. If the microphone unit is arranged at a different position and angle each time depending on the usage environment, speaker tracking may not be performed correctly.

例えば、特許文献１に記載の技術では、カメラを載置した回転台の０°方向と２つのマイクを結ぶ線とが平行であるように設置する必要があり、移動可能範囲も平行移動の範囲内に制限されており、平行配置以外の角度を持った配置にするためには、予め角度を測定しておく必要があった。 For example, in the technique described in Patent Document 1, it is necessary to install the rotary table on which the camera is mounted so that the 0 ° direction and the line connecting the two microphones are parallel, and the movable range is also a range of parallel movement. In order to make an arrangement having an angle other than the parallel arrangement, it is necessary to measure the angle in advance.

そこで本発明は、音声入力ユニットを装置本体に対して任意の角度で配置しても、話者追尾を実現することができる情報処理装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an information processing apparatus capable of realizing speaker tracking even when a voice input unit is arranged at an arbitrary angle with respect to the apparatus main body.

かかる目的を達成するため、本発明に係る情報処理装置は、複数の音声入力部を備えてなる音声入力手段を有する第１の装置と、音声を出力する音声出力手段と、所定範囲を撮像する撮像手段と、前記複数の音声入力部へ入力される音声に基づいて、前記音声入力手段への音声の入力方向を検知する音源方向検知手段と、前記音源方向検知手段の検知結果に応じて、前記撮像手段による撮像範囲を制御する撮像範囲制御手段と、を有する第２の装置と、を備え、前記音源方向検知手段は、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記第１の装置の前記第２の装置に対する相対的な位置情報を検知し、前記撮像範囲制御手段は、前記検知結果および前記位置情報に基づいて、前記撮像範囲を制御するものである。 In order to achieve such an object, an information processing apparatus according to the present invention images a first range having a voice input unit including a plurality of voice input units, a voice output unit that outputs voice, and a predetermined range. Based on the detection result of the sound source direction detecting means, the sound source direction detecting means for detecting the input direction of the sound to the sound input means based on the sound input to the imaging means, the plurality of sound input units, And a second device having an imaging range control means for controlling an imaging range by the imaging means, wherein the sound source direction detection means is a voice to the voice input means by a voice output from the voice output means. Based on the input, the relative position information of the first device with respect to the second device is detected, and the imaging range control means controls the imaging range based on the detection result and the position information. Is shall.

本発明によれば、音声入力ユニットを装置本体に対して任意の角度で配置しても、話者追尾を実現することができる。 According to the present invention, speaker tracking can be realized even if the voice input unit is arranged at an arbitrary angle with respect to the apparatus main body.

テレビ会議システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of a video conference system. 会議端末の主要内部構成例（１）を示すブロック図である。It is a block diagram which shows the main internal structural examples (1) of a conference terminal. 音源方向検知処理の説明図である。It is explanatory drawing of a sound source direction detection process. 会議端末の主要内部構成例（２）を示すブロック図である。It is a block diagram which shows the main internal structural examples (2) of a conference terminal. マイクユニットを基準の配置位置（角度）としたときの、カメラによる撮像範囲の説明図である。It is explanatory drawing of the imaging range by a camera when a microphone unit is made into the reference | standard arrangement position (angle). ある拠点の様子を示す説明図であって、（Ａ）マイクユニットを基準の配置角度で設置した例、（Ｂ）マイクユニットが基準の配置角度で設置されない例、（Ｃ）マイクユニットの配置角度を補正した様子を示す例、である。It is explanatory drawing which shows the mode of a certain base, Comprising: (A) The example which installed the microphone unit with the reference | standard arrangement | positioning angle, (B) The example where a microphone unit is not installed with the reference | standard arrangement | positioning angle, (C) The arrangement | positioning angle of microphone unit It is an example which shows a mode that was corrected. カメラによる話者追尾の説明図であって、（Ａ）０°方向のオフセットを行わない比較例、（Ｂ）０°方向のオフセットを行う実施例の説明図である。It is explanatory drawing of speaker tracking by a camera, Comprising: (A) The comparative example which does not perform 0 degree direction offset, (B) It is explanatory drawing of the Example which performs 0 degree direction offset. 話者追尾制御の一例を示すフローチャートである。It is a flowchart which shows an example of speaker tracking control.

以下、本発明に係る構成を図１から図８に示す実施の形態に基づいて詳細に説明する。 Hereinafter, the configuration according to the present invention will be described in detail based on the embodiment shown in FIGS.

（会議システム構成）
本発明に係る会議システムの一実施形態であるテレビ会議システムの構成について説明する。 (Conference system configuration)
A configuration of a video conference system which is an embodiment of the conference system according to the present invention will be described.

図１は、テレビ会議システム１の構成例を示すブロック図である。図１に示すように、テレビ会議システム１は、サーバ３と複数の会議端末５（５−１，５−２，５−３，５−４・・・）とを備え、これらがインターネット等のネットワークＮを介して接続されて構成される。サーバ３としては、サーバコンピュータやワークステーション等を利用することができ、会議端末５としては、専用の会議端末装置（情報処理装置）のほか、パーソナルコンピュータ等の汎用の情報処理装置を利用することができる。 FIG. 1 is a block diagram illustrating a configuration example of the video conference system 1. As shown in FIG. 1, the video conference system 1 includes a server 3 and a plurality of conference terminals 5 (5-1, 5-2, 5-3, 5-4...), Such as the Internet. It is configured to be connected via a network N. A server computer or a workstation can be used as the server 3, and a general-purpose information processing device such as a personal computer can be used as the conference terminal 5 in addition to a dedicated conference terminal device (information processing device). Can do.

サーバ３は、個々の会議端末５との間で通信接続が確立しているか否かを監視する処理や、会議開始時においてテレビ会議に参加する拠点（参加拠点）に設置された会議端末５を呼び出す処理、呼び出しに応答して通信接続が確立した参加拠点の会議端末５からテレビ会議の間に送信される画像データや音声データを相手先（他の参加拠点）の会議端末５に転送する処理等を行う。 The server 3 performs processing for monitoring whether or not a communication connection is established with each conference terminal 5, and the conference terminal 5 installed at a base (participating base) that participates in the video conference at the start of the conference. Processing for calling and processing for transferring image data and audio data transmitted during the video conference from the conference terminal 5 at the participating site where communication connection is established in response to the call to the conference terminal 5 at the other party (other participating site) Etc.

会議端末５の各々は、遠隔地にある拠点の会議室等に設置され、テレビ会議の出席者によって操作される。テレビ会議中の各参加拠点の会議端末５は、後述するカメラ１１２によって撮像した会議出席者の画像データやマイクアレイ７１によって集音した会議出席者の音声データをサーバ３に送信する一方、他の参加拠点の会議端末５から送信されてサーバ３によって転送された画像データや音声データを受信し、ディスプレイ１２０に会議画面として表示出力するとともにスピーカ１１５から出力（放音）する。 Each of the conference terminals 5 is installed in a conference room or the like at a base in a remote place and operated by attendees of the video conference. The conference terminal 5 at each participating site during the video conference transmits to the server 3 the image data of the conference attendee imaged by the camera 112 described later and the audio data of the conference attendee collected by the microphone array 71, while Image data and audio data transmitted from the conference terminal 5 at the participating base and transferred by the server 3 are received, displayed on the display 120 as a conference screen, and output (sound) from the speaker 115.

例えば、このテレビ会議システム１において図１に示す３台の会議端末５−１〜５−３が参加するテレビ会議では、会議端末５−１から送信された画像データや音声データはサーバ３の制御によって相手先である会議端末５−２，５−３に転送される一方、会議端末５−４には転送されない。同様に、会議端末５−２，５−３から送信された画像データや音声データはサーバ３の制御によって各々の相手先である会議端末５−１，５−３や会議端末５−１，５−２に転送され、会議端末５−４には転送されない。このようにして、テレビ会議システム１では、サーバ３との通信接続が確立された２台以上の会議端末５が設置された参加拠点間でテレビ会議が行われる。 For example, in the video conference in which three conference terminals 5-1 to 5-3 shown in FIG. 1 participate in the video conference system 1, image data and audio data transmitted from the conference terminal 5-1 are controlled by the server 3. Is transferred to the conference terminals 5-2 and 5-3 as the other party, but not transferred to the conference terminal 5-4. Similarly, the image data and audio data transmitted from the conference terminals 5-2 and 5-3 are controlled by the server 3 to the conference terminals 5-1 and 5-3 and the conference terminals 5-1 and 5 which are the respective counterparts. -2 and not transferred to the conference terminal 5-4. In this manner, in the video conference system 1, a video conference is performed between participating sites where two or more conference terminals 5 having established communication connections with the server 3 are installed.

（会議端末構成（１））
図２は、会議端末５の主要内部構成例（１）を示すブロック図である。図２に示されているように、会議端末５は、マイクユニット７０を会議端末５の本体部５０と別体としている。図２の例では、マイクユニット７０は、本体部５０とはケーブル７０ｃを介して有線接続される。 (Conference terminal configuration (1))
FIG. 2 is a block diagram showing a main internal configuration example (1) of the conference terminal 5. As shown in FIG. 2, the conference terminal 5 separates the microphone unit 70 from the main body 50 of the conference terminal 5. In the example of FIG. 2, the microphone unit 70 is wired to the main body 50 via a cable 70 c.

会議端末５の本体部５０は、会議端末５の全体の動作を制御するＣＰＵ（Central Processing Unit）１０１、ＩＰＬ（Initial Program Loader）等のＣＰＵ１０１の駆動に用いられるプログラムを記憶したＲＯＭ（Read Only Memory）１０２、ＣＰＵ１０１のワークエリアとして使用されるＲＡＭ（Random Access Memory）１０３、端末用プログラム、画像データ、及び音声データ等の各種データを記憶するフラッシュメモリ１０４、ＣＰＵ１０１の制御にしたがってフラッシュメモリ１０４に対する各種データの読み出し又は書き込みを制御するＳＳＤ（Solid State Drive）１０５、フラッシュメモリ等の記録メディア１０６に対するデータの読み出し又は書き込み（記憶）を制御するメディアドライブ１０７、会議端末５の宛先を選択する場合などに操作される操作部１０８、会議端末５の電源のＯＮ／ＯＦＦを切り換えるための電源スイッチ１０９、ネットワークＮを利用してデータ伝送をするためのネットワークＩ／Ｆ（Interface）１１１を備えている。 The main body 50 of the conference terminal 5 includes a CPU (Central Processing Unit) 101 that controls the overall operation of the conference terminal 5 and a ROM (Read Only Memory) that stores programs used to drive the CPU 101 such as an IPL (Initial Program Loader). ) 102, a RAM (Random Access Memory) 103 used as a work area for the CPU 101, a flash memory 104 for storing various data such as a terminal program, image data, and audio data, and various types of data for the flash memory 104 according to the control of the CPU 101 When selecting an SSD (Solid State Drive) 105 that controls reading or writing of data, a media drive 107 that controls reading or writing (storage) of data to a recording medium 106 such as a flash memory, or a destination of the conference terminal 5 Manipulated operation Part 108, the power switch 109 for switching ON / OFF the power of the conference terminal 5, and a network I / F (Interface) 111 for using a network N to the data transmission.

操作部１０８は、キーボードやマウス、タッチパネル、各種スイッチ等の入力装置によって実現されるものであり、操作入力に応じた入力データをＣＰＵ１０１に出力する。 The operation unit 108 is realized by an input device such as a keyboard, a mouse, a touch panel, and various switches, and outputs input data corresponding to the operation input to the CPU 101.

ネットワークＩ／Ｆ１１１は、外部（例えばサーバ３）とのデータ通信を行うためのものであり、ＬＡＮを経由してネットワークＮと接続し、相手先の会議端末５との画像データや音声データ等の送受を、サーバ３を介して行う。このネットワークＩ／Ｆ１１１は、１０Ｂａｓｅ−Ｔ，１００Ｂａｓｅ−ＴＸ，１０００Ｂａｓｅ−Ｔ等に対応した制御を行いイーサネット（登録商標）に接続するもの（有線ＬＡＮ）や、ＩＥＥＥ８０２．１１ａ／ｂ／ｇ／ｎに対応した制御を行うもの（無線ＬＡＮ）等、接続態様に応じたものを適宜採用して用いることができる。 The network I / F 111 is used for data communication with the outside (for example, the server 3). The network I / F 111 is connected to the network N via the LAN and transmits image data, audio data, and the like with the conference terminal 5 of the other party. Transmission / reception is performed via the server 3. This network I / F 111 is connected to Ethernet (registered trademark) by performing control corresponding to 10Base-T, 100Base-TX, 1000Base-T, etc., or IEEE802.11a / b / g / n. A device according to the connection mode such as a device that performs corresponding control (wireless LAN) or the like can be appropriately adopted and used.

また、会議端末５の本体部５０は、ＣＰＵ１０１の制御に従って被写体を撮像して画像データを得る内蔵型のカメラ１１２、このカメラ１１２の駆動を制御する撮像素子Ｉ／Ｆ１１３、音声を出力する内蔵型のスピーカ１１５、ＣＰＵ１０１の制御に従ってマイクユニット７０のマイクアレイ７１及びスピーカ１１５との間で音声信号の入出力を処理する音声入出力Ｉ／Ｆ１１６、ＣＰＵ１０１の制御に従って外付けのディスプレイ１２０に画像データを伝送するディスプレイＩ／Ｆ１１７、各種の外部機器を接続するための外部機器接続Ｉ／Ｆ１１８、および上記各構成要素を電気的に接続するためのアドレスバスやデータバス等のバスライン１１０を備えている。 Further, the main body 50 of the conference terminal 5 includes a built-in camera 112 that captures an image of a subject under the control of the CPU 101 to obtain image data, an image sensor I / F 113 that controls driving of the camera 112, and a built-in camera that outputs sound. The audio data input / output I / F 116 that processes input / output of audio signals between the speaker 115 and the microphone array 71 of the microphone unit 70 and the speaker 115 according to the control of the CPU 101, and the image data on the external display 120 according to the control of the CPU 101. A display I / F 117 for transmission, an external device connection I / F 118 for connecting various external devices, and a bus line 110 such as an address bus and a data bus for electrically connecting the above components are provided. .

カメラ１１２は、レンズや、光を電荷に変換して被写体の画像（映像）を電子化する固体撮像素子を含み、固体撮像素子として、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサや、ＣＣＤ（Charge Coupled Device）イメージセンサ等が用いられる。 The camera 112 includes a lens and a solid-state image sensor that converts an image (video) of a subject by converting light into electric charges. As the solid-state image sensor, a CMOS (Complementary Metal Oxide Semiconductor) image sensor or a CCD (Charge Coupled) is used. Device) An image sensor or the like is used.

カメラ１１２は、会議出席者の画像を入力するためのものであり、会議室内の様子を撮像し、生成した画像データを随時、ＣＰＵ１０１に出力する。 The camera 112 is for inputting images of the attendees of the conference. The camera 112 captures the situation in the conference room and outputs the generated image data to the CPU 101 as needed.

スピーカ１１５は、ＣＰＵ１０１から入力される音声データを出力する音声出力部である。 The speaker 115 is an audio output unit that outputs audio data input from the CPU 101.

外部機器接続Ｉ／Ｆ１１８には、ＵＳＢ（Universal Serial Bus）ケーブル等によって、外付けカメラ、外付けスピーカ等の外部機器がそれぞれ接続可能である。例えば、外付けカメラが接続された場合には、ＣＰＵ１０１の制御に従って、内蔵型のカメラ１１２に優先して、外付けカメラが動作するようにしてもよい。同じく、外付けスピーカが接続された場合には、ＣＰＵ１０１の制御に従って、内蔵型のスピーカ１１５に優先して、外付けスピーカを駆動させるようにしてもよい。 External devices such as an external camera and an external speaker can be connected to the external device connection I / F 118 by a USB (Universal Serial Bus) cable or the like. For example, when an external camera is connected, the external camera may be operated in preference to the built-in camera 112 under the control of the CPU 101. Similarly, when an external speaker is connected, the external speaker may be driven in preference to the built-in speaker 115 under the control of the CPU 101.

なお、記録メディア１０６は、会議端末５に対して着脱自在な構成となっている。また、ＣＰＵ１０１の制御にしたがってデータの読み出し又は書き込みを行う不揮発性メモリであれば、フラッシュメモリ１０４に限らず、ＥＥＰＲＯＭ（Electrically Erasable and Programmable ROM）等を用いてもよい。 Note that the recording medium 106 is detachable from the conference terminal 5. Further, as long as it is a non-volatile memory that reads or writes data according to the control of the CPU 101, not only the flash memory 104 but also an EEPROM (Electrically Erasable and Programmable ROM) or the like may be used.

更に、上記端末用プログラムは、インストール可能な形式又は実行可能な形式のファイルで、上記記録メディア１０６等の、コンピュータで読み取り可能な記録媒体に記録して流通させるようにしてもよい。また、上記端末用プログラムは、フラッシュメモリ１０４ではなくＲＯＭ１０２に記憶させるようにしてもよい。 Further, the terminal program may be recorded in a computer-readable recording medium such as the recording medium 106 and distributed as a file in an installable or executable format. The terminal program may be stored in the ROM 102 instead of the flash memory 104.

ディスプレイ１２０は、被写体の画像や操作用アイコン等を表示するＬＣＤやＥＬディスプレイ、ＣＲＴディスプレイ等によって構成された表示部であり、ＣＰＵ１０１から入力される画像データを表示した会議画面等の各種画面を表示出力する。また、ディスプレイ１２０は、ケーブル１２０ｃによってディスプレイＩ／Ｆ１１７に接続される。このケーブル１２０ｃは、アナログＲＧＢ（ＶＧＡ）信号用のケーブルであってもよいし、コンポーネントビデオ用のケーブルであってもよいし、ＨＤＭＩ（登録商標）（High-Definition Multimedia Interface）やＤＶＩ（Digital Video Interactive）信号用のケーブルであってもよい。 The display 120 is a display unit configured by an LCD, an EL display, a CRT display, or the like that displays a subject image, an operation icon, and the like, and displays various screens such as a conference screen that displays image data input from the CPU 101. Output. The display 120 is connected to the display I / F 117 by a cable 120c. The cable 120c may be an analog RGB (VGA) signal cable, a component video cable, HDMI (registered trademark) (High-Definition Multimedia Interface) or DVI (Digital Video). Interactive) signal cable may be used.

マイクユニット７０は、本体部５０とは別体として構成される。マイクユニット７０は、会議出席者の音声を入力するマイク７２を複数備えたマイクアレイ７１を有する音声入力ユニットであって、マイクアレイ７１へ入力される会議出席者の音声データを、ケーブル７０ｃを介して本体部５０へ送信する。 The microphone unit 70 is configured as a separate body from the main body 50. The microphone unit 70 is a voice input unit having a microphone array 71 provided with a plurality of microphones 72 for inputting the voices of conference attendees. The microphone unit 70 receives voice data of conference attendees input to the microphone array 71 via a cable 70c. To the main body 50.

マイクアレイ７１は、会議出席者の音声を入力するためのマイク７２が複数配列されてなる音声入力手段であって、会議出席者の音声を集音する。なお、本実施形態では、マイクアレイ７１が４つのマイク７２ａ〜７２ｄを備えている例を示しているが、マイク数はこれに限られるものではない。 The microphone array 71 is a voice input means in which a plurality of microphones 72 for inputting voices of conference attendees are arranged, and collects voices of conference attendees. In the present embodiment, the microphone array 71 includes four microphones 72a to 72d, but the number of microphones is not limited to this.

ＣＰＵ１０１は、カメラ１１２から入力される画像データやマイクアレイ７１から入力される音声データ、ネットワークＩ／Ｆ１１１から入力される相手先の会議端末５からの画像データや音声データ、操作部１０８から入力される入力データ、フラッシュメモリ１０４等に記録されるプログラムやデータ等をもとに、会議端末５を構成する各部への指示やデータの転送等を行って会議端末５の動作を統括的に制御する。例えば、ＣＰＵ１０１は、サーバ３からの呼び出しを受けてサーバ３との通信接続が確立した後、カメラ１１２から入力される画像データやマイクアレイ７１から入力される音声データをサーバ３に送信する処理と、サーバ３から転送される相手先の会議端末５からの画像データや音声データを受信する処理とを並行して繰り返し行う。 The CPU 101 receives image data input from the camera 112, audio data input from the microphone array 71, image data and audio data from the destination conference terminal 5 input from the network I / F 111, and input from the operation unit 108. Based on the input data to be recorded, the program or data recorded in the flash memory 104 or the like, and performs overall control of the operation of the conference terminal 5 by instructing each unit constituting the conference terminal 5 and transferring the data. . For example, the CPU 101 receives a call from the server 3 and establishes a communication connection with the server 3 and then transmits image data input from the camera 112 and audio data input from the microphone array 71 to the server 3. The process of receiving image data and audio data from the destination conference terminal 5 transferred from the server 3 is repeatedly performed in parallel.

具体的には、ＣＰＵ１０１は、テレビ会議中にカメラ１１２から随時入力される画像データ、およびマイクアレイ７１から随時入力される音声データをエンコードしてネットワークＩ／Ｆ１１１に出力することで、これらをサーバ３に送信する処理を行う。ＣＰＵ１０１は、例えば、Ｈ.２６４／ＡＶＣ、Ｈ.２６４／ＳＶＣ等の規格によるコーデックを行う。 Specifically, the CPU 101 encodes image data input from the camera 112 at any time during the video conference and audio data input from the microphone array 71 at any time, and outputs these to the network I / F 111 to provide them as a server. The process of transmitting to 3 is performed. The CPU 101 performs codec based on standards such as H.264 / AVC and H.264 / SVC, for example.

また、ＣＰＵ１０１は、これと並行し、相手先の会議端末５から送信されてサーバ３によって転送された画像データおよび音声データをネットワークＩ／Ｆ１１１を介して受信する。そして、ＣＰＵ１０１は、受信した画像データおよび音声データをデコードしてディスプレイ１２０、スピーカ１１５に送信するコーデック機能を有している。これにより、相手先の会議端末５で入力された画像および音声の再生を行う。 In parallel with this, the CPU 101 receives image data and audio data transmitted from the destination conference terminal 5 and transferred by the server 3 via the network I / F 111. The CPU 101 has a codec function that decodes the received image data and audio data and transmits them to the display 120 and the speaker 115. As a result, the image and sound input at the destination conference terminal 5 are reproduced.

また、ＣＰＵ１０１は、マイクアレイ７１の各マイク７２からの入力に基づいて、音源方向検知処理を実行する音源方向検知部１３０を備えている。 In addition, the CPU 101 includes a sound source direction detection unit 130 that executes sound source direction detection processing based on input from each microphone 72 of the microphone array 71.

図３は、音源方向検知部１３０が実行する音源方向検知処理の説明図である。音源方向検知処理は、マイクアレイ７１を構成する各マイク７２に届く音源の時間差に基づいて、音声が入力された方向を特定するものである。すなわち、例えば、図３に示すように、４つのマイク（マイク１〜マイク４）に対して、音源である人物Ａから音声が入力された場合、マイク１とマイク２の到達時間差（Δｔ１）、マイク１とマイク３の到達時間差（Δｔ２）、マイク１とマイク４の到達時間差（Δｔ３）、に基づいて音声の入力方向を検知することができる。音源方向検知処理としては、公知または新規の方法を適用することができる。 FIG. 3 is an explanatory diagram of the sound source direction detection process executed by the sound source direction detection unit 130. The sound source direction detection process is to specify the direction in which sound is input based on the time difference of the sound sources reaching each microphone 72 constituting the microphone array 71. That is, for example, as shown in FIG. 3, when voice is input from a person A as a sound source to four microphones (microphones 1 to 4), the arrival time difference (Δt1) between the microphone 1 and the microphone 2, The voice input direction can be detected based on the arrival time difference (Δt2) between the microphone 1 and the microphone 3 and the arrival time difference (Δt3) between the microphone 1 and the microphone 4. As the sound source direction detection process, a known or new method can be applied.

なお、本実施形態では、音源方向検知処理を本体部５０のＣＰＵ１０１にて実行する例を説明したが、マイクユニット７０が音源方向検知処理を実行する音源方向処理部（音声処理モジュール）を備え、検知結果をＣＰＵ１０１に通知するようにしてもよい。 In this embodiment, an example in which the sound source direction detection process is executed by the CPU 101 of the main body unit 50 has been described. However, the microphone unit 70 includes a sound source direction processing unit (audio processing module) that executes the sound source direction detection process. The detection result may be notified to the CPU 101.

また、ＣＰＵ１０１は、音源方向検知部１３０での検知結果に基づいて、カメラ１１２の撮像範囲を制御する撮像範囲制御部１３１を備えている。カメラ１１２は、例えば、撮像方向が旋回可能に設けられており、検出された発話者の方向に基づいて、ＣＰＵ１０１により旋回が制御される。また、カメラ１１２を、広角レンズを用いて構成し、その視野範囲（画角）内に会議出席者の全員が含まれるようにして、検出された発話者の方向に基づいて、デジタル処理により撮像範囲を切り替える制御をするものであってもよい。 In addition, the CPU 101 includes an imaging range control unit 131 that controls the imaging range of the camera 112 based on the detection result of the sound source direction detection unit 130. For example, the camera 112 is provided so that the imaging direction can be turned, and the turning of the camera 112 is controlled by the CPU 101 based on the detected direction of the speaker. Further, the camera 112 is configured by using a wide-angle lens, and all the attendees of the conference are included in the field of view (angle of view), and imaged by digital processing based on the detected direction of the speaker. Control for switching the range may be performed.

（会議端末構成（２））
図４は、会議端末５の主要内部構成例（２）を示すブロック図である。図４に示されているように、会議端末５は、マイクユニット７０を会議端末５の本体部５０と別体としている。図４の例では、マイクユニット７０は、本体部５０とは無線通信で接続される。 (Conference terminal configuration (2))
FIG. 4 is a block diagram showing a main internal configuration example (2) of the conference terminal 5. As shown in FIG. 4, the conference terminal 5 separates the microphone unit 70 from the main body 50 of the conference terminal 5. In the example of FIG. 4, the microphone unit 70 is connected to the main body unit 50 by wireless communication.

マイクユニット７０および本体部５０は、それぞれ無線通信部７３、無線通信部１１４を備えており、無線通信部７３から無線通信部１１４へ音声データが送信される。他の構成については、図２と同様であるので、説明を省略する。なお、マイクユニット７０と本体部５０は、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＮＦＣなどの公知の近距離無線通信方式、または新規の近距離無線通信方式を用いて無線通信されるものであればよく、無線通信方式は、特に限られるものではない。 The microphone unit 70 and the main body unit 50 include a wireless communication unit 73 and a wireless communication unit 114, respectively, and audio data is transmitted from the wireless communication unit 73 to the wireless communication unit 114. Other configurations are the same as those in FIG. Note that the microphone unit 70 and the main body 50 may be any devices that can perform wireless communication using a known short-range wireless communication method such as Bluetooth (registered trademark) or NFC, or a new short-range wireless communication method. The communication method is not particularly limited.

（話者追尾制御）
図５は、マイクユニット７０を基準の配置位置（角度）としたときの、会議端末５の本体部５０のカメラ１１２による撮像範囲の説明図である。 (Speaker tracking control)
FIG. 5 is an explanatory diagram of an imaging range by the camera 112 of the main body 50 of the conference terminal 5 when the microphone unit 70 is set as a reference arrangement position (angle).

図５は、会議端末５の本体部５０がテーブル９０の一端側に配置されるとともに、これと無線接続されるマイクユニット７０がテーブル９０の中央部に配置された様子を示す上面図である。 FIG. 5 is a top view showing a state in which the main body 50 of the conference terminal 5 is arranged on one end side of the table 90 and the microphone unit 70 wirelessly connected thereto is arranged in the center of the table 90.

なお、マイクユニット７０の基準位置は、例えば、マイクユニット７０に本体部５０側へ合わせる基準（目印）が設けてあり、これを本体部５０の正面に向けて設置した場合を、基準の配置角度（０°）とするものである。また、有線接続されるマイクユニット７０の場合は、例えば、ケーブル７０ｃを本体部５０から真っ直ぐ正面に引き出した際の設置位置が、通常、基準の配置角度（０°）となる。 Note that the reference position of the microphone unit 70 is, for example, a reference (mark) that is aligned with the microphone unit 70 toward the main body 50, and is set facing the front of the main body 50. (0 °). In the case of the microphone unit 70 connected by wire, for example, the installation position when the cable 70c is pulled straight out from the main body 50 to the front is usually the reference arrangement angle (0 °).

音源方向検知部１３０は、マイクユニット７０の各マイク７２に入力される音声に基づいて、音源方向検知処理を行うことで、マイクユニット７０に対する音源（話者）の角度を検出する。そして、撮像範囲制御部１３１は、音源方向検知部１３０で検出された角度に応じて、カメラ１１２の向き（または切り出す範囲）および焦点制御する。なお、本実施形態では、検出角度が４つのフォーカスエリアのいずれであるかを検出する例を説明するが、音源方向検知部１３０で検出する角度の区分数はこれに限られるものではなく、例えば、検出可能な区分数を増やすことで、より精度よく話者追尾が可能であることは勿論である。 The sound source direction detection unit 130 detects the angle of the sound source (speaker) with respect to the microphone unit 70 by performing sound source direction detection processing based on the sound input to each microphone 72 of the microphone unit 70. Then, the imaging range control unit 131 controls the direction (or the range to be cut out) and the focus of the camera 112 according to the angle detected by the sound source direction detection unit 130. In this embodiment, an example in which the detected angle is one of the four focus areas will be described. However, the number of angle segments detected by the sound source direction detection unit 130 is not limited to this. For example, Of course, it is possible to track the speaker more accurately by increasing the number of segments that can be detected.

図５では、例えば、検出角度に応じて、以下の４つのフォーカスエリアを設定し、撮像範囲制御部１３１は、このフォーカスエリアにカメラ１１２を向けるとともに、焦点を合わせる。
第１フォーカスエリア（左エリアへのフォーカス）：４５°〜１３５°
第２フォーカスエリア（正面へのフォーカス）：１３５°〜２２５°
第３フォーカスエリア（右エリアへのフォーカス）：２２５°〜３１５°
第４フォーカスエリア（フォーカス無し）：３１５°〜４５° In FIG. 5, for example, the following four focus areas are set according to the detection angle, and the imaging range control unit 131 directs the camera 112 to the focus area and focuses.
First focus area (focus on the left area): 45 ° to 135 °
Second focus area (front focus): 135 ° to 225 °
Third focus area (focus on the right area): 225 ° to 315 °
Fourth focus area (no focus): 315 ° to 45 °

しかしながら、図５では、マイクユニット７０が基準の配置角度で設置されることを前提としており、マイクユニット７０の配置位置、配置角度がずれていると、正しく話者追尾できない場合があった。 However, in FIG. 5, it is assumed that the microphone unit 70 is installed at a reference arrangement angle. If the arrangement position and the arrangement angle of the microphone unit 70 are deviated, there is a case where the speaker cannot be tracked correctly.

そこで、本実施形態に係る情報処理装置（会議端末５）は、複数の音声入力部（マイク７２）を備えてなる音声入力手段（マイクアレイ７１）を有する第１の装置（マイクユニット７０）と、音声を出力する音声出力手段（スピーカ１１５）と、所定範囲を撮像する撮像手段（カメラ１１２）と、複数の音声入力部へ入力される音声に基づいて、音声入力手段への音声の入力方向を検知する音源方向検知手段（音源方向検知部１３０）と、音源方向検知手段の検知結果に応じて、撮像手段による撮像範囲を制御する撮像範囲制御手段（撮像範囲制御部１３１）と、を有する第２の装置（本体部５０）と、を備え、音源方向検知手段は、音声出力手段から出力された音声による音声入力手段への音声入力に基づいて、第１の装置の第２の装置に対する相対的な位置情報を検知し、撮像範囲制御手段は、検知結果および位置情報に基づいて、撮像範囲を制御するものである。なお、括弧内は実施形態での符号、適用例を示す。 Therefore, the information processing apparatus (conference terminal 5) according to the present embodiment includes a first apparatus (microphone unit 70) having voice input means (microphone array 71) including a plurality of voice input units (microphones 72). The sound input means (speaker 115) for outputting sound, the image pickup means (camera 112) for imaging a predetermined range, and the sound input direction to the sound input means based on the sound input to the plurality of sound input units Sound source direction detecting means (sound source direction detecting unit 130) and imaging range control means (imaging range control unit 131) for controlling the imaging range by the imaging means according to the detection result of the sound source direction detecting means. And a sound source direction detecting means in the second apparatus of the first apparatus based on the sound input to the sound input means by the sound output from the sound output means. Detecting the relative position information, the imaging range control means based on the detection result and the location information, and controls the imaging range. In addition, the code | symbol in embodiment and the example of application are shown in a parenthesis.

すなわち、マイクユニット７０の本体部５０に対する配置角度を適宜検出して、検出した角度に基づいて補正制御を行うことにより、マイクユニット７０が任意の角度および任意の位置に設置された場合であっても、カメラ１１２による話者追尾制御を正確に実行することを可能とするものである。 That is, the microphone unit 70 is installed at an arbitrary angle and an arbitrary position by appropriately detecting the arrangement angle of the microphone unit 70 with respect to the main body 50 and performing correction control based on the detected angle. In addition, the speaker tracking control by the camera 112 can be accurately executed.

本実施形態に係る会議端末５による話者追尾制御について、図６を参照して説明する。図６は、ある拠点の様子を示す説明図であって、テーブル９０の一端側に会議端末５の本体部５０が配置されるとともに、テーブル９０の中央部に、マイクユニット７０を配置されている例である。また、テーブル９０の周囲には、会議の参加者Ａ〜Ｆがいる様子を示している。 Speaker tracking control by the conference terminal 5 according to the present embodiment will be described with reference to FIG. FIG. 6 is an explanatory diagram showing a state of a certain base. The main body 50 of the conference terminal 5 is arranged on one end side of the table 90, and the microphone unit 70 is arranged in the center of the table 90. It is an example. In addition, a state in which conference participants A to F are present around the table 90 is shown.

図６（Ａ）は、マイクユニット７０を基準の配置角度（０°）で設置した様子を示している。本実施形態では、マイクユニット７０の０°方向を本体部５０の方向とし、時計回りに角度を振った例を示している。基準の配置角度は、マイクユニット７０と本体部５０との間で予め設定されるものであればよく、図６（Ａ）の例に限られるものではないのは勿論である。 FIG. 6A shows a state in which the microphone unit 70 is installed at a reference arrangement angle (0 °). In the present embodiment, an example in which the 0 ° direction of the microphone unit 70 is the direction of the main body 50 and the angle is swung clockwise is shown. The reference arrangement angle only needs to be set in advance between the microphone unit 70 and the main body 50, and is of course not limited to the example of FIG.

そして、本実施形態に係る会議端末５は、マイクユニット７０が基準の配置位置に設置されない場合も話者追尾を可能とするものである。このため、例えば、図６（Ｂ）に示すように、マイクユニット７０の０°方向が、本体部５０の方向を向けられずに設置された場合を考える。図６（Ｂ）では、本体部５０の方向は、マイクユニット７０の９０°方向となっている。 The conference terminal 5 according to the present embodiment enables speaker tracking even when the microphone unit 70 is not installed at the reference arrangement position. For this reason, for example, as shown in FIG. 6B, consider a case where the 0 ° direction of the microphone unit 70 is installed without the direction of the main body 50 being directed. In FIG. 6B, the direction of the main body 50 is the 90 ° direction of the microphone unit 70.

図６（Ａ）に示す０°位置にマイクユニット７０が正しく設置される場合、マイクアレイ７１に入力される音源の方向を音源方向検知処理により検出し、検出角度に基づいてカメラ１１２を制御することで、話者追尾が可能となる。 When the microphone unit 70 is correctly installed at the 0 ° position shown in FIG. 6A, the direction of the sound source input to the microphone array 71 is detected by sound source direction detection processing, and the camera 112 is controlled based on the detected angle. This enables speaker tracking.

しかしながら、図６（Ｂ）のようにマイクユニット７０が基準角度以外の角度で設置されている場合、音源方向検知部１３０での検出結果が実際の音源方向とは異なることとなる。そして、撮像範囲制御部１３１は、検出角度に基づいてカメラ１１２を制御するが、撮像範囲制御部１３１は、マイクユニット７０と本体部５０との間で予め設定された基準角度に基づいて、撮像範囲を制御するため、話者追尾制御に失敗してしまうこととなる。 However, when the microphone unit 70 is installed at an angle other than the reference angle as shown in FIG. 6B, the detection result of the sound source direction detection unit 130 is different from the actual sound source direction. The imaging range control unit 131 controls the camera 112 based on the detection angle. The imaging range control unit 131 performs imaging based on a reference angle that is set in advance between the microphone unit 70 and the main body unit 50. Since the range is controlled, the speaker tracking control will fail.

そこで、本実施形態に係る会議端末５は、先ず、本体部５０のスピーカ１１５から出力した所定の音を、マイクアレイ７１に入力させて、この入力に対して、音源方向検知処理を実行した際の音源の検出角度（すなわち、マイクアレイ７１に対する本体部５０の方向）を、基準角度（０°）として、オフセットするものである。 Therefore, when the conference terminal 5 according to the present embodiment first inputs a predetermined sound output from the speaker 115 of the main body unit 50 to the microphone array 71 and executes sound source direction detection processing on this input. The sound source detection angle (that is, the direction of the main body 50 with respect to the microphone array 71) is offset as the reference angle (0 °).

すなわち、図６（Ｃ）に示すように、実際は、図６（Ｂ）のように配置されたマイクユニット７０の角度（本体部５０に９０°が向いている状態）を、この角度（９０°）を０°としてオフセットし、時計周りに角度を振るものである。そして、以後、このオフセットされた０°方向を基準として、参加者Ａ〜Ｆからの音声を検出するものである。 That is, as shown in FIG. 6C, in reality, the angle of the microphone unit 70 arranged as shown in FIG. 6B (state in which 90 ° is directed to the main body 50) is set to this angle (90 °). ) Is offset at 0 °, and the angle is rotated clockwise. Thereafter, the voices from the participants A to F are detected based on the offset 0 ° direction.

図７は、会議端末５のカメラ１１２による話者追尾の説明図である。図７（Ａ）は、０°方向のオフセットを行わない比較例、図７（Ｂ）は、０°方向のオフセットを行う実施例の説明図である。 FIG. 7 is an explanatory diagram of speaker tracking by the camera 112 of the conference terminal 5. FIG. 7A is a comparative example in which no offset in the 0 ° direction is performed, and FIG. 7B is an explanatory diagram of an embodiment in which the offset in the 0 ° direction is performed.

例えば、図７（Ａ）に示すように、音源方向検知部１３０での検出結果が、１８０°〜２２５°の範囲であった場合、実際の発話者は参加者Ｂであるが、オフセット前の角度では、１８０°〜２２５°は９０°分ずれた位置になっているため、このままの位置情報に基づいて、撮像範囲を制御すると、実際に発話者がいる位置とは異なる範囲をフォーカスエリアとしてしまう。 For example, as shown in FIG. 7A, when the detection result of the sound source direction detection unit 130 is in the range of 180 ° to 225 °, the actual speaker is the participant B, but before the offset, Since the angle is 180 ° to 225 ° shifted by 90 °, when the imaging range is controlled based on the position information as it is, a range different from the position where the speaker is actually used is set as the focus area. End up.

一方、図６（Ｃ）にて説明したように、０°方向をオフセットした後の角度を用いることで、１８０°〜２２５°は、範囲を９０°〜１３５°の範囲として検出することが可能になり、検出した位置情報に基づいて、正しく、実際の発話者のいる範囲をフォーカスエリアとすることができる。 On the other hand, as described in FIG. 6C, by using the angle after offsetting the 0 ° direction, 180 ° to 225 ° can be detected as a range of 90 ° to 135 °. Thus, based on the detected position information, the area where the actual speaker is present can be correctly set as the focus area.

ここまで説明した話者追尾制御について、図８のフローチャートを参照して説明する。図８は、本実施形態に係る会議端末５による話者追尾制御の一例を示すフローチャートである。 The speaker tracking control described so far will be described with reference to the flowchart of FIG. FIG. 8 is a flowchart illustrating an example of speaker tracking control by the conference terminal 5 according to the present embodiment.

先ず、会議端末５の会議開始ボタンの押下や、相手先の会議端末５からの会議呼び出しにより、会議が開始する（Ｓ１０１）。 First, the conference starts by pressing the conference start button of the conference terminal 5 or by calling a conference from the conference terminal 5 of the other party (S101).

会議の開始時において、会議端末５は、スピーカ１１５から相手先の会議端末５への発信音、または、相手先の会議端末５からの受信音を出力させる（Ｓ１０２）。この発信音は、例えば、会議開始時に鳴らされる会議端末５の呼び出し音である。 At the start of the conference, the conference terminal 5 outputs a dial tone from the speaker 115 to the destination conference terminal 5 or a received tone from the destination conference terminal 5 (S102). This dial tone is, for example, a ringing tone of the conference terminal 5 that is played at the start of the conference.

そして、スピーカ１１５から出力された発信音もしくは受信音が、マイクユニット７０のマイクアレイ７１の各マイク７２に入力されると（Ｓ１０３）、音源方向検知部１３０は、音源方向検知処理により音源の方向（すなわち、マイクユニット７０に対する本体部５０のスピーカ１１５の方向（角度））を検出する（Ｓ１０４）。 When the dial tone or the received sound output from the speaker 115 is input to each microphone 72 of the microphone array 71 of the microphone unit 70 (S103), the sound source direction detection unit 130 performs the direction of the sound source by the sound source direction detection process. (That is, the direction (angle) of the speaker 115 of the main body 50 with respect to the microphone unit 70) is detected (S104).

次いで、音源方向検知部１３０は、検出された角度が、規定の基準角度（０°）からどれだけずれた方向から音声が入力されたかを算出する（Ｓ１０５）。 Next, the sound source direction detection unit 130 calculates how much the sound is input from the direction in which the detected angle is deviated from the specified reference angle (0 °) (S105).

そして、Ｓ１０５で算出された角度を、話者追尾制御における基準角度とする。すなわち、Ｓ１０５で算出された角度を用いて配置したマイクユニット７０の０°位置をオフセットする（Ｓ１０６）。 Then, the angle calculated in S105 is set as a reference angle in the speaker tracking control. That is, the 0 ° position of the microphone unit 70 arranged using the angle calculated in S105 is offset (S106).

Ｓ１０１〜Ｓ１０６の処理により、会議開始時のマイクユニット７０の設置角度を補正した後、実際の会議が開始する。 After correcting the installation angle of the microphone unit 70 at the start of the conference by the processing of S101 to S106, the actual conference starts.

会議の開始後は、マイクユニット７０は、何らかの音声入力を待つ状態である（Ｓ１０７）。マイクユニット７０への音声入力があった場合、この音声入力が会議端末５のスピーカ１１５から出力された音声（すなわち、接続先の会議端末５で発せられた音声）であるのか、自拠点で発せられた音声（すなわち、周囲の会議参加者の発話）であるのかを判断する（Ｓ１０８）。 After the start of the conference, the microphone unit 70 waits for some voice input (S107). When there is a voice input to the microphone unit 70, whether the voice input is a voice output from the speaker 115 of the conference terminal 5 (that is, a voice emitted from the conference terminal 5 at the connection destination) It is determined whether the received voice (that is, the speech of the surrounding conference participants) (S108).

このＳ１０８の判断は、判断手段としてのＣＰＵ１０１で判断する。例えば、音声データを相手先の会議端末５から受信して、該音声データに基づく音声を出力している際の、マイクユニット７０への入力音声であるかを判断することで、会議端末５のスピーカ１１５から出力された音声であるのか、会議参加者の発話であるのかを判断する。また、例えば、スピーカ１１５からの音声出力中に実行される所定の処理（エコーキャンセラ処理など）が実行中であるか否かに基づいて判断するようにしてもよい。 The determination in S108 is determined by the CPU 101 as a determination unit. For example, by determining whether the voice data is input to the microphone unit 70 when the voice data is received from the destination conference terminal 5 and the voice based on the voice data is output, the conference terminal 5 It is determined whether the sound is output from the speaker 115 or the speech of the conference participant. Further, for example, the determination may be made based on whether or not a predetermined process (echo canceller process or the like) that is executed during audio output from the speaker 115 is being executed.

マイクユニット７０への音声入力が会議端末５のスピーカ１１５から出力された音声であった場合（Ｓ１０８：ＹＥＳ）、音源方向検知部１３０は、再度、音源方向検知処理により音源の方向を検出する（Ｓ１０９：Ｓ１０４と同処理）。 When the sound input to the microphone unit 70 is a sound output from the speaker 115 of the conference terminal 5 (S108: YES), the sound source direction detecting unit 130 detects the direction of the sound source again by the sound source direction detecting process (S108: YES). S109: Same processing as S104).

この結果、検出されたマイクユニット７０に対する本体部５０の方向が前回検出した値と同じ方向であるかを判断する（Ｓ１１０）。同じ方向であった場合（Ｓ１１０：ＹＥＳ）は、マイクユニット７０の配置位置の変化がないことを示しているので、新たに補正値を算出することはせずに、再び何らかのマイクユニット７０への音声入力を待つ処理となる（Ｓ１０７に戻る）。 As a result, it is determined whether the direction of the main body 50 with respect to the detected microphone unit 70 is the same as the previously detected value (S110). If they are in the same direction (S110: YES), this indicates that there is no change in the arrangement position of the microphone unit 70, so that no correction value is newly calculated, The process waits for voice input (returns to S107).

一方、前回検出した値と異なる方向であった場合（Ｓ１１０：ＮＯ）は、会議中にマイクユニット７０の配置位置の変化があったことを示しているので、Ｓ１０５へ戻り、再度、検出された角度と、基準角度とのずれを算出し、算出結果に基づいて、マイクユニット７０の０°位置をオフセットする（Ｓ１０５，Ｓ１０６）。そして、再び何らかのマイクユニット７０への音声入力を待つ（Ｓ１０７）。 On the other hand, when the direction is different from the previously detected value (S110: NO), it indicates that the arrangement position of the microphone unit 70 has changed during the conference, so the process returns to S105 and is detected again. The deviation between the angle and the reference angle is calculated, and the 0 ° position of the microphone unit 70 is offset based on the calculation result (S105, S106). Then, it again waits for some voice input to the microphone unit 70 (S107).

このように、会議中にスピーカ１１５から出力された音声がマイクユニット７０へ入力された際に、角度の算出を適宜行うようにすることで、会議中にマイクユニット７０の配置位置が変更になった場合もその変化に対応することができる。また、Ｓ１１０の判断は、スピーカ１１５から出力された音声がマイクユニット７０へ入力されるたびに実行せず、所定の時間の経過を追加の条件として実行するようにしてもよい。 As described above, when the sound output from the speaker 115 during the conference is input to the microphone unit 70, the position of the microphone unit 70 is changed during the conference by appropriately calculating the angle. It is possible to cope with such changes. Further, the determination in S110 may not be performed every time the sound output from the speaker 115 is input to the microphone unit 70, but may be performed with the passage of a predetermined time as an additional condition.

これに対し、Ｓ１０８において、マイクユニット７０への音声入力が自拠点で発せられた音声であった場合（Ｓ１０８：ＮＯ）は、Ｓ１０６において、オフセットされた０°位置に基づいて、マイクユニット７０へ入力された音源方向（すなわち、発話者の方向）を検出する（Ｓ１１１）。 On the other hand, in S108, when the voice input to the microphone unit 70 is a voice emitted from the local site (S108: NO), in S106, the microphone unit 70 is input to the microphone unit 70 based on the offset 0 ° position. The input sound source direction (ie, the direction of the speaker) is detected (S111).

そして、撮像範囲制御部１３１は、検出された発話者の方向に基づいて、発話者の方向にカメラ１１２を追尾させる制御を行って、撮像する（Ｓ１１２）。会議の終了処理（Ｓ１１３：ＹＥＳ）がなされるまでは、Ｓ１０７の処理に戻って、マイクユニット７０への音声入力を待つ（Ｓ１１３：ＮＯ）。 Based on the detected direction of the speaker, the imaging range control unit 131 performs control to track the camera 112 in the direction of the speaker and captures an image (S112). Until the conference termination process (S113: YES) is performed, the process returns to S107 and waits for voice input to the microphone unit 70 (S113: NO).

なお、Ｓ１０２の処理では、発信音または受信音に基づいて、会議開始時のマイクユニット７０と本体部５０との位置関係を検出しているが、必ずしも発信音または受信音に基づいて、会議開始時の位置関係を検出する必要はない。すなわち、本体部５０から何らかの音を出力して、これに基づいて、会議開始時に位置関係を検出するものであればよいが、発信音や受信音がなっている間に、マイクユニット７０の近傍で、発話者から発話されることは少ないと考えられるため、発信音や受信音に基づいて、会議開始時にマイクユニット７０の位置検出を行うことが好適である。 In the process of S102, the positional relationship between the microphone unit 70 and the main body 50 at the start of the conference is detected based on the dial tone or the received tone. However, the conference start is not necessarily based on the dial tone or the received tone. It is not necessary to detect the positional relationship of time. That is, any sound may be output from the main body unit 50 and the positional relationship may be detected based on this sound at the start of the conference. Therefore, since it is considered that the speaker does not speak much, it is preferable to detect the position of the microphone unit 70 at the start of the conference based on the dial tone and the received tone.

以上説明した本実施形態に係る会議端末によれば、音声入力ユニットを装置本体に対して任意の角度で配置しても、話者追尾を実現することができる。 According to the conference terminal according to the present embodiment described above, speaker tracking can be realized even when the voice input unit is arranged at an arbitrary angle with respect to the apparatus main body.

すなわち、マイクアレイ７１を備えたマイクユニット７０が、スピーカ１１５とカメラ１１２を備えた会議端末５の本体部５０とは、別体として設けられ、有線又は無線により接続される構成においても、マイクユニット７０が、本体部５０のスピーカ１１５から出力した音をマイクアレイ７１から入力した際の音源方向検知結果に基づいて、マイクユニット７０（マイクアレイ７１）と本体部５０との位置関係を検出し、検出した角度に基づいて、マイクユニット７０の配置角度をオフセットすることで、話者追尾機能を有する会議端末５のマイクユニット７０において、マイクユニット７０を本体部５０に対し、配置位置、配置角度の制限なく、任意の角度、位置で配置することが可能となる。換言すれば、任意の位置にマイクユニット７０を配置した場合であっても、話者追尾機能を正確に実現することができる。 In other words, the microphone unit 70 including the microphone array 71 is provided separately from the main body 50 of the conference terminal 5 including the speaker 115 and the camera 112, and the microphone unit is also connected in a wired or wireless manner. 70 detects the positional relationship between the microphone unit 70 (microphone array 71) and the main body 50 based on the sound source direction detection result when the sound output from the speaker 115 of the main body 50 is input from the microphone array 71. By offsetting the arrangement angle of the microphone unit 70 based on the detected angle, in the microphone unit 70 of the conference terminal 5 having the speaker tracking function, the microphone unit 70 is arranged with respect to the main body 50 with the arrangement position and the arrangement angle. Without limitation, it can be arranged at an arbitrary angle and position. In other words, even if the microphone unit 70 is arranged at an arbitrary position, the speaker tracking function can be accurately realized.

尚、上述の実施形態は本発明の好適な実施の例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。 The above-described embodiment is a preferred embodiment of the present invention, but is not limited thereto, and various modifications can be made without departing from the gist of the present invention.

例えば、カメラ１１２の画像データに対し、人物認識技術を用いて画像データに含まれる人物（話者）を検知して、検出した話者に合わせて、カメラ１１２をズームアップ制御することも好ましい。なお、人物認識技術は、公知または新規の技術によればよく、特に限られるものではない。 For example, it is also preferable to detect a person (speaker) included in the image data using the person recognition technique for the image data of the camera 112 and to perform zoom-up control of the camera 112 according to the detected speaker. The person recognition technique may be a known or new technique and is not particularly limited.

１テレビ会議システム
３サーバ
５会議端末
５０本体部
７０マイクユニット
７０ｃケーブル
７１マイクアレイ
７２，７２ａ〜７２ｄマイク
７３無線通信部
９０テーブル
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４フラッシュメモリ
１０５ＳＳＤ
１０６記録メディア
１０７メディアドライブ
１０８操作部
１０９電源スイッチ
１１０バスライン
１１１ネットワークＩ／Ｆ
１１２カメラ
１１３撮像素子Ｉ／Ｆ
１１４無線通信部
１１５スピーカ
１１６音声入出力Ｉ／Ｆ
１１７ディスプレイＩ／Ｆ
１１８外部機器接続Ｉ／Ｆ
１２０ディスプレイ
１２０ｃケーブル
１３０音源方向検知部
１３１撮像範囲制御部
Ｎネットワーク DESCRIPTION OF SYMBOLS 1 Video conference system 3 Server 5 Conference terminal 50 Main body part 70 Microphone unit 70c Cable 71 Microphone array 72, 72a-72d Microphone 73 Wireless communication part 90 Table 101 CPU
102 ROM
103 RAM
104 Flash memory 105 SSD
106 Recording medium 107 Media drive 108 Operation unit 109 Power switch 110 Bus line 111 Network I / F
112 Camera 113 Image sensor I / F
114 Wireless communication unit 115 Speaker 116 Audio input / output I / F
117 Display I / F
118 External device connection I / F
120 Display 120c Cable 130 Sound Source Direction Detection Unit 131 Imaging Range Control Unit N Network

特開平１０−２２７８４９号公報JP-A-10-227849

Claims

複数の音声入力部を備えてなる音声入力手段を有する第１の装置と、
音声を出力する音声出力手段と、
所定範囲を撮像する撮像手段と、
前記複数の音声入力部へ入力される音声に基づいて、前記音声入力手段への音声の入力方向を検知する音源方向検知手段と、
前記音源方向検知手段の検知結果に応じて、前記撮像手段による撮像範囲を制御する撮像範囲制御手段と、を有する第２の装置と、を備え、
前記音源方向検知手段は、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記第１の装置の前記第２の装置に対する相対的な位置情報を検知し、
前記撮像範囲制御手段は、前記検知結果および前記位置情報に基づいて、前記撮像範囲を制御することを特徴とする情報処理装置。 A first device having voice input means comprising a plurality of voice input units;
Audio output means for outputting audio;
Imaging means for imaging a predetermined range;
Sound source direction detection means for detecting the input direction of the sound to the sound input means based on the sound input to the plurality of sound input units;
An imaging range control unit that controls an imaging range by the imaging unit according to a detection result of the sound source direction detection unit,
The sound source direction detection means detects relative positional information of the first device with respect to the second device based on a voice input to the voice input means by a voice output from the voice output means,
The information processing apparatus, wherein the imaging range control unit controls the imaging range based on the detection result and the position information.

複数の音声入力部を備えてなる音声入力手段と、
前記複数の音声入力部へ入力される音声に基づいて、前記音声入力手段への音声の入力方向を検知する音源方向検知手段と、を有する第１の装置と、
音声を出力する音声出力手段と、
所定範囲を撮像する撮像手段と、
前記音源方向検知手段の検知結果に応じて、前記撮像手段による撮像範囲を制御する撮像範囲制御手段と、を有する第２の装置と、を備え、
前記音源方向検知手段は、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記第１の装置の前記第２の装置に対する相対的な位置情報を検知するとともに、該位置情報を前記第２の装置へ伝送し、
前記撮像範囲制御手段は、前記検知結果および前記位置情報に基づいて、前記撮像範囲を制御することを特徴とする情報処理装置。 Voice input means comprising a plurality of voice input units;
A sound source direction detecting means for detecting a sound input direction to the sound input means based on the sound input to the plurality of sound input sections;
Audio output means for outputting audio;
Imaging means for imaging a predetermined range;
An imaging range control unit that controls an imaging range by the imaging unit according to a detection result of the sound source direction detection unit,
The sound source direction detection unit detects positional information of the first device relative to the second device based on a voice input to the voice input unit by a voice output from the voice output unit. , Transmitting the position information to the second device,
The information processing apparatus, wherein the imaging range control unit controls the imaging range based on the detection result and the position information.

前記音源方向検知手段は、当該情報処理装置が他の情報処理装置との通信開始時に、前記音声出力手段から出力され、前記音声入力手段に入力される所定音に基づいて、前記位置情報を検知することを特徴とする請求項１または２に記載の情報処理装置。 The sound source direction detection means detects the position information based on a predetermined sound output from the voice output means and input to the voice input means when the information processing apparatus starts communication with another information processing apparatus. The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

前記所定音は、当該情報処理装置が前記他の情報処理装置を呼び出す際の発信音、または、前記他の情報処理装置が当該情報処理装置を呼び出す際の受信音であることを特徴とする請求項３に記載の情報処理装置。 The predetermined sound is a dial tone when the information processing apparatus calls the other information processing apparatus or a reception sound when the other information processing apparatus calls the information processing apparatus. Item 4. The information processing device according to Item 3.

前記第２の装置に対する前記第１の装置の基準角度が予め設定されており、
前記音源方向検知手段は、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記第１の装置の前記第２の装置に対する相対的な設置角度を検知するとともに、前記設置角度と前記基準角度とのずれ量に基づいて、前記基準角度を補正し、
前記撮像範囲制御手段は、補正後の前記基準角度を用いて前記撮像範囲を制御することを特徴とする請求項１から４までのいずれかに記載の情報処理装置。 A reference angle of the first device with respect to the second device is preset,
The sound source direction detection unit detects a relative installation angle of the first device with respect to the second device based on a voice input to the voice input unit by a voice output from the voice output unit. , Correcting the reference angle based on a deviation amount between the installation angle and the reference angle,
The information processing apparatus according to claim 1, wherein the imaging range control unit controls the imaging range using the corrected reference angle.

前記第２の装置は、前記音声入力手段への音声入力が、前記音声出力手段から出力されたことによるものであるのか、その他のものであるのかを判断する判断手段を備えることを特徴とする請求項１から５までのいずれかに記載の情報処理装置。 The second apparatus includes a determination unit that determines whether the voice input to the voice input unit is caused by being output from the voice output unit or the other. The information processing apparatus according to any one of claims 1 to 5.

前記音源方向検知手段は、当該情報処理装置が他の情報処理装置との通信開始後も、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記位置情報を検知して、該位置情報を更新することを特徴とする請求項１から６までのいずれかに記載の情報処理装置。 The sound source direction detecting unit is configured to detect the position information based on voice input to the voice input unit by voice output from the voice output unit even after the information processing device starts communication with another information processing device. The information processing apparatus according to claim 1, wherein the position information is detected and updated.

前記第１の装置と前記第２の装置が有線または無線により接続されることを特徴とする請求項１から７までのいずれかに記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the first apparatus and the second apparatus are connected by wire or wirelessly.

複数の会議端末の少なくとも１つとして、請求項１から８までのいずれかに記載の情報処理装置を備え、前記会議端末の間で音声データおよび画像データの送受信を行うことを特徴とする会議システム。 9. A conference system comprising the information processing apparatus according to claim 1 as at least one of a plurality of conference terminals, wherein audio data and image data are transmitted and received between the conference terminals. .

複数の音声入力部を備えてなる音声入力手段を有する第１の装置と、
音声を出力する音声出力手段と、
所定範囲を撮像する撮像手段と、を有する第２の装置と、を備えた情報処理装置の制御方法であって、
前記複数の音声入力部へ入力される音声に基づいて、前記音声入力手段への音声の入力方向を検知する音源方向検知工程と、
前記音源方向検知工程での検知結果に応じて、前記撮像手段による撮像範囲を制御する撮像範囲制御工程と、を有し、
前記音源方向検知工程では、前記音声出力手段から出力された音声による前記音声入力手段への音声入力に基づいて、前記第１の装置の前記第２の装置に対する相対的な位置情報を検知し、
前記撮像範囲制御工程は、前記検知結果および前記位置情報に基づいて、前記撮像範囲を制御することを特徴とする情報処理装置の制御方法。 A first device having voice input means comprising a plurality of voice input units;
Audio output means for outputting audio;
A control method of an information processing apparatus comprising: a second device having an imaging unit that images a predetermined range,
A sound source direction detecting step for detecting an input direction of sound to the sound input means based on sounds input to the plurality of sound input units;
An imaging range control step for controlling an imaging range by the imaging means according to a detection result in the sound source direction detection step,
In the sound source direction detection step, relative position information of the first device with respect to the second device is detected based on a sound input to the sound input device by a sound output from the sound output device,
The method for controlling an information processing apparatus, wherein the imaging range control step controls the imaging range based on the detection result and the position information.