JP2005110103A

JP2005110103A - Voice normalizing method in video conference

Info

Publication number: JP2005110103A
Application number: JP2003343093A
Authority: JP
Inventors: Yoshikazu Inoue; 美和井上
Original assignee: Kyushu Electronics Systems Inc
Current assignee: Kyushu Electronics Systems Inc
Priority date: 2003-10-01
Filing date: 2003-10-01
Publication date: 2005-04-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice normalizing method for holding a video conference with reality by matching a direction of a voice and display contents. <P>SOLUTION: In the voice normalizing method in the video conference, when displaying a plurality of video images of participants in a plurality of stations on a display device 10 in a video telephone or video conference utilizing a network, locations of audio signals transmitted from the stations are mixed and stereo-reproduced in accordance with display positions on the display device 10. Thus, a stereo function is used to balance a plurality of audio data from the network when returning them to voices and mixing them in accordance with display positions of video images, thereby listening to voices from positions corresponding to the actual video image positions, so that a real video conference can be realized. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、社内ＬＡＮやインターネット通信網などのネットワークを利用したテレビ会議（テレビ電話を含む）における音声の定位方法に関する。 The present invention relates to a sound localization method in a video conference (including a video phone) using a network such as an in-house LAN or an Internet communication network.

テレビ電話、テレビ会議は、会話を行う複数の者が、カメラで撮像した映像信号とマイクロフォンで収集した音声信号をＬＡＮ（Local Area Network）やブロードバンドインターネット通信網などのネットワークを介して送受信し、互いに相手の顔を見ながら会話、会議を行うものである。特にテレビ会議は、１対１ではなく、３人以上が会議を行うことがあり、通常、テレビ画面には３つ以上のマルチ画面を表示して、会議に参加する全員の顔が見えるようにしている。 In videophones and videoconferences, a plurality of persons having a conversation transmit and receive video signals captured by a camera and audio signals collected by a microphone via a network such as a LAN (Local Area Network) or a broadband Internet communication network. Conversation and meeting while looking at the other person's face. In particular, video conferencing is not one-on-one, and three or more people may hold a conference. Usually, three or more multi-screens are displayed on the TV screen so that everyone who participates in the conference can see the faces. ing.

最近では、プラズマディスプレイや液晶ディスプレイのように、大画面の表示装置が用いられるようになってきた。小さな画面であれば、音声はモノラルでも問題が無かったが、大画面になると、臨場感を出すために、ステレオで音声を出力することが要請されてきた。 Recently, large-screen display devices such as plasma displays and liquid crystal displays have been used. If the screen is small, there is no problem with the sound being monaural. However, when the screen is large, it has been requested to output the sound in stereo to give a sense of realism.

特開平５−３５６７号公報（特許文献１）には、音声入力部と音声出力部を備えるとともに、音声入力部から入力されたそれぞれの音声データを交信フレーム内に組み込んで送信する一方、受信した交信フレーム内に組み込まれている複数の音声データを個々の音声データに分離して音声出力部に導くマルチ・デマルチプレクサを備えた映像通信装置が開示されている。これにより、ステレオ化された音声が出力されるために、テレビ会議の臨場感が高まる。 JP-A-5-3567 (Patent Document 1) includes an audio input unit and an audio output unit, and each audio data input from the audio input unit is incorporated into a communication frame and transmitted. There has been disclosed a video communication apparatus including a multi-demultiplexer that separates a plurality of audio data incorporated in a communication frame into individual audio data and guides the audio data to an audio output unit. As a result, the stereophonic sound is output, so that the presence of the video conference is enhanced.

特開平１０−８４５３９号公報（特許文献２）には、ビデオカメラとマイクとを雲台に取り付け、雲台の方向を表す雲台方向情報に基づいて、相手テレビ会議装置から転送され音声複号化装置により複号化されたモノラルの音声信号からステレオ化された音声信号を生成する音場生成装置を設け、音場生成装置で生成されたステレオ化された音声信号を左右のスピーカによりステレオ再生するステレオ音声テレビ会議装置が開示されている。 In Japanese Patent Laid-Open No. 10-84539 (Patent Document 2), a video camera and a microphone are attached to a pan head, and based on pan head information indicating the direction of the pan head, it is transferred from the other video conference device and decoded. A sound field generator that generates a stereo audio signal from the monaural audio signal decoded by the encoding device is provided, and the stereo audio signal generated by the sound field generator is reproduced in stereo by the left and right speakers. A stereo audio video conferencing apparatus is disclosed.

また、特開２００２−５８００５号公報（特許文献３）には、送信装置では、Ｌ（左）およびＲ（右）チャネルの２つの音声信号を加算したデータをモノラル音声として第１の通信チャネルで送信し、２つの音声信号を減算したデータを非標準の音声として第２の通信チャネルで送信し、受信装置では、２つの音声信号を加算したデータをモノラル音声として受信し、２つの音声信号を減算したデータを非標準の音声として受信し、２つの音声信号を基に演算して元の音声に復元するテレビ電話・テレビ会議システムが開示されている。 Japanese Patent Laid-Open No. 2002-58005 (Patent Document 3) discloses that the transmission apparatus uses the first communication channel as data obtained by adding two audio signals of L (left) and R (right) channels as monaural audio. The data obtained by subtracting the two audio signals is transmitted as non-standard audio on the second communication channel, and the receiving device receives the data obtained by adding the two audio signals as monaural audio, and receives the two audio signals. There has been disclosed a videophone / videoconferencing system that receives subtracted data as non-standard sound, calculates the sound based on two sound signals, and restores the original sound.

以上の従来技術においては、いずれも複数のテレビ電話、テレビ会議の参加者の音声を再生するとき、ステレオ音声とするものである。
例えば、参加者が５人（５ステーション）の場合、図３（ａ）に示すように、表示装置１０には５つのマルチ画面が表示される。図３（ｂ）に示すように、各ステーションＳ１〜Ｓ５の音声データは、ネットワークインターフェイス１１により均等にミキシングされ、再生手段１２において、ステレオ信号に復元されるが、図３（ｃ）に示すように、全ての音が、画面の位置や大きさに関係なく再生される。 In the above prior arts, all of them are stereo audio when reproducing the audio of a plurality of videophone and video conference participants.
For example, when there are five participants (five stations), five multi-screens are displayed on the display device 10 as shown in FIG. As shown in FIG. 3 (b), the audio data of each of the stations S1 to S5 is mixed evenly by the network interface 11 and restored to a stereo signal by the reproducing means 12, but as shown in FIG. 3 (c). In addition, all sounds are played regardless of the position and size of the screen.

特開平５−３５６７号公報JP-A-5-3567 特開平１０−８４５３９号公報Japanese Patent Laid-Open No. 10-84539 特開２００２−５８００５号公報JP 2002-58005 A

以上のように、特許文献１〜３に開示されたテレビ会議システムにおいては、複数箇所からネットワークを通して伝送されてきた音声データを画面の位置や大きさに関係なく同等に再生していたので、マルチ画面の右側に表示されている人の発言か、左側に表示されている人の発言かは、映像で口の動きを見ないと判断できなかった。
また、背景の物音などは、音の発生原因が画面内に表示されなければ、どの画面の相手から発生したものか、判断できなかった。 As described above, in the video conference systems disclosed in Patent Documents 1 to 3, since audio data transmitted from a plurality of places through a network is reproduced equally regardless of the position and size of the screen, Whether the person's remarks displayed on the right side of the screen or the remarks of the person displayed on the left side could not be determined without looking at the movement of the mouth in the video.
In addition, if the cause of sound generation is not displayed on the screen, it cannot be determined from which screen partner the background sound.

本発明は、画面右側の映像の音声は右側から、画面左側の映像の音声は左側から聞こえるようにすることで、音の方向と表示内容が一致し、リアル感のあるテレビ会議が行える音声の定位方法を提供することを目的とする。 In the present invention, the sound of the video on the right side of the screen can be heard from the right side, and the sound of the video on the left side of the screen can be heard from the left side. The purpose is to provide a localization method.

本発明の第１の構成は、ネットワークを利用したテレビ電話やテレビ会議で、複数のステーションにおける参加者の映像を表示装置上に複数表示する場合に、前記各ステーションから送信される音声信号の位置を、前記表示装置上の表示位置に合わせてミキシングしてステレオ再生することを特徴とするテレビ会議における音声の定位方法である。
また、本発明の第２の構成は、前記音声信号の位置を、サラウンド機能を用いてミキシングおよび再生することを特徴とする。 The first configuration of the present invention is the position of an audio signal transmitted from each station when a plurality of video images of participants in a plurality of stations are displayed on a display device in a videophone or videoconference using a network. Is reproduced in stereo in accordance with the display position on the display device and reproduced in stereo.
The second configuration of the present invention is characterized in that the position of the audio signal is mixed and reproduced using a surround function.

本発明においては、社内ＬＡＮやブロードバンドを利用したテレビ会議システムにおいて、複数の相手を表示した際に、表示位置に合わせてミキシングレベルを調整し、ステレオ効果や、サラウンド効果を利用して、表示位置と実際に聞こえてくる音声位置が一致するようにする。
また、大きく表示した場合は、音量を大きくし、小さく表示した場合は、音量を下げることで、音声の遠近感が出て、聞きたい音声を聞きやすくする効果を与える。 In the present invention, in a video conference system using an in-house LAN or broadband, when a plurality of parties are displayed, the mixing level is adjusted according to the display position, and the display position is adjusted using the stereo effect or the surround effect. And the sound position actually heard is matched.
In addition, when the volume is displayed large, the volume is increased, and when the volume is decreased, the volume is decreased, thereby giving a sense of perspective of the voice and making it easier to hear the desired voice.

ネットワークから送られてきた音声データを音声に戻す際に、ステレオのＬチャンネルとＲチャンネルを同等のレベルで再現せず、現在、表示装置上に表示されている映像の位置情報から、ＬチャンネルとＲチャンネルの再生レベルを調整する。
各映像は、それぞれ位置が違うので、その位置に応じた音声の定位ができる。
また、表示サイズの情報を使い、音声レベルを比例させることで、遠近感を作り出す。
この方法で、テレビ会議を行うと異なる場所にいる表示装置内の人の発言が、表示装置内の表示されている位置に存在するように聞こえる。 When the audio data sent from the network is returned to audio, the stereo L channel and R channel are not reproduced at the same level, and the L channel and the L channel are determined from the position information of the video currently displayed on the display device. Adjust the playback level of the R channel.
Since each video has a different position, the sound can be localized according to the position.
In addition, the sense of perspective is created by using the information of the display size and making the audio level proportional.
In this way, when a video conference is held, a person's speech in a display device at a different location sounds like being present at the displayed position in the display device.

本発明によれば、ステレオ機能を用いて、ネットワークからの複数の音声データを音声に戻してミキシングする際のバランスを映像の表示位置に合わせて調整することにより、音声が実際の映像の位置と対応した位置から聞こえるので、リアルなテレビ会議が実現できる。 According to the present invention, the stereo function is used to adjust the balance when mixing a plurality of audio data from the network back to the audio in accordance with the display position of the video, so that the audio can be compared with the actual video position. Since it can be heard from the corresponding position, a realistic video conference can be realized.

以下、本発明の実施の形態を、図面を参照しながら説明する。
（第１実施形態）
図１は本発明の第１実施形態を示すもので、（ａ）は表示装置におけるマルチ映像表示、（ｂ）は音声ミキシングのフローチャート、（ｃ）はステレオ再生時の各ステーションの定位状態を示す説明図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(First embodiment)
1A and 1B show a first embodiment of the present invention, in which FIG. 1A shows a multi video display on a display device, FIG. 1B shows a flowchart of audio mixing, and FIG. 1C shows a localization state of each station during stereo reproduction. It is explanatory drawing.

本実施形態は、５箇所（ステーション）のテレビ会議システムを示しており、表示装置１０の上部に２画面（センター振り分け）、下部に３画面（中央と左右）を配置している。
各ステーションＳ１〜Ｓ５の音声データ（ステレオ）は、ネットワークインターフェイス１１によりそれぞれネットワーク経由で送信される。各音声データに復元されたステレオ信号は、表示装置１０のドライバにおける表示位置情報と表示サイズ情報により、図１（ｂ）の１２に示すようにミキシングされる。 This embodiment shows a video conference system at five locations (stations), and two screens (center distribution) are arranged at the top of the display device 10 and three screens (center and left and right) are arranged at the bottom.
The audio data (stereo) of each station S1 to S5 is transmitted via the network by the network interface 11 respectively. The stereo signal restored to each audio data is mixed as shown at 12 in FIG. 1B by display position information and display size information in the driver of the display device 10.

例えば、上部左画面の音声のＬＲの比率を７：３、上部右画面の音声のＬＲの比率を３：７、下部左画面の音声のＬＲの比率を１０：０、下部中央画面の音声のＬＲの比率を５：５、下部右画面の音声のＬＲの比率を０：１０にすることで、それぞれの画面の位置から音声が出ているように聞こえる。 For example, the upper left screen audio LR ratio is 7: 3, the upper right screen audio LR ratio is 3: 7, the lower left screen audio LR ratio is 10: 0, and the lower center screen audio LR ratio is By setting the ratio of LR to 5: 5 and the ratio of LR of the audio in the lower right screen to 0:10, it sounds as if sound is being output from the position of each screen.

また、上部の表示のサイズを大きくし、下部の表示サイズを小さくして、ミキシングのレベルを３：２に合わせることで、図１（ｃ）に示すように、上部に表示されている相手が、より近くにいるように感じる。 In addition, by increasing the size of the upper display and decreasing the lower display size, and adjusting the mixing level to 3: 2, as shown in FIG. , Feel like being closer.

なお、各ステーションＳ１〜Ｓ５の音声データは、ステレオ信号を基本とするが、モノラル信号でも、これを画面の位置と大きさで振り分けることで、テレビ会議システムとしては各ステーションＳ１〜Ｓ５の音声と映像の位置が整合した定位を持った音源となる。 The audio data of each station S1 to S5 is basically a stereo signal, but even a monaural signal is distributed according to the position and size of the screen, so that the video conferencing system has the audio data of each station S1 to S5. The sound source has a localized position that matches the position of the video.

（第２実施形態）
図２は本発明の第２実施形態を示す配置図である。本実施形態では、複数の表示装置２１，２２，２３を左右と正面に配置した場合、サラウンドステレオ再生のために前方左右にフロントスピーカＬ３１、フロントスピーカＲ３２、中央にセンタースピーカ３３を配置し、出席者の後部にリアスピーカＬ３４、リアスピーカＲ３５を配置することで、左右だけではなく前後方向の定位が決まる。 (Second Embodiment)
FIG. 2 is a layout view showing a second embodiment of the present invention. In this embodiment, when a plurality of display devices 21, 22, and 23 are arranged on the left and right and front, a front speaker L31 and a front speaker R32 are arranged on the front and left and a center speaker 33 in the center for surround stereo reproduction, and attendance is performed. By arranging the rear speaker L34 and the rear speaker R35 at the rear of the person, localization in the front-rear direction as well as right and left is determined.

この場合、各表示装置２１，２２，２３の位置、各スピーカ３１〜３５の位置、出席者の位置の情報を計算して、各スピーカからの出力配分を決める。
各表示装置２１，２２，２３に複数箇所の映像が表示されている場合は、その情報を出力配分に付加する。
サラウンドステレオ方式としては各種あるが、いずれの方式でもよい。 In this case, information on the positions of the display devices 21, 22, 23, the positions of the speakers 31 to 35, and the positions of the attendees is calculated to determine the output distribution from the speakers.
When a plurality of video images are displayed on each of the display devices 21, 22, and 23, the information is added to the output distribution.
Although there are various types of surround stereo systems, any system may be used.

本発明は、音声が実際の映像の位置と対応した位置から聞こえ、リアルなテレビ会議が実現できるテレビ会議システムに利用することができる。 INDUSTRIAL APPLICABILITY The present invention can be used for a video conference system in which sound can be heard from a position corresponding to an actual video position and a real video conference can be realized.

本発明の第１実施形態を示すもので、（ａ）は表示装置におけるマルチ映像表示、（ｂ）は音声ミキシングのフローチャート、（ｃ）はステレオ再生時の各ステーションの定位状態を示す説明図である。1A and 1B show a first embodiment of the present invention, in which FIG. 1A is a multi-image display on a display device, FIG. 1B is a flowchart of audio mixing, and FIG. 2C is an explanatory diagram showing a localization state of each station during stereo reproduction. is there. 本発明の第２実施形態を示す配置図である。It is a layout view showing a second embodiment of the present invention. 従来のテレビ会議システムを示すもので、（ａ）は表示装置におけるマルチ映像表示、（ｂ）は音声ミキシングのフローチャート、（ｃ）はステレオ再生時の各ステーションの定位状態を示す説明図である。1 shows a conventional video conference system, in which (a) is a multi-image display on a display device, (b) is a flowchart of audio mixing, and (c) is an explanatory diagram showing a localization state of each station during stereo reproduction.

符号の説明Explanation of symbols

１０表示装置
１１ネットワークインターフェイス
２１〜２３表示装置
３１〜３５スピーカ DESCRIPTION OF SYMBOLS 10 Display apparatus 11 Network interface 21-23 Display apparatus 31-35 Speaker

Claims

ネットワークを利用したテレビ電話やテレビ会議で、複数のステーションにおける参加者の映像を表示装置上に複数表示する場合に、前記各ステーションから送信される音声信号の位置を、前記表示装置上の表示位置に合わせてミキシングしてステレオ再生することを特徴とするテレビ会議における音声の定位方法。 When displaying a plurality of video images of participants at a plurality of stations on a display device in a videophone or video conference using a network, the position of an audio signal transmitted from each station is displayed on the display device. A method for localizing audio in a video conference, characterized by performing stereo reproduction after mixing according to the above.

前記音声信号の位置を、サラウンド機能を用いてミキシングおよび再生することを特徴とする請求項１記載のテレビ会議における音声の定位方法。 2. The method of localizing audio in a video conference according to claim 1, wherein the position of the audio signal is mixed and reproduced using a surround function.