JP3059022B2

JP3059022B2 - Video display device

Info

Publication number: JP3059022B2
Application number: JP5135755A
Authority: JP
Inventors: 憲治坂本
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1993-06-07
Filing date: 1993-06-07
Publication date: 2000-07-04
Anticipated expiration: 2015-07-04
Also published as: JPH06348811A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、人間の表情を人工的に
合成して表示装置の画面上に表示することにより、オペ
レータである人間と機械とのコミニュケーションを円滑
に行えるようにした動画像表示装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a moving image in which communication between a human being as an operator and a machine can be smoothly performed by artificially synthesizing human facial expressions and displaying them on a screen of a display device. It relates to a display device.

【０００２】[0002]

【従来の技術】従来、この種の表示装置としては、発声
する内容をテキストにし、このテキストに応じて発声す
る音声の口形と発声時間とを決定し、この決定した口形
の画像を人間の顔画像の口領域に合成し、この合成顔画
像をテキストから合成した音声と同期して表示する装置
がある。2. Description of the Related Art Conventionally, as a display device of this type, a uttered content is converted into text, a mouth shape and a utterance time of a voice to be uttered are determined in accordance with the text, and an image of the determined mouth shape is converted to a human face. There is a device that synthesizes a synthesized facial image with a mouth region of an image and displays the synthesized facial image in synchronization with a voice synthesized from text.

【０００３】また、音声認識技術を用いて入力音声の母
音を抽出し、この抽出した母音に応じた口形を決定し、
この決定した口形の画像を人間の顔画像の口領域に合成
し、この合成顔画像を入力音声の発声速度に同期して表
示する装置がある（原島ら；「表情アニメーション作成
のためのシナリオ記述ツールとリアルタイム動画像表
示」；信学技法ＨＣ９１−５７，Ｐ２３〜３０参照）。[0003] Further, a vowel of an input voice is extracted by using a voice recognition technique, and a mouth shape corresponding to the extracted vowel is determined.
There is a device that synthesizes the determined mouth shape image with the mouth region of the human face image and displays the synthesized face image in synchronization with the utterance speed of the input voice (Harashima et al .: “Scenario description for creating facial expression animation” Tool and Real-time Video Display "; see IEICE HC91-57, pp. 23-30).

【０００４】[0004]

【発明が解決しようとする課題】ところが、前述した２
種類の装置のうち、第１の装置は、テキストに応じて口
形を決定するため口形は正確に表現できるが、音声はテ
キストから合成するため自然性に欠けるといった不都合
がある。However, the above-mentioned 2
Among the types of devices, the first device determines the mouth shape according to the text, so that the mouth shape can be accurately expressed. However, since the speech is synthesized from the text, the first device has a disadvantage that it lacks naturalness.

【０００５】また、第２の装置は、音声認識技術によっ
て入力音声をテキスト化してから口形を決定するため、
音声は正確に表現できるが、口形は音声の認識を誤ると
不自然なものが選択され、正確な合成顔画像が得られな
いといった不都合がある。また、この装置は処理に時間
がかかり過ぎるといった不都合もある。Further, the second device determines the mouth shape after converting the input voice into text by voice recognition technology.
Although speech can be accurately expressed, if the mouth shape is incorrectly recognized, an unnatural one is selected, and there is a disadvantage that an accurate synthesized face image cannot be obtained. In addition, this device has a disadvantage that processing takes too long.

【０００６】そこで、本発明は、入力音声を短時間で正
確に識別し、入力音声と同期して正確な口形を有する顔
画像を表示することを目的とする。Accordingly, it is an object of the present invention to accurately identify an input voice in a short time and display a face image having an accurate mouth shape in synchronization with the input voice.

【０００７】[0007]

【課題を解決するための手段】本発明による動画像表示
装置は、発声する音声の低域および高域の周波数成分を
抽出する周波数検出部と、周波数検出部で抽出した周波
数成分に基づき発声する音声の口形を決定する口形選択
部と、この決定した口形に応じた口形画像を人間の顔画
像の口領域に合成して合成顔画像を得る画像合成部と、
音声の発声に同期して合成顔画像を表示する表示部とか
ら構成する。According to the present invention, there is provided a moving image display apparatus for extracting a low-frequency component and a high-frequency component of an uttered voice, and uttering based on the frequency components extracted by the frequency detecting unit. A mouth shape selection unit that determines a mouth shape of a voice, and an image synthesis unit that synthesizes a mouth shape image corresponding to the determined mouth shape with a mouth region of a human face image to obtain a synthesized face image,
A display unit for displaying the synthesized face image in synchronization with the utterance of the voice.

【０００８】この場合、周波数検出部は、周波数１２０
０Ｈｚを境にして、発声する音声の低域側の出力値を検
出する第１の帯域フィルタと高域側の出力値を検出する
第２の帯域フィルタとを備え、口形選択部はこの第１お
よび第２の帯域フィルタの両出力値に基づいて発声する
音声の口形を決定する。[0008] In this case, the frequency detecting unit detects the frequency 120
A first bandpass filter for detecting a low-frequency output value of the uttered voice and a second bandpass filter for detecting a high-frequency output value at 0 Hz are provided. And the mouth shape of the voice to be uttered is determined based on both output values of the second bandpass filter.

【０００９】[0009]

【作用】本発明の構成において、周波数検出部は発声す
る音声の低域および高域の周波数成分を抽出し、口形選
択部はこの抽出した周波数成分に基づいて口形を決定す
る。口形を決定するうえで重要なのは調音位置である
が、母音の調音位置は主に第１フォルマントおよび第２
フォルマントの位置によって決まる。In the structure of the present invention, the frequency detecting section extracts low and high frequency components of the uttered voice, and the mouth shape selecting section determines the mouth shape based on the extracted frequency components. It is the articulation position that is important in determining the mouth shape, but the articulation position of the vowel is mainly the first formant and the second formant.
Determined by the position of the formant.

【００１０】このため周波数検出部では低域側の第１の
帯域フィルタによって第１フォルマントの出力値を抽出
し、高域側の第２の帯域フィルタによって第２フォルマ
ントの出力値を抽出する。口形選択部はこの２種類の帯
域フィルタで抽出した出力値の割合から入力音声を摩擦
子音、バスバー、５種類の母音および無音に大別し、そ
れぞれに応じた口形を決定する。合成画像部はこうして
決定した口形に対する口形画像を顔画像の口領域に合成
して合成顔画像を得、表示部はこの合成画像を音声の発
声に同期して表示する。[0010] Therefore the frequency detecting unit extracts the output value of the first formant by a first band-pass filter of the low-frequency side, extracts the output value of the second formant by a second band-pass filter the high frequency side. The mouth shape selection unit roughly divides the input voice into a fricative consonant, a bus bar, five vowels, and a silence based on the ratio of the output values extracted by the two types of bandpass filters, and determines a mouth shape corresponding to each. The synthesized image unit synthesizes the mouth shape image corresponding to the determined mouth shape with the mouth region of the face image to obtain a synthesized face image, and the display unit displays the synthesized image in synchronization with the utterance of the voice.

【００１１】[0011]

【実施例】図１は、本発明による動画像表示装置の一実
施例を示すブロック図である。本実施例において、音声
格納部１は自然音声を格納するためのもので、予め収集
した音声であっても、リアルタイムで入力した音声であ
ってもよい。FIG. 1 is a block diagram showing an embodiment of a moving picture display device according to the present invention. In the present embodiment, the voice storage unit 1 stores natural voices, and may be voices collected in advance or voices input in real time.

【００１２】周波数検出部２は音声格納部１から入力さ
れた音声信号の低域および高域の周波数成分を抽出する
ためのもので、低域側の帯域フィルタＢＰＦ１と高域側
の帯域フィルタＢＰＦ２とによって構成されている。こ
の２つの帯域フィルタＢＰＦ１およびＢＰＦ２は、図２
に示すように、周波数１２００Ｈｚを境界とする周波数
特性を有しており、この境界周波数（１２００Ｈｚ）は
母音の第１フォルマントと第２フォルマントとを区別し
やすくするために設定した値である。The frequency detector 2 is for extracting low-frequency and high-frequency components of the audio signal input from the audio storage 1, and includes a low-band filter BPF1 and a high-band filter BPF2. And is constituted by. These two bandpass filters BPF1 and BPF2 are shown in FIG.
As shown in FIG. 5, the frequency characteristic has a frequency of 1200 Hz as a boundary, and this boundary frequency (1200 Hz) is a value set to make it easy to distinguish the first formant and the second formant of a vowel.

【００１３】口形選択部３は周波数検出部２における周
波数成分の検出結果に基づいて、表示すべき口形を決定
するためのもので、前述した２つの帯域フィルタＢＰＦ
１およびＢＰＦ２の出力値から口形を決定する。図３
は、口形選択部３における口形決定のための特性図で、
横軸に帯域フィルタＢＰＦ１の出力値、縦軸に帯域フィ
ルタＢＰＦ２の出力値を示し、この２つの出力値の相関
から口形を決定する。The mouth shape selection unit 3 is for determining the mouth shape to be displayed based on the detection result of the frequency component in the frequency detection unit 2.
The mouth shape is determined from the output values of 1 and BPF2. FIG.
Is a characteristic diagram for determining a mouth shape in the mouth shape selection unit 3,
The horizontal axis indicates the output value of the bandpass filter BPF1, and the vertical axis indicates the output value of the bandpass filter BPF2. The mouth shape is determined from the correlation between the two output values.

【００１４】図中、ｆｒ，ｒｅｌ，ａ，ｉ，ｕ，ｅ，
ｏ，ｍｍは、それぞれ口形を表す記号で、ｆｒは／ｓ／
（サ行音）のような摩擦性子音を発声するときの口形、
ｒｅｌは発声していないときの口形、ａは母音／ａ／を
発声するときの口形、ｉは母音／ｉ／を発声するときの
口形、ｕは母音／ｕ／を発声するときの口形、ｏは母音
／ｏ／を発声するときの口形、ｍｍは／ｍａ／を発声す
るときのような唇を閉じたときの口形である。In the figure, fr, rel, a, i, u, e,
o and mm are symbols representing the mouth shape, and fr is / s /
Mouth shape when uttering fricative consonants such as
rel is a mouth shape when not uttering, a is a mouth shape when vowel / a / is uttered, i is a mouth shape when vowel / i / is uttered, u is a mouth shape when vowel / u / is uttered, o Is the mouth shape when the vowel / o / is uttered, and mm is the mouth shape when the lips are closed such as when the / ma / is uttered.

【００１５】従って、例えば帯域フィルタＢＰＦ１の出
力レベルがａ１以下で、かつ帯域フィルタＢＰＦ２の出
力レベルがｂ１以上であれば、口形選択部３は記号ｆｒ
の口形を選択する。Therefore, for example, if the output level of the bandpass filter BPF1 is equal to or lower than a1 and the output level of the bandpass filter BPF2 is equal to or higher than b1, the mouth shape selecting unit 3 determines the symbol fr
Choose the mouth shape.

【００１６】画像合成部４は口形選択部３で選択した口
形に対応する画像データを口形画像格納部５から読み出
し、同時に顔画像格納部６から人間の顔画像を表示する
画像データを読み出して、図４に示すように、顔画像Ｆ
の口領域Ｍに、口形画像を嵌め込んで合成顔画像を形成
する。口形画像格納部５には、図５に示すように、前述
した８種類の口形ｆｒ，ｒｅｌ，ａ，ｉ，ｕ，ｅ，ｏ，
ｍｍに対応する口形の画像データが格納されている。The image synthesizing unit 4 reads out image data corresponding to the mouth shape selected by the mouth shape selecting unit 3 from the mouth shape image storage unit 5, and at the same time reads out image data for displaying a human face image from the face image storage unit 6, As shown in FIG.
The mouth shape image is fitted into the mouth region M of the image to form a composite face image. As shown in FIG. 5, the eight types of mouth shapes fr, rel, a, i, u, e, o, and
Mouth-shaped image data corresponding to mm is stored.

【００１７】表示部７は画像合成部４で形成した合成顔
画像を、同期制御部（不図示）からの制御によって、発
音部８から発声する音声出力と同期して表示するもので
ある。この場合、顔画像に嵌め込む口形画像の切り換え
を、口形が変化したときのみ実行するようにすれば、処
理量を大幅に削減することができる。The display unit 7 displays the synthesized face image formed by the image synthesizing unit 4 in synchronism with the sound output uttered from the sound generation unit 8 under the control of a synchronization control unit (not shown). In this case, if the switching of the mouth shape image to be fitted to the face image is executed only when the mouth shape changes, the processing amount can be significantly reduced.

【００１８】[0018]

【発明の効果】本発明によれば、簡易な手段で正確かつ
迅速に入力音声の口形を決定することができ、その結
果、音声と同期した違和感のない口形を有する合成顔画
像を表示することができるので、オペレータと機械との
間でのヒューマン・インターフェースが可能となる。According to the present invention, it is possible to accurately and quickly determine the mouth shape of an input voice by simple means, and as a result, to display a synthesized face image having a mouth shape without a sense of incongruity synchronized with the voice. , A human interface between the operator and the machine is possible.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明による動画像表示装置の一実施例を示す
ブロック図である。FIG. 1 is a block diagram showing an embodiment of a moving image display device according to the present invention.

【図２】図１に示す周波数検出部の周波数特性を示す図
である。FIG. 2 is a diagram illustrating frequency characteristics of a frequency detection unit illustrated in FIG. 1;

【図３】図１に示す口形選択部の特性図である。FIG. 3 is a characteristic diagram of the mouth shape selection unit shown in FIG.

【図４】顔画像と口領域とを表す図である。FIG. 4 is a diagram illustrating a face image and a mouth area.

【図５】口形画像の一例を示す図である。FIG. 5 is a diagram illustrating an example of a mouth image.

【符号の説明】[Explanation of symbols]

１音声格納部２周波数検出部３口形選択部４画像合成部５口形画像格納部６顔画像格納部７表示部８発音部 Reference Signs List 1 voice storage unit 2 frequency detection unit 3 mouth shape selection unit 4 image synthesis unit 5 mouth shape image storage unit 6 face image storage unit 7 display unit 8 sound generation unit

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06T 13/00 G06T 15/70 G10L 13/00 G10L 15/00 G10L 21/06 Continuation of the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06T 13/00 G06T 15/70 G10L 13/00 G10L 15/00 G10L 21/06

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】周波数１２００Ｈｚを境として、発声す
る音声の低域側の出力値を検出する第１の帯域フィルタ
と発声する音声の高域側の出力値を検出する第２の帯域
フィルタとを備える周波数検出部と、前記第１および第２の帯域フィルタの両出力値に基づい
て前記発声する音声の口形を決定する口形選択部と、前記決定した口形に応じた口形画像を人間の顔画像の口
領域に合成して合成顔画像を得る画像合成部と、前記音声の発声に同期して前記合成顔画像を表示する表
示部と、を備えることを特徴とする動画像表示装置。1. Speaking at a frequency of 1200 Hz as a boundary
First bandpass filter for detecting an output value on the low frequency side of a voice
Second band for detecting the output value on the high frequency side of the voice uttering
A frequency detection unit comprising a filter, and both output values of the first and second bandpass filters.
And mouth shape selection unit for determining the mouth shape of the sound to the utterance Te, and the synthetic to the mouth shape image corresponding to the determined mouth shape to the mouth area of a human face image obtaining synthetic face image image synthesizing unit, the utterance of the speech A display unit for displaying the composite face image in synchronization with the display unit.