JPH06348811A - Moving image display device - Google Patents

Moving image display device

Info

Publication number
JPH06348811A
JPH06348811A JP5135755A JP13575593A JPH06348811A JP H06348811 A JPH06348811 A JP H06348811A JP 5135755 A JP5135755 A JP 5135755A JP 13575593 A JP13575593 A JP 13575593A JP H06348811 A JPH06348811 A JP H06348811A
Authority
JP
Japan
Prior art keywords
mouth shape
image
voice
face image
mouth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP5135755A
Other languages
Japanese (ja)
Other versions
JP3059022B2 (en
Inventor
Kenji Sakamoto
憲治 坂本
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sharp Corp
Original Assignee
Sharp Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Corp filed Critical Sharp Corp
Priority to JP5135755A priority Critical patent/JP3059022B2/en
Publication of JPH06348811A publication Critical patent/JPH06348811A/en
Application granted granted Critical
Publication of JP3059022B2 publication Critical patent/JP3059022B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

PURPOSE:To accurately identify an input voice in a short time and to obtain a face image provided with accurate mouth shape synchronizing with the input voice by deciding the mouth shape of a vocalized voice based on a sampled frequency component, and obtaining a synthesized face image by synthesizing a mouth shape image with the mouth area of a human face image. CONSTITUTION:A frequency detecting part 2 extracts frequencies of high-pass and low-pass areas in an audio signal inputted from a voice storage part 1. A mouth shape selection part 3 decides the mouth shape to be displayed based on the detection result of the frequency component of the frequency detecting part 2. An image synthesizing part 4 reads out image data in accordance with the mouth shape selected at the mouth shape selection part 3 from a mouth shape image storage part 5, and simultaneously, reads out the image data representing the human face image from a face image storage part 6, and forms the synthesized face image by fitting the mouth shape image in the mouth area of the face image. Furthermore, a display part 7 outputs the synthesized image formed at the image synthesizing part 4 by the control of a synchronous control part synchronizing with voice output from a vocalizing part 8.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、人間の表情を人工的に
合成して表示装置の画面上に表示することにより、オペ
レータである人間と機械とのコミニュケーションを円滑
に行えるようにした動画像表示装置に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a moving image in which a human expression as an operator is artificially synthesized and displayed on the screen of a display device so that the operator and a machine can smoothly communicate with each other. Regarding display device.

【0002】[0002]

【従来の技術】従来、この種の表示装置としては、発声
する内容をテキストにし、このテキストに応じて発声す
る音声の口形と発声時間とを決定し、この決定した口形
の画像を人間の顔画像の口領域に合成し、この合成顔画
像をテキストから合成した音声と同期して表示する装置
がある。
2. Description of the Related Art Conventionally, as a display device of this type, the contents to be uttered are made into text, the mouth shape and utterance time of the uttered voice are determined according to the text, and the image of the determined mouth shape is used as a human face. There is a device that synthesizes it in the mouth area of the image and displays the synthesized face image in synchronization with the voice synthesized from the text.

【0003】また、音声認識技術を用いて入力音声の母
音を抽出し、この抽出した母音に応じた口形を決定し、
この決定した口形の画像を人間の顔画像の口領域に合成
し、この合成顔画像を入力音声の発声速度に同期して表
示する装置がある(原島ら;「表情アニメーション作成
のためのシナリオ記述ツールとリアルタイム動画像表
示」;信学技法HC91−57,P23〜30参照)。
Also, a vowel of an input voice is extracted by using a voice recognition technique, and a mouth shape corresponding to the extracted vowel is determined,
There is a device that synthesizes the determined mouth-shaped image with the mouth area of a human face image and displays the synthesized face image in synchronization with the utterance speed of the input voice (Harashima et al .; "Scenario description for creating facial expression animation. Tools and real-time moving image display "; see Jpn. Tech. HC91-57, P23-30).

【0004】[0004]

【発明が解決しようとする課題】ところが、前述した2
種類の装置のうち、第1の装置は、テキストに応じて口
形を決定するため口形は正確に表現できるが、音声はテ
キストから合成するため自然性に欠けるといった不都合
がある。
However, the above-mentioned 2
Of the devices of the type, the first device determines the mouth shape in accordance with the text and thus can accurately express the mouth shape, but has the disadvantage of lacking naturalness because the voice is synthesized from the text.

【0005】また、第2の装置は、音声認識技術によっ
て入力音声をテキスト化してから口形を決定するため、
音声は正確に表現できるが、口形は音声の認識を誤ると
不自然なものが選択され、正確な合成顔画像が得られな
いといった不都合がある。また、この装置は処理に時間
がかかり過ぎるといった不都合もある。
In addition, since the second device converts the input voice into text by the voice recognition technique and then determines the mouth shape,
Although the voice can be accurately expressed, the mouth shape has an inconvenience that an incorrect synthetic face image cannot be obtained if the voice is erroneously recognized. In addition, this device has the disadvantage that it takes too long to process.

【0006】そこで、本発明は、入力音声を短時間で正
確に識別し、入力音声と同期して正確な口形を有する顔
画像を表示することを目的とする。
Therefore, an object of the present invention is to accurately identify an input voice in a short time and display a face image having an accurate mouth shape in synchronization with the input voice.

【0007】[0007]

【課題を解決するための手段】本発明による動画像表示
装置は、発声する音声の低域および高域の周波数成分を
抽出する周波数検出部と、周波数検出部で抽出した周波
数成分に基づき発声する音声の口形を決定する口形選択
部と、この決定した口形に応じた口形画像を人間の顔画
像の口領域に合成して合成顔画像を得る画像合成部と、
音声の発声に同期して合成顔画像を表示する表示部とか
ら構成する。
A moving image display apparatus according to the present invention utters on the basis of a frequency detecting section for extracting low and high frequency components of a voice to be uttered, and a frequency component extracted by the frequency detecting section. A mouth shape selecting unit that determines a mouth shape of voice, an image combining unit that combines a mouth shape image corresponding to the determined mouth shape with a mouth area of a human face image to obtain a combined face image,
The display unit displays a synthetic face image in synchronization with the utterance of voice.

【0008】この場合、周波数検出部は、発声する音声
の低域を検出する第1の帯域フィルタと高域を検出する
第2の帯域フィルタとを備え、口形選択部はこの第1お
よび第2の帯域フィルタの両出力値に基づいて発声する
音声の口形を決定する。
In this case, the frequency detecting section includes a first band filter for detecting the low band of the uttered voice and a second band filter for detecting the high band, and the mouth shape selecting section has the first and second band filters. The mouth shape of the voice to be uttered is determined based on both output values of the bandpass filter.

【0009】[0009]

【作用】本発明の構成において、周波数検出部は発声す
る音声の低域および高域の周波数成分を抽出し、口形選
択部はこの抽出した周波数成分に基づいて口形を決定す
る。口形を決定するうえで重要なのは調音位置である
が、母音の調音位置は主に第1フォルマントおよび第2
フォルマントの位置によって決まる。
In the configuration of the present invention, the frequency detecting section extracts the low frequency component and the high frequency component of the uttered voice, and the mouth shape selecting section determines the mouth shape based on the extracted frequency component. It is the articulatory position that is important in determining the mouth shape, but the articulatory position of the vowel is mainly the first formant and the second formant.
It depends on the position of the formant.

【0010】このため周波数検出部では低域側の第1の
帯域フィルタによって第1フォルマントを抽出し、高域
側の第2の帯域フィルタによって第2フォルマントを抽
出する。口形選択部はこの2種類の帯域フィルタで抽出
した周波数成分の割合から入力音声を摩擦子音、バスバ
ー、5種類の母音および無音に大別し、それぞれに応じ
た口形を決定する。合成画像部はこうして決定した口形
に対する口形画像を顔画像の口領域に合成して合成顔画
像を得、表示部はこの合成画像を音声の発声に同期して
表示する。
Therefore, in the frequency detecting section, the first formant on the low band side extracts the first formant, and the second bandont on the high band side extracts the second formant. The mouth shape selection unit roughly divides the input voice into fricative consonants, busbars, five kinds of vowels, and silence based on the ratio of the frequency components extracted by these two types of bandpass filters, and determines the mouth shape corresponding to each. The synthetic image section synthesizes the mouth shape image for the mouth shape thus determined in the mouth area of the face image to obtain a synthetic face image, and the display section displays the synthetic image in synchronization with the utterance of voice.

【0011】[0011]

【実施例】図1は、本発明による動画像表示装置の一実
施例を示すブロック図である。本実施例において、音声
格納部1は自然音声を格納するためのもので、予め収集
した音声であっても、リアルタイムで入力した音声であ
ってもよい。
1 is a block diagram showing an embodiment of a moving image display apparatus according to the present invention. In the present embodiment, the voice storage unit 1 is for storing natural voices, and may be voices collected in advance or voices input in real time.

【0012】周波数検出部2は音声格納部1から入力さ
れた音声信号の低域および高域の周波数成分を抽出する
ためのもので、低域側の帯域フィルタBPF1と高域側
の帯域フィルタBPF2とによって構成されている。こ
の2つの帯域フィルタBPF1およびBPF2は、図2
に示すように、周波数1200Hzを境界とする周波数
特性を有しており、この境界周波数(1200Hz)は
母音の第1ホルマントと第2ホルマントとを区別しやす
くするために設定した値である。
The frequency detecting section 2 is for extracting low-frequency and high-frequency components of the audio signal input from the audio storing section 1. The frequency detecting section 2 is a low-frequency band filter BPF1 and a high-frequency band filter BPF2. It is composed of and. The two bandpass filters BPF1 and BPF2 are shown in FIG.
As shown in, the frequency characteristic has a frequency of 1200 Hz as a boundary, and this boundary frequency (1200 Hz) is a value set in order to easily distinguish the first formant and the second formant of the vowel.

【0013】口形選択部3は周波数検出部2における周
波数成分の検出結果に基づいて、表示すべき口形を決定
するためのもので、前述した2つの帯域フィルタBPF
1およびBPF2の出力値から口形を決定する。図3
は、口形選択部3における口形決定のための特性図で、
横軸に帯域フィルタBPF1の出力値、縦軸に帯域フィ
ルタBPF2の出力値を示し、この2つの出力値の相関
から口形を決定する。
The mouth shape selecting unit 3 is for determining the mouth shape to be displayed based on the detection result of the frequency component in the frequency detecting unit 2, and the two band-pass filters BPF described above.
The mouth shape is determined from the output values of 1 and BPF2. Figure 3
Is a characteristic diagram for determining the mouth shape in the mouth shape selecting unit 3,
The horizontal axis shows the output value of the bandpass filter BPF1 and the vertical axis shows the output value of the bandpass filter BPF2, and the mouth shape is determined from the correlation between these two output values.

【0014】図中、fr,rel,a,i,u,e,
o,mmは、それぞれ口形を表す記号で、frは/s/
(サ行音)のような摩擦性子音を発声するときの口形、
relは発声していないときの口形、aは母音/a/を
発声するときの口形、iは母音/i/を発声するときの
口形、uは母音/u/を発声するときの口形、oは母音
/o/を発声するときの口形、mmは/ma/を発声す
るときのような唇を閉じたときの口形である。
In the figure, fr, rel, a, i, u, e,
o and mm are symbols representing the mouth shape, and fr is / s /
Mouth shape when uttering frictional consonants such as
rel is the mouth shape when not uttering, a is the mouth shape when the vowel / a / is uttered, i is the mouth shape when the vowel / i / is uttered, u is the mouth shape when the vowel / u / is uttered, o Is the mouth shape when voicing / o /, and mm is the mouth shape when the lips are closed as when uttering / ma /.

【0015】従って、例えば帯域フィルタBPF1の出
力レベルがa1以下で、かつ帯域フィルタBPF2の出
力レベルがb1以上であれば、口形選択部3は記号fr
の口形を選択する。
Therefore, for example, when the output level of the bandpass filter BPF1 is a1 or less and the output level of the bandpass filter BPF2 is b1 or more, the mouth shape selecting unit 3 indicates the symbol fr.
Select the mouth shape of.

【0016】画像合成部4は口形選択部3で選択した口
形に対応する画像データを口形画像格納部5から読み出
し、同時に顔画像格納部6から人間の顔画像を表示する
画像データを読み出して、図4に示すように、顔画像F
の口領域Mに、口形画像を嵌め込んで合成顔画像を形成
する。口形画像格納部5には、図5に示すように、前述
した8種類の口形fr,rel,a,i,u,e,o,
mmに対応する口形の画像データが格納されている。
The image synthesizing section 4 reads out image data corresponding to the mouth shape selected by the mouth shape selecting section 3 from the mouth shape image storing section 5, and at the same time reads out image data for displaying a human face image from the face image storing section 6, As shown in FIG. 4, the face image F
A mouth-shaped image is fitted into the mouth area M of 1 to form a synthetic face image. As shown in FIG. 5, the mouth shape image storage unit 5 stores the above-described eight kinds of mouth shapes fr, rel, a, i, u, e, o,
Mouth-shaped image data corresponding to mm is stored.

【0017】表示部7は画像合成部4で形成した合成顔
画像を、同期制御部(不図示)からの制御によって、発
音部8から発声する音声出力と同期して表示するもので
ある。この場合、顔画像に嵌め込む口形画像の切り換え
を、口形が変化したときのみ実行するようにすれば、処
理量を大幅に削減することができる。
The display unit 7 displays the synthetic face image formed by the image synthesis unit 4 in synchronization with the voice output from the sounding unit 8 under the control of the synchronization control unit (not shown). In this case, the amount of processing can be significantly reduced by switching the mouth-shaped images fitted to the face image only when the mouth shape changes.

【0018】[0018]

【発明の効果】本発明によれば、簡易な手段で正確かつ
迅速に入力音声の口形を決定することができ、その結
果、音声と同期した違和感のない口形を有する合成顔画
像を表示することができるので、オペレータと機械との
間でのヒューマン・インターフェースが可能となる。
According to the present invention, the mouth shape of an input voice can be accurately and quickly determined by a simple means, and as a result, a synthetic face image having a mouth shape synchronized with the voice can be displayed. This allows a human interface between the operator and the machine.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明による動画像表示装置の一実施例を示す
ブロック図である。
FIG. 1 is a block diagram showing an embodiment of a moving image display device according to the present invention.

【図2】図1に示す周波数検出部の周波数特性を示す図
である。
FIG. 2 is a diagram showing frequency characteristics of a frequency detection unit shown in FIG.

【図3】図1に示す口形選択部の特性図である。FIG. 3 is a characteristic diagram of the mouth shape selecting unit shown in FIG.

【図4】顔画像と口領域とを表す図である。FIG. 4 is a diagram showing a face image and a mouth region.

【図5】口形画像の一例を示す図である。FIG. 5 is a diagram showing an example of a mouth shape image.

【符号の説明】[Explanation of symbols]

1 音声格納部 2 周波数検出部 3 口形選択部 4 画像合成部 5 口形画像格納部 6 顔画像格納部 7 表示部 8 発音部 1 voice storage unit 2 frequency detection unit 3 mouth shape selection unit 4 image synthesis unit 5 mouth shape image storage unit 6 face image storage unit 7 display unit 8 sounding unit

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】 発声する音声の低域および高域の周波数
成分を抽出する周波数検出部と、 前記周波数検出部で抽出した周波数成分に基づき前記発
声する音声の口形を決定する口形選択部と、 前記決定した口形に応じた口形画像を人間の顔画像の口
領域に合成して合成顔画像を得る画像合成部と、 前記音声の発声に同期して前記合成顔画像を表示する表
示部と、を備えることを特徴とする動画像表示装置。
1. A frequency detector that extracts low-frequency and high-frequency components of a voice to be uttered, and a mouth shape selector that determines a mouth shape of the voice to be uttered based on the frequency components extracted by the frequency detector. An image synthesizing unit that synthesizes a mouth shape image corresponding to the determined mouth shape in the mouth area of a human face image to obtain a synthesized face image, and a display unit that displays the synthesized face image in synchronization with the utterance of the voice, A moving image display device comprising:
【請求項2】 前記周波数検出部は、前記発声する音声
の低域を検出する第1の帯域フィルタと前記発声する音
声の高域を検出する第2の帯域フィルタとを備え、前記
口形選択部は前記第1および第2の帯域フィルタの両出
力値に基づいて前記発声する音声の口形を決定すること
を特徴とする請求項1記載の動画像表示装置。
2. The frequency detecting unit includes a first band filter for detecting a low band of the uttered voice and a second band filter for detecting a high band of the uttered voice, and the mouth shape selecting unit. The moving image display apparatus according to claim 1, wherein the mouth shape of the uttered voice is determined based on both output values of the first and second bandpass filters.
JP5135755A 1993-06-07 1993-06-07 Video display device Expired - Fee Related JP3059022B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP5135755A JP3059022B2 (en) 1993-06-07 1993-06-07 Video display device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP5135755A JP3059022B2 (en) 1993-06-07 1993-06-07 Video display device

Publications (2)

Publication Number Publication Date
JPH06348811A true JPH06348811A (en) 1994-12-22
JP3059022B2 JP3059022B2 (en) 2000-07-04

Family

ID=15159111

Family Applications (1)

Application Number Title Priority Date Filing Date
JP5135755A Expired - Fee Related JP3059022B2 (en) 1993-06-07 1993-06-07 Video display device

Country Status (1)

Country Link
JP (1) JP3059022B2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002133445A (en) * 2000-10-30 2002-05-10 Namco Ltd Image processing device, image processing method and information storage medium
US7096079B2 (en) 1999-10-14 2006-08-22 Sony Computer Entertainment Inc. Audio processing and image generating apparatus, audio processing and image generating method, recording medium and program
USRE42000E1 (en) 1996-12-13 2010-12-14 Electronics And Telecommunications Research Institute System for synchronization between moving picture and a text-to-speech converter
USRE42647E1 (en) 1997-05-08 2011-08-23 Electronics And Telecommunications Research Institute Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same
JP2012150363A (en) * 2011-01-20 2012-08-09 Kddi Corp Message image editing program and message image editing apparatus
CN108847234A (en) * 2018-06-28 2018-11-20 广州华多网络科技有限公司 Lip reading synthetic method, device, electronic equipment and storage medium
CN112770062A (en) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 Image generation method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW270123B (en) 1993-07-02 1996-02-11 Ciba Geigy

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USRE42000E1 (en) 1996-12-13 2010-12-14 Electronics And Telecommunications Research Institute System for synchronization between moving picture and a text-to-speech converter
USRE42647E1 (en) 1997-05-08 2011-08-23 Electronics And Telecommunications Research Institute Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same
US7096079B2 (en) 1999-10-14 2006-08-22 Sony Computer Entertainment Inc. Audio processing and image generating apparatus, audio processing and image generating method, recording medium and program
JP2002133445A (en) * 2000-10-30 2002-05-10 Namco Ltd Image processing device, image processing method and information storage medium
JP2012150363A (en) * 2011-01-20 2012-08-09 Kddi Corp Message image editing program and message image editing apparatus
CN108847234A (en) * 2018-06-28 2018-11-20 广州华多网络科技有限公司 Lip reading synthetic method, device, electronic equipment and storage medium
CN112770062A (en) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 Image generation method and device
CN112770062B (en) * 2020-12-22 2024-03-08 北京奇艺世纪科技有限公司 Image generation method and device

Also Published As

Publication number Publication date
JP3059022B2 (en) 2000-07-04

Similar Documents

Publication Publication Date Title
EP3226245B1 (en) System and method to insert visual subtitles in videos
Le Goff et al. A text-to-audiovisual-speech synthesizer for french
US6317716B1 (en) Automatic cueing of speech
US5940797A (en) Speech synthesis method utilizing auxiliary information, medium recorded thereon the method and apparatus utilizing the method
US4913539A (en) Apparatus and method for lip-synching animation
US6109923A (en) Method and apparatus for teaching prosodic features of speech
Childers et al. Gender recognition from speech. Part II: Fine analysis
JP2518683B2 (en) Image combining method and apparatus thereof
JP2003186379A (en) Program for voice visualization processing, program for voice visualization figure display and for voice and motion image reproduction processing, program for training result display, voice-speech training apparatus and computer system
KR950035447A (en) Video Signal Processing System Using Speech Analysis Automation and Its Method
Sako et al. HMM-based text-to-audio-visual speech synthesis.
KR20000005183A (en) Image synthesizing method and apparatus
JPH06348811A (en) Moving image display device
Barker et al. Evidence of correlation between acoustic and visual features of speech
JP2002023716A (en) Presentation system and recording medium
JPH05232856A (en) Method and device for speech visualization and language learning device using the same
Sawashima Glottal adjustments for English obstruents
US11114101B2 (en) Speech recognition with image signal
JP2003216173A (en) Method, device and program of synchronous control of synthetic voice and video
Athanasopoulos et al. 3D immersive karaoke for the learning of foreign language pronunciation
Granström et al. Neglected dimensions in speech synthesis
WO2019150234A1 (en) Speech recognition with image signal
KR100359988B1 (en) real-time speaking rate conversion system
Ogata et al. Inverse estimation of the vocal tract shape from speech sounds including consonants using a vocal tract mapping interface
Ohtsuka et al. Aperiodicity control in ARX-based speech analysis-synthesis method

Legal Events

Date Code Title Description
FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20080421

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090421

Year of fee payment: 9

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090421

Year of fee payment: 9

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100421

Year of fee payment: 10

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100421

Year of fee payment: 10

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110421

Year of fee payment: 11

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120421

Year of fee payment: 12

LAPS Cancellation because of no payment of annual fees