JP2006163269A

JP2006163269A - Language learning apparatus

Info

Publication number: JP2006163269A
Application number: JP2004358336A
Authority: JP
Inventors: Yukio Tada; 幸生多田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-12-10
Filing date: 2004-12-10
Publication date: 2006-06-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a language learning apparatus capable of recording the shape and movement of a user's mouth. <P>SOLUTION: The language learning apparatus comprises: a photographing means for photographing user's moving images indicating the shape and movement of a speaker's mouth; a first storage means for storing the shape and movement of a model speaker as model moving images; a generation means for generating synthetic moving images from the model moving images stored in the first storage means and the user's moving images photographed by the photographing means; and a second storage means for storing the synthetic moving images generated by the generation means. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、話者の口の形を撮影するためのカメラを有する語学学習装置に関する。 The present invention relates to a language learning apparatus having a camera for photographing a mouth shape of a speaker.

外国語あるいは母国語の語学学習、特に、発音あるいは発話の独習においては、ＣＤ（Compact Disk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音あるいは発話するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものである。特に外国語の学習においては、母国語には無い発音を習得しなければならない場合がある。例えば、英語の「th」の発音は日本語には無いものである。このような母国語に無い発音を習得することは難しく、例えば英語を母国語とする者が「th」と発音した音声を聞いても、初めて英語を学習する日本人はその発音を真似しようにも発音方法がまったく分からないという問題がある。 In language learning of a foreign language or native language, especially in self-study of pronunciation or utterance, the model voice recorded on a recording medium such as a CD (Compact Disk) is played, and the model voice is imitated to pronounce or speak. The learning method is widely used. The purpose of this is to acquire correct pronunciation by imitating model voices. Especially when learning foreign languages, you may have to learn pronunciation that is not in your native language. For example, the pronunciation of “th” in English is not in Japanese. It is difficult to learn such pronunciations that are not in their native language. For example, even if a person who speaks English as a native language listens to the sound pronounced “th”, Japanese who are learning English for the first time try to imitate the pronunciation. There is also a problem that the pronunciation method is completely unknown.

母国語に無い発音を学習するためには、生徒（学習者）に対し模範音声を音で聞かせるだけでなく、模範音声を発声するときの口の形や動き等を映像で示し視覚的に確認させる必要がある。すなわち、まず第１には模範となる話者（先生）の口の形や動き等を映像で示す必要がある。学習者である話者（生徒）は先生の口の形や動き等を映像で確認し、その口の形や動き等を真似て発音することにより、より正確な発音方法を習得することができるものである。
ここで、学習をより効率的に進めるためには生徒自身の口の形や動き等を映像で示すことが有効である。生徒が、先生の音声を聴覚的に真似て発音することに加え、自分の口の形や動き等を視覚的に確認しながら先生の口の形や動きを真似ることにより、より正確な発音方法を習得することができるものである。 In order to learn pronunciation that is not in the native language, not only the student (learner) can hear the model voice, but also the video shows the shape and movement of the mouth when the model voice is uttered. Need to be confirmed. That is, first of all, it is necessary to show the shape and movement of the mouth of a model speaker (teacher) as a model. Speakers (students) who are learners can learn a more accurate pronunciation method by checking the mouth shape and movement of the teacher on the video and imitating the mouth shape and movement. Is.
Here, in order to advance learning more efficiently, it is effective to show the mouth shape and movement of the student himself / herself with images. In addition to the auditory imitation of the teacher's voice, the student can imitate the teacher's mouth shape and movement while visually confirming the shape and movement of his / her mouth. Can learn.

このように話者の口の形や動き等の映像を提示あるいは記録する技術としては、例えば非特許文献１および特許文献１〜２に記載の技術がある。非特許文献１は、話者の口を鏡に投影し、話者から見えるようにする技術を開示している。特許文献１は、音声認識システムにおいて、話者の口元の領域を横方向からビデオカメラで撮影し、唇の形状を認識することで発話された音声の音声認識精度を向上させる技術を開示している。特許文献２は、音声認識システムにおいて、発話者の調音器官（口の中および口の周辺）にあらかじめ特定のパターンを持った光を照射し、そのパターンの変化をカメラで読み取ることで調音器官の動作を認識し、音声認識制度を向上させる技術を開示している。
中村敬和、外３名、「聴覚障害児用発声練習システム「あいちゃんの手」」、ＳａｖｅｍａｔｉｏｎＲｅｖｉｅｗ、株式会社山武、１９９８年８月、第１６巻、ｐ．９８−１０５特開２０００−１１２４９６号公報特開２０００−６７２１４号公報 Thus, as a technique for presenting or recording a video such as the shape and movement of a speaker's mouth, there are techniques described in Non-Patent Document 1 and Patent Documents 1 and 2, for example. Non-Patent Document 1 discloses a technique in which a speaker's mouth is projected onto a mirror so that the speaker can see it. Patent Document 1 discloses a technique for improving the speech recognition accuracy of speech uttered by recognizing the shape of a lip by photographing a region of a speaker's mouth from a lateral direction with a video camera in a speech recognition system. Yes. In Patent Document 2, in a speech recognition system, light having a specific pattern is irradiated to a speaker's articulating organ (in and around the mouth) in advance, and the change in the pattern is read by a camera to Discloses technology that recognizes motion and improves the speech recognition system.
Takakazu Nakamura, 3 others, “Voice practice system for hearing-impaired children“ Ai-chan's Hand ””, Save Review, Yamatake Corporation, August 1998, Vol. 16, p. 98-105 JP 2000-112896 A JP 2000-67214 A

非特許文献１に記載の技術においては、話者（生徒）の口の映像は鏡に映されるものであるため、自分の口の形や動きを後からゆっくりと確認することができないという問題があった。さらに、生徒は先生の口の形や動きと自分の口の形や動きとを比較することができないため、自分の発音方法が本当に正しいのか確認することが困難であるという問題もあった。また、特許文献１、２に記載の技術においては、話者の口の動きはカメラで撮影され記録されるものの、撮影された映像は音声認識の認識率を向上させるために用いられるものであり、撮影した映像の話者へのフィードバックには何ら寄与しないという問題があった。 In the technique described in Non-Patent Document 1, since the video of the mouth of the speaker (student) is reflected in the mirror, it is impossible to check the shape and movement of his mouth slowly later was there. In addition, the students cannot compare the shape and movement of the teacher's mouth with the shape and movement of their own mouth, making it difficult to check whether their pronunciation is really correct. In the techniques described in Patent Documents 1 and 2, the movement of the speaker's mouth is photographed and recorded by a camera, but the photographed video is used to improve the recognition rate of voice recognition. There was a problem that it did not contribute at all to the feedback of the photographed video to the speaker.

本発明は上述の事情に鑑みてなされたものであり、自分の口の形や動きを記録し、後から確認することができる語学学習装置を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a language learning device that records the shape and movement of his / her mouth and can be confirmed later.

上述の課題を解決するため、本発明は、話者の口の形および動きを示すユーザ動画を撮影する撮影手段と、模範となる話者の口の形および動きを模範動画として記憶する第１の記憶手段と、前記第１の記憶手段に記憶された模範動画と前記撮影手段により撮影されたユーザ動画とから、合成動画を生成する生成手段と、前記生成手段により生成された合成動画を記憶する第２の記憶手段とを有する語学学習装置を提供する。
好ましい態様において、この語学学習装置が、前記生成手段により生成された合成動画を表示する表示手段をさらに有してもよい。
別の好ましい態様において、この語学学習装置が、前記生成手段により生成される合成動画の態様を決定するパラメータを指定するパラメータ指定手段をさらに有し、前記生成手段が、前記パラメータ指定手段により指定されたパラメータにより決定される態様で合成動画の生成を行ってもよい。 In order to solve the above-described problem, the present invention provides a photographing means for photographing a user moving image showing the shape and movement of a speaker's mouth, and a first storing the shape and movement of a model speaker's mouth as a model moving image. Storage means; generating means for generating a composite video from the model video stored in the first storage means and the user video shot by the shooting means; and storing the composite video generated by the generation means And a second language learning device.
In a preferred embodiment, the language learning device may further include display means for displaying the synthesized moving image generated by the generating means.
In another preferred embodiment, the language learning device further includes parameter designating means for designating a parameter for determining a mode of the synthesized moving image generated by the generating means, and the generating means is specified by the parameter specifying means. The synthesized moving image may be generated in a manner determined by the parameters.

本発明によれば、ユーザ動画と模範動画とが合成されるので、語学学習を行うユーザは自分の発音、発話方法と模範話者の発話、発音方法の違いを視覚的に認識することができる。 According to the present invention, since the user video and the model video are synthesized, the user who performs language learning can visually recognize the difference between his / her pronunciation / speaking method and the model speaker's utterance / speaking method. .

以下、図面を参照して本発明の実施形態について説明する。
図１は、本発明の一実施形態に係る語学学習装置１の構成を示す図である。語学学習装置１は、話者の音声および口の映像を取得するマイクユニット１０、マイクユニット１０が取得した音声および映像データを処理する処理装置２０、および話者の口の画像等を表示するディスプレイ３０とから構成される。マイクユニット１０はケーブル４０を介して処理装置２０との間で信号の授受を行う。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing a configuration of a language learning device 1 according to an embodiment of the present invention. The language learning device 1 includes a microphone unit 10 that acquires the voice and video of the speaker, a processing device 20 that processes voice and video data acquired by the microphone unit 10, and a display that displays an image of the speaker's mouth and the like. 30. The microphone unit 10 exchanges signals with the processing device 20 through the cable 40.

図２は、マイクユニット１０の構造を示す斜視図である。カメラ１１は、例えばＣＣＤ（Charge Coupled Devices）カメラあるいはＣＭＯＳ（Complementary Metal Oxide Semiconductor）カメラ等の小型撮像デバイスであって、話者（ユーザ）の映像（動画）を撮影し、その映像を示す映像信号をケーブル４０を介して出力する。ライト１２は、被写体である話者の口元を照射する光源である。マイク１３は、話者の音声を取得し、その音声を示す音声信号をケーブル４０を介して出力する。マイクユニット１０を使用する際、話者は把持部１４を手で持ち、位置決め部材１５の先端部分を自らの鼻の下に当てて使用する。位置決め部材１５は、例えば図２中に示されるように、径の異なる２本のパイプ（パイプ１５ａおよびパイプ１５ｂ）を用いてその長さを調節可能な構成としてもよい。すなわち、径の小さなパイプ１５ａを径の大きなパイプ１５ｂの内部に収納し、ネジ１５ｃを用いてパイプ１５ａを固定する構成としてもよい。あるいは、位置決め部材１５は、ある決まった長さを有する部材のみで構成してもよい。
また、把持部１４にはスイッチ１６が設けられている。スイッチ１６は、話者の操作に応じた操作信号をケーブル４０を介して出力する。スイッチ１６の詳細な機能については後述する。 FIG. 2 is a perspective view showing the structure of the microphone unit 10. The camera 11 is a small-sized imaging device such as a CCD (Charge Coupled Devices) camera or a CMOS (Complementary Metal Oxide Semiconductor) camera, for example, and captures a video (moving image) of a speaker (user) and a video signal indicating the video. Is output via the cable 40. The light 12 is a light source that irradiates the mouth of the speaker who is the subject. The microphone 13 acquires the voice of the speaker and outputs a voice signal indicating the voice via the cable 40. When using the microphone unit 10, the speaker holds the grip portion 14 with his hand and uses the tip of the positioning member 15 under his nose. For example, as shown in FIG. 2, the positioning member 15 may be configured such that the length thereof can be adjusted using two pipes (pipe 15 a and pipe 15 b) having different diameters. That is, the pipe 15a having a small diameter may be accommodated inside the pipe 15b having a large diameter, and the pipe 15a may be fixed using the screw 15c. Or you may comprise the positioning member 15 only with the member which has a certain fixed length.
In addition, a switch 16 is provided in the grip portion 14. The switch 16 outputs an operation signal corresponding to the operation of the speaker via the cable 40. Detailed functions of the switch 16 will be described later.

図３は、処理装置２０のハードウェア構成を示すブロック図である。ＣＰＵ（Central Processing Unit）２１は、ＲＡＭ（Random Access Memory）２２を作業エリアとして、ＲＯＭ（Read Only Memory）２３あるいはＨＤＤ（Hard Disk Drive）２４に記憶されているプログラムを読み出して実行する。ＨＤＤ２４は、各種アプリケーションプログラムやデータを記憶する記憶装置である。本実施形態に関して、特に、ＨＤＤ２４は、語学学習プログラム、この語学学習プログラムで使用する模範動画データ、および模範動画データの音声トラックとして模範音声データを記憶している。模範動画データは、先生である話者（模範話者）が例文を発話したときの口元の映像および音声を記録したものである。模範音声は、後述するユーザ動画の記録と同様の方法で記録される。 FIG. 3 is a block diagram illustrating a hardware configuration of the processing device 20. A CPU (Central Processing Unit) 21 reads and executes a program stored in a ROM (Read Only Memory) 23 or an HDD (Hard Disk Drive) 24 using a RAM (Random Access Memory) 22 as a work area. The HDD 24 is a storage device that stores various application programs and data. Regarding the present embodiment, in particular, the HDD 24 stores a language learning program, model moving image data used in the language learning program, and model voice data as an audio track of the model moving image data. The model moving image data is a recording of video and audio of the mouth when a speaker (model speaker) who is a teacher utters an example sentence. The model voice is recorded by the same method as the recording of the user moving image described later.

画像処理部２５は、動画データが入力されると、その動画データに応じてディスプレイ３０を制御する制御信号を出力する。ディスプレイ３０は、ＣＲＴ（Cathode Ray Tube）やＬＣＤ（Liquid Crystal Display）等の表示装置であり、画像処理部２５からの制御信号に従って動画を表示する。動画データの音声トラックを含む音声データはＤＡＣ２６でアナログ音声信号に変換され、スピーカ２７から音声が再生される。
話者（使用者）は、キーボード２８を操作することにより処理装置２０に対し指示入力を行うことができる。また、処理装置２０はＩ／Ｆ２９を介してケーブル４０に接続されており、マイクユニット１０と信号の授受を行うことができる。
以上の各構成要素はバス９９を介して相互に接続されている。 When the moving image data is input, the image processing unit 25 outputs a control signal for controlling the display 30 according to the moving image data. The display 30 is a display device such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display), and displays a moving image according to a control signal from the image processing unit 25. The audio data including the audio track of the moving image data is converted into an analog audio signal by the DAC 26, and the audio is reproduced from the speaker 27.
A speaker (user) can input an instruction to the processing device 20 by operating the keyboard 28. The processing device 20 is connected to the cable 40 via the I / F 29, and can exchange signals with the microphone unit 10.
The above components are connected to each other via a bus 99.

続いて、語学学習装置１の動作について説明する。話者（使用者）がキーボード２８を操作する等の方法により、語学学習プログラムの実行を指示すると、ＣＰＵ２１はＨＤＤ２４から語学学習プログラムを読み出して実行する。この語学学習プログラムを実行することにより、語学学習装置１は本実施形態に係る各機能を有する。 Next, the operation of the language learning device 1 will be described. When a speaker (user) instructs execution of the language learning program by a method such as operating the keyboard 28, the CPU 21 reads the language learning program from the HDD 24 and executes it. By executing this language learning program, the language learning device 1 has each function according to the present embodiment.

語学学習プログラムを実行すると、ＣＰＵ２１は、ＨＤＤ２４に記憶されている複数の模範動画データから１の模範動画データを選択して読み出す。ここで、１の模範動画データを選択する方法は、ＣＰＵ２１が模範動画データのリストをディスプレイ３０に表示させ、話者がキーボード２８を操作することによりその中から任意のものを選択してもよいし、語学学習プログラムに模範動画データを読み出す順番があらかじめ記憶させておき、その順番に従って自動的に模範動画データを選択する構成としてもよい。 When the language learning program is executed, the CPU 21 selects and reads one model video data from a plurality of model video data stored in the HDD 24. Here, as a method of selecting one model moving image data, the CPU 21 may display a list of the model moving image data on the display 30 and the speaker may select an arbitrary one by operating the keyboard 28. The order of reading the model moving image data in the language learning program may be stored in advance, and the model moving image data may be automatically selected according to the order.

ＣＰＵ２１は読み出した模範動画データを画像処理部２５に出力する。また、ＣＰＵ２１は、模範動画データの音声トラックに記録された音声データをＤＡＣ２６に出力する。これによりディスプレイ３０上には模範動画が、スピーカ２７からは模範音声が再生される。なお、模範動画データは圧縮形式のデータでも非圧縮形式のデータでもよい。圧縮形式の動画データを用いた場合、動画データを伸張処理する必要がある。その場合、本実施形態においてはＣＰＵ２１が伸張処理を行うが、伸張処理を行うための処理回路等を別途設けてもよい。 The CPU 21 outputs the read exemplary moving image data to the image processing unit 25. Further, the CPU 21 outputs the audio data recorded on the audio track of the model moving image data to the DAC 26. As a result, the model moving image is reproduced on the display 30 and the model sound is reproduced from the speaker 27. The exemplary moving image data may be compressed data or uncompressed data. When compressed moving image data is used, it is necessary to decompress the moving image data. In this case, in this embodiment, the CPU 21 performs expansion processing, but a processing circuit or the like for performing expansion processing may be provided separately.

話者は、ディスプレイ３０上で再生された模範動画を見て、また同時にスピーカ２７から再生された模範音声を聞いて、模範音声を真似して、また、模範動画で示される口の形および口の動きを真似して発声する。発声の際、話者はマイクユニット１０の把持部１４を握り、位置決め部材１５の先端部を自分の顔の所定位置（図１では鼻と口の間）に当てる。こうすることにより、カメラ１１と話者の口との距離を一定に保つことができる。すなわち、画像における口の位置を一定に保つことができる。 The speaker watches the model video reproduced on the display 30 and listens to the model voice reproduced from the speaker 27 at the same time, imitates the model voice, and the mouth shape and mouth shown in the model video Say the movement of the voice. When speaking, the speaker grasps the grip 14 of the microphone unit 10 and places the tip of the positioning member 15 on a predetermined position (between the nose and mouth in FIG. 1) of his / her face. In this way, the distance between the camera 11 and the speaker's mouth can be kept constant. That is, the position of the mouth in the image can be kept constant.

カメラ１１は撮影した話者の口の映像の映像信号を処理装置２０に出力する。また、話者は発声している間はスイッチ１６を押し続ける。スイッチ１６は、押されている間、処理装置２０に対しスイッチ１６が押されている旨を示す押下信号を出力する。処理装置２０のＣＰＵ２１は、押下信号を受信している間は、受信した映像信号をデジタル映像データに変換してＨＤＤ２４に保存する。話者がスイッチ１６の押下をやめ、押下信号が出力されなくなると、ＣＰＵ２１は映像データの記録を停止する。すなわち、話者がスイッチ１６を押している間のみ、話者の口元の映像が記録される（こうして記録された話者の口元の映像を「ユーザ動画データ」という）。話者の発話した音声についても同様である。すなわち、話者がスイッチ１６を押している間のみ、話者の音声が記録される。 The camera 11 outputs a video signal of the photographed speaker's mouth image to the processing device 20. Further, the speaker keeps pressing the switch 16 while speaking. While being pressed, the switch 16 outputs a pressing signal indicating that the switch 16 is being pressed to the processing device 20. While receiving the push signal, the CPU 21 of the processing device 20 converts the received video signal into digital video data and stores it in the HDD 24. When the speaker stops pressing the switch 16 and no pressing signal is output, the CPU 21 stops recording video data. That is, the video of the speaker's mouth is recorded only while the speaker is pressing the switch 16 (the video of the speaker's mouth recorded in this way is referred to as “user moving image data”). The same applies to the voice spoken by the speaker. That is, the voice of the speaker is recorded only while the speaker is pressing the switch 16.

話者が発話を終了すると、すなわち、話者がスイッチ１６の押下を終了すると、ＣＰＵ２１は、模範動画データとユーザ動画データから、模範動画とユーザ動画を混合した混合動画データを生成する。混合動画データは、例えば以下のように生成される。ＲＡＭ２２は、動画を混合する方法を指定するパラメータＡ、および指定された混合方法における混合の態様を指定するパラメータＢとを記憶している。動画の混合方法は例えば以下に説明するものがある。図４は、動画の混合方法を例示する図である。動画の混合方法としては、例えば、模範動画Ｍとユーザ動画Ｕとを重ね合わせる方法（図４（Ａ））、模範動画Ｍとユーザ動画Ｕとを並べる方法（図４（Ｂ））、模範動画Ｍの一部にユーザ動画Ｕを挿入する方法（図４（Ｃ））、ユーザ動画Ｕの一部に模範動画Ｍを挿入する方法（図４（Ｄ））等がある。また、混合の態様を指定するパラメータとしては、例えば、模範動画とユーザ動画とを重ね合わせる混合方法において、量社の動画を重ね合わせる割合がある。すなわち、模範動画：ユーザ動画＝１：１の割合で重ね合わせる場合には、模範動画およびユーザ動画はそれぞれ同じ濃度で画面上に表示される。また例えば模範動画：ユーザ動画＝１：３の割合で重ね合わせる場合には、ユーザ動画が模範動画の３倍の濃度で画面上に表示される。 When the speaker finishes speaking, that is, when the speaker finishes pressing the switch 16, the CPU 21 generates mixed moving image data in which the exemplary moving image and the user moving image are mixed from the exemplary moving image data and the user moving image data. The mixed moving image data is generated as follows, for example. The RAM 22 stores a parameter A for designating a method for mixing moving images and a parameter B for designating a mixing mode in the designated mixing method. For example, a method for mixing moving images is described below. FIG. 4 is a diagram illustrating a moving image mixing method. As a method for mixing moving images, for example, a method of superimposing the model video M and the user video U (FIG. 4A), a method of arranging the model video M and the user video U (FIG. 4B), a model video, and the like. There are a method of inserting the user video U into a part of M (FIG. 4C), a method of inserting the model video M into a part of the user video U (FIG. 4D), and the like. Moreover, as a parameter which designates the mode of mixing, for example, in the mixing method in which the model moving image and the user moving image are overlapped, there is a ratio of overlapping the moving company's moving image. That is, when superimposing at a ratio of model video: user video = 1: 1, the model video and the user video are displayed on the screen at the same density. For example, when superimposing at a ratio of model video: user video = 1: 3, the user video is displayed on the screen at a density three times that of the model video.

ＣＰＵ２１は、上述のパラメータＡおよびＢで指定される方法および態様で混合動画データを生成する。ＣＰＵ２１は、生成した混合動画データをＨＤＤ２４に記憶する。また、ＣＰＵ２１は、生成した混合動画を画像処理部２５に出力する。画像処理部２５は、混合動画データに応じた制御信号をディスプレイ３０に出力する。こうして、ディスプレイ３０には、先生の模範動画および話者（生徒）のユーザ動画が重ね合わされた混合動画が表示される。なお、合成動画の生成および再生に関して、データの合成と再生をリアルタイムで行ってもよいし、合成動画データを最初から最後まで生成してＲＡＭ２２あるいはＨＤＤ２４に記憶した後で再生してもよい。 CPU21 produces | generates mixed moving image data by the method and aspect designated with the above-mentioned parameters A and B. FIG. The CPU 21 stores the generated mixed moving image data in the HDD 24. Further, the CPU 21 outputs the generated mixed moving image to the image processing unit 25. The image processing unit 25 outputs a control signal corresponding to the mixed moving image data to the display 30. Thus, the display 30 displays a mixed moving image in which the model moving image of the teacher and the user moving image of the speaker (student) are superimposed. Note that regarding the generation and reproduction of the synthesized moving image, the synthesis and reproduction of data may be performed in real time, or the synthesized moving image data may be generated from the beginning to the end and stored in the RAM 22 or the HDD 24 before being reproduced.

＜変形例１＞
本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。
上述の実施形態において、ユーザの指示入力により合成動画を生成する態様を変化させる構成としてもよい。具体的には、例えば以下のとおりである。合成動画の再生の際、ＣＰＵ２１は、ディスプレイ３０上に指示入力用のメニュー画面を表示させる。このメニュー画面には、例えば、スイッチ１６を１回押すとパラメータＢの値を１増加させ、スイッチ１６を２回押すとパラメータＢの値を１減少させるという旨のメッセージが記されている。いま、模範動画とユーザ動画の混合方法が、重ね合わせである場合について考える。パラメータＢは模範動画の濃度に対するユーザ動画の相対的濃度を示すパラメータである。また、パラメータＢの初期値は１、すなわち、模範動画Ｍとユーザ動画Ｕは１：１の濃度で重ね合わされている（図５（Ａ））。また、混合動画データの生成は、再生と同時にリアルタイムで行われている。ここで、ユーザがメニュー画面に従ってスイッチ１６を１回押すと、パラメータＢの値は１増加する。ＣＰＵ２１はパラメータＢに従って、ユーザ動画が模範動画の２倍の濃度となるようにデータを処理して両者を重ね合わせる。こうしてユーザの指示入力によりユーザ動画Ｕの濃度が濃くなる、すなわち、混合動画生成の態様が変化する（図５（Ｂ））。同様にして、混合動画生成の方法を変化させてもよい。 <Modification 1>
The present invention is not limited to the above-described embodiment, and various modifications can be made.
In the above-mentioned embodiment, it is good also as a structure which changes the aspect which produces | generates a synthetic | combination moving image by a user's instruction | indication input. Specifically, it is as follows, for example. At the time of reproduction of the synthesized moving image, the CPU 21 displays a menu screen for inputting instructions on the display 30. On this menu screen, for example, a message is written that if the switch 16 is pressed once, the value of the parameter B is increased by 1, and if the switch 16 is pressed twice, the value of the parameter B is decreased by 1. Consider a case where the mixing method of the model moving image and the user moving image is superposition. The parameter B is a parameter indicating the relative density of the user movie with respect to the density of the model movie. The initial value of the parameter B is 1, that is, the model moving image M and the user moving image U are overlapped at a density of 1: 1 (FIG. 5A). The generation of the mixed moving image data is performed in real time simultaneously with the reproduction. Here, when the user presses the switch 16 once according to the menu screen, the value of the parameter B increases by one. In accordance with parameter B, the CPU 21 processes the data so that the user moving image has twice the density of the model moving image, and superimposes both. Thus, the density of the user moving image U is increased by the user's instruction input, that is, the mixed moving image generation mode changes (FIG. 5B). Similarly, the method of generating the mixed moving image may be changed.

＜変形例２＞
位置決め部材１５の形状は、図２に示されるものに限られない。すなわち、ユーザの鼻と口の間に着けて使用する構成でなくてもよい。例えば、ユーザの頭部に装着して使用するヘッドギアあるいはヘッドバンドのような構造でもよい（図６）。あるいは、腕、肩、胸等、顔や頭以外の部分に装着する構成としてもよい。要するに、撮影手段であるカメラと、被写体である口との相対的位置関係を固定できる構造であればどのようなものでもよい。 <Modification 2>
The shape of the positioning member 15 is not limited to that shown in FIG. In other words, it may not be configured to be worn between the user's nose and mouth. For example, a structure such as a headgear or a headband worn on the user's head may be used (FIG. 6). Or it is good also as a structure equipped with parts other than a face and a head, such as an arm, a shoulder, and a chest. In short, any structure may be used as long as the relative positional relationship between the camera as the photographing means and the mouth as the subject can be fixed.

＜他の変形例＞
変形例１で説明した構成と、変形例２で説明した構成とを組み合わせて用いてもよい。また、語学学習装置１にディスプレイ３０を設けず、語学学習装置１はユーザ動画あるいは混合動画をＨＤＤ２４あるいはＲＡＭ２２に記憶するだけでもよい。ユーザはＩ／Ｆ２９を介してＨＤＤ２４あるいはＲＡＭ２２に記憶されたデータを読み出し、パーソナルコンピュータ等の他の装置で読み出したデータを再生することができる。また、マイクユニット１０と処理装置２０との間での信号の授受は無線通信により行ってもよい。
あるいは、ユーザ動画あるいは合成動画をデータとしてＨＤＤ２４に記憶せず、リアルタイムでの再生のみ行う構成としてもよい。
また、上述の実施形態においては、ＣＰＵ２１が語学学習プログラムを実行することにより語学学習装置としての機能が実現されたが、動画の合成等の処理を専用の電子回路等のハードウェア装置を用いて実現してもよい。
また、上述の実施形態においては、ユーザ動画、模範動画、混合動画はすべてデジタルデータとして記憶される態様について説明したが、これらのうち一部または全部をアナログ信号として記憶する攻勢としてもよい。この場合ＨＤＤ２４に代わる記憶手段として磁気テープおよび磁気テープレコーダを用いてもよい。 <Other variations>
The configuration described in Modification 1 and the configuration described in Modification 2 may be used in combination. In addition, the language learning device 1 may store only the user moving image or the mixed moving image in the HDD 24 or the RAM 22 without providing the display 30 in the language learning device 1. The user can read the data stored in the HDD 24 or the RAM 22 via the I / F 29 and reproduce the data read by another device such as a personal computer. Moreover, you may perform transmission / reception of the signal between the microphone unit 10 and the processing apparatus 20 by radio | wireless communication.
Or it is good also as a structure which does not memorize | store a user moving image or a synthetic | combination moving image as data in HDD24, but only reproduce | regenerates in real time.
Further, in the above-described embodiment, the function as a language learning device is realized by the CPU 21 executing the language learning program. However, processing such as video composition is performed using a hardware device such as a dedicated electronic circuit. It may be realized.
Further, in the above-described embodiment, the mode in which the user moving image, the model moving image, and the mixed moving image are all stored as digital data has been described. However, some or all of these may be stored as analog signals. In this case, a magnetic tape and a magnetic tape recorder may be used as storage means instead of the HDD 24.

本発明の一実施形態に係る語学学習装置１の構成を示す図である。It is a figure which shows the structure of the language learning apparatus 1 which concerns on one Embodiment of this invention. マイクユニット１０の構造を示す斜視図である。1 is a perspective view showing a structure of a microphone unit 10. FIG. 処理装置２０のハードウェア構成を示すブロック図である。3 is a block diagram illustrating a hardware configuration of a processing device 20. FIG. 動画の混合方法を例示する図である。It is a figure which illustrates the mixing method of a moving image. 別の実施形態に係る動画の混合方法を例示する図である。It is a figure which illustrates the mixing method of the moving image which concerns on another embodiment. 別の実施形態に係るマイクユニット１０の構造を示す図である。It is a figure which shows the structure of the microphone unit 10 which concerns on another embodiment.

符号の説明Explanation of symbols

１…語学学習装置、１０…マイクユニット、１１…カメラ、１２…ライト、１３…マイク、１４…把持部、１５…位置決め部材、１６…スイッチ、２０…処理装置、２１…ＣＰＵ、２２…ＲＡＭ、２４…ＨＤＤ、２５…画像処理部、２６…ＤＡＣ、２７…スピーカ、２８…キーボード、２９…Ｉ／Ｆ、３０…ディスプレイ、４０…ケーブル、９９…バス
DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 10 ... Microphone unit, 11 ... Camera, 12 ... Light, 13 ... Microphone, 14 ... Gripping part, 15 ... Positioning member, 16 ... Switch, 20 ... Processing device, 21 ... CPU, 22 ... RAM, 24 ... HDD, 25 ... Image processing unit, 26 ... DAC, 27 ... Speaker, 28 ... Keyboard, 29 ... I / F, 30 ... Display, 40 ... Cable, 99 ... Bus

Claims

話者の口の形および動きを示すユーザ動画を撮影する撮影手段と、
模範となる話者の口の形および動きを模範動画として記憶する第１の記憶手段と、
前記第１の記憶手段に記憶された模範動画と前記撮影手段により撮影されたユーザ動画とから、合成動画を生成する生成手段と、
前記生成手段により生成された合成動画を記憶する第２の記憶手段と
を有する語学学習装置。 Photographing means for photographing a user video showing the shape and movement of the speaker's mouth;
First storage means for storing the mouth shape and movement of the model speaker as a model video;
Generating means for generating a composite video from the model video stored in the first storage means and the user video shot by the shooting means;
A language learning apparatus comprising: a second storage unit that stores the synthesized moving image generated by the generation unit.

前記生成手段により生成された合成動画を表示する表示手段をさらに有する請求項１に記載の語学学習装置。 The language learning apparatus according to claim 1, further comprising display means for displaying the synthesized moving image generated by the generating means.

前記生成手段により生成される合成動画の態様を決定するパラメータを指定するパラメータ指定手段をさらに有し、
前記生成手段が、前記パラメータ指定手段により指定されたパラメータにより決定される態様で合成動画の生成を行う
ことを特徴とする請求項１に記載の語学学習装置。 Further comprising parameter designating means for designating parameters for determining the mode of the composite video generated by the generating means;
The language learning apparatus according to claim 1, wherein the generation unit generates a composite moving image in a manner determined by a parameter specified by the parameter specification unit.