JP2011109292A

JP2011109292A - Imaging apparatus, control method and program thereof, and storage medium

Info

Publication number: JP2011109292A
Application number: JP2009260653A
Authority: JP
Inventors: Masashi Kawakami; 雅司川上
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-11-16
Filing date: 2009-11-16
Publication date: 2011-06-02

Abstract

<P>PROBLEM TO BE SOLVED: To add a text suitable for a photographic scene to a recorded image. <P>SOLUTION: A user sets a photographic scene to a scene setting part (115). The scene setting part (115) stores a plurality of typical texts according to the photographic scene. A voice recognition part (106) recognizes a voice input by a voice input part (102) during photographing. A text making part (107) compares a text whose voice has been recognized by the voice recognition part (106), with typical texts according to the photographic scene, which are stored in the scene setting part (115), to calculate similarities and determines a typical text having the highest similarity to be text data which should be recorded in a recording medium (116) together with a recorded image. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、撮像装置、その制御方法及びプログラム並びに記憶媒体に関する。 The present invention relates to an imaging apparatus, a control method and program thereof, and a storage medium.

特開平１１−２８９５１７号公報JP-A-11-289517 特開平９−１３０７３６号公報Japanese Patent Laid-Open No. 9-130736

デジタルビデオカメラ及びデジタルスチルカメラ等の撮像装置では、記録媒体として、光ディスク、ハードディスク装置（以下ＨＤＤ）又は半導体メモリなどのランダムアクセス可能な記録媒体が使用されている。これらの記録媒体は大容量であり、多くの画像を保存できる。多くの記録画像中から所望の画像を探索する方法として、記録画像の縮小画像を一覧表示するいわゆるサムネイル表示が有効である。特許文献１には、サムネイル表示を用いる撮像装置が記載されている。 In imaging devices such as digital video cameras and digital still cameras, a randomly accessible recording medium such as an optical disk, a hard disk device (hereinafter referred to as HDD) or a semiconductor memory is used as a recording medium. These recording media have a large capacity and can store many images. As a method of searching for a desired image from many recorded images, so-called thumbnail display that displays a list of reduced images of the recorded images is effective. Japanese Patent Application Laid-Open No. 2004-228561 describes an imaging device that uses thumbnail display.

特許文献２には、撮影と同時に取得する撮影者の音声をキーワードとして検索する技術が記載されている。具体的には、撮影と同時に撮影者の音声を音声認識してテキストに変換し、そのテキストを撮影画像と関連付けて記録する。そして、撮影時に同時入力した音声に対応するテキストを入力して、所望の画像を検索する。 Japanese Patent Application Laid-Open No. 2004-228561 describes a technique for searching for a photographer's voice acquired at the same time as shooting as a keyword. Specifically, simultaneously with photographing, the photographer's voice is recognized and converted into text, and the text is recorded in association with the photographed image. Then, a text corresponding to the voice input simultaneously at the time of photographing is input to search for a desired image.

サムネイル画像の一覧表示では、同時に表示できるサムネイル数が限定されるので、記録画像数が多くなると、一覧画面を順送りすることになり、所望の画像を発見するのが困難になる。動画像の場合、シーン単位又は一定時間単位でサムネイルが作成されることがある。この場合、全記録画像のサムネイル数は膨大になりうるので、なおさら、所望動画像の所望シーンを発見するのは困難になる。 In the list display of thumbnail images, the number of thumbnails that can be displayed at the same time is limited. Therefore, if the number of recorded images increases, the list screen is forwarded and it becomes difficult to find a desired image. In the case of a moving image, a thumbnail may be created in a scene unit or a fixed time unit. In this case, since the number of thumbnails of all the recorded images can be enormous, it becomes more difficult to find the desired scene of the desired moving image.

また、類似した画面の場合、再生して見なければわからない。すなわち、類似したサムネイルで個々のシーンを識別するのは難しく。可能性ある画像を再生してみるしかない。サムネイルの表示だけでは効率的に画像を検索するのは困難である。 In the case of a similar screen, it is not known unless it is reproduced. That is, it is difficult to identify individual scenes with similar thumbnails. There is no choice but to play a possible image. It is difficult to retrieve images efficiently only by displaying thumbnails.

特許文献２に記載の技術では、撮影者の音声を無作為に取り込み、テキストデータ化して記録するので、動画の特徴を表していないような音声テキストも記録してしまう。これでは、有効な検索が難しく、好ましくない動画が検索されてしまう。 In the technique described in Patent Document 2, since the photographer's voice is randomly captured and converted into text data, the voice text that does not represent the feature of the moving image is also recorded. In this case, an effective search is difficult and an undesired moving image is searched.

本発明は、多数の記録画像から所望の画像を迅速且つ適切に検索できるようにした撮像装置、その制御方法及びプログラム並びに記憶媒体を提示することを目的とする。 An object of the present invention is to provide an imaging apparatus, a control method and program thereof, and a storage medium that can quickly and appropriately search for a desired image from a large number of recorded images.

本発明に係る撮像装置は、映像入力手段と、音声入力手段と、前記音声入力手段による入力音声を音声認識し、前記入力音声の示すテキストを出力する音声認識手段と、撮影シーンごとの典型テキストを記憶するシーン設定手段と、前記音声認識手段による前記テキストと、前記シーン設定手段に記憶される前記典型テキストとの類似度を算出し、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定するテキスト化手段と、前記映像入力手段により入力される映像と、前記テキスト化手段により決定された記録すべきテキストとを記録媒体に記録する記録手段とを有することを特徴とする。 An image pickup apparatus according to the present invention includes a video input unit, a voice input unit, a voice recognition unit that recognizes a voice input by the voice input unit and outputs a text indicated by the input voice, and a typical text for each shooting scene. Calculating a similarity between the text set by the scene setting unit, the text by the voice recognition unit, and the typical text stored in the scene setting unit, and the text according to the similarity or the typical text is converted into the video. Texting means for determining as text to be recorded together with video input by the input means, video input by the video input means, and text to be recorded determined by the textizing means are recorded on a recording medium. And recording means.

本発明に係る撮像装置の制御方法は、映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置の制御方法であって、撮影シーンを設定するステップと、撮影時の前記音声入力手段による入力音声を音声認識し、前記入力音声の示すテキストを出力する音声認識ステップと、前記音声認識ステップによる前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出し、前記類似度に従う前記テキスト又は前記典型テキストを、前記前記映像入力手段により入力される映像とともに記録すべきテキストとして決定するテキスト化ステップと、前記映像入力手段により入力される映像と、前記テキスト化ステップにより決定された記録すべきテキストとを記録媒体に記録する記録ステップとを有することを特徴とする。 An imaging device control method according to the present invention is a method for controlling an imaging device including a video input unit, an audio input unit, and a storage unit that stores typical text for each shooting scene. A voice recognition step of recognizing an input voice by the voice input means at the time of shooting and outputting a text indicated by the input voice, the text by the voice recognition step, and the typical text corresponding to a set shooting scene The text or the typical text according to the similarity is determined as the text to be recorded together with the video input by the video input means, and is input by the video input means. And the text to be recorded determined in the text conversion step are recorded on a recording medium. And having a recording step that.

本発明に係る撮像装置の制御プログラムは、映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置を制御するプログラムであって、前記撮像装置に撮影シーンを設定する機能と、前記撮像装置に、撮影時の前記音声入力手段による入力音声を音声認識させ、前記入力音声の示すテキストを出力させる音声認識機能と、前記撮像装置に、前記音声認識機能による前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出させ、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定させるテキスト化機能と、前記撮像装置に、前記映像入力手段により入力される映像と、前記テキスト化機能により決定された記録すべきテキストとを記録媒体に記録させる記録機能とを有することを特徴とする。 An imaging apparatus control program according to the present invention is a program for controlling an imaging apparatus having a video input unit, an audio input unit, and a storage unit that stores typical text for each shooting scene. A voice recognition function for causing the imaging device to recognize voice input by the voice input means at the time of shooting and outputting text indicated by the input voice, and for the imaging device to use the voice recognition function. As the text to be recorded together with the video input by the video input means, the similarity between the text and the typical text corresponding to the set shooting scene is calculated, and the text or the typical text according to the similarity is calculated. A text conversion function to be determined, a video input to the imaging device by the video input means, and the text And having a recording function to record the text to be recorded is determined by the preparative function on the recording medium.

本発明に係る記憶媒体は、映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置を制御するプログラムを記憶する記憶媒体であって、前記プログラムが、前記撮像装置に撮影シーンを設定する機能と、前記撮像装置に、撮影時の前記音声入力手段による入力音声を音声認識させ、前記入力音声の示すテキストを出力させる音声認識機能と、前記撮像装置に、前記音声認識機能による前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出させ、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定させるテキスト化機能と、前記撮像装置に、前記映像入力手段により入力される映像と、前記テキスト化機能により決定された記録すべきテキストとを記録媒体に記録させる記録機能とを有することを特徴とする。 A storage medium according to the present invention is a storage medium that stores a program for controlling an imaging apparatus having a video input unit, a voice input unit, and a storage unit that stores typical text for each shooting scene, and the program includes: A function for setting a shooting scene in the imaging device; a voice recognition function for causing the imaging device to recognize a voice input by the voice input unit during shooting and outputting a text indicated by the input voice; and , Calculating the similarity between the text by the voice recognition function and the typical text corresponding to the set shooting scene, and inputting the text or the typical text according to the similarity by the video input means And a text conversion function for determining the text to be recorded together with the video input means to the imaging device. A video that is characterized by having a recording function for recording on a recording medium and a text to be recorded is determined by the text feature.

本発明によれば、撮影シーンに応じた、画像の内容を示すテキストを撮影時の音声から自動的に付与できる。これにより、記録画像の一覧、又は特定のキーワード条件で抽出した一覧から、所望の画像を効率的に検索できる。再生装置側には音声認識機能が不要になる。 According to the present invention, it is possible to automatically add text indicating the content of an image from the sound at the time of shooting according to the shooting scene. Thereby, a desired image can be efficiently searched from a list of recorded images or a list extracted under a specific keyword condition. A voice recognition function is not required on the playback device side.

本発明の一実施例の概略構成ブロック図である。It is a schematic block diagram of one Example of this invention. 本実施例の入力音声をテキスト化する処理のフローチャートである。It is a flowchart of the process which converts the input audio | voice of a present Example into text. テキストデータのフォーマット例である。It is an example of a format of text data. 本実施例の再生時の動作フローチャートである。It is an operation | movement flowchart at the time of reproduction | regeneration of a present Example. テキストを使った一覧表示の画面例である。It is an example of the screen of the list display using a text. 動画に対するサムネイルとテキストの関係を示す模式図である。It is a schematic diagram which shows the relationship between the thumbnail with respect to a moving image, and a text.

以下、図面を参照して、本発明の実施例を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明に係る撮像装置の一実施例の概略構成ブロック図を示す。図１に示す撮像装置１００の制御部１１３は、操作部１０９によるユーザ操作及び動作状態に従い、撮像装置１００の全体的な動作を制御する。制御部１１３は、たとえば、ＣＰＵ（Central Processing Unit）などからなる。 FIG. 1 shows a schematic block diagram of an embodiment of an imaging apparatus according to the present invention. The control unit 113 of the imaging apparatus 100 illustrated in FIG. 1 controls the overall operation of the imaging apparatus 100 in accordance with a user operation and operation state by the operation unit 109. The control unit 113 includes, for example, a CPU (Central Processing Unit).

操作部１０９は、撮像装置１００に付随するボタンやジョイスティック等で代表される入力装置であり、その操作を示す信号は制御部１１３に供給される。操作部１０９は、表示部１０８の前面に設置されるタッチパネルを含んでもよい。この場合、表示部１０８に操作対象を示すボタン等が表示され、そのボタンに向けたユーザ操作が、操作部１０９の操作となる。操作部１０９は、撮影時には、撮影の開始と停止、撮影モードの変更、及びズーム操作等に使用される。操作部１０９は、再生時には、再生すべき画像の選択、再生の開始と停止、及び再生画像の切り替え等に使用される。 The operation unit 109 is an input device represented by a button, a joystick, or the like attached to the imaging apparatus 100, and a signal indicating the operation is supplied to the control unit 113. The operation unit 109 may include a touch panel installed on the front surface of the display unit 108. In this case, a button or the like indicating an operation target is displayed on the display unit 108, and a user operation directed to the button is an operation of the operation unit 109. The operation unit 109 is used for starting and stopping shooting, changing a shooting mode, zooming operation, and the like during shooting. The operation unit 109 is used for selecting an image to be reproduced, starting and stopping reproduction, switching reproduction images, and the like during reproduction.

表示部１０８は、例えば、液晶ディスプレイ（Liquid Crystal Display）であり、撮影時は被写体画像を表示し、再生時には再生画像を表示する。表示部１０８はまた、撮像装置１００の動作モード及び動作パラメータ等を設定する種々の設定画面も表示する。 The display unit 108 is, for example, a liquid crystal display, and displays a subject image during shooting and a playback image during playback. The display unit 108 also displays various setting screens for setting the operation mode and operation parameters of the imaging apparatus 100.

映像入力部１０１は、撮影レンズ、撮像素子、及び撮像素子による画像信号を所定形式の映像データに変換するカメラ信号処理回路からなる。撮像素子は、ＣＣＤ（Charge Coupled Device）型でも、ＣＭＯＳ（Complementary Metal-Oxide Semiconductor）型でもよい。記録（撮影）モードにおいて、映像入力部１０１は、被写体を撮像した映像データを出力する。映像入力部１０１から出力される映像データは、メモリ１０３の映像信号用領域に一時格納される。 The video input unit 101 includes a photographing lens, an image sensor, and a camera signal processing circuit that converts an image signal from the image sensor into video data of a predetermined format. The imaging element may be a CCD (Charge Coupled Device) type or a CMOS (Complementary Metal-Oxide Semiconductor) type. In the recording (shooting) mode, the video input unit 101 outputs video data obtained by imaging the subject. Video data output from the video input unit 101 is temporarily stored in the video signal area of the memory 103.

音声入力部１０２は、マイクに代表される音声入力装置であり、記録モードにおいて、周囲の音声を電気信号に変換する。音声入力部１０２は、取り込んだ音声信号をデジタル信号に変換し、メモリ１０３の音声信号用領域に一時格納する。 The audio input unit 102 is an audio input device typified by a microphone, and converts ambient audio into an electrical signal in the recording mode. The audio input unit 102 converts the acquired audio signal into a digital signal and temporarily stores it in the audio signal area of the memory 103.

符号化部１０４は、メモリ１０３の映像データと音声データを所定の方式で圧縮符号化し、圧縮データをメモリ１０３の圧縮データ用領域に書き戻す。映像符号化として、ＭＰＥＧ（Moving Picture Experts Group）やＨ．２６４が知られている。制御部１１３は、メモリ１０３の圧縮映像データと圧縮音声データを読み出して所定のフォーマットで多重化し、メディアＩ／Ｆ１０５を介して記録媒体１１６に動画像データとして記録する。記録媒体１１６は、光ディスク、半導体メモリ又はハードディスク等のランダムアクセス媒体からなる。 The encoding unit 104 compresses and encodes the video data and audio data in the memory 103 by a predetermined method, and writes the compressed data back to the compressed data area of the memory 103. Video coding includes MPEG (Moving Picture Experts Group) and H.264. H.264 is known. The control unit 113 reads the compressed video data and the compressed audio data in the memory 103, multiplexes them in a predetermined format, and records them as moving image data on the recording medium 116 via the media I / F 105. The recording medium 116 is a random access medium such as an optical disk, a semiconductor memory, or a hard disk.

撮像装置１００は、音声入力部１０２，音声認識部１０６，テキスト化部１０７及びシーン設定部１１５を使って、再生時の検索に使用できるテキストデータを作成する。図２は、その処理フローチャートを示す。 The imaging apparatus 100 uses the voice input unit 102, the voice recognition unit 106, the text conversion unit 107, and the scene setting unit 115 to create text data that can be used for retrieval at the time of reproduction. FIG. 2 shows the process flowchart.

ユーザは、撮影前又は撮影中に、撮影シーンを予め撮像装置１００に登録できる。制御部１１３は、操作部１０９を使って入力された撮影シーンを示すテキストを、シーン設定部１１５に格納する。例えば結婚式を撮影する場合、ユーザは、そのシーンを示す「結婚式」というテキストを、操作部１０９を用いて文字入力するか選択し、制御部１１３がシーン設定部１１５に設定する。「結婚式」以外にも、例えば、「運動会」、「旅行」、及び「誕生日」などの代表的なイベントに対するシーン名をテンプレートとして用意しておけば、設定が容易になる。 The user can register a shooting scene in the imaging apparatus 100 before or during shooting. The control unit 113 stores, in the scene setting unit 115, text indicating a shooting scene input using the operation unit 109. For example, when photographing a wedding, the user selects whether to input text “wedding” indicating the scene using the operation unit 109, and the control unit 113 sets the text in the scene setting unit 115. In addition to “wedding”, for example, if scene names for typical events such as “athletic meet”, “travel”, and “birthday” are prepared as templates, setting becomes easy.

シーン設定部１１５は、代表的な各シーンに対して頻出する音声に対応するテキスト（典型テキスト）を内部ＲＯＭ（Read Only Memory）に保持する。例えば、結婚式に対して、「おめでとう」、「入場」及び「乾杯」等の典型テキストが予め登録されている。この点で、シーン設定部１１５は、典型テキスト記憶手段として機能する。 The scene setting unit 115 holds text (typical text) corresponding to voices that frequently appear for each representative scene in an internal ROM (Read Only Memory). For example, typical texts such as “Congratulations”, “Admission” and “Cheers” are registered in advance for the wedding. In this respect, the scene setting unit 115 functions as a typical text storage unit.

記録モードにおいて、音声入力部１０２が周囲の音声を入力する（Ｓ１）。入力された音声データは、メモリ１０３の音声信号用領域に一時格納される。音声認識部１０６は、一定の条件の下で、メモリ１０３の音声信号用領域に一時格納された音声データを読み出して音声認識する（Ｓ２）。音声認識の対象は、例えば、一定レベル以上の音声が入力する場合のその音声である。他にも、一定レベル以上の笑い声がある場合の、前後数秒間の音声、一定以上の期間、無音が継続した後の音声、予め登録したユーザの音声等である。登録ユーザの音声か否かを、別途登録した音紋等との照合で判定すればよい。音声認識部１０６は、音声認識の結果のテキスト情報をテキスト化部１０７に供給する。 In the recording mode, the voice input unit 102 inputs ambient voice (S1). The input audio data is temporarily stored in the audio signal area of the memory 103. The voice recognition unit 106 reads out voice data temporarily stored in the voice signal area of the memory 103 and recognizes voice under a certain condition (S2). The target of voice recognition is, for example, the voice when a voice of a certain level or higher is input. In addition, when there is a laughing voice of a certain level or higher, it is a voice for several seconds before and after, a voice for a predetermined period or longer, a voice after silence continues, a voice of a user registered in advance. What is necessary is just to judge whether it is a registered user's voice by collation with the voiceprint etc. which were separately registered. The voice recognition unit 106 supplies text information as a result of the voice recognition to the text conversion unit 107.

撮影シーンが設定されている場合（Ｓ３）、テキスト化部１０７は、音声認識部１０６の認識結果からのテキストを、設定シーンに対してシーン設定部１１５に記憶されるテキストと比較し、類似度を算出する（Ｓ４）。例えば、音声認識結果とシーン設定部１１５に登録されるテキストが全く同じであれば、類似度は最も高い。音声認識結果と同じテキストがシーン設定部１１５に登録されていないにない場合、類似度が最も低い。例えば、音声認識結果が「おめでとさん」であるのに対し、シーン設定部１１５に登録されるテキストが「おめでとう」である場合、前から順に比較して４文字、一致する。６文字中４文字まで一致するので、類似度は６５％と設定する。逆に、シーン設定部１１５に登録されている文字の５文字に対する類似度を算出しても良い。この場合、５文字のうちの４文字「おめでと」が一致するので、類似度は８０％となる。 When the photographic scene is set (S3), the text conversion unit 107 compares the text from the recognition result of the voice recognition unit 106 with the text stored in the scene setting unit 115 for the set scene, and the degree of similarity. Is calculated (S4). For example, if the voice recognition result and the text registered in the scene setting unit 115 are exactly the same, the similarity is the highest. When the same text as the voice recognition result is not registered in the scene setting unit 115, the similarity is the lowest. For example, when the speech recognition result is “Congratulations” but the text registered in the scene setting unit 115 is “Congratulations”, the four characters match in order from the previous one. Since four of the six characters match, the similarity is set to 65%. Conversely, the similarity of five characters registered in the scene setting unit 115 may be calculated. In this case, four characters out of the five characters “Congratulations” match, so the similarity is 80%.

一定以上の類似度が得られる場合には（Ｓ５）、音声認識部１０６の認識結果を、シーン設定部１１５に記憶されるテキストで置換する（Ｓ６）。これにより、音声認識のぶれを解消でき、統一的な文言をテキストとして撮影画像に付加できることになる。類似度が低い場合（Ｓ５）、音声認識部１０６による音声認識結果のみ、又は、これとシーン設定部１１５からの最も類似するテキストの両方を、記録用に決定する。 When a certain degree of similarity is obtained (S5), the recognition result of the speech recognition unit 106 is replaced with text stored in the scene setting unit 115 (S6). As a result, the blur of voice recognition can be eliminated, and a uniform wording can be added to the photographed image as text. When the degree of similarity is low (S5), only the voice recognition result by the voice recognition unit 106 or both of this and the most similar text from the scene setting unit 115 are determined for recording.

撮影シーンが設定されていない場合（Ｓ３）、テキスト化部１０７は、音声認識部１０６の認識結果からのテキストを、記録用に決定する。類似度を０とする。 When the shooting scene is not set (S3), the text unit 107 determines the text from the recognition result of the voice recognition unit 106 for recording. The similarity is 0.

テキスト化部１０７は、記録用に決定したテキストと類似度に、制御部１１３からの撮影時刻情報をタイムスタンプとして付加した図３に示すようなデータ構造に整える。この明細書では、音声認識結果のテキスト情報にタイムスタンプを付加したデータを、音声認識テキストデータと呼ぶ。 The text conversion unit 107 arranges the data structure as shown in FIG. 3 in which shooting time information from the control unit 113 is added as a time stamp to the degree of similarity with the text determined for recording. In this specification, data obtained by adding a time stamp to text information of a speech recognition result is referred to as speech recognition text data.

テキスト化部１０７は、このように生成したテキストデータをメディアＩ／Ｆ部１０５を介して記録媒体１１６のテキストデータ用領域に記録する。記録媒体１１６上では、音声認識テキストデータは、同時の撮影で記録媒体１１６に記録される動画像データと関連付けられている。シーン設定部１１５を設けることで、音声認識が困難な状況、又は、音声認識で適切な結果が得られないような状況でも、適切なテキストを撮影画像に付加して記録媒体１１６に記録できる。 The text conversion unit 107 records the text data generated in this way in the text data area of the recording medium 116 via the media I / F unit 105. On the recording medium 116, the voice recognition text data is associated with moving image data recorded on the recording medium 116 by simultaneous photographing. By providing the scene setting unit 115, appropriate text can be added to the captured image and recorded on the recording medium 116 even in situations where voice recognition is difficult or situations where an appropriate result cannot be obtained by voice recognition.

テキスト化部１０７はまた、記録時間が所定時間以上の場合で、無音状態が一定期間以上、継続するときに、無音を示すキーワードを含むテキストデータを生成してもよい。 The text converting unit 107 may also generate text data including a keyword indicating silence when the recording time is a predetermined time or longer and the silent state continues for a certain period or longer.

復号化部１１１は、再生モードにおいて、ユーザにより指定された動画像データを記録媒体１１６から読み出し、圧縮映像データ及び圧縮音声データを復号化する。メモリ１０３は、復号化前の圧縮データの一時保存用として、また、復号化後の再生映像データ及び再生音声データの一時保存用に使用される。再生映像データは表示部１０８により画像表示でき、また、再生音声データは、音声出力部１１７から音響出力することができる。 In the playback mode, the decoding unit 111 reads moving image data designated by the user from the recording medium 116 and decodes the compressed video data and the compressed audio data. The memory 103 is used for temporarily storing compressed data before decoding, and for temporarily storing reproduced video data and reproduced audio data after decoding. The reproduced video data can be displayed on the display unit 108, and the reproduced audio data can be acoustically output from the audio output unit 117.

再生モードにおける記録画像のサムネイルによる一覧表示では、復号化部１１１とサムネイル作成部１１０が、協働する。具体的には、復号化部１１１が記録媒体１１６から所定数の画像データを読み出して復号化し、サムネイル作成部１１０に供給する。動画像データの場合には、動画像の先頭フレーム等の特定フレームの画像がサムネイルの作成に使用され、制御部１１３が、その特定フレームを指定する。サムネイル作成部１１０は、復号化部１１１で復号化された画像データのサイズを縮小してサムネイル画像を作成する。サムネイルは、その原画像データを記録媒体１１６に記録する際に同時に又は前後して作成してもよいし、一覧表示等の必要時に作成してもよい。 In the list display of thumbnails of recorded images in the playback mode, the decryption unit 111 and the thumbnail creation unit 110 cooperate. Specifically, the decoding unit 111 reads a predetermined number of image data from the recording medium 116, decodes them, and supplies them to the thumbnail creation unit 110. In the case of moving image data, an image of a specific frame such as the first frame of the moving image is used to create a thumbnail, and the control unit 113 designates the specific frame. The thumbnail creation unit 110 creates a thumbnail image by reducing the size of the image data decrypted by the decryption unit 111. The thumbnail may be created at the same time or before or after the original image data is recorded on the recording medium 116, or may be created when a list display or the like is necessary.

本実施例では、再生モードにおいて、テキストによる一覧表示のインデックス画面又は一覧画面とサムネイルによる一覧表示のインデックス画面又は一覧画面を選択できる。図４は、その動作フローチャートを示す。ユーザは、事前に又は再生モードに入った時点で、一覧画面としてテキスト一覧かサムネイル一覧かを設定する。 In the present embodiment, in the playback mode, an index screen or list screen for list display by text and an index screen or list screen for list display by thumbnail can be selected. FIG. 4 shows a flowchart of the operation. The user sets a list screen as a text list or a thumbnail list in advance or when entering the playback mode.

制御部１１３は、インデックス画面としてテキスト一覧かサムネイル一覧のどちらが選択されているかを調べる（Ｓ１１）。サムネイル一覧の場合（Ｓ１１）、インデックス作成部１１２は、一覧表示する記録画像に対するサムネイルを記録媒体１１６から読み込む（Ｓ１２）。もちろん、サムネイルが事前に作成されていない場合には、復号化部１１１及びサムネイル作成部１１０が、先に説明したように、必要な記録画像のサムネイルを生成する。そして、インデックス作成部１１２は、読み込んだ所定数のサムネイルを使って、一覧表示のインデックス画面を生成する（Ｓ１３）。 The control unit 113 checks whether a text list or a thumbnail list is selected as the index screen (S11). In the case of a thumbnail list (S11), the index creation unit 112 reads thumbnails for the recorded images to be displayed as a list from the recording medium 116 (S12). Of course, when the thumbnails are not created in advance, the decoding unit 111 and the thumbnail creation unit 110 generate the thumbnails of the necessary recorded images as described above. Then, the index creating unit 112 generates a list display index screen using the read predetermined number of thumbnails (S13).

テキスト一覧の場合（Ｓ１１）、インデックス作成部１１２は、一覧表示する各記録画像に対して、付属するテキストデータを記録媒体１１６から読み込む（Ｓ１４）。そして、読み込んだ所定数のテキストデータを使って、一覧表示のインデックス画面を生成する（Ｓ１５）。 In the case of a text list (S11), the index creation unit 112 reads the attached text data from the recording medium 116 for each recorded image to be displayed as a list (S14). Then, a list display index screen is generated using the predetermined number of read text data (S15).

制御部１１３は、インデックス作成部１１２により生成されたインデックス画面を表示部１０８に供給して、表示させる（Ｓ１６）。図５は、テキストデータによるインデックス画面例を示す。各記録画像に対して、年月日と、入力音声から生成されたテキストが並記される。 The control unit 113 supplies the index screen generated by the index creation unit 112 to the display unit 108 for display (S16). FIG. 5 shows an example of an index screen based on text data. For each recorded image, the date and the text generated from the input voice are written side by side.

ユーザが、インデックス画面上で特定の記録画像を選択した場合（Ｓ１７）、先に説明したように、制御部１１３は、復号化部１１１に指示して、選択された記録画像（及び音声）を再生させる（Ｓ１８）。再生画像信号は表示部１０８又は外部の映像表示装置により表示され、再生音声信号は図示しないスピーカから出力される。再生の中止又は終了により、インデックス画面に戻る。 When the user selects a specific recorded image on the index screen (S17), as described above, the control unit 113 instructs the decoding unit 111 to select the selected recorded image (and sound). Reproduction is performed (S18). The reproduced image signal is displayed on the display unit 108 or an external video display device, and the reproduced audio signal is output from a speaker (not shown). When playback is stopped or ended, the screen returns to the index screen.

また、インデックス画面の表示中に、ユーザが操作部１０９により画面送りを指示すると（Ｓ１９）、指示された次の一群の記録画像に対してインデックス画面を作成し、表示する（Ｓ１１〜Ｓ１６）。 Further, when the user instructs to move the screen through the operation unit 109 while the index screen is displayed (S19), an index screen is created and displayed for the next group of recorded images instructed (S11 to S16).

ユーザが、インデックス画面の作成方法の変更を指示する場合には（Ｓ２０）、テキスト一覧だった場合にはサムネイル一覧で、また、サムネイル一覧だった場合にはテキスト一覧で、インデックス画面を作成し直す（Ｓ１１〜Ｓ１６）。 When the user instructs to change the creation method of the index screen (S20), the index screen is recreated with the thumbnail list when the text list is displayed, or with the text list when the list is the thumbnail list. (S11-S16).

図６は、動画に対するサムネイルとテキストデータの対応例を示す。記録された動画像５０に対し、一定時間ごとにサムネイル５２が作成され、図５に示す例と同様の、音声入力によるテキスト５４が付加されている。 FIG. 6 shows an example of correspondence between thumbnails and text data for moving images. A thumbnail 52 is created for the recorded moving image 50 at regular intervals, and a text 54 by voice input is added as in the example shown in FIG.

図６に示すような一連の動画中の途中の画面が再生用に選択された場合、制御部１１３は、再生開始点として、選択位置（又はフレーム）、選択位置より一定時間前（例えば、数秒前）、及び先頭のいずれかを選択できる。再生開始点は、操作部１０９により制御部１１３に事前に設定しておいても、その都度、指定してもよい。選択位置より一定時間前が先頭位置を超える場合、先頭位置からの再生になるのは当然である。通常、見どころは、音声入力の直前から開始していることが多いことから、選択位置より一定時間前から再生開始するのをデフォルトとするのが好ましい。これにより、ユーザの希望する場面を見逃すことなく再生できる。また、動画像５０の記録時間が短い場合には一律に先頭から再生を開始するようにしてもよい。 When a screen in the middle of a series of moving images as shown in FIG. 6 is selected for playback, the control unit 113 uses the selected position (or frame) as a playback start point, and a certain time before the selected position (for example, several seconds). You can select either the previous or the first. The playback start point may be set in advance in the control unit 113 by the operation unit 109 or may be designated each time. Naturally, when a certain time before the selected position exceeds the head position, the playback starts from the head position. Usually, the highlight is often started just before voice input, so it is preferable to start playback from a certain time before the selected position as a default. Thereby, it is possible to reproduce without missing a scene desired by the user. Further, when the recording time of the moving image 50 is short, the reproduction may be started uniformly from the beginning.

本実施例では、撮影時に音声入力したテキストを使うので、所望の画像又はシーンを効率的に検索できる。 In this embodiment, since a text input by voice at the time of shooting is used, a desired image or scene can be searched efficiently.

音声認識結果とシーン設定部１１５に予め登録したテキストとの類似度も記録することにより、次のような利点がある。すなわち、記録媒体１１６に大量の映像信号が記録されている場合、シーン毎にインデックス表示を行うと検索性が向上する。例えば、シーン「結婚式」の記録画像を抽出して、一覧表示する。このとき、シーン設定部１１５に予め登録されているいわば定型文での絞り込み検索が可能になり、検索性が向上する。また、類似度順に一覧を表示することで、検索性が向上する。もちろん、記録媒体１１６に記録されている全画像を同じテキストで検索でき、様々なシーンの「おめでとう」というテキストが付加された画像を一覧表示できる。 Recording the similarity between the voice recognition result and the text registered in advance in the scene setting unit 115 has the following advantages. That is, when a large amount of video signals is recorded on the recording medium 116, searchability is improved by displaying an index for each scene. For example, the recorded images of the scene “wedding” are extracted and displayed in a list. At this time, it becomes possible to perform a narrowing search with a standard sentence registered in advance in the scene setting unit 115, and the search performance is improved. In addition, searchability is improved by displaying the list in order of similarity. Of course, all the images recorded on the recording medium 116 can be searched with the same text, and a list of images with the text “congratulations” added in various scenes can be displayed.

制御部１１３の制御は１つのハードウェアが行ってもよいし、複数のハードウェアが処理を分担することで、装置全体の制御を行ってもよい。例えば、音声認識部１０６に対応する音声認識機能、テキスト化部１０７に対応するテキスト化機能、種々のデータを記録媒体１１６に記録する記録機能などが、制御プログラムとしてソフトウエアでも実現されうる。 The control of the control unit 113 may be performed by one hardware, or the entire apparatus may be controlled by a plurality of hardware sharing the processing. For example, a voice recognition function corresponding to the voice recognition unit 106, a text conversion function corresponding to the text conversion unit 107, a recording function for recording various data on the recording medium 116, and the like can be realized by software as a control program.

また、本発明をその好適な実施形態に基づいて詳述してきたが、本発明はこれら特定の実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の様々な形態も本発明に含まれる。さらに、上述した各実施形態は本発明の一実施形態を示すものにすぎず、各実施形態を適宜組み合わせることも可能である。 Although the present invention has been described in detail based on the preferred embodiments thereof, the present invention is not limited to these specific embodiments, and various forms without departing from the gist of the present invention are also included in the present invention. included. Furthermore, each embodiment mentioned above shows only one embodiment of this invention, and it is also possible to combine each embodiment suitably.

また、上記実施形態では、撮像装置での撮像の際に音声認識してキーワードを付与する例を説明したが、再生装置に音声を認識する機能があれば、上記実施の形態で説明した各種キーワードの付与を再生装置で動画を再生することにより行ってもよい。 In the above-described embodiment, an example in which a keyword is given by recognizing voice at the time of imaging with the imaging device has been described. May be performed by reproducing a moving image with a reproducing apparatus.

本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）をネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（又はＣＰＵやＭＰＵ等）が実行する。この場合、そのプログラム、及び該プログラムを記憶した記憶媒体は本発明を構成することになる。 The present invention is also realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, or the like) of the system or apparatus executes. In this case, the program and the storage medium storing the program constitute the present invention.

Claims

映像入力手段と、
音声入力手段と、
前記音声入力手段による入力音声を音声認識し、前記入力音声の示すテキストを出力する音声認識手段と、
撮影シーンごとの典型テキストを記憶するシーン設定手段と、
前記音声認識手段による前記テキストと、前記シーン設定手段に記憶される前記典型テキストとの類似度を算出し、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定するテキスト化手段と、
前記映像入力手段により入力される映像と、前記テキスト化手段により決定された記録すべきテキストとを記録媒体に記録する記録手段
とを有することを特徴とする撮像装置。 Video input means;
Voice input means;
Voice recognition means for recognizing an input voice by the voice input means and outputting a text indicated by the input voice;
Scene setting means for storing typical text for each shooting scene;
The similarity between the text by the voice recognition unit and the typical text stored in the scene setting unit is calculated, and the text or the typical text according to the similarity is combined with the video input by the video input unit. Texting means for determining the text to be recorded;
An image pickup apparatus comprising: a recording unit configured to record a video input by the video input unit and a text to be recorded determined by the text unit on a recording medium.

さらに、
前記記録媒体に記録された前記テキストを使う一覧画面を生成する手段と、
前記一覧画面で選択された画像を再生する再生手段
とを有することを特徴とする請求項１に記載の撮像装置。 further,
Means for generating a list screen using the text recorded in the recording medium;
The imaging apparatus according to claim 1, further comprising a reproducing unit that reproduces an image selected on the list screen.

前記再生手段は、前記選択された画像が動画像の場合、その記録時間に応じて、記録を開始する位置を変更することを特徴とする請求項２に記載の撮像装置。 The imaging apparatus according to claim 2, wherein when the selected image is a moving image, the reproducing unit changes a recording start position according to a recording time.

映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置の制御方法であって、
撮影シーンを設定するステップと、
撮影時の前記音声入力手段による入力音声を音声認識し、前記入力音声の示すテキストを出力する音声認識ステップと、
前記音声認識ステップによる前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出し、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定するテキスト化ステップと、
前記映像入力手段により入力される映像と、前記テキスト化ステップにより決定された記録すべきテキストとを記録媒体に記録する記録ステップ
とを有することを特徴とする撮像装置の制御方法。 A control method for an imaging apparatus having a video input means, a voice input means, and a storage means for storing typical text for each shooting scene,
A step of setting a shooting scene;
A voice recognition step of recognizing voice input by the voice input means at the time of shooting and outputting text indicated by the input voice;
The degree of similarity between the text obtained by the voice recognition step and the typical text corresponding to the set shooting scene is calculated, and the text or the typical text according to the similarity is combined with the video input by the video input unit. A texting step that determines the text to be recorded;
A method for controlling an image pickup apparatus, comprising: a recording step of recording a video input by the video input unit and a text to be recorded determined in the text conversion step on a recording medium.

映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置を制御するプログラムであって、
前記撮像装置に撮影シーンを設定する機能と、
前記撮像装置に、撮影時の前記音声入力手段による入力音声を音声認識させ、前記入力音声の示すテキストを出力させる音声認識機能と、
前記撮像装置に、前記音声認識機能による前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出させ、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定させるテキスト化機能と、
前記撮像装置に、前記映像入力手段により入力される映像と、前記テキスト化機能により決定された記録すべきテキストとを記録媒体に記録させる記録機能
とを有することを特徴とする撮像装置の制御プログラム。 A program for controlling an imaging apparatus having video input means, voice input means, and storage means for storing typical text for each shooting scene,
A function of setting a shooting scene in the imaging device;
A voice recognition function for causing the imaging device to recognize a voice input by the voice input unit at the time of shooting and outputting a text indicated by the input voice;
The imaging apparatus calculates a similarity between the text by the voice recognition function and the typical text corresponding to a set shooting scene, and the text or the typical text according to the similarity is calculated by the video input unit. A text conversion function that allows text to be recorded along with the input video,
A control program for an image pickup apparatus, comprising: a recording function for causing the image pickup apparatus to record a video input by the video input means and a text to be recorded determined by the text conversion function on a recording medium. .

映像入力手段、音声入力手段、及び、撮影シーンごとの典型テキストを記憶する記憶手段を有する撮像装置を制御するプログラムを記憶する記憶媒体であって、前記プログラムが、
前記撮像装置に撮影シーンを設定する機能と、
前記撮像装置に、撮影時の前記音声入力手段による入力音声を音声認識させ、前記入力音声の示すテキストを出力させる音声認識機能と、
前記撮像装置に、前記音声認識機能による前記テキストと、設定された撮影シーンに対応する前記典型テキストとの類似度を算出させ、前記類似度に従う前記テキスト又は前記典型テキストを、前記映像入力手段により入力される映像とともに記録すべきテキストとして決定させるテキスト化機能と、
前記撮像装置に、前記映像入力手段により入力される映像と、前記テキスト化機能により決定された記録すべきテキストとを記録媒体に記録させる記録機能
とを有することを特徴とする記憶媒体。 A storage medium for storing a program for controlling an image pickup apparatus having a video input means, a voice input means, and a storage means for storing typical text for each shooting scene, wherein the program includes:
A function of setting a shooting scene in the imaging device;
A voice recognition function for causing the imaging device to recognize a voice input by the voice input unit at the time of shooting and outputting a text indicated by the input voice;
The imaging apparatus calculates a similarity between the text by the voice recognition function and the typical text corresponding to a set shooting scene, and the text or the typical text according to the similarity is calculated by the video input unit. A text conversion function that allows text to be recorded along with the input video,
A storage medium comprising: a recording function that causes the imaging device to record a video input by the video input unit and a text to be recorded determined by the text conversion function on a recording medium.