JP2009259167A

JP2009259167A - Video search digest generator and generation method, and program

Info

Publication number: JP2009259167A
Application number: JP2008110400A
Authority: JP
Inventors: Kota Hidaka; 浩太日高; Takashi Sato; 隆佐藤; Takeshi Irie; 豪入江; Uwe Kowalik; ウーヴェコヴァリク; Yosuke Torii; 陽介鳥井; Yukinobu Taniguchi; 行信谷口; Hidenobu Osada; 秀信長田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-04-21
Filing date: 2008-04-21
Publication date: 2009-11-05

Abstract

<P>PROBLEM TO BE SOLVED: To identify a digest satisfying a search request, and to exhibit the digest to a user. <P>SOLUTION: This video search digest generator or the like is input with a video, to be stored into a video content storage means, reads out the video from the video content storage means, to be analyzed, extracts a keyword to be stored into a search digest storage means, and reads out the video from the video content storage means, to be stored into the search digest storage means, as summary information that is information for generating the digest in a section of a divided video. The video search digest generator or the like acquires the keyword corresponding to a search word assigned by the user, from the search digest storage means, acquires the summary information corresponding to the keyword, from the search digest storage means, and acquires the video of a classification of the summary information from the video content storage means, as the digest, to be output to a display means of the user. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、映像検索ダイジェスト生成装置及び方法及びプログラムに係り、特に、音声、音楽を含む映像を検索してダイジェストを生成する映像検索ダイジェスト生成装置及び方法及びプログラムに関する。 The present invention relates to a video search digest generation apparatus, method, and program, and more particularly, to a video search digest generation apparatus, method, and program for searching video including voice and music to generate a digest.

映像数の増加に伴い、効率的な映像視聴方法が求められる。このためには、映像のダイジェストを視聴することが効果的な手法としてあげられる。例えば、強調音声区間を基に、ユーザの指定する任意の時間長でダイジェストを生成する手法がある（例えば、特許文献１参照）。 As the number of videos increases, an efficient video viewing method is required. For this purpose, viewing the video digest is an effective method. For example, there is a method of generating a digest with an arbitrary time length designated by the user based on the emphasized speech section (see, for example, Patent Document 1).

また、音声の感情状態に対応してダイジェストを生成する手法がある（例えば、特許文献２参照）。 In addition, there is a method of generating a digest corresponding to the emotional state of speech (see, for example, Patent Document 2).

また、動物体が大きく写っている映像区間を作成することができ、ダイジェスト的な映像区間閲覧のためのインデックスを利用者に提供することが可能な技術がある（例えば、特許文献３参照）。
特許第３８０３３１１号公報特開２００５−２４５４９６号公報特開２００６−２４４０７４号公報 In addition, there is a technique that can create a video section in which a moving object is greatly captured and can provide a user with an index for browsing a digest video section (see, for example, Patent Document 3).
Japanese Patent No. 3803311 JP 2005-245496 A JP 2006-244074 A

しかしながら、従来提案されている方法は、装置、または、プログラムがダイジェストを生成するものである。各映像をダイジェスト視聴することにより、１映像あたりの視聴時間を短縮することはできても、映像数の増加、例えば、１００００コンテンツを視聴するには、１映像を１０秒でダイジェスト視聴したとしても、１０００００秒、約２７時間以上を要してしまう。この問題を解決するには、ダイジェスト視聴技術に加え、検索技術が必要となる。 However, the conventionally proposed method is one in which a device or a program generates a digest. Although each video can be digested, the viewing time per video can be reduced, but to increase the number of videos, for example, to view 10000 content, even if one video is digested in 10 seconds. , 100,000 seconds, about 27 hours or more will be required. In order to solve this problem, search technology is required in addition to digest viewing technology.

本発明は、上記の点に鑑みなされたもので、検索要求を満足するダイジェストを同定し、該ダイジェストをユーザに提示する映像検索ダイジェスト生成装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a video search digest generation apparatus, method, and program for identifying a digest that satisfies a search request and presenting the digest to a user.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、音声データを含む映像を検索してダイジェストを生成して提供する映像検索ダイジェスト生成装置であって、
映像を入力し、記憶手段１４に格納する映像入力手段と１３１、
記憶手段１４から映像を読み出して解析し、キーワードを抽出し、該記憶手段１４に格納するキーワード抽出手段１３２と、
記憶手段１４から映像を読み出して、該映像を分割した区間のダイジェストを生成するための情報である要約情報として記憶手段１４に格納するダイジェスト生成手段１３３と、
ユーザから指定された検索語に対応するキーワードを前記記憶手段１４から取得し、該キーワードに対応する要約情報を該記憶手段１４から取得する、または、記憶手段１４から該要約情報の区分の映像をダイジェストとして取得して出力する検索提示手段１３４と、を有する。 The present invention (Claim 1) is a video search digest generation device that searches video including audio data, generates a digest, and provides the digest.
A video input unit 131 for inputting a video and storing it in the storage unit 14;
A keyword extraction unit 132 that reads out and analyzes a video from the storage unit 14, extracts a keyword, and stores the keyword in the storage unit 14;
Digest generation means 133 that reads video from the storage means 14 and stores it in the storage means 14 as summary information that is information for generating a digest of a section obtained by dividing the video;
A keyword corresponding to a search term designated by a user is acquired from the storage unit 14, and summary information corresponding to the keyword is acquired from the storage unit 14, or a video of the summary information section is stored from the storage unit 14. Search presenting means 134 that obtains and outputs as a digest.

また、本発明（請求項２）は、キーワード抽出手段１３２において、
映像と同梱されたメタデータ、映像を公開したサイトにおける該映像の表示位置付近のテキスト、該映像に付帯する音声を解析して求めた音素列、の少なくとも１つから、該映像のキーワードを抽出する手段を含む。 Further, the present invention (Claim 2) is the keyword extraction unit 132,
The keyword of the video is selected from at least one of metadata bundled with the video, text near the display position of the video on the site where the video is released, and a phoneme string obtained by analyzing audio attached to the video. Means for extracting.

また、本発明（請求項３）は、キーワード抽出手段１３２において、
ネットワーク上のテキストと、映像に付帯する音声を解析して求めた音素列を対応付けて該映像のキーワードを抽出する手段を含む。 Further, according to the present invention (Claim 3), the keyword extracting unit 132
And means for extracting a keyword of the video by associating the text on the network with the phoneme string obtained by analyzing the voice attached to the video.

また、本発明（請求項４）は、ダイジェスト生成手段１３３において、
映像を一つ以上の区間に分割し、該区間について、ダイジェストに利用する優先順位を、音声特徴量、画像特徴量の少なくとも１つを用いて付与し、該優先順位を用いて複数の長さの要約情報を生成する手段を含む。 Further, the present invention (Claim 4) is the digest generating means 133,
The video is divided into one or more sections, and priorities used for digests are assigned to the sections using at least one of audio feature amounts and image feature amounts, and a plurality of lengths are used using the priority orders. Means for generating summary information.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項５）は、音声データを含む映像を検索してダイジェストを生成して提供する映像検索ダイジェスト生成方法であって、
映像入力手段が、映像を入力し、記憶手段に格納する映像入力ステップ（ステップ１）と、
キーワード抽出手段が、記憶手段から映像を読み出して解析し、キーワードを抽出し、検記憶手段に格納するキーワード抽出ステップ（ステップ２）と、
ダイジェスト生成手段が、記憶手段から映像を読み出して、該映像を分割した区間のダイジェストを生成するための情報である要約情報として該記憶手段に格納するダイジェスト生成ステップ（ステップ３）と、
検索提示手段が、ユーザから指定された検索語に対応するキーワードを記憶手段から取得し、該キーワードに対応する要約情報を該記憶手段から取得する、または、記憶手段から該要約情報の区分の映像をダイジェストとして取得して出力する検索提示ステップ（ステップ４）と、を行う。 The present invention (Claim 5) is a video search digest generation method for searching for video including audio data and generating and providing a digest,
A video input step (step 1) for the video input means to input the video and store it in the storage means;
A keyword extracting unit that reads and analyzes the video from the storage unit, extracts a keyword, and stores the keyword in the test storage unit (step 2);
A digest generation step (Step 3) in which the digest generation means reads the video from the storage means and stores it in the storage means as summary information that is information for generating a digest of a section obtained by dividing the video;
The search presentation unit acquires a keyword corresponding to the search term designated by the user from the storage unit and acquires summary information corresponding to the keyword from the storage unit, or a video of the summary information section from the storage unit And a search presenting step (step 4) for obtaining and outputting the digest as a digest.

また、本発明（請求項６）は、キーワード抽出ステップ（ステップ２）において、
映像と同梱されたメタデータ、映像を公開したサイトにおける該映像の表示位置付近のテキスト、該映像に付帯する音声を解析して求めた音素列、の少なくとも１つから、該映像のキーワードを抽出するステップを行う。 Further, the present invention (Claim 6), in the keyword extraction step (Step 2),
The keyword of the video is selected from at least one of metadata bundled with the video, text near the display position of the video on the site where the video is released, and a phoneme string obtained by analyzing audio attached to the video. Perform the extraction step.

また、本発明（請求項７）は、キーワード抽出ステップ（ステップ２）において、
ネットワーク上のテキストと、映像に付帯する音声を解析して求めた音素列を対応付けて該映像のキーワードを抽出するステップを行う。 Further, according to the present invention (Claim 7), in the keyword extraction step (Step 2),
A step of extracting a keyword of the video by associating the text on the network with the phoneme string obtained by analyzing the voice attached to the video is performed.

また、本発明（請求項８）は、ダイジェスト生成ステップ（ステップ３）において、
映像を一つ以上の区間に分割し、該区間について、ダイジェストに利用する優先順位を、音声特徴量、画像特徴量の少なくとも１つを用いて付与し、該優先順位を用いて複数の長さの要約情報を生成するステップを行う。 In the digest generation step (step 3), the present invention (claim 8)
The video is divided into one or more sections, and priorities used for digests are assigned to the sections using at least one of audio feature amounts and image feature amounts, and a plurality of lengths are used using the priority orders. A step of generating summary information is performed.

本発明（請求項９）は、請求項１乃至４のいずれか１項に記載の映像検索ダイジェスト生成装置を構成する各手段としてコンピュータを機能させるための映像検索ダイジェスト提示プログラムである。 The present invention (Claim 9) is a video search digest presentation program for causing a computer to function as each means constituting the video search digest generation apparatus according to any one of Claims 1 to 4.

本発明によれば、映像のダイジェストとキーワードを抽出して保存しておくことにより、ユーザの指定した検索語を満足するダイジェストを高速に提示することが可能となる。 According to the present invention, by extracting and storing a video digest and a keyword, it is possible to present a digest that satisfies a search term designated by the user at high speed.

また、映像のキーワードを抽出することが可能となる。 Also, it is possible to extract video keywords.

さらに、ダイジェストに利用される一つ以上の区間について、ダイジェストに利用する優先順位を付与し、複数の長さのダイジェストを生成することが可能となる。 Furthermore, with respect to one or more sections used for the digest, it is possible to give priority to use for the digest and generate digests having a plurality of lengths.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における映像検索ダイジェスト生成装置の構成を示す。 FIG. 3 shows a configuration of the video search digest generation device according to the embodiment of the present invention.

同図に示す映像検索ダイジェスト生成装置は、中央処理ユニット（ＣＰＵ：Central Processing Unit）１１を備える。当該ＣＰＵ１１には、バス１２を介して、プログラムメモリ１３、データメモリ１４、通信インタフェース（通信Ｉ／Ｆ）１５がそれぞれ接続されている。 The video search digest generation apparatus shown in the figure includes a central processing unit (CPU) 11. A program memory 13, a data memory 14, and a communication interface (communication I / F) 15 are connected to the CPU 11 via a bus 12.

プログラムメモリ１３には、映像入力部１３１、キーワード抽出部１３２、ダイジェスト生成部１３３、検索要求検索提示部（以下、「検索提示部」と記す）１３４が格納されている。 The program memory 13 stores a video input unit 131, a keyword extraction unit 132, a digest generation unit 133, and a search request search / presentation unit (hereinafter referred to as “search / presentation unit”) 134.

データメモリ１４には、コンテンツ記憶部１４１と検索ダイジェスト記憶部１４２が設けられている。 The data memory 14 includes a content storage unit 141 and a search digest storage unit 142.

通信Ｉ／Ｆ１５は、ＣＰＵ１１の制御の下、インターネット上のサーバ及びインターネットサイトとの間で、通信ネットワークにより規定される通信プロトコルに従い通信を行う。通信プロトコルとしては、例えば、TCP/IP(Transmission Control Protocol/Internet Protocol)が使用される。 The communication I / F 15 performs communication with a server on the Internet and an Internet site according to a communication protocol defined by a communication network under the control of the CPU 11. As the communication protocol, for example, TCP / IP (Transmission Control Protocol / Internet Protocol) is used.

プログラムメモリ１３の映像入力部１３１は、映像ファイルが入力されると、コンテンツ記憶部１４１に記憶する。 When a video file is input, the video input unit 131 of the program memory 13 stores the content in the content storage unit 141.

キーワード抽出部１３２は、コンテンツ記憶部１４１から映像ファイルを読み出して、当該映像ファイルの映像を解析してキーワードを抽出する。キーワードを抽出する際に、映像と同梱されたメタデータ、映像を公開したサイトにおける映像の表示付近のテキスト、映像に付帯する音声を解析して求めた音素列のいずれかまたは両方から映像のキーワードを抽出し、検索ダイジェスト記憶部１４２に格納する。また、ネットワーク上のテキストと映像に付帯する音声を解析して求めた音素列を対応付けて映像のキーワードを抽出してもよい。 The keyword extraction unit 132 reads a video file from the content storage unit 141, analyzes the video of the video file, and extracts a keyword. When extracting keywords, the metadata from the video bundled with the video, the text near the video display on the site where the video is published, the phoneme sequence obtained by analyzing the audio attached to the video, or both, The keyword is extracted and stored in the search digest storage unit 142. Alternatively, a video keyword may be extracted by associating a phoneme string obtained by analyzing text on a network and a voice attached to the video.

ダイジェスト生成部１３３は、映像を一つ以上の区間に分割し、各区間について、ダイジェストに利用する優先順位を、音声特徴量、画像特徴量のいずれか一つ以上を用いて付与して、優先順位を用いて複数の長さのダイジェストを生成するためのダイジェスト要約情報を生成し、キーワード抽出部１３２で抽出されたキーワードと対応付けて検索ダイジェスト記憶部１４２に格納する。 The digest generation unit 133 divides the video into one or more sections, and assigns a priority order to be used for the digest for each section using one or more of the audio feature amount and the image feature amount. Digest summary information for generating digests having a plurality of lengths is generated using the rank, and stored in the search digest storage unit 142 in association with the keywords extracted by the keyword extraction unit 132.

検索提示部１３４は、ユーザから指定された検索語に対応するダイジェストをダイジェスト記憶部１４２から検索して、当該ダイジェスト要約に対応するコンテンツをコンテンツ記憶部１４１からダイジェストとして取得して、ユーザの表示手段（図示せず）に通信Ｉ／Ｆ１５を介して表示する。 The search presentation unit 134 searches the digest storage unit 142 for a digest corresponding to the search term designated by the user, acquires the content corresponding to the digest summary as a digest from the content storage unit 141, and displays the user's display means. (Not shown) is displayed via the communication I / F 15.

図４は、本発明の一実施の形態における映像検索ダイジェスト生成装置の動作のフローチャートである。 FIG. 4 is a flowchart of the operation of the video search digest generation device according to the embodiment of the present invention.

ステップ１０１）映像が入力されると、映像入力部１３１は、データメモリ１４のコンテンツ記憶部１４１に格納する。 Step 101) When a video is input, the video input unit 131 stores the content in the content storage unit 141 of the data memory 14.

ステップ１０２）次に、キーワード抽出部１３２において、コンテンツ記憶部１４１から映像を読み込んで、当該映像と同梱されたメタデータ、映像を公開したサイトにおける映像の表示付近のテキスト、映像に付帯する音声を解析して求めた音素列のいずれかまたは両方から映像のキーワードを抽出し、検索ダイジェスト記憶部１４２に格納する。 Step 102) Next, in the keyword extraction unit 132, the video is read from the content storage unit 141, the metadata bundled with the video, the text near the video display on the site where the video is released, and the audio accompanying the video. Video keywords are extracted from either or both of the phoneme strings obtained by analyzing the above and stored in the search digest storage unit 142.

ステップ１０３）ダイジェスト生成部１３３において、コンテンツ記憶部１４１から映像を読み込んで、当該映像を複数の区間に分割し、各区間について、ダイジェストを提示するための優先度、その区間の時間長等を含めた要約情報として、ステップ１０２で抽出されたキーワードに対応付けて検索ダイジェスト記憶部１４２に格納する。 Step 103) The digest generation unit 133 reads the video from the content storage unit 141, divides the video into a plurality of sections, and includes the priority for presenting the digest for each section, the time length of the section, and the like. The summary information is stored in the search digest storage unit 142 in association with the keyword extracted in step 102.

ステップ１０４）検索提示部１３４は、ユーザから検索語が入力されると、当該検索語に基づいて、検索ダイジェスト記憶部１４２を検索し、当該検索語に対応するキーワードを取得し、当該キーワードに対応するダイジェスト要約情報を取得し、当該要約情報の優先度に基づいてN件の区間の映像をダイジェスト映像としてコンテンツ記憶部１４１から読み出して、ユーザの表示手段（図示せず）に通信Ｉ／Ｆ１５を介して出力する。 Step 104) When a search word is input from the user, the search presentation unit 134 searches the search digest storage unit 142 based on the search word, acquires a keyword corresponding to the search word, and corresponds to the keyword. Digest digest information to be obtained, and videos of N sections are read out as digest videos from the content storage unit 141 based on the priority of the summary information, and the communication I / F 15 is provided to the user display means (not shown). Output via.

以下に、上記の構成の各要素について詳細に説明する。 Below, each element of said structure is demonstrated in detail.

＜キーワード抽出部１３２＞
まず、キーワード抽出部１３２について説明する。 <Keyword extraction unit 132>
First, the keyword extraction unit 132 will be described.

キーワード抽出部１３２は、入力された映像データと共に、コンテンツについての説明が、映像ファイルに梱包されていれば、それを利用してキーワードを抽出する。例えば、mpeg7(http://www.itscj.ipsj.or.jp/mpeg7/)形式で記述された文書などが想定される。あるいは、映像ファイルには、ヘッダ部分と呼ばれる映像圧縮形式などが記された領域がある。その領域に、映像を説明する内容が記されていれば、それを利用してもよい。 The keyword extraction unit 132 extracts a keyword by using the input video data if the description of the content is packed in the video file. For example, a document described in the mpeg7 (http://www.itscj.ipsj.or.jp/mpeg7/) format is assumed. Alternatively, the video file has an area in which a video compression format called a header portion is written. If the contents describing the video are written in the area, it may be used.

あるいは、映像が公開されたネットワーク上のサイトのＨＴＭＬなどを解析することにより、映像と関連したキーワードを抽出してもよい。例えば、特開２００５−１１５７２１号公報"植松幸生、竹野浩、小長井俊介、「画像検索方法、画像検索装置及び画像検索プログラム」"では、周辺テキストを、該画像のリンクについて記述するタグ前後の文字列として、該文字列と該画像を関連付ける方法が述べられている。この方法を該画像から該映像のリンクと変更することにより、該映像と該文字列を関連付けることが可能となる。さらに、該文字列からキーワードを抽出するには、例えば、特許第３５７５２４２号公報"別所克人、岩瀬成人、「キーワード抽出方法及び装置及びキーワード抽出プログラムを格納した記憶媒体」"を用いればよい。 Alternatively, keywords related to the video may be extracted by analyzing HTML or the like of the site on the network where the video is released. For example, in Japanese Patent Application Laid-Open No. 2005-115721 “Yukio Uematsu, Hiroshi Takeno, Shunsuke Konagai,“ Image Search Method, Image Search Device, and Image Search Program ””, the text before and after the tag describing the link of the image is described. A method of associating the character string with the image is described as a column, and by changing the method from the image to the link of the video, it is possible to associate the video with the character string. In order to extract a keyword from the character string, for example, Japanese Patent No. 3575242, “Katsuto Bessho, Adult Iwase,“ Keyword Extraction Method and Apparatus, and Storage Medium Stored Keyword Extraction Program ”” may be used.

あるいは、映像に付帯する音声を解析して音素列を得て、それを利用するものであってもよい。例えば、音素列は、特許第３３６８９８９号公報"野田喜昭、嵯峨山茂樹、「音声認識方法」"、もしくは、特開２０００−８９７９１号公報"宮崎昇、川端豪「音声認識応答方法、その装置及びプログラム記録媒体」"の技術を用いて抽出すればよい。この音素列を上記の特許第３３６８９８９号公報に記載の技術により音声認識し、結果を上記の特許第３５７５２４２号公報に記載の技術により、キーワード抽出すればよい。 Alternatively, it may be possible to obtain a phoneme string by analyzing a voice attached to a video and use it. For example, the phoneme sequence is disclosed in Japanese Patent No. 3368899 “Yoshiaki Noda, Shigeki Hiyama,“ Voice Recognition Method ””, or Japanese Patent Laid-Open No. 2000-87991 “Noboru Miyazaki, Go Kawabata” voice recognition response method, apparatus thereof, and What is necessary is just to extract using the technique of a program recording medium "". This phoneme sequence is recognized by the technique described in the above-mentioned Japanese Patent No. 3368899, and the result is obtained by the technique described in the above-mentioned Japanese Patent No. 3575242. What is necessary is just to extract a keyword.

あるいは、映像ファイルから音声を抽出し、当該音声から音素列を求め、音素記号と無音区間情報からなるシンボル列を取得し、Ｗｅｂ情報から取得した記事情報の音素列と文節情報からなるシンボル列とを用いて、音声シンボル集合に対する類似度を求め、映像ファイルと類似度が最大となる関連記事ＩＤに対応する記事の文章を取得する。当該関連記事のテキストを前述の特許第３５７５２４２号の方法により、キーワード抽出すればよい。 Alternatively, audio is extracted from the video file, a phoneme sequence is obtained from the audio, a symbol sequence consisting of phoneme symbols and silent section information is acquired, and a symbol sequence consisting of phoneme sequence and article information of article information acquired from Web information Is used to obtain the similarity to the speech symbol set, and the article text corresponding to the related article ID having the maximum similarity to the video file is obtained. The text of the related article may be extracted by the method of the aforementioned Japanese Patent No. 3575242.

これまで述べてきたキーワードは、一つである必要はなく、例えば、前述の特許第３５７５２４２号の技術によれば、複数のキーワードを抽出できる。上記の一つ以上のキーワードを検索ダイジェスト記憶部１４２に記憶する方法については後述する。 The keyword described so far does not have to be one. For example, according to the technique of the aforementioned Japanese Patent No. 3575242, a plurality of keywords can be extracted. A method of storing the one or more keywords in the search digest storage unit 142 will be described later.

＜ダイジェスト生成部１３３＞
次に、ダイジェスト生成部１３３について詳細に説明する。 <Digest generation unit 133>
Next, the digest generation unit 133 will be described in detail.

ダイジェスト生成部１３３は、コンテンツ記憶部１４１から映像を読み込んで、当該映像を１つ以上の区間に分割し、ダイジェストに利用可能な区間に優先順位を付与する。 The digest generation unit 133 reads the video from the content storage unit 141, divides the video into one or more sections, and gives priority to the sections that can be used for the digest.

優先順位の付与方法としては、例えば、特許第３８０３３１１号公報"日高浩太、水野理、中嶌信弥「音声処理方法及びその方法を使用した装置及びそのプログラム」"に記載の技術を用いてもよい。音声の強調状態を確率的、すなわち、強調度として抽出する手法は、区間の優先順位を当該区間の強調度を降順にすることで付与することが可能となる。 As a method of assigning priorities, for example, the technique described in Japanese Patent No. 3803111 “Kouta Hidaka, Osamu Mizuno, Shinya Nakajo“ Speech Processing Method, Apparatus Using the Method, and Program ”” may be used. The method of extracting the voice emphasis state probabilistically, that is, as the degree of emphasis, can give the priority of the section by decreasing the emphasis degree of the section in descending order.

また、区間の音声の感情度を求め、感情度の降順に優先順位を付与する技術を用いてもよい。例えば、WO 2008/032787 A1に記載の技術を用いることが可能である。学習行程において、一つ以上の感情を設定しておくことにより当該感情毎の感情度を抽出する可能となる。あるいは、当該区間について、一つ以上の感情度の最大値／和算／乗算／平均のいずれかを最終的な感情度として規定し、優先順位を付与するものであってもよい。 Further, a technique may be used in which the emotion level of the voice in the section is obtained and priority is given in descending order of the emotion level. For example, the technique described in WO 2008/032787 A1 can be used. By setting one or more emotions in the learning process, the emotion level for each emotion can be extracted. Or about the said area, any one of the maximum value / summation / multiplication / average of one or more emotion degrees may be prescribed | regulated as a final emotion degree, and a priority may be provided.

または、下記の表情検出装置を用いて行ってもよい。以下の表情検出装置により、区間の画像情報を用いて、人間の笑い状態を検出し、感情度の降順に優先順位を付与する。 Or you may perform using the following facial expression detection apparatus. The following facial expression detection device detects the human laughing state using the image information of the section, and gives priority in descending order of emotion level.

図５は、本発明の一実施の形態における表情検出装置の構成を示し、図６は、本発明の一実施の形態における基本的な表情検出処理のフローチャートである。 FIG. 5 shows the configuration of a facial expression detection apparatus according to an embodiment of the present invention, and FIG. 6 is a flowchart of basic facial expression detection processing according to an embodiment of the present invention.

同図に示す表情検出装置は、動画入力部１０、顔画像領域抽出部２０、特徴点抽出部３０、特徴量抽出部４０、笑い状態検出部５０、特徴点記憶部３５、特徴量記憶部４５から構成される。 The facial expression detection apparatus shown in FIG. 1 includes a moving image input unit 10, a face image region extraction unit 20, a feature point extraction unit 30, a feature amount extraction unit 40, a laughter state detection unit 50, a feature point storage unit 35, and a feature amount storage unit 45. Consists of

ステップ１）動画入力部１０は、動画を入力する。 Step 1) The moving image input unit 10 inputs a moving image.

ステップ２）顔画像領域抽出部２０は、Adaboost学習によるHaar-like特徴を用いた識別器を用いるものとし、入力された動画像から人物の顔画像領域を抽出する。ここで、多数の弱識別器をカスケード型とし、該カスケード型識別器を識別対象の大きさ、位置を変化させて適用し、顔画像領域を特定する。これについては、例えば、文献「Paul Viola, Michael J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision. Vol. 57, No2 pp.137-154 (2004)」などに記載されている。 Step 2) The face image region extraction unit 20 uses a discriminator using Haar-like features by Adaboost learning, and extracts a person's face image region from the input moving image. Here, a large number of weak classifiers are set as cascade types, and the cascade type classifiers are applied by changing the size and position of the identification target to identify the face image region. This is described, for example, in the document “Paul Viola, Michael J. Jones. Robust Real-Time Face Detection. International Journal of Computer Vision. Vol. 57, No. 2 pp. 137-154 (2004)”.

ステップ３）特徴点抽出部３０は、顔画像領域抽出部２０で抽出された顔画像領域から鼻の先端、口角の左右の位置を特徴点として抽出し、特徴点記憶部３５に格納する。特徴点抽出処理を行う際に、事前処理として、図７に示す黒抜き丸で示す２５点の特徴点を抽出している。特徴点は、輪郭、目玉、眉毛、鼻、口に関連して割り振っている。この特徴点の抽出方法としては、例えば、文献「Lades M., Vorbruggen J., Buhmann J., Lange J., Konen W., von der Malsburg C., Wurtz R. Distortion Invariant Object Recognition in the Dynamic Link Architecture. IEEE Trans. Computers, Vol. 42, No. 3 pp.300-311(1993)」、「Wiskott L., Fellous J.-M., Kruger N., von der Malsburg C. Face Recognition by Elastic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19. Issue 7, pp. 775-779 (1997)」等の公知の方法を用いるものとする。これらの公知技術を参照することにより、例えば、人物が顔を動かすなどの行為を行ったとしても安定して、図７に示す２５点の特徴を抽出可能である。この２５点の特徴のうち、図８に示す二重丸の３点（ｈ，ｉ，ｊ）のみを抽出し、残りの点は必要としない。これらは、鼻の先端、口角の左右の位置に相当する点である。また、２５点の特徴を抽出することなく、必要な３点のみを直接抽出してもよい。このような方法により、不要な点の抽出処理を省くことができる。 Step 3) The feature point extraction unit 30 extracts the right and left positions of the tip of the nose and the mouth corner from the face image region extracted by the face image region extraction unit 20 as feature points and stores them in the feature point storage unit 35. When the feature point extraction processing is performed, 25 feature points indicated by black circles shown in FIG. 7 are extracted as pre-processing. The feature points are assigned in relation to the outline, eyeball, eyebrows, nose, and mouth. As a method for extracting this feature point, for example, the literature `` Lades M., Vorbruggen J., Buhmann J., Lange J., Konen W., von der Malsburg C., Wurtz R. Distortion Invariant Object Recognition in the Dynamic Link Architecture. IEEE Trans. Computers, Vol. 42, No. 3 pp. 300-311 (1993) ”,“ Wiskott L., Fellous J.-M., Kruger N., von der Malsburg C. Face Recognition by Elastic Bunch It is assumed that a known method such as “Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19. Issue 7, pp. 775-779 (1997)” is used. By referring to these known techniques, for example, even if a person performs an action such as moving a face, the features of 25 points shown in FIG. 7 can be extracted stably. Of the 25 points, only the double circles (h, i, j) shown in FIG. 8 are extracted, and the remaining points are not required. These are points corresponding to the tip of the nose and the left and right positions of the mouth corner. Alternatively, only the necessary three points may be directly extracted without extracting the 25 points of features. With such a method, unnecessary point extraction processing can be omitted.

ステップ４）特徴量抽出部４０は、鼻の先端を基準としたときの口角の左右位置との角度を計測して特徴とする。図９の例では、鼻の先端ｈを基準としたときの左右の口角の位置ｉ，ｊの角度αを計測し、特徴量とし、特徴量記憶部４５に格納する。 Step 4) The feature amount extraction unit 40 measures the angle of the mouth angle with respect to the left and right positions when the tip of the nose is used as a reference. In the example of FIG. 9, the angles α of the left and right mouth corner positions i and j with respect to the tip h of the nose are measured and stored as feature amounts in the feature amount storage unit 45.

ステップ５）笑い状態検出部５０は、特徴量記憶部４５から特徴量（角度α）を読み出して、各度αの時間変化を求め、時間変化から平衡状態からの立ち上がり状態、最大角度状態、平衡状態への立ち下がり状態の３状態に分割し、笑い状態からの開始から終了までの連続的な変化を捉える。具体的には、角度の特徴量をαとしたときのその時間履歴を図１０のように計測する。更に、αの時間履歴から、同図に示すように、平衡状態からの立ち上がり、最大角度、平衡状態への立下りの３状態に分割する。実際には、人間が平衡状態にあるときに必ずしも口の形状が閉まっている状態ではないことが想定される。また、通常会話しているときも口の形状は開閉状態となる。このような場合においても、笑っているか否かを判断するためには、例えば、特徴量αの時間変化を観測すればよい。具体的には、図１１に示すように、特徴量の時間微分ｄα／ｄｔと、２つの閾値を用いる。２つの閾値については、高閾値「thupper」と低閾値「thlower」と呼ぶこととする。この閾値は静的に設定されるものでもよく、後述する方法により動的に設定されるものであってもよい。 Step 5) The laughter state detection unit 50 reads the feature amount (angle α) from the feature amount storage unit 45, obtains the time change of α each time, and determines the rising state from the equilibrium state, the maximum angle state, and the equilibrium from the time change. It is divided into three states, the falling state to the state, and the continuous change from the start to the end from the laughing state is captured. Specifically, the time history when the feature amount of the angle is α is measured as shown in FIG. Furthermore, from the time history of α, as shown in the figure, it is divided into three states: rising from the equilibrium state, maximum angle, and falling to the equilibrium state. Actually, it is assumed that the mouth shape is not necessarily closed when the person is in an equilibrium state. Also, the mouth shape is opened and closed during normal conversation. Even in such a case, in order to determine whether or not the person is laughing, for example, a temporal change of the feature amount α may be observed. Specifically, as shown in FIG. 11, the time differential dα / dt of the feature value and two threshold values are used. The two threshold values are referred to as a high threshold “thupper” and a low threshold “thlower”. This threshold value may be set statically or may be set dynamically by a method described later.

以下に、笑い状態検出部５０における、３状態に分割する方法について説明する。 Hereinafter, a method of dividing into three states in the laughter state detection unit 50 will be described.

平衡状態からの立ち上がり状態については、その開始時刻を時間微分ｄα／ｄｔが高閾値thupperを超えた時刻の時間微分ｄα／ｄｔから時間的に前方向を観測し、最短時間でｄα／ｄｔ＝０となる時刻ｔ０とする。一方、終了時刻は、時間微分ｄα／ｄｔが高閾値thupper越えた時刻の時間微分ｄα／ｄｔから時間的に後ろ方向を観測し、最短時間でｄα／ｄｔ＝０となる時刻ｔ１とする。この時刻ｔ１は最大角度状態の開始時刻にも相当する。 For the rising state from the equilibrium state, the forward time is observed from the time derivative dα / dt at the time when the time derivative dα / dt exceeds the high threshold thupper, and dα / dt = 0 in the shortest time. It is assumed that time t0. On the other hand, the end time is the time t1 when the backward direction is observed from the time derivative dα / dt at the time when the time derivative dα / dt exceeds the high threshold thupper and dα / dt = 0 in the shortest time. This time t1 also corresponds to the start time of the maximum angle state.

平衡状態への立下り状態については、最大角度状態以降で、低閾値thlowerを下回った時刻の時間微分ｄα／ｄｔから時間的に前方向を観測し、最短時間でｄα／ｄｔ＝０となる時刻ｔ２を開始時刻とする。この時刻ｔ２は、最大角度状態の終了時刻にも相当する。一方、開始時刻は、時間微分ｄα／ｄｔが低閾値thlowerを下回った時刻の時間微分ｄα／ｄｔから時間的に後ろ方向を観測し、最短時間でｄα／ｄｔ＝０となる時刻ｔ３とする。 As for the falling state to the equilibrium state, the forward direction is observed from the time differential dα / dt at the time when the angle falls below the low threshold thlower after the maximum angle state, and the time when dα / dt = 0 in the shortest time. Let t2 be the start time. This time t2 also corresponds to the end time of the maximum angle state. On the other hand, the starting time is a time t3 when the backward direction is observed from the time derivative dα / dt at the time when the time derivative dα / dt falls below the low threshold thlower, and dα / dt = 0 in the shortest time.

前述のように、立上がり状態の開始時刻から平衡状態への立下り終了時刻までが一連の笑い状態として判別される。 As described above, from the start time of the rising state to the end time of falling to the equilibrium state is determined as a series of laughing states.

次に、前述の高閾値thupperと低閾値thlowerを動的に設定する方法について述べる。 Next, a method for dynamically setting the above-described high threshold thupper and low threshold thlower will be described.

例えば、高閾値と低閾値の標準偏差と平均値をそれぞれσupperとμupper、σlowerとμlowerとした場合、
thupper＝ａ・σupper＋ｂ・μupper 式（１）
thlower＝ｃ・σlower＋ｄ・μlower 式（２）
としてもよい。ここで、ａ，ｂ，ｃ，ｄは、係数で任意の値とし、例えば、予め、試験用動画像を用意し、統計的な学習工程を経て設定するものであってもよい。具体的には、人手により本手法による笑い状態の上記の３状態の開始時刻と終了時刻の正解集合を設定し、これと本発明によって抽出された上記の３状態の開始時刻と終了時刻との時間差を最小限とするようにａ，ｂ，ｃ，ｄを設定してもよい。 For example, when the standard deviation and average value of the high threshold and low threshold are σupper and μupper, σlower and μlower, respectively,
thupper = a · σupper + b · μupper Equation (1)
thlower = c · σlower + d · μlower (2)
It is good. Here, a, b, c, and d may be arbitrary values as coefficients, and for example, a test moving image may be prepared in advance and set through a statistical learning process. Specifically, a correct answer set of the start time and end time of the above three states of the laughing state according to the present method is manually set, and the start time and end time of the above three states extracted by the present invention are set. A, b, c, and d may be set so as to minimize the time difference.

人間は、発話を一切していない状態においても口の形状が微小に変化していることが想定される。例えば、唇を噛みしめたり、つばを飲み込む動作を考えるだけでもこれらは容易に想像できる。これらの微小な変化が、角度αに影響する。また、笑いを含む発声行為についても、人間は規則的に口を開閉するもではなく、ある程度の不規則さを伴って開閉することが想定される。いわゆるこのようなノイズの影響を軽減させるために、例えば、検出した角度にメディアンフィルタを適用する対策を施してもよい。 It is assumed that the shape of the mouth is slightly changed even when a human is not speaking at all. For example, these can be easily imagined simply by considering the action of biting the lips or swallowing the brim. These small changes affect the angle α. In addition, it is assumed that humans do not regularly open and close their mouths with utterances including laughter, but open and close with some irregularities. In order to reduce the influence of the so-called noise, for example, a measure of applying a median filter to the detected angle may be taken.

また、本発明による時間微分ｄα／ｄｔでは、笑い状態と、一般の発声と区別が付かない場合も想定される。例えば、illegalと発声した場合、"ille"の部分でｄα／ｄtが増加し、"gal"の部分でｄα／ｄｔが減少するため、笑い状態と似ている挙動となる可能性がある。そのような場合には、例えば、最大角度状態の時間に着目し、t2−t1＞ttimeなどの時間的な閾値ttimeを設定することで問題を回避可能となる。 In addition, it is assumed that the time differentiation dα / dt according to the present invention cannot be distinguished from a laughing state and a general utterance. For example, when illegal is uttered, dα / dt increases at the “ille” portion and dα / dt decreases at the “gal” portion, which may result in behavior similar to that of a laughing state. In such a case, for example, focusing on the time in the maximum angle state, the problem can be avoided by setting a temporal threshold value ttime such as t2−t1> ttime.

当該笑い状態検出部５０は、上記の処理により、時間、角度α、時間微分ｄα／ｄｔからなる情報、または、３状態に分割された時刻の情報を出力する。 The laughing state detection unit 50 outputs information including time, angle α, and time differential dα / dt, or information of times divided into three states by the above processing.

これまで、本発明の基本的な例を述べてきたが、例えば、角度αのみに着目している場合、例えば、引きつった笑いや、いやみを発言するときなどに頻出する。鼻の稜線を基準線としたときの左右非対称の状態においても笑い状態と判別する可能性がある。このような問題に対しては、図１２に示すように、口角の左右の位置ｉ，ｊを結ぶ線分の中心と、鼻の先端ｈとを結ぶ線分を基準線とし、基準線に対する左右の口角位置との角度をそれぞれ、α１、α２としてこれらの値の差を考慮することで対象であるか否かを判定すればよい。 So far, a basic example of the present invention has been described. For example, when attention is paid only to the angle α, for example, it frequently appears when a laughter is pulled or an irritability is expressed. There is also a possibility of determining a laughing state even in an asymmetrical state when the ridgeline of the nose is used as a reference line. To solve such a problem, as shown in FIG. 12, the line segment connecting the center of the line segment connecting the left and right positions i and j of the mouth corner and the tip h of the nose is used as the reference line, It is only necessary to determine whether or not the object is a target by considering the difference between these values as α1 and α2 respectively.

例えば、それぞれの時間微分ｄα１／ｄｔ、ｄα２／ｄｔの時間履歴を測定し、これらの相関係数を求め、例えば、０．５以上であるときに対象としてもよい。また、それぞれの時間微分がｄα１／ｄｔ＞０、ｄα２／ｄｔ＞０となる時刻をｔｓ１、ｔｓ２としたときの│ｔｓ１−ｔｓ２│に閾値を設定するなどしてもよい。 For example, the time histories of the respective time derivatives dα1 / dt and dα2 / dt are measured, and these correlation coefficients are obtained. Alternatively, a threshold may be set to | ts1-ts2 | when the times at which the respective time derivatives are dα1 / dt> 0 and dα2 / dt> 0 are ts1 and ts2.

または、以下の方法によって行ってもよい。顔領域が画像中に支配的であるか否かの支配度度合いを求め、区間の支配度合いを降順に優先順位とする。このためには、例えば、特開２００６−２４４０７４号公報"鳥井陽介、紺谷精一、森元正志「動物体アップフレーム検出方法及びプログラム及びプログラムを格納した記憶媒体及び動物体アップショット検出方法及び動物体アップフレームあるいはショット検出方法及びプログラム及びプログラムを格納した記憶媒体」"に記載された技術を用いることが可能である。 Alternatively, the following method may be used. The degree of dominance of whether or not the face area is dominant in the image is obtained, and the degree of dominance of the sections is set in descending order of priority. For this purpose, for example, JP 2006-244074 A, Yosuke Torii, Seiichi Sugaya, Masashi Morimoto “Animal body up-frame detection method and program, storage medium storing the program, and animal body up-shot detection method and animal body It is possible to use the technique described in “Upframe or shot detection method, program, and storage medium storing program”.

上記の強調度、感情度、笑顔度、支配度を０〜１の範囲内で表現し、いずれか一つ以上について、和算、乗算、平均、最大のいずれかの値を降順に、優先順位を付与してもよい。 Express the above degree of emphasis, emotion, smile, and dominance within the range of 0 to 1, and for any one or more, priority is given to any one of the values of addition, multiplication, average, and maximum in descending order May be given.

区間毎の優先度の結果を検索ダイジェスト記憶部１４２に格納する。その際、例えば、特開２００７−１４０９５１号公報"日高浩太、佐藤隆「データ編集装置都そのプログラム」"で述べられている要約管理情報の形式で記憶してもよい。さらに、キーワードについても併記してもよい。例えば、図１３に示すように記述してもよい。本発明では、図４に示す（ａ）キーワード情報記述パートと、（ｂ）要約管理情報記述パートを最低限列挙した記述方式を「検索ダイジェスト記述文書」と呼ぶ。図１３（ｂ）の開始時間、終了時間はダイジェスト生成部１３３について示した、区間の開始時刻、終了時刻に対応すればよく、これらから時間長を求めることができる。尤度は、上記の強調度、感情度、笑顔度、支配度を０〜１の範囲内で表現したものとすればよい。同図では、検索ダイジェスト記述文書の（ｂ）要約管理情報パートについて、区間の時系列として記述しているが、図１４に示すように、ダイジェスト時間長毎に記述するものであってもよい。図１４では、例えば、ダイジェスト時間長を超えるまで優先順位の降順に区間を繋ぎ合わせる。例えば、ダイジェスト時間長５秒の時は区間番号"１"のみ、１５秒のときは区間番号"１"と"３"が選択されている例を示している。このようにすることで、複数のダイジェストの生成方法を記述することが可能となる。あるいは、各ダイジェストが予め生成されている場合は、その保管場所を映像ファイル名、ダイジェスト時間長と対応付けて記述しておけばよい。図１４では、映像ファイル名を"Ｃ１"とし、ダイジェスト時間長５秒のダイジェスト保管場所を、「http://www.abc.d.e.jp/C1/d05.mpg」で示している。 The result of the priority for each section is stored in the search digest storage unit 142. At that time, for example, the information may be stored in the form of summary management information described in Japanese Patent Application Laid-Open No. 2007-140951 “Kota Hidaka, Takashi Sato“ Data Editing Device Capital Program ”. For example, it may be described as shown in Fig. 13. In the present invention, (a) keyword information description part and (b) summary management information description part shown in Fig. 4 are listed at a minimum. The description method is called a “search digest description document”. The start time and end time in FIG. 13B may correspond to the start time and end time of the section shown for the digest generation unit 133, and the time length can be obtained from these. Likelihood should just express said emphasis degree, feeling degree, smile degree, and control degree in the range of 0-1. In FIG. 14, the (b) summary management information part of the search digest description document is described as a time series of sections, but may be described for each digest time length as shown in FIG. In FIG. 14, for example, the sections are connected in descending order of priority until the digest time length is exceeded. For example, only the section number “1” is selected when the digest time length is 5 seconds, and the section numbers “1” and “3” are selected when the digest time length is 15 seconds. In this way, it is possible to describe a plurality of digest generation methods. Alternatively, when each digest is generated in advance, the storage location may be described in association with the video file name and the digest time length. In FIG. 14, the video file name is “C1”, and the digest storage location with a digest time length of 5 seconds is indicated by “http://www.abc.d.e.jp/C1/d05.mpg”.

次に、検索提示部１３４について説明する。 Next, the search presentation unit 134 will be described.

検索提示部１３４は、ユーザから検索条件が入力されると、当該検索条件と検索ダイジェスト記憶部１４２に格納された検索ダイジェスト記述文書とを対応付けることにより検索条件を満足する映像をユーザへ提示する。対応付けについては、例えば、特許第３３７１９８３号公報"小澤英昭、中川透「不完全文字列と文字列の照合方法及び装置」"に記載の技術を用いればよい。 When a search condition is input from the user, the search presenting unit 134 presents a video that satisfies the search condition by associating the search condition with the search digest description document stored in the search digest storage unit 142. For the association, for example, the technique described in Japanese Patent No. 3371983 "Hideaki Ozawa, Toru Nakagawa" Incomplete character string and character string matching method and apparatus "" may be used.

例えば、ユーザの検索条件「行政改革」が図１３の検索ダイジェスト記憶部１４２の（ｂ）要約情報記述パートの区間番号"１"に存在していたとする。その際の、ダイジェストは優先順位を考慮して区間"３"、"２"を繋ぎ合わせたものを提示する。 For example, it is assumed that the user search condition “administrative reform” exists in the section number “1” of the (b) summary information description part of the search digest storage unit 142 in FIG. 13. In this case, the digest presents a combination of the sections “3” and “2” in consideration of the priority order.

あるいは、ユーザの検索条件「衆議院」に対して、図１３の（ｂ）要約情報記述パートの区間番号"３"が該当すれば、ダイジェストは区間番号３のみで作成する。 Alternatively, if the section number “3” in the summary information description part of FIG. 13 corresponds to the user's search condition “the House of Representatives”, the digest is created with only the section number 3.

また、ユーザの検索条件「行政改革」が複数のコンテンツに存在していたとする。例えば、図１５に示すようなコンテンツがあったとする。図１５の（ｂ）要約情報記述パートの区間番号"３"に「行政改革」があり、当該優先順位が"１"だった場合、図１５と図１３の優先順位を比較し、図１３のコンテンツよりも優先的に図１５のコンテンツのダイジェストを作成して、ユーザに提示すればよい。 Further, it is assumed that the user search condition “administrative reform” exists in a plurality of contents. For example, assume that there is content as shown in FIG. If the section number “3” of the summary information description part of FIG. 15 has “administrative reform” and the priority is “1”, the priority order of FIG. 15 is compared with that of FIG. A digest of the content in FIG. 15 may be created with priority over the content and presented to the user.

このように複数のコンテンツを比較する場合、例えば、優先順位"１"の区間に検索条件が該当する複数のコンテンツが存在する場合も想定される。その場合は、コンテンツに付与されている映像作成、公開、改訂日時、または、検索ダイジェスト記憶部１４２に映像作成、公開、改訂日時設定しておくことにより比較し、より最近のコンテンツを優先してユーザに提示してもよい。 When a plurality of contents are compared in this way, for example, there may be a case where a plurality of contents satisfying the search condition exist in the section having the priority “1”. In that case, the video creation, release, revision date and time attached to the content, or the video digest, release, revision date set in the search digest storage unit 142 are compared, and the more recent content is given priority. It may be presented to the user.

上記では、ダイジェスト生成において、映像の区間の優先順位付与を強調度、感情度、笑顔度、支配度のいずれか一つ以上によって行ってきたが、これ以外にキーワードが出現する区間の優先順位を昇順にすることも可能である。 In the above, in the digest generation, prioritization of video sections has been performed according to any one or more of the emphasis level, emotion level, smile level, and dominance level. An ascending order is also possible.

例えば、映像に付帯する音声を解析して音素列を求める手法では、その後のキーワードが存在する時刻を知ることが可能となる。当該時刻と図１３の（ｂ）要約情報記述パートを対応付け、例えば、区間番号"２"に「内閣改造」というキーワードが含まれていることによって、当該区間の優先順位を"１"に変更するなどしてもよい。 For example, in the method of obtaining a phoneme string by analyzing audio attached to a video, it is possible to know the time when the subsequent keyword exists. 13b is associated with the summary information description part of FIG. 13, for example, the section number “2” includes the keyword “Cabinet Remodeling”, so that the priority of the section is changed to “1”. You may do it.

あるいは、図１３の（ａ）キーワード情報記述パートのキーワードの含まれる個数に応じて、例えば、「内閣改造」「首相」「更迭」の３キーワードに含まれる当該区間の優先順位を"１"とし、「首相」「組閣」の２つのキーワードが含まれる区間の優先順位を"２"とするなどしてもよい。 Alternatively, according to the number of keywords included in the keyword information description part in FIG. 13A, for example, the priority order of the section included in the three keywords “Cabinet Remodeling”, “Prime Minister”, and “Farewell” is set to “1”. , “2” may be set as the priority of the section including the two keywords “Prime Minister” and “Kankaku”.

あるいは、検索条件を鑑みて、例えば「マニフェスト」と入力された場合は、当該キーワードを含む区間を優先順位"１"としてもよい。 Alternatively, in consideration of the search condition, for example, when “manifest” is input, a section including the keyword may be set to the priority “1”.

また、ユーザ検索条件に該当する区間については、図１３、図１５の（ｂ）の要約情報記述パートに示される区間だけではなく、その優先順位に対応して前後の区間を追加してダイジェスト生成に用いてもよい。これは、ユーザが意図的に検索している当該区間をより理解できるように時系列的に伸張するものであり、例えば、検索要求「行政改革」が図１３の（ｂ）の要約情報記述パートの区間番号"３"に存在する場合、その前後の区間の優先順位を比較し、大の優先順位の前／後区間を追加する。例えば、最大で、当該区間の時間長と同長だけ前後に追加することを想定すれば、優先順位ｙと時間長ｄについて、２ｄ×１／ｙだけ追加すると予め設定しておけばよい。例えば、検索要求「行政改革」が図１３の（ｂ）要約情報記述パートの区間番号"３"に存在する。すなわち、優先順位"１"の場合、２ｄとなるように、まず「区間２」と「区間４」との尤度を比較し、大となる「区間４」を追加し、次に、２ｄとなるまで、あるいは超えるまで、次は「区間２」、次は「区間１」と「区間５」の比較のうちどちらか、のように追加してもよい。 In addition, for the section corresponding to the user search condition, not only the section shown in the summary information description part of FIG. 13 and FIG. 15B but also the previous and subsequent sections corresponding to the priority order are added to generate a digest. You may use for. This expands in time series so that the user can intentionally search for the section that is intentionally searched. For example, the search request “administrative reform” is the summary information description part of FIG. , The priorities of the preceding and succeeding sections are compared, and the preceding / following sections having a higher priority are added. For example, if it is assumed that the maximum length is added before and after the same length as the time length of the section, the priority order y and the time length d may be set in advance by 2d × 1 / y. For example, the search request “administrative reform” exists in the section number “3” of the summary information description part in FIG. That is, in the case of the priority “1”, first, the likelihoods of “Section 2” and “Section 4” are compared so as to be 2d, “Section 4” is added, and then 2d and Until it becomes or exceeds, it may be added as follows: “Section 2”, and next, “Section 1” and “Section 5”.

また、図１３に示すキーワード情報記述パートと図１４に示すダイジェストを作成しておくことで、高速に映像数が増加した場合でも、高速にダイジェストを生成することが可能となる。 Also, by creating the keyword information description part shown in FIG. 13 and the digest shown in FIG. 14, it is possible to generate a digest at high speed even when the number of videos increases at high speed.

なお、図３に示す映像検索ダイジェスト生成装置の構成要素の動作をプログラムとして構築し、映像検索ダイジェスト生成装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the components of the video search digest generation device shown in FIG. 3 can be constructed as a program and installed in a computer used as the video search digest generation device, or can be distributed via a network. It is.

また、構築されたプログラムをハードディスクや、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 In addition, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、映像を検索してダイジェストを提供する技術に適用可能である。 The present invention can be applied to a technique for searching a video and providing a digest.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の一実施の形態における映像検索ダイジェスト生成装置の構成図である。It is a block diagram of the video search digest production | generation apparatus in one embodiment of this invention. 本発明の一実施の形態における映像検索ダイジェスト生成装置の動作のフローチャートである。It is a flowchart of operation | movement of the image | video search digest production | generation apparatus in one embodiment of this invention. 本発明の一実施の形態における表情検出装置の構成図である。It is a block diagram of the facial expression detection apparatus in one embodiment of this invention. 本発明の一実施の形態における基本的な表情検出処理のフローチャートである。It is a flowchart of the basic facial expression detection process in one embodiment of the present invention. 本発明の一実施の形態における特徴点抽出の事前準備として抽出した特徴点の例である。It is an example of the feature point extracted as prior preparation of the feature point extraction in one embodiment of this invention. 本発明の一実施の形態における特徴点の例である。It is an example of the feature point in one embodiment of this invention. 本発明の一実施の形態における特徴量の例である。It is an example of the feature-value in one embodiment of this invention. 本発明の一実施の形態における笑い状態を３状態に分割した模式図である。It is the schematic diagram which divided the laughing state in one embodiment of this invention into three states. 本発明の一実施の形態における笑い状態を３状態に分割する方法を示した模式図である。It is the schematic diagram which showed the method which divides the laughing state into 3 states in one embodiment of this invention. 本発明の一実施の形態における左右対称性を考慮して笑い状態を抽出するために用いる特徴量である。This is a feature amount used for extracting a laughing state in consideration of left-right symmetry in an embodiment of the present invention. 本発明の一実施の形態における検索ダイジェスト記憶部に格納される検索ダイジェスト記述文書の例（その１）である。It is an example (the 1) of the search digest description document stored in the search digest memory | storage part in one embodiment of this invention. 本発明の一実施の形態における要約情報記述パートをダイジェスト時間長毎に記述した例である。It is the example which described the summary information description part in one embodiment of this invention for every digest time length. 本発明の一実施の形態における検索ダイジェスト記憶部に格納される検索ダイジェスト記述文書の例（その２）である。It is an example (the 2) of the search digest description document stored in the search digest memory | storage part in one embodiment of this invention.

符号の説明Explanation of symbols

１１ＣＰＵ
１２バス
１３プログラムメモリ
１４データメモリ
１０入力手段、入力部
２０顔画像領域抽出部
３０特徴店抽出部
３５特徴点記憶部
４０特徴量抽出部
４５特徴量記憶部
５０笑い状態検出部
１３１映像入力手段、映像入力部
１３２キーワード抽出手段、キーワード抽出部
１３３ダイジェスト生成手段、ダイジェスト生成部
１３４検索提示手段、検索提示部
１４１映像コンテンツ記憶手段、コンテンツ記憶部
１４２検索ダイジェスト記憶手段、検索ダイジェスト記憶部 11 CPU
12 bus 13 program memory 14 data memory 10 input means, input unit 20 facial image area extraction unit 30 feature store extraction unit 35 feature point storage unit 40 feature amount extraction unit 45 feature amount storage unit 50 laughter state detection unit 131 video input unit, Video input unit 132 Keyword extraction unit, keyword extraction unit 133 Digest generation unit, digest generation unit 134 Search presentation unit, search presentation unit 141 Video content storage unit, content storage unit 142 Search digest storage unit, search digest storage unit

Claims

音声データを含む映像を検索してダイジェストを生成して提供する映像検索ダイジェスト生成装置であって、
映像を入力し、記憶手段に格納する映像入力手段と、
前記記憶手段から前記映像を読み出して解析し、キーワードを抽出し、該記憶手段に格納するキーワード抽出手段と、
前記記憶手段から前記映像を読み出して、該映像を分割した区間のダイジェストを生成するための情報である要約情報として該記憶手段に格納するダイジェスト生成手段と、
ユーザから指定された検索語に対応するキーワードを前記記憶手段から取得し、該キーワードに対応する要約情報を該記憶手段から取得する、または、前記記憶手段から該要約情報の区分の映像をダイジェストとして取得して出力する検索提示手段と、
を有することを特徴とする映像検索ダイジェスト生成装置。 A video search digest generating device that searches video including audio data and generates and provides a digest,
Video input means for inputting video and storing it in storage means;
A keyword extracting unit that reads out and analyzes the video from the storage unit, extracts a keyword, and stores the keyword in the storage unit;
Digest generation means for reading the video from the storage means and storing it in the storage means as summary information that is information for generating a digest of a section obtained by dividing the video;
A keyword corresponding to a search term designated by a user is acquired from the storage unit, and summary information corresponding to the keyword is acquired from the storage unit, or a video of a section of the summary information from the storage unit is used as a digest. Search presentation means for obtaining and outputting; and
A video search digest generation device characterized by comprising:

前記キーワード抽出手段は、
前記映像と同梱されたメタデータ、映像を公開したサイトにおける該映像の表示位置付近のテキスト、該映像に付帯する音声を解析して求めた音素列、の少なくとも１つから、該映像のキーワードを抽出する手段を含む
請求項１記載の映像検索ダイジェスト生成装置。 The keyword extracting means includes
A keyword of the video from at least one of metadata bundled with the video, text near a display position of the video on a site where the video is released, and a phoneme string obtained by analyzing audio attached to the video The video search digest generation device according to claim 1, further comprising means for extracting a video.

前記キーワード抽出手段は、
ネットワーク上のテキストと、前記映像に付帯する音声を解析して求めた音素列を対応付けて該映像のキーワードを抽出する手段を含む
請求項１記載の映像検索ダイジェスト生成装置。 The keyword extracting means includes
2. The video search digest generation device according to claim 1, further comprising means for extracting a keyword of the video by associating a text on the network with a phoneme string obtained by analyzing a voice attached to the video.

前記ダイジェスト生成手段は、
前記映像を一つ以上の区間に分割し、該区間について、ダイジェストに利用する優先順位を、音声特徴量、画像特徴量の少なくとも１つを用いて付与し、該優先順位を用いて複数の長さの要約情報を生成する手段を含む
請求項１記載の映像検索ダイジェスト生成装置。 The digest generation means includes:
The video is divided into one or more sections, and priorities used for digest are assigned to the sections using at least one of audio feature amounts and image feature amounts, and a plurality of lengths are used using the priority orders. 2. The video search digest generation device according to claim 1, further comprising means for generating summary information.

音声データを含む映像を検索してダイジェストを生成して提供する映像検索ダイジェスト生成方法であって、
映像入力手段が、映像を入力し、記憶手段に格納する映像入力ステップと、
キーワード抽出手段が、前記記憶手段から前記映像を読み出して解析し、キーワードを抽出し、該記憶手段に格納するキーワード抽出ステップと、
ダイジェスト生成手段が、前記記憶手段から前記映像を読み出して、該映像を分割した区間のダイジェストを生成するための情報である要約情報として該記憶手段に格納するダイジェスト生成ステップと、
検索提示手段が、ユーザから指定された検索語に対応するキーワードを前記記憶手段から取得し、該キーワードに対応する要約情報を該記憶手段から取得する、または、該記憶手段から該要約情報の区分の映像をダイジェストとして取得して出力する検索提示ステップと、
を行うことを特徴とする映像検索ダイジェスト生成方法。 A video search digest generation method for searching video including audio data and generating and providing a digest,
A video input means for inputting video and storing the video in the storage means;
A keyword extracting unit that reads and analyzes the video from the storage unit, extracts a keyword, and stores the keyword in the storage unit;
A digest generating step for reading out the video from the storage unit and storing it in the storage unit as summary information that is information for generating a digest of a section obtained by dividing the video;
The search presentation unit acquires a keyword corresponding to the search term designated by the user from the storage unit, acquires summary information corresponding to the keyword from the storage unit, or classifies the summary information from the storage unit A search and presentation step of acquiring and outputting the video of
A video search digest generation method characterized by:

前記キーワード抽出ステップにおいて、
前記映像と同梱されたメタデータ、映像を公開したサイトにおける該映像の表示位置付近のテキスト、該映像に付帯する音声を解析して求めた音素列、の少なくとも１つから、該映像のキーワードを抽出するステップを行う
請求項５記載の映像検索ダイジェスト生成方法。 In the keyword extraction step,
A keyword of the video from at least one of metadata bundled with the video, text near a display position of the video on a site where the video is released, and a phoneme string obtained by analyzing audio attached to the video The method of generating a video search digest according to claim 5, wherein the step of extracting the video is performed.

前記キーワード抽出ステップにおいて、
ネットワーク上のテキストと、前記映像に付帯する音声を解析して求めた音素列を対応付けて該映像のキーワードを抽出するステップを行う
請求項５記載の映像検索ダイジェスト生成方法。 In the keyword extraction step,
6. The video search digest generation method according to claim 5, wherein a step of extracting a keyword of the video by associating a text on the network with a phoneme sequence obtained by analyzing a voice attached to the video is performed.

前記ダイジェスト生成ステップにおいて、
前記映像を一つ以上の区間に分割し、該区間について、ダイジェストに利用する優先順位を、音声特徴量、画像特徴量の少なくとも１つを用いて付与し、該優先順位を用いて複数の長さの要約情報を生成するステップを行う
請求項５記載の映像検索ダイジェスト生成方法。 In the digest generation step,
The video is divided into one or more sections, and priorities used for digest are assigned to the sections using at least one of audio feature amounts and image feature amounts, and a plurality of lengths are used using the priority orders. The video search digest generation method according to claim 5, wherein a step of generating summary information is performed.

請求項１乃至４のいずれか１項に記載の映像検索ダイジェスト生成装置を構成する各手段としてコンピュータを機能させるための映像検索ダイジェスト生成プログラム。 A video search digest generation program for causing a computer to function as each means constituting the video search digest generation device according to any one of claims 1 to 4.