JP2000069420A

JP2000069420A - Video image processor

Info

Publication number: JP2000069420A
Application number: JP10239859A
Authority: JP
Inventors: Katsuhiko Sato; 克彦佐藤; Hiroyuki Akagi; 宏之赤木; Mitsuaki Nakamura; 三津明中村
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1998-08-26
Filing date: 1998-08-26
Publication date: 2000-03-03
Anticipated expiration: 2018-08-26
Also published as: JP3558886B2

Abstract

PROBLEM TO BE SOLVED: To integrate a long video image into a short shot group consisting of meaningful units. SOLUTION: The processing unit has a cut detection means 102 that detects cuts among video image shots, a scene detection means 103 that integrates the shots, based on the similarity of images among the shots, a relative time calculation means 106 that calculates a relative time of the video image from a head frame, and an integrating means 104 that integrates shot groups detected and integrated by the scene detection means 103, based on the relative time detected by a relative time calculation means 106.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、映像を自動的にま
とめ、まとめられた映像の編集や検索を容易にする映像
処理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video processing apparatus for automatically collecting videos and facilitating editing and searching of the collected videos.

【０００２】[0002]

【従来の技術】近年、衛星放送やケーブルテレビなどに
よるテレビの多チャンネル化が、進んでいる。また、家
庭におけるビデオムービーの普及により、映像を録画す
る機会が益々増えている。そのため膨大な数の映像につ
いて内容をより早く把握し、見たい部分のみを取り出す
ことが必要となっている。従来は、映像を早送りしなが
ら見たい場面を探すが、この場合最初から順番に見てい
く必要があり、たとえ１０倍の速さで見ても、２時間の
映像全体を見るのには１２分もかかることになり、検索
に非常に時間がかかる。2. Description of the Related Art In recent years, the number of channels of television such as satellite broadcasting and cable television has been increased. Also, with the widespread use of video movies at home, opportunities for recording video have been increasing. For this reason, it is necessary to grasp the contents of a huge number of videos more quickly and to extract only the desired parts. Conventionally, a scene to be watched is searched for while fast-forwarding the image. In this case, it is necessary to watch the scene in order from the beginning. This can take minutes and the search can be very time consuming.

【０００３】この問題に対し、映像を分割し、その代表
画像で表示することは、映像全体を理解し、編集や検索
を行なう上で非常に便利である。この分割方法として、
カットによる分割が用いられる場合が多い。カットと
は、カメラの切替えによって映像が大きく変化する点で
あり、カットと次のカットの間は連続した内容を持つ一
連の画像フレームで構成されている。この一連のフレー
ム群をショットと呼び、映像の１つの単位として扱うこ
とができる。しかしながら、映像をカットごとに分ける
作業は、非常に大きな労力が必要であるため、自動化す
ることが大きな課題であり、多くの技術が開発されてい
る。In order to solve this problem, dividing an image and displaying the image as a representative image is very convenient for understanding the entire image and performing editing and searching. As this dividing method,
In many cases, division by cutting is used. A cut is a point at which an image changes greatly by switching cameras, and is composed of a series of image frames having continuous contents between a cut and the next cut. This series of frames is called a shot, and can be treated as one unit of video. However, since the work of dividing a video for each cut requires a very large amount of labor, automation is a major issue, and many technologies have been developed.

【０００４】また、映像を意味のある内容ごとにまとめ
るために、ショットを統合する方法として、特開平９−
９３５８８号公報に示されているように、動画像のフレ
ーム群間の類似度を求め、リンクを付け同一の集合とし
て統合する方法がある。Japanese Patent Laid-Open Publication No. Hei 9-1997 discloses a method of integrating shots in order to organize images into meaningful contents.
As disclosed in Japanese Unexamined Patent Publication No. 93588, there is a method of calculating the similarity between frames of a moving image, attaching links, and integrating them as the same set.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、映像を
カット検出のみで分割する方法では、多くの映像はカッ
トが平均約５秒〜１０秒に１回起こると言われているた
め、ショット数が多く検出され、長い映像が短くまとま
らない。例えば、カットが１０秒に一度起こると仮定し
て、３０分の映像についてカット検出すると１８０箇所
検出され、それを１画面に３０枚の画像が表示できる表
示装置に表示すると、画面６枚分となる。これでは、映
像全体を把握することは難しい。However, in the method of dividing an image only by detecting a cut, it is said that a cut occurs in an average of about 5 to 10 seconds on many images, so that the number of shots is large. Detected, long video not short. For example, assuming that a cut occurs once every 10 seconds, when a cut is detected for a 30-minute video, 180 points are detected and displayed on a display device capable of displaying 30 images on one screen. Become. This makes it difficult to grasp the entire video.

【０００６】さらに、ＣＭ（コマーシャル）などでは、
極端に短いショットが数多くあり、重要な内容を含まな
いショットが数多く表示されることになり、内容全体を
把握することは困難である。Further, in commercials (commercials) and the like,
There are many extremely short shots, and many shots that do not contain important content are displayed, making it difficult to grasp the entire content.

【０００７】また、画像の類似度のみでショットを統合
する方法では、統合されるショット数は多くても数個か
ら数十個程度である。統合してもやはりショット数が多
く、長い映像を短くまとめることは困難である。In the method of integrating shots based only on the degree of similarity between images, the number of shots to be integrated is at most several to several tens. Even after integration, the number of shots is still large, and it is difficult to combine long videos into short ones.

【０００８】また、映像中のショットには重要な内容を
持つ部分とそうでない部分があり、映像を短くまとめる
際、重要な内容を持つショットを残すことが必要であ
る。[0008] In addition, shots in a video include a portion having important content and a portion having no important content, and it is necessary to leave a shot having important content when the video is shortened.

【０００９】また、映像中から特定のショットや同じ背
景のショットだけをまとめて検索して取り出したい、あ
るいは編集したいなどの要求がある。そのため、映像の
様々な特徴を基に望みのショットを表示したり、実際に
検索できることが必要となる。[0009] There is also a demand for searching for a specific shot or a shot with the same background only in a video and extracting the shot or for editing. Therefore, it is necessary to be able to display a desired shot based on various characteristics of the video and to be able to actually retrieve the shot.

【００１０】本発明の目的は、上記問題点に鑑み、映像
全体を容易に把握したり、短くまとめたり等できる映像
処理装置を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide an image processing apparatus capable of easily comprehending an entire image, compiling a short image, and the like in view of the above problems.

【００１１】[0011]

【課題を解決するための手段】請求項１記載の映像処理
装置は、順次入力される映像のフレームを統合する映像
処理装置であって、フレーム間の特徴量変化に基づいて
ショットのカットを検出するカット検出手段と、前記検
出されたカットにより区別された各ショットのフレーム
間の類似度に基づいてシーンチェンジを検出するシーン
検出手段と、統合のための基準フレームから入力フレー
ムまでの相対時刻を求める相対時刻計算手段と、前記相
対時刻が指定された一定時間となるたびに、前記検出さ
れたシーンチェンジを区切りとしてシーン統合する統合
手段と、を有することを特徴とする。According to a first aspect of the present invention, there is provided an image processing apparatus for integrating frames of a sequentially input image, wherein a cut of a shot is detected based on a change in a characteristic amount between frames. Cut detection means, scene detection means for detecting a scene change based on the similarity between frames of each shot distinguished by the detected cut, and a relative time from a reference frame for integration to an input frame. It is characterized by comprising relative time calculating means to be obtained, and integrating means for integrating scenes each time the relative time reaches a specified fixed time, with the detected scene change as a break.

【００１２】請求項２記載の映像処理装置は、順次入力
される映像のフレームを統合する映像処理装置であっ
て、フレーム間の特徴量変化に基づいてショットのカッ
トを検出するカット検出手段と、統合のための基準フレ
ームから入力フレームの相対時刻を求める相対時刻計算
手段と、前記検出されたカットの基準フレームからの前
記相対時刻が指定された一定時間の整数倍である場合
に、ＣＭを検出するＣＭ検出手段と、前記検出されたＣ
Ｍを区切りとしてショット統合する統合手段と、を有す
ることを特徴とする。According to a second aspect of the present invention, there is provided an image processing apparatus for integrating frames of a sequentially input image, wherein cut detection means detects a shot cut based on a change in a characteristic amount between frames. A relative time calculating means for calculating a relative time of the input frame from a reference frame for integration, and detecting a CM when the relative time of the detected cut from the reference frame is an integral multiple of a specified time. CM detecting means for detecting
And integrating means for integrating shots with M as a delimiter.

【００１３】請求項３記載の映像処理装置は、映像に同
期して入力される音声に基づいて順次入力される映像の
フレームを統合する映像処理装置であって、フレーム間
の特徴量変化に基づいてショットのカットを検出するカ
ット検出手段と、前記検出されたカットにより区別され
たショット間の入力音声の類似度に基づいて検出された
シーンチェンジを区切りとしてシーン統合する統合手段
と、を有することを特徴とする。According to a third aspect of the present invention, there is provided a video processing apparatus for integrating frames of a video that is sequentially input based on audio that is input in synchronization with the video. Cut detecting means for detecting a cut of a shot, and integrating means for integrating a scene with a scene change detected as a break based on the similarity of the input voice between shots distinguished by the detected cut. It is characterized by.

【００１４】請求項４記載の映像処理装置は、請求項１
乃至３記載の映像処理装置において、順次入力される映
像に関する特徴を抽出する特徴検出手段と、前記抽出さ
れた特徴を前記カット検出手段により検出されたカット
により区別されたショットごとに記憶する記憶手段と、
抽出された特徴に基づいて前記記憶手段のショットを指
定されたショット数以下まで分割する分割手段と、を有
することを特徴とする。According to a fourth aspect of the present invention, there is provided a video processing apparatus according to the first aspect.
4. The image processing apparatus according to any one of claims 1 to 3, wherein the characteristic detecting means extracts a characteristic relating to a sequentially input image, and the storing means stores the extracted characteristic for each shot distinguished by the cut detected by the cut detecting means. When,
Dividing means for dividing the shots of the storage means into a number of shots or less based on the extracted features.

【００１５】請求項５記載の映像処理装置は、請求項１
乃至３記載の映像処理装置において、順次入力される映
像に関する特徴を抽出する特徴検出手段と、前記抽出さ
れた特徴を前記カット検出手段により検出されたカット
により区別されたショットごとに記憶する記憶手段と、
検索条件として設定された特徴に基づいて前記記憶手段
からショットを検索する検索手段と、を有することを特
徴とする。According to a fifth aspect of the present invention, there is provided a video processing apparatus according to the first aspect.
4. The image processing apparatus according to any one of claims 1 to 3, wherein the characteristic detecting means extracts a characteristic relating to a sequentially input image, and the storing means stores the extracted characteristic for each shot distinguished by the cut detected by the cut detecting means. When,
And a search unit for searching for a shot from the storage unit based on a feature set as a search condition.

【００１６】請求項６記載の映像処理装置は、請求項４
乃至５記載の映像処理装置において、前記特徴検出手段
が抽出する特徴は、フレーム間の動きベクトルから得ら
れる映像中のカメラの動き情報であることを特徴とす
る。According to a sixth aspect of the present invention, there is provided a video processing apparatus according to the fourth aspect.
6. The video processing device according to any one of the first to fifth aspects, wherein the feature extracted by the feature detection unit is camera motion information in a video obtained from a motion vector between frames.

【００１７】請求項７記載の映像処理装置は、請求項４
乃至５記載の映像処理装置において、前記特徴検出手段
が抽出する特徴は、音量情報であることを特徴とする。According to a seventh aspect of the present invention, there is provided a video processing apparatus according to the fourth aspect.
6. The video processing device according to any one of claims 1 to 5, wherein the feature extracted by the feature detecting means is volume information.

【００１８】請求項８記載の映像処理装置は、請求項４
乃至５記載の映像処理装置において、前記特徴検出手段
が抽出する特徴は、ある特定の環境下における画像モデ
ルと比較することにより求まる類似度であることを特徴
とする。[0018] The image processing apparatus according to the eighth aspect is the fourth aspect.
6. The video processing device according to any one of Items 5 to 5, wherein the feature extracted by the feature detecting means is a similarity obtained by comparing the feature with an image model under a specific environment.

【００１９】請求項９記載の映像処理装置は、請求項４
乃至５記載の映像処理装置において、前記特徴検出手段
が抽出する特徴は、ある特定の環境下における音声モデ
ルと比較することにより求まる類似度であることを特徴
とする。According to a ninth aspect of the present invention, there is provided a video processing apparatus according to the fourth aspect.
6. The video processing device according to any one of Items 5 to 5, wherein the feature extracted by the feature detecting means is a similarity obtained by comparing the feature with an audio model under a specific environment.

【００２０】[0020]

【発明の実施の形態】まず、図２を用いて、映像の構成
について説明する。本実施の形態で入力される映像は、
入力時に何らかの時系列的あるいは番号的に順序付けら
れた画像、例えば動画像であればよく、例えばテレビ、
ＶＴＲ（デジタルテープレコーダ）、ポータブルビデオ
カメラなどの外部から入力される信号でも良いし、コン
ピュータのハードディスク、光磁気ディスク、ＣＤ−Ｒ
ＯＭ（コンパクトディスク−リードオンリーメモリ）な
どの記録媒体に保持されている符合化された信号でも良
い。ビデオカメラで撮影した場合を例にとると、映像を
構成する１枚１枚の画像を「フレーム」、撮影を中断せ
ずに撮影された一続きの映像のフレーム群を「ショッ
ト」、その境界を「カット」と呼ぶ。また、内容が連続
しているショット群を「シーン」とし、「シーン」の境
界を「シーンチェンジ」と呼ぶ。さらに、連続シーンで
の同一性を基に「シーン」を統合し、新しい「シーン」
を構成する。この「シーン」は、階層が何段あってもよ
い。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, the structure of an image will be described with reference to FIG. The video input in this embodiment is
At the time of input, images may be arranged in some chronological order or number, for example, a moving image, such as a television,
A signal input from outside such as a VTR (digital tape recorder) and a portable video camera may be used, or a computer hard disk, a magneto-optical disk, a CD-R
An encoded signal held in a recording medium such as an OM (compact disk-read only memory) may be used. In the case of taking a picture with a video camera as an example, each image constituting a picture is a “frame”, a frame group of a series of pictures taken without interrupting the shooting is a “shot”, and a boundary between the shots. Is called a "cut". A group of shots having continuous contents is referred to as a “scene”, and a boundary between the “scene” is referred to as a “scene change”. Furthermore, "scenes" are integrated based on the identity of consecutive scenes to create new "scenes".
Is configured. This “scene” may have any number of layers.

【００２１】（第１の実施の形態）以下に、本実施の形
態を示す図面に基づき説明する。図１に、第１の実施の
形態の映像処理装置の構成を示す。１０１は画像入力端
であり、１０２は入力された画像を用いてカットを検出
するカット検出手段であり、１０３はシーンチェンジを
検出するシーン検出手段であり、１０４は１０６からの
相対時刻でシーンを統合する統合手段であり、１０５は
ワークエリアのみならず本装置の各部で処理された映像
を記憶する記憶部であり、１０６は先頭フレームからの
相対時刻を求める相対時刻計算手段である。相対時刻検
出手段１０６は、先頭時刻のフレーム番号と入力フレー
ム番号との差から計算できる。例えば、３０フレーム／
秒のビデオ信号であれば、フレーム番号の差が１２０で
あれば、４秒であることが分かる。(First Embodiment) A description will be given below with reference to the drawings showing the present embodiment. FIG. 1 shows the configuration of the video processing device according to the first embodiment. 101 is an image input end, 102 is cut detection means for detecting a cut using the input image, 103 is scene detection means for detecting a scene change, and 104 is a scene detection means based on a relative time from 106. Reference numeral 105 denotes a storage unit that stores not only a work area but also video processed by each unit of the apparatus, and reference numeral 106 denotes a relative time calculation unit that obtains a relative time from the first frame. The relative time detecting means 106 can calculate the difference from the difference between the frame number at the head time and the input frame number. For example, 30 frames /
In the case of a video signal of seconds, if the frame number difference is 120, it can be understood that the time is 4 seconds.

【００２２】次に、カット検出手段１０２の処理につい
て説明する。カット検出手段１０２は、入力された画像
の連続するフレーム群のうち、特徴量変化の大きいフレ
ーム間をカットとして検出するものである。これによっ
て、連続的に変化の小さいフレーム群をショットとして
まとめる。前記特徴量変化の検出としては、たとえば、
画素変化面積を用いる方法やフレーム間での輝度ヒスト
グラムを用いて比較する方法や、色の分布の変化を求め
る方法などがあるが、特に限定されない。Next, the processing of the cut detecting means 102 will be described. The cut detection unit 102 detects, as a cut, a portion between frames having a large change in the feature amount in a continuous frame group of the input image. As a result, a group of frames having continuously small changes is grouped as a shot. As the detection of the feature amount change, for example,
There are a method using a pixel change area, a method using a luminance histogram between frames, and a method for obtaining a change in color distribution, but are not particularly limited.

【００２３】この特徴量変化の検出として、画素変化面
積を用いる方法について図３を用いて説明する。１枚の
画像は画素と呼ばれる小さな形（矩形の場合が多い）の
集まりで表される。図３（ａ）は時間ｔの画像の一部分
を表しているとする。時間がｔ＋１の時の同一の場所に
おける画像が図３（ｂ）であれば、時間ｔとｔ＋１の変
化を表す差分画像は図３（ｄ）となる。同様に時間ｔ＋
１の画像が図３（ｃ）であれば、差分画像は図３（ｅ）
となる。カットが起こった場合、画像が不連続であるた
め、この差分画像の画素変化面積が大きくかつ画像全体
に広がることになる。図３（ｅ）の場合、カットとなる
可能性が大きい。一方、カメラのパンによる動きの場合
変化は小さく、被写体の動きなどは局所的にしか変化し
ない。ズームでは、広い範囲で変化が起こるが、時間軸
上で連続した変化となるので、カットの判定とは見分け
ることが可能である。この画素変化面積の変化を時間と
ともに示したものが図４である。変化量がしきい値より
大きく、なおかつ１フレームにおいて大きな変化が起こ
る点をカットであると判定して、カットとして検出す
る。A method of using the pixel change area to detect the change in the characteristic amount will be described with reference to FIG. One image is represented by a collection of small shapes (often rectangular) called pixels. FIG. 3A shows a part of the image at time t. If the image at the same place when the time is t + 1 is as shown in FIG. 3B, the difference image representing the change between times t and t + 1 is as shown in FIG. 3D. Similarly, time t +
If the image No. 1 is as shown in FIG. 3C, the difference image is as shown in FIG.
Becomes When the cut occurs, the image is discontinuous, so that the pixel change area of the difference image is large and spreads over the entire image. In the case of FIG. 3E, there is a high possibility that a cut will occur. On the other hand, in the case of the movement due to the panning of the camera, the change is small, and the movement of the subject or the like changes only locally. In the zoom, a change occurs in a wide range, but since the change is continuous on the time axis, it is possible to distinguish the cut from the determination. FIG. 4 shows the change in the pixel change area with time. A point where the amount of change is larger than the threshold value and a large change occurs in one frame is determined to be a cut, and is detected as a cut.

【００２４】次に、シーン検出手段１０３の処理につい
て説明する。シーン検出手段１０３は、カット検出手段
１０２によりまとめられたショット群間の類似度を求
め、この類似度からシーンにまとめるものである。これ
によって類似度が所定の条件より低いショット群間をシ
ーンチェンジとし、ショット群をシーンとしてまとめる
ことができる。前記類似度として、例えば、色の分布を
もとにしたものがある。これは、フレームを領域分割
し、各領域の色とその色の分布比率から、フレーム間の
画像間距離を求め、その距離が特定の閾値より大きけれ
ば、シーンチェンジとするものである。その他、輝度を
用いたものや、動きを求める方法でも適応可能である。Next, the processing of the scene detecting means 103 will be described. The scene detecting means 103 obtains the similarity between the shot groups compiled by the cut detecting means 102, and combines the shots into a scene based on the similarity. As a result, shot groups having a similarity lower than a predetermined condition can be regarded as a scene change, and the shot groups can be grouped as a scene. As the similarity, for example, there is a similarity based on a color distribution. In this method, a frame is divided into regions, a distance between images between frames is determined from colors of the respective regions and a distribution ratio of the colors, and if the distance is larger than a specific threshold, a scene change is performed. In addition, a method using luminance and a method for obtaining motion can be applied.

【００２５】色の分布を用いる方法について、図５を用
いて説明する。まず、ショットの中から１枚代表画像と
して選び、その画像を色を基にクラスタリングすること
で領域分割する。クラスタリングした領域の面積とその
色の分布比率をグラフにしたものが図５下の色分布グラ
フである。この色の種類と分布比率から画像間の距離を
求めることで、類似度を計測する。図５（ａ）と図５
（ｂ）とは、色の種類数が同じで、各色の分布比率も所
定の範囲内であるので、類似度が高く、シーンチェンジ
でないと判断され、図５（ｂ）と図５（ｃ）とは、色の
種類数が異なり、色の分布比率も所定の範囲外であるの
で、類似度が低く、シーンチェンジであると判断され
る。A method using the color distribution will be described with reference to FIG. First, one shot is selected as a representative image from shots, and the image is divided into regions by clustering based on colors. A graph of the area of the clustered region and the distribution ratio of the colors is a color distribution graph in the lower part of FIG. The degree of similarity is measured by obtaining the distance between images from the type of color and the distribution ratio. 5 (a) and 5
5B, since the number of types of colors is the same and the distribution ratio of each color is also within a predetermined range, it is determined that the similarity is high and it is not a scene change, and FIGS. 5B and 5C Since the number of types of colors is different and the color distribution ratio is outside the predetermined range, the similarity is low and it is determined that a scene change has occurred.

【００２６】次に、統合手段１０４の処理について図６
を用いて説明する。図６は、連続する映像をシーン単位
で示したものである。各シーンは図の様にフレーム群で
構成されている。シーン１、シーン２、シーン３、シー
ン４、シーン５は、シーン検出手段１０３によって検出
されたもので、ｔ１、ｔ２、ｔ３、ｔ４、ｔ５はシーン
の先頭時刻を示している。また、統合手段１０４は、相
対時刻計算手段１０６からの時間情報により所定の時間
Ｔごとに区切りを設け、区切りの時刻におけるシーン番
号を求め、その次のシーン番号からを新しいシーンとし
て統合する。つまり、統合手段１０４によって、新しい
シーン（シーン１０、シーン１１）として統合するので
ある。この時、シーン１、シーン２、…は、上位シーン
１０、１１の下位シーンとして、記憶部１０５に階層的
に保存される。以下、同様に時間Ｔごとにシーンの統合
処理を行なうことにより、類似した映像が、極端な時間
長でなく適切な時間長でシーン統合がなされ、検索や編
集に便宜を図ることができる。Next, the processing of the integrating means 104 will be described with reference to FIG.
This will be described with reference to FIG. FIG. 6 shows a continuous image in scene units. Each scene is composed of a frame group as shown in the figure. Scene 1, scene 2, scene 3, scene 4, and scene 5 are detected by the scene detection unit 103, and t1, t2, t3, t4, and t5 indicate the head times of the scene. Further, the integrating means 104 sets a break at every predetermined time T based on the time information from the relative time calculating means 106, obtains a scene number at the break time, and integrates the next scene number as a new scene. That is, the integration means 104 integrates new scenes (scene 10, scene 11). At this time, scene 1, scene 2,... Are hierarchically stored in the storage unit 105 as lower scenes of the upper scenes 10, 11. Hereinafter, similarly, by performing scene integration processing for each time T, similar videos are integrated with scenes not with an extreme time length but with an appropriate time length, thereby making it easy to search and edit.

【００２７】次に、記憶部１０５について説明する。記
憶部１０５は、ハードディスクなどの記憶装置である。
この記憶部１０５に記憶されるデータ構造例は、図８で
示され、カット検出手段１０２より検出されたショット
の先頭フレーム番号を記憶する領域、シーン検出手段１
０３により検出された下位シーンの先頭フレーム番号を
記憶する領域、統合手段１０４により検出された上位シ
ーンの先頭フレーム番号を記憶する領域からなる。図８
の例では、アドレス００００〜に上位シーン領域、００
１０〜に下位シーン領域、０１００〜にショット領域が
確保されている。Next, the storage unit 105 will be described. The storage unit 105 is a storage device such as a hard disk.
An example of the data structure stored in the storage unit 105 is shown in FIG. 8, and the area for storing the first frame number of the shot detected by the cut detection unit 102, the scene detection unit 1
An area for storing the head frame number of the upper scene detected by the integration means 104 and an area for storing the head frame number of the lower scene detected by the integration unit 104. FIG.
In the example of the above, the upper scene area, 00
A lower scene area is secured at 10 and a shot area is secured at 0100.

【００２８】以上の処理方法を図７のフローチャートを
用いて説明する。はじめに、記憶領域を確保し、初期設
定する（Ｓ１００）。統合の基準になる先頭フレームの
相対時刻を設定する（Ｓ１０１）。次に、画像を処理装
置内のメモリ上に取り込む。この時画像のフレーム番号
も読み込む（Ｓ１０２）。前フレームとの比較により、
カットであるか判定する（Ｓ１０３）。カットと判定さ
れれば、記憶部１０５のショット領域アドレス上にショ
ット先頭番号を入れる（Ｓ１０４）。さらに、このカッ
トが、シーンチェンジであるか判定し（Ｓ１０５）、シ
ーンチェンジであれば、記憶部１０５の下位シーン領域
アドレス上にショット先頭番号を入れる（Ｓ１０６）。
それと同時に、先頭フレームからの順番から入力フレー
ムの相対時刻を求める（Ｓ１０７）。この相対時刻と指
定時刻Ｔを比較し（Ｓ１０８）、越えておれば、そのカ
ットまでを１つのシーンとして統合し、記憶部１０５の
上位シーンアドレス上に先頭フレーム番号を記憶する
（Ｓ１０９）。指定時刻Ｔを越えた時の先頭フレームの
相対時刻を０にセットする（Ｓ１１０）。入力が終了
（Ｓ１１１）すれば終了し、そうでなければ次の映像を
読み込む（Ｓ１０２）。The above processing method will be described with reference to the flowchart of FIG. First, a storage area is secured and initialized (S100). The relative time of the first frame as a reference for integration is set (S101). Next, the image is taken into the memory in the processing device. At this time, the frame number of the image is also read (S102). By comparison with the previous frame,
It is determined whether it is a cut (S103). If it is determined that the shot has been cut, the shot start number is entered in the shot area address of the storage unit 105 (S104). Further, it is determined whether or not the cut is a scene change (S105). If the cut is a scene change, the shot head number is entered in the lower scene area address of the storage unit 105 (S106).
At the same time, the relative time of the input frame is obtained from the order from the first frame (S107). The relative time and the specified time T are compared (S108). If the relative time is exceeded, the scene up to the cut is integrated as one scene, and the top frame number is stored in the upper scene address of the storage unit 105 (S109). The relative time of the first frame when the specified time T has passed is set to 0 (S110). If the input is completed (S111), the process ends, otherwise, the next video is read (S102).

【００２９】（第２の実施の形態）図９に、第２の実施
の形態の映像処理装置の構成を示す。第１の実施の形態
と相違するのは、構成上、ＣＭを検出するＣＭ検出手段
２０１を設けた点であり、それ以外の構成は、第１の実
施の形態と同様であるので説明を省略し、ＣＭ検出手段
２０１に関する箇所について説明する。(Second Embodiment) FIG. 9 shows a configuration of a video processing apparatus according to a second embodiment. The difference from the first embodiment is that, in terms of configuration, a CM detecting unit 201 for detecting a CM is provided, and the other configuration is the same as that of the first embodiment, and therefore the description is omitted. Then, a portion relating to the CM detection means 201 will be described.

【００３０】ＣＭ検出手段２０１は、多くのＣＭが６０
秒以下であって、所定時間である１５秒の数倍（１、
２、４倍等）時間であること、ＣＭとＣＭの継目では、
シーンチェンジが必ずあること、また複数のＣＭが続け
て繰り返されることが多く、ＣＭ中はカットが多いこと
を用い、カット間の時間差を計算することでＣＭを検出
する。The CM detecting means 201 detects that many CMs are 60
Seconds or less and several times the predetermined time of 15 seconds (1,
(2, 4 times, etc.)
The CM is detected by calculating the time difference between the cuts by using the fact that there is always a scene change and that a plurality of CMs are often repeated consecutively, and that there are many cuts in the CM.

【００３１】図１０に、連続する映像をＣＭ検出手段２
０１で得られるＣＭシーンとそれ以外のショット群を示
す。相対時刻計算手段１０６によって、ＣＭシーン１，
２の開始時刻および終了時刻を検出し、図１０では、ｔ
１，ｔ２がＣＭ開始時刻、ｔ１’，ｔ２’が終了時刻で
ある。統合手段１０４は、検出したＣＭ終了時刻から次
開始時刻までをショット群１，２，３をシーン３０，３
１，３２として統合している。FIG. 10 shows that the continuous image is detected by the CM detecting means 2.
01 shows a CM scene obtained by 01 and other shot groups. CM scene 1 and CM scene 1
2 are detected, and in FIG.
1 and t2 are CM start times, and t1 'and t2' are end times. The integrating means 104 converts the shot groups 1, 2, and 3 from the detected CM end time to the next start time into the scenes 30, 3
1, 32 are integrated.

【００３２】次に、本実施の形態の処理方法を図１１の
フローチャートを用いて説明する。はじめに、画像を処
理装置内のメモリ上に取り込む（Ｓ２０１）。前フレー
ムとの比較により、カットであるか判定する（Ｓ２０
２）。それと同時に、入力されたフレームの順番から先
頭フレームからの順番相対時刻を求める（Ｓ２０３）。
カットと判定されれば、記憶部１０５のショット領域ア
ドレス上にショット先頭番号を入れる（Ｓ２０４）。相
対時刻が１５秒の整数倍であれば、その間はＣＭであっ
たと判定し、記憶部１０５のＣＭシーン領域アドレス上
にＣＭシーンの先頭フレーム番号を入れる（Ｓ２０
７）。そして、ＣＭとＣＭの間を１つのシーンとし、記
憶部１０５の下位シーン領域アドレス上に先頭番号を入
れる（Ｓ２０８）。１５秒の倍数でなければ、相対時刻
が設定時間を越えているかどうか判定する（Ｓ２０
６）。例えば、ＣＭは６０秒以内とし、この時間を過ぎ
ているかどうか判定する。越えていれば、相対時刻を０
にリセットする（Ｓ２０９）。入力が終了（Ｓ２１０）
すれば終了し、そうでなければ次の映像を読み込む（Ｓ
２０１）。Next, the processing method of the present embodiment will be described with reference to the flowchart of FIG. First, an image is taken into a memory in the processing device (S201). It is determined whether or not a cut is made by comparing with the previous frame (S20).
2). At the same time, the order relative time from the first frame is obtained from the input frame order (S203).
If it is determined that the shot has been cut, the shot head number is entered in the shot area address of the storage unit 105 (S204). If the relative time is an integral multiple of 15 seconds, it is determined that the CM was a CM during that time, and the head frame number of the CM scene is entered in the CM scene area address of the storage unit 105 (S20).
7). Then, a section between the CMs is defined as one scene, and the head number is put on the lower scene area address of the storage unit 105 (S208). If it is not a multiple of 15 seconds, it is determined whether or not the relative time exceeds the set time (S20).
6). For example, the CM is set within 60 seconds, and it is determined whether or not this time has passed. Otherwise, set the relative time to 0
(S209). Input is completed (S210)
If so, the process ends, otherwise, the next image is read (S
201).

【００３３】ここで、記憶部１０５に記憶されるデータ
構造例は、図１２で示され、カット検出手段１０２によ
り検出されたショットの先頭フレーム番号を記憶する領
域、シーン検出手段１０３により下位シーンの先頭フレ
ーム番号を記憶する領域、統合手段１０４により上位シ
ーンの先頭フレーム番号を記憶する領域、ＣＭ検出手段
２０１によりＣＭの先頭フレーム番号を記憶する領域か
らなる。このように領域を区別することで、上位シーン
では、ＣＭが除去された映像が得られ、ＣＭと他の映像
とを区別して表示手段に表示でき、検索や編集に便宜を
図ることができる。Here, an example of the data structure stored in the storage unit 105 is shown in FIG. 12, and an area for storing the head frame number of the shot detected by the cut detection means 102, and a lower scene of the lower scene by the scene detection means 103. There are an area for storing the first frame number, an area for storing the first frame number of the upper scene by the integrating means 104, and an area for storing the first frame number of the CM by the CM detecting means 201. By distinguishing the regions in this way, in the upper scene, a video from which the CM has been removed can be obtained, and the CM and other videos can be distinguished and displayed on the display means, and search and editing can be facilitated.

【００３４】なお、本実施の形態では、検出されたＣＭ
を区切りとしてその間のショット群をすべて統合処理し
たが、第１の実施の形態と同様に、シーンチェンジを区
切りとして統合するのを組み合わせてもよい。In the present embodiment, the detected CM
Is used as a break, and all shot groups between the shots are integrated. However, as in the first embodiment, the integration may be combined with a scene change as a break.

【００３５】（第３の実施の形態）図１３に、第３の実
施の形態の映像処理装置の構成を示す。符号１０１〜１
０３の構成は、第１の実施の形態と同様であるので説明
を省略する。図１３において、３０１は画像入力端１０
１から入力される映像と同期して音声が入力される音声
入力端であり、３０２は音声の特徴となる参照パターン
を抽出するパターン抽出手段であり、３０３は入力され
た音声と参照パターンとを比較する音声類似度計測手段
であり、統合手段１０４は、音声類似度計測手段３０３
により求めた音声類似度とシーン検出手段１０３により
統合したシーンとによりシーンを統合し、記憶部１０５
のデータ構造例は、図示しないが、音声の参照パターン
を記憶する領域がさらに設けられている以外は図８と同
様である。(Third Embodiment) FIG. 13 shows the configuration of a video processing apparatus according to a third embodiment. Symbols 101 to 1
The configuration of 03 is the same as that of the first embodiment, and the description is omitted. In FIG. 13, reference numeral 301 denotes the image input terminal 10.
Reference numeral 302 denotes a pattern extracting means for extracting a reference pattern which is a feature of the sound. Reference numeral 303 denotes a pattern extracting means for extracting a reference pattern which is a characteristic of the sound. The voice similarity measuring means for comparison, the integrating means 104 includes a voice similarity measuring means 303
The scenes are integrated by the voice similarity obtained by the above and the scene integrated by the scene detecting means 103, and the storage unit 105
Although not shown, the example data structure is the same as that in FIG. 8 except that an area for storing a voice reference pattern is further provided.

【００３６】次に、本実施の形態の音声処理について、
図１４を用いて説明する。映像と音声とが同期して入力
され、まず映像のカットが検出されると、そのショット
内での音声データを参照パターンとしてパターン抽出手
段３０２により抽出し、記憶部１０５に記憶させる。音
声類似度計測手段３０３は、その後入力される音声と記
憶されている参照パターンとを比較して、音声類似度を
求め、カットの前後において、類似度が小さければ、シ
ーンチェンジと判定し、統合手段１０４はその判定結果
でシーンを統合する。この音声データの類似度計測は、
一定期間の音声データ（パターン）を周波数に変換し、
周波数軸でパターンマッチングをとることで求めればよ
い。また、シーンチェンジと判定されれば、参照パター
ンを記憶部１０５に新たに記憶させて、以降の比較に用
い、シーンチェンジと判定されるたびに新たに記憶され
る。Next, audio processing according to the present embodiment will be described.
This will be described with reference to FIG. The video and the audio are input in synchronization with each other. When a cut of the video is first detected, the audio data in the shot is extracted as a reference pattern by the pattern extraction unit 302 and stored in the storage unit 105. The voice similarity measuring means 303 compares the voice input thereafter with the stored reference pattern to determine the voice similarity. If the similarity is small before and after the cut, the voice similarity is determined to be a scene change, and The means 104 integrates the scenes based on the determination result. The similarity measurement of this audio data
Converts audio data (pattern) for a certain period to frequency,
It may be obtained by performing pattern matching on the frequency axis. If a scene change is determined, the reference pattern is newly stored in the storage unit 105, and is used for subsequent comparisons, and is newly stored each time a scene change is determined.

【００３７】本実施の形態の処理方法を図１５のフロー
チャートを用いて説明する。はじめに、画像および音声
を処理装置内のメモリ上に取り込む（Ｓ３０１）。前フ
レームとの比較により、カットであるか判定する（Ｓ３
０２）。カットと判定されれば、記憶部１０５のショッ
ト領域アドレス上にショット先頭番号を入れ（Ｓ３０
３）、さらに参照パターンをショットに対応の音声デー
タから取り出し、記憶部１０５に保存する（Ｓ３０
４）。記憶部１０５に保存された参照パターンと以降入
力されて検出されたパターンとを比較し、類似度を求
め、類似しているかどうか判断し（Ｓ３０５）、シーン
チェンジであると判定すると、シーン領域アドレス上に
フレーム番号を保存する（Ｓ３０６）。入力が終了（Ｓ
３０７）でなければ、画像と音声を読み込み（Ｓ３０
１）、以下同様に処理する。The processing method of the present embodiment will be described with reference to the flowchart of FIG. First, images and sounds are loaded into a memory in the processing device (S301). It is determined whether or not the cut is made by comparing with the previous frame (S3).
02). If it is determined that the shot has been cut, the shot head number is entered in the shot area address of the storage unit 105 (S30).
3) Further, a reference pattern is extracted from the audio data corresponding to the shot and stored in the storage unit 105 (S30).
4). The reference pattern stored in the storage unit 105 is compared with a pattern input and detected thereafter, a similarity is obtained, it is determined whether or not they are similar (S305). The frame number is stored above (S306). Input is completed (S
307), the image and the sound are read (S30).
1) The same processing is performed thereafter.

【００３８】このようにして、音声の特徴によりシーン
チェンジを検出してシーン統合をすることができるた
め、画像の類似度のみでシーン統合した場合よりもより
適切なシーン統合ができ、検索や編集に便宜を図ること
ができる。In this manner, scene change can be detected by detecting a scene change based on the feature of the sound, and thus scene integration can be performed more appropriately than when scene integration is performed based only on the similarity of images. This can be convenient.

【００３９】（第４の実施の形態）図１６に、第４の実
施の形態の映像処理装置の構成を示す。符号１０１〜１
０３の構成は、第１の実施の形態と同様であるので説明
を省略する。図１６において、４０１は図示しない指示
手段により記憶部１０５に記憶されている映像の各階層
のショット数を指示する各階層ショット数決定手段、４
０２はショット数決定手段４０１で指示されたショット
数の階層構造となるように記憶部１０５の内容を階層化
する階層構造構成手段、４０３は入力される音声と画像
の特徴量を検出する特徴量検出手段、４０４は階層構造
構成手段４０２により作成された階層及びショット数で
先頭フレームを表示する表示再生手段である。(Fourth Embodiment) FIG. 16 shows the configuration of a video processing apparatus according to a fourth embodiment. Symbols 101 to 1
The configuration of 03 is the same as that of the first embodiment, and the description is omitted. In FIG. 16, reference numeral 401 denotes each layer shot number determination means for instructing the number of shots of each layer of the video stored in the storage unit 105 by instruction means (not shown);
02 is a hierarchical structure forming unit that hierarchizes the contents of the storage unit 105 so as to have the hierarchical structure of the number of shots specified by the shot number determining unit 401, and 403 is a feature amount that detects the feature amount of the input audio and image. A detecting means 404 is a display reproducing means for displaying the first frame with the number of layers and the number of shots created by the hierarchical structure composing means 402.

【００４０】本実施の形態では、長い映像全体を素早く
把握するために、最初に映像全体を記憶部１０５の上位
シーンの先頭フレームを表示再生手段４０４に表示し、
さらに詳しく知りたい範囲、つまり、その先頭フレーム
を指示すれば、その指示されたフレームに対応する下位
シーンの階層の映像を指示されたショット数で表示再生
手段４０４に表示するのである。例えば、表示再生手段
４０４に４枚の画像が表示できる場合、映像全体を上位
シーンとして４つのショット数に特徴量検出手段４０３
で検出された特徴量を基に分割し、ショット数が４以下
になるまで繰り返すのである。また、同様にして、下位
シーン領域の映像をショット数が４以下になるまで分割
するのである。また、本実施の形態では、ショット数を
４としたが、この数に限定されるものでなく、また各階
層でのショット数を下層ほど多くする等に変化させても
よい。In this embodiment, in order to quickly grasp the entire long video, the entire video is first displayed on the display / playback means 404 at the top frame of the upper scene in the storage unit 105,
If a range to be known in more detail, that is, the first frame is specified, the video of the lower scene hierarchy corresponding to the specified frame is displayed on the display / reproduction means 404 with the specified number of shots. For example, when four images can be displayed on the display / reproduction means 404, the feature amount detection means 403 is set to four shots with the entire video as the upper scene.
Is divided on the basis of the feature amount detected in step (1), and the process is repeated until the number of shots becomes four or less. Similarly, the video in the lower scene area is divided until the number of shots becomes 4 or less. Further, in the present embodiment, the number of shots is four. However, the number of shots is not limited to four, and may be changed such that the number of shots in each layer is increased toward lower layers.

【００４１】次に、本実施の形態において、ショットと
その特徴量を取り込んでから階層構造を再構築する処理
を図１７のフローチャートを用いて説明する。はじめ
に、画像および音声を処理装置内のメモリ上に取り込
む。この時画像のフレーム番号も読み込む（Ｓ４０
１）。画像および音声の特徴量を検出する（Ｓ４０
２）。前フレームとの比較により、カットであるか判定
し、カットと判定されれば、記憶部１０５のショット領
域アドレス上にショット先頭番号を入れ（Ｓ４０３）、
ショットごとに記憶部１０５の特徴量記憶領域に特徴量
を保存する（Ｓ４０４）。以下、入力終了（Ｓ４０５）
まで繰り返し行なう。これらの処理によりショットと対
応の特徴量の記憶が終了する。Next, in the present embodiment, a process of reconstructing a hierarchical structure after taking in a shot and its characteristic amount will be described with reference to a flowchart of FIG. First, images and sounds are loaded into a memory in the processing device. At this time, the frame number of the image is also read (S40).
1). Detect image and audio feature values (S40)
2). By comparing with the previous frame, it is determined whether or not the shot is a cut. If the cut is determined, the shot head number is put on the shot area address of the storage unit 105 (S403).
The feature amount is stored in the feature amount storage area of the storage unit 105 for each shot (S404). Hereinafter, the input is completed (S405).
Repeat until By these processes, the storage of the feature amount corresponding to the shot ends.

【００４２】入力終了後、所望の映像を得るための実際
の表示操作となる。画面に表示可能なフレーム数より各
階層のショット数を各階層ショット数決定手段４０１に
より設定する（Ｓ４０６）。ここでは、各階層ともショ
ット数を４に設定した。すると、最初の表示時に、近い
特徴量を有する記憶部１０５のショット同士を設定され
たショット数のグループに分割して上位シーンとして表
示する。次に、記憶部１０５に設定数分の下位シーンデ
ータ領域を作成する（Ｓ４０７）。記憶部１０５からＳ
４０４で作成した特徴量データを取り出す（Ｓ４０
８）。指定された範囲の全てのショットを調べ続け（Ｓ
４０９）、取り出した特徴量に基づいて映像を設定数で
分割する（Ｓ４１０）。記憶部１０５の下位シーン領域
アドレス上に先頭フレーム番号を保存する（Ｓ４１
１）。分割後のすべてショット群において、ショット数
が設定数より少なければ終了（４１２）し、そのショッ
ト群内についてＳ４０７以降を繰り返す。以下、同様に
して、各階層のショット数が４になるまで分割する。こ
のように記憶部１０５を分割して再構成することで、表
示、検索、編集の便宜を図ることができる。After the input is completed, an actual display operation for obtaining a desired image is performed. The number of shots of each layer is set by the number-of-layers shot number determination means 401 based on the number of frames that can be displayed on the screen (S406). Here, the number of shots is set to 4 for each layer. Then, at the time of the first display, the shots in the storage unit 105 having similar feature amounts are divided into groups of the set number of shots and displayed as upper scenes. Next, lower-level scene data areas for the set number are created in the storage unit 105 (S407). S from storage unit 105
The feature amount data created in 404 is extracted (S40).
8). Continue examining all shots in the specified range (S
409), the video is divided by the set number based on the extracted feature amount (S410). The first frame number is stored on the lower scene area address of the storage unit 105 (S41).
1). If the shot number is smaller than the set number in all the shot groups after the division, the process ends (412), and S407 and the subsequent steps are repeated within the shot group. Hereinafter, similarly, division is performed until the number of shots of each layer becomes four. By dividing and reconfiguring the storage unit 105 in this manner, display, search, and editing can be facilitated.

【００４３】この記憶部１０５に記憶されるデータ構造
例は、図１８で示され、図８と同様、ショットの先頭フ
レーム番号を記憶する領域、下位シーンの先頭フレーム
番号を記憶する領域、上位シーンの先頭フレーム番号を
記憶する領域で構成され、各層において、設定数（図１
８では４）のショット群が統合されて記憶されている。
さらに、各ショットごとの特徴量を記憶する特徴量記憶
領域が設けられている。An example of the data structure stored in the storage unit 105 is shown in FIG. 18, and similarly to FIG. 8, an area for storing the head frame number of the shot, an area for storing the head frame number of the lower scene, and an upper scene 1 is stored in an area for storing the number of the first frame.
In 8, the shot group of 4) is integrated and stored.
Further, a feature amount storage area for storing a feature amount for each shot is provided.

【００４４】（第５の実施の形態）図１９に、第５の実
施の形態の映像処理装置の構成を示す。本実施の形態
は、映像の中から特定の特徴量を持つショットを指定し
た階層から取り出すものであり、符号１０１，１０２は
第１の実施の形態と同様であり、符号３０１，４０３は
第４の実施の形態と同様であるため、説明を省略する。
５０１は対象とする特徴量を設定する特徴量設定手段、
５０２は特徴量設定手段５０１で設定した特徴量と記憶
部１０５から取り出した特徴量とを比較して類似度を検
出する類似度検出手段である。(Fifth Embodiment) FIG. 19 shows the configuration of a video processing apparatus according to a fifth embodiment. In the present embodiment, a shot having a specific feature amount is extracted from a specified hierarchy from a video. Reference numerals 101 and 102 are the same as those in the first embodiment, and reference numerals 301 and 403 are a fourth embodiment. Since the third embodiment is the same as the first embodiment, the description is omitted.
501 is a feature amount setting means for setting a feature amount to be targeted;
Reference numeral 502 denotes a similarity detection unit that detects the similarity by comparing the characteristic amount set by the characteristic amount setting unit 501 with the characteristic amount extracted from the storage unit 105.

【００４５】次に、本実施の形態において、ショットと
その特徴量を取り込んでから所望の特徴量をもつショッ
トを取り出す処理を図２０のフローチャートを用いて説
明する。はじめに、画像および音声を処理装置内のメモ
リ上に取り込む。この時画像のフレーム番号も読み込む
（Ｓ５０１）。画像および音声の特徴量を検出する（Ｓ
５０２）。前フレームとの比較により、カットであるか
判定し、カットと判定されれば、記憶部１０５のショッ
ト領域アドレス上にショット先頭番号を入れ（Ｓ５０
３）、ショットごとに確保された特徴量記憶領域に特徴
量を保存する（Ｓ５０４）。以下、入力終了まで（Ｓ５
０５）繰り返し行なう。これらの処理によりショットと
対応の特徴量の記憶が終了し、検索の準備が整う。Next, in this embodiment, a process of taking in a shot and its characteristic amount and then taking out a shot having a desired characteristic amount will be described with reference to the flowchart of FIG. First, images and sounds are loaded into a memory in the processing device. At this time, the frame number of the image is also read (S501). Detecting image and audio features (S
502). By comparing with the previous frame, it is determined whether or not the shot is a cut. If the cut is determined, the shot head number is put on the shot area address of the storage unit 105 (S50).
3) The feature amount is stored in the feature amount storage area secured for each shot (S504). Hereinafter, until the input is completed (S5
05) Repeat. With these processes, the storage of the feature amount corresponding to the shot is completed, and the preparation for the search is completed.

【００４６】入力終了後、所望の映像（ショット）を得
るために、検索する特徴の条件を設定する（Ｓ５０
６）。すると、記憶部１０５から特徴量データを取り出
し（Ｓ５０７）、取り出した特徴量データが検索の条件
を満たすなら（Ｓ５０８）、表示のために、下位または
上位のシーン領域アドレス上に先頭フレーム番号を保存
する（Ｓ５０９）。以下、全てのショットについて調べ
る（Ｓ５１０）。別の特徴でも条件を設定するなら（Ｓ
５１１）、Ｓ５０６に戻り繰り返す。After the input, the condition of the feature to be searched is set in order to obtain a desired video (shot) (S50).
6). Then, the feature amount data is extracted from the storage unit 105 (S507). If the extracted feature amount data satisfies the search condition (S508), the first frame number is stored on the lower or upper scene area address for display. (S509). Hereinafter, all shots are checked (S510). If you want to set a condition for another feature (S
511), return to S506 and repeat.

【００４７】この記憶部１０５に記憶されるデータ構造
例は、図２１で示され、図８と同様、ショットの先頭フ
レーム番号を記憶する領域、下位シーンの先頭フレーム
番号を記憶する領域、上位シーンの先頭フレーム番号を
記憶する領域で構成されている。さらに、各ショットご
との特徴量を記憶する特徴量記憶領域が設けられてい
る。この特徴量記憶領域は、図２１に示す如く、特徴ご
とに設けられていてもよい。An example of the data structure stored in the storage unit 105 is shown in FIG. 21, and similarly to FIG. 8, an area for storing the head frame number of the shot, an area for storing the head frame number of the lower scene, and an upper scene Is stored in an area for storing the first frame number of the first frame. Further, a feature amount storage area for storing a feature amount for each shot is provided. This feature amount storage area may be provided for each feature as shown in FIG.

【００４８】図２２に、特徴量検出手段４０３の第１の
具体例を示す。図２２では、画像入力端１０１から入力
された画像は、フロー検出処理６０１で複数枚のフレー
ム間でオプティカルフローを計測する。カメラ方向計測
処理６０２では、フロー検出処理６０１で得られたフロ
ー結果を基にカメラの動きを判定する。FIG. 22 shows a first specific example of the feature amount detecting means 403. In FIG. 22, an image input from the image input terminal 101 measures an optical flow between a plurality of frames in a flow detection process 601. In the camera direction measurement processing 602, the movement of the camera is determined based on the flow result obtained in the flow detection processing 601.

【００４９】オプティカルフローとは、画面内の物体の
移動に伴う輝度分布の移動を示す速度ベクトルである。
このオプティカルフローの検出方法としては、大きく分
けて二種類ある。一つは、画像中の特徴点を見つけ、画
像フレーム間で対応点を探し速度ベクトルを決定するマ
ッチング法であり、もう一つは、動画像中の運動物体の
輝度の時間変化に一定の関係があることを利用する勾配
法である。The optical flow is a velocity vector indicating the movement of the luminance distribution accompanying the movement of the object in the screen.
There are roughly two types of optical flow detection methods. One is a matching method that finds a feature point in an image, finds a corresponding point between image frames, and determines a speed vector, and the other is a method that has a fixed relationship to the temporal change of the brightness of a moving object in a moving image. This is a gradient method that utilizes this fact.

【００５０】上記で計測したオプティカルフローの分布
とカメラの動きの関係は、図２３で示される。カメラが
パン・チルトすると左右あるいは上下の１方向のベクト
ルが得られ（図２３（ａ）、図２３（ｂ））、ズームあ
るいはワイドの時は画像の中心から放射状のベクトルが
得られる（図２３（ｃ））。パンとチルトの混合した場
合は、平行のベクトルが得られる（図２３（ｄ））。パ
ンとチルトおよびズーム・ワイドが混合した場合は、画
像の中心ではないある点を中心に放射状のベクトルとな
る（図２３（ｅ））。FIG. 23 shows the relationship between the optical flow distribution measured as described above and the movement of the camera. When the camera is panned / tilted, vectors in one direction, left or right or up and down, are obtained (FIGS. 23A and 23B), and when zooming or widening, radial vectors are obtained from the center of the image (FIG. 23). (C)). When pan and tilt are mixed, a parallel vector is obtained (FIG. 23D). When pan, tilt, and zoom / wide are mixed, the vector becomes a radial vector centered on a point other than the center of the image (FIG. 23 (e)).

【００５１】また、カメラの動きと画像上のベクトルの
関係は下式で表現される。The relationship between the camera motion and the vector on the image is expressed by the following equation.

【００５２】[0052]

【数１】 (Equation 1)

【００５３】ここで、（ｕ，ｖ）は動きベクトル、ｘ、
ｙは画面座標値、ａはズーム成分、ｐｘはパン成分、ｐ
ｙはチルト成分の係数を表す。上式と図２３との関係
は、図２３（ａ）はａ＝０、ｐｘ≠０、ｐｙ＝０を満た
す。図２３（ｂ）はａ＝０、ｐｘ＝０、ｐ≠０を満た
す。図２３（ｃ）はａ≠０、ｐｘ＝０、ｐｙ＝０を満た
す。図２３（ｄ）はａ＝０、ｐｘ≠０、ｐｙ≠０を満た
す。図２３（ｅ）はａ≠０、ｐｘ≠０、ｐｙ≠０を満た
す。Where (u, v) is a motion vector, x,
y is the screen coordinate value, a is the zoom component, px is the pan component, p
y represents the coefficient of the tilt component. The relationship between the above equation and FIG. 23 shows that FIG. 23A satisfies a = 0, px ≠ 0, and py = 0. FIG. 23B satisfies a = 0, px = 0, and p ≠ 0. FIG. 23C satisfies a ≠ 0, px = 0, and py = 0. FIG. 23D satisfies a = 0, px ≠ 0, py ≠ 0. FIG. 23E satisfies a ≠ 0, px ≠ 0, and py ≠ 0.

【００５４】カメラの動きの判定は、まずベクトルが平
行であるかを判定する。平行であればａ＝０が決定し、
ｐｘ、ｐｙもベクトルから求まる。平行でない場合、カ
メラの中心（ＥＯＦ）を求め、その中心からの距離とベ
クトルの大きさの関係よりカメラのズーム成分を求め
る。これによってａが得られる。その後、上式にａを代
入することで、ｐｘ、ｐｙが求まる。この３成分ａ，ｐ
ｘ，ｐｙが決まれば、カメラの動きが決定され、これら
を、映像の特徴量として用いる。このカメラの動きでシ
ョットを表示、あるいは検索することが可能となる。In determining the movement of the camera, first, it is determined whether the vectors are parallel. If parallel, a = 0 is determined,
px and py are also obtained from the vector. If not parallel, the center of the camera (EOF) is determined, and the zoom component of the camera is determined from the relationship between the distance from the center and the magnitude of the vector. This gives a. Thereafter, by substituting a into the above equation, px and py are obtained. These three components a, p
When x and py are determined, the motion of the camera is determined, and these are used as the image feature amounts. The shot can be displayed or searched for by the movement of the camera.

【００５５】図２４に、特徴量検出手段４０３の第２の
具体例を示す。図２４では、音声入力端３０１から入力
された音声は、音量検出処理部７０１において、音の大
きさを測定し、音量記憶手段７０２に記憶させ、後から
入力される音声との変化を順次検出する。この方法によ
って、音量変化が大きい、あるいは音量自体が大きい場
合を特徴量とする。これにより、音量変化の大きいショ
ットあるいは音量が大きいショットを表示、あるいは検
索することが可能となる。FIG. 24 shows a second specific example of the characteristic amount detecting means 403. In FIG. 24, the sound input from the sound input terminal 301 measures the loudness of the sound in the sound volume detection processing unit 701 and stores it in the sound volume storage unit 702, and sequentially detects a change from the sound input later. I do. According to this method, a case where the change in the sound volume is large or the sound volume itself is large is set as the feature amount. This makes it possible to display or search for a shot with a large volume change or a shot with a large volume.

【００５６】図２５に、特徴量検出手段４０３の第３の
具体例を示す。図２５では、ある特定の環境下における
特定の形状などの典型的な画像例を物体モデル（画像モ
デル）８０１として持ち、そのモデルを基に画像入力端
１０１から入力される画像との類似度を類似度計測手段
８０２で計測し、求めた類似度を特徴量として用いる。
物体モデルとしては、例えば人の顔、車、風景など様々
なものを用いることが可能である。これにより、物体モ
デル８０１に近いショットを表示、あるいは検索するこ
とが可能となる。FIG. 25 shows a third specific example of the feature amount detecting means 403. In FIG. 25, a typical image example of a specific shape or the like under a specific environment is used as an object model (image model) 801, and based on the model, the similarity with the image input from the image input terminal 101 is determined. The similarity measured by the similarity measuring unit 802 and obtained is used as a feature amount.
As the object model, various models such as a human face, a car, and a landscape can be used. This makes it possible to display or search for a shot close to the object model 801.

【００５７】図２６に、特徴量検出手段４０３の第４の
具体例を示す。図２６では、ある特定の環境下における
ノイズなどの典型例を環境モデル（音声モデル）９０１
として持ち、そのモデルを基に音声入力端３０１から入
力される音声との類似度を類似度計測手段９０２で計測
し、求めた類似度を特徴量として用いる。環境モデルと
しては、例えば車内の雑音、列車音など様々なものを用
いることができる。これにより、環境モデル９０１に近
いショットを表示、あるいは検索することが可能にな
る。FIG. 26 shows a fourth specific example of the feature amount detecting means 403. In FIG. 26, a typical example such as noise under a specific environment is shown as an environment model (speech model) 901.
The similarity with the voice input from the voice input terminal 301 is measured by the similarity measuring unit 902 based on the model, and the obtained similarity is used as a feature amount. As the environment model, for example, various models such as a noise in a car and a train sound can be used. This makes it possible to display or search for a shot close to the environment model 901.

【００５８】以上、種々の実施の形態について説明して
きたが、これらの実施の形態を適宜組み合わせて用いて
もよく、さらには、上述した処理を実行するプログラム
を、フロッピーディスクやＣＤ−ＲＯＭ等のコンピュー
タ読み取り可能な記録媒体にあらかじめ記録したり、あ
るいは通信回線でコンピュータ読み取り可能な記録媒体
に供給し、適宜コンピュータにインストールして用いて
もよい。Although various embodiments have been described above, these embodiments may be appropriately combined and used. Further, a program for executing the above-described processing may be executed on a floppy disk, a CD-ROM, or the like. The information may be recorded in advance on a computer-readable recording medium or supplied to a computer-readable recording medium via a communication line, and may be appropriately installed in a computer for use.

【００５９】[0059]

【発明の効果】請求項１記載の発明によれば、長い映像
を意味のある短いシーンにまとめることができるので、
映像内容の理解が簡単になる。According to the first aspect of the present invention, a long video can be combined into a meaningful short scene.
It is easy to understand the video content.

【００６０】請求項２記載の発明によれば、ＣＭと映像
内容とを区別して分かりやすくまとめることができるの
で、映像内容の理解が簡単になる。According to the second aspect of the present invention, since the CM and the video content can be distinguished and summarized in an easy-to-understand manner, the video content can be easily understood.

【００６１】請求項３記載の発明によれば、映像のみで
シーン統合した場合よりも意味的に適切なシーン統合が
できるので、映像内容の理解が簡単になる。According to the third aspect of the present invention, since the scene integration can be performed more semantically than the case where the scenes are integrated only with the video, it is easy to understand the content of the video.

【００６２】請求項４記載の発明によれば、上記効果に
加え、記憶手段のショットを分割して再構成することに
より、表示に適切な形にまとめることができ、映像中か
ら重要な部分だけを取り出すことができ、検索、編集の
便宜を図ることができる。According to the fourth aspect of the present invention, in addition to the above-mentioned effects, by dividing and reconstructing shots in the storage means, it is possible to combine the shots into a form suitable for display, and only important parts from the video are displayed. Can be retrieved, and the convenience of search and editing can be improved.

【００６３】請求項５記載の発明によれば、上記効果に
加え、記憶手段から検索条件として設定した特徴に近い
所望のショットを記憶手段から検索して取り出すことが
できる。According to the fifth aspect of the present invention, in addition to the above effects, a desired shot close to the feature set as a search condition from the storage means can be searched and retrieved from the storage means.

【００６４】請求項６記載の発明によれば、所定の映像
中のカメラの動き情報に近いショットを表示、あるいは
検索することが可能になる。According to the sixth aspect of the present invention, it is possible to display or search for a shot close to the camera motion information in a predetermined video.

【００６５】請求項７記載の発明によれば、音量変化あ
るいは音量の大きなショットを表示、あるいは検索する
ことが可能になる。According to the seventh aspect of the present invention, it is possible to display or search for a shot with a large volume change or volume.

【００６６】請求項８記載の発明によれば、特定の画像
モデルに近いショットを表示、あるいは検索することが
可能になる。According to the eighth aspect of the invention, it is possible to display or search for a shot close to a specific image model.

【００６７】請求項９記載の発明によれば、特定の音声
モデルに近いショットを表示、あるいは検索することが
可能になる。According to the ninth aspect of the present invention, it is possible to display or search for a shot close to a specific audio model.

【図面の簡単な説明】[Brief description of the drawings]

【図１】第１の実施の形態の映像処理装置の構成を示す
図である。FIG. 1 is a diagram illustrating a configuration of a video processing device according to a first embodiment.

【図２】本発明における映像の構成を表す概念図であ
る。FIG. 2 is a conceptual diagram illustrating a configuration of an image according to the present invention.

【図３】第１の実施の形態のカット検出として画素変化
面積を用いる場合の説明図である。FIG. 3 is an explanatory diagram of a case where a pixel change area is used as cut detection according to the first embodiment;

【図４】第１の実施の形態の画素変化面積の時間経過を
示す図である。FIG. 4 is a diagram illustrating a lapse of time of a pixel change area according to the first embodiment.

【図５】第１の実施の形態のシーン検出として色分布を
用いる場合の説明図である。FIG. 5 is an explanatory diagram when a color distribution is used as scene detection according to the first embodiment;

【図６】第１の実施の形態のシーン統合の説明図であ
る。FIG. 6 is an explanatory diagram of scene integration according to the first embodiment;

【図７】第１の実施の形態のフローチャートである。FIG. 7 is a flowchart of the first embodiment.

【図８】第１の実施の形態の記憶部のデータ構造の構成
例である。FIG. 8 is a configuration example of a data structure of a storage unit according to the first embodiment.

【図９】第２の実施の形態の映像処理装置の構成を示す
図である。FIG. 9 is a diagram illustrating a configuration of a video processing device according to a second embodiment.

【図１０】第２の実施の形態のシーン統合の説明図であ
る。FIG. 10 is an explanatory diagram of scene integration according to the second embodiment.

【図１１】第２の実施の形態のフローチャートである。FIG. 11 is a flowchart of the second embodiment.

【図１２】第２の実施の形態の記憶部のデータ構造の構
成例である。FIG. 12 is a configuration example of a data structure of a storage unit according to the second embodiment.

【図１３】第３の実施の形態の映像処理装置の構成を示
す図である。FIG. 13 is a diagram illustrating a configuration of a video processing device according to a third embodiment.

【図１４】第３の実施の形態の音声データによりシーン
チェンジを検出する説明図である。FIG. 14 is an explanatory diagram for detecting a scene change from audio data according to the third embodiment.

【図１５】第３の実施の形態のフローチャートである。FIG. 15 is a flowchart of the third embodiment.

【図１６】第４の実施の形態の映像処理装置の構成を示
す図である。FIG. 16 is a diagram illustrating a configuration of a video processing device according to a fourth embodiment.

【図１７】第４の実施の形態のフローチャートである。FIG. 17 is a flowchart of the fourth embodiment.

【図１８】第４の実施の形態の記憶部のデータ構造の構
成例である。FIG. 18 is a configuration example of a data structure of a storage unit according to the fourth embodiment.

【図１９】第５の実施の形態の映像処理装置の構成を示
す図である。FIG. 19 is a diagram illustrating a configuration of a video processing device according to a fifth embodiment.

【図２０】第５の実施の形態のフローチャートである。FIG. 20 is a flowchart of the fifth embodiment.

【図２１】第５の実施の形態の記憶部のデータ構造の構
成例である。FIG. 21 is a configuration example of a data structure of a storage unit according to the fifth embodiment.

【図２２】特徴量検出手段の第１の構成の具体例であ
る。FIG. 22 is a specific example of a first configuration of a feature amount detection unit.

【図２３】特徴量検出手段の第１の具体例に関する説明
図である。FIG. 23 is an explanatory diagram relating to a first specific example of the feature amount detecting means.

【図２４】特徴量検出手段の第２の構成の具体例であ
る。FIG. 24 is a specific example of a second configuration of the feature amount detection means.

【図２５】特徴量検出手段の第３の構成の具体例であ
る。FIG. 25 is a specific example of a third configuration of the feature amount detection means.

【図２６】特徴量検出手段の第４の構成の具体例であ
る。FIG. 26 is a specific example of a fourth configuration of the feature amount detection means.

【符号の説明】[Explanation of symbols]

１０２カット検出手段１０３シーン検出手段１０４統合手段１０５記憶部１０６相対時刻計算手段 102 cut detection means 103 scene detection means 104 integration means 105 storage unit 106 relative time calculation means

───────────────────────────────────────────────────── フロントページの続き (72)発明者中村三津明大阪府大阪市阿倍野区長池町22番22号シャープ株式会社内Ｆターム(参考） 5C023 AA14 AA34 AA37 AA38 BA02 CA01 CA04 DA04 EA13 5C052 AA01 AA17 AC08 CC20 DD06 5C053 FA05 FA14 FA21 FA23 GB09 GB19 HA29 HA30 HA40 JA12 JA24 JA30 KA05 ────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Mitsuaki Nakamura 22-22 Nagaikecho, Abeno-ku, Osaka-shi, Osaka F-term (reference) 5C023 AA14 AA34 AA37 AA38 BA02 CA01 CA04 DA04 EA13 5C052 AA01 AA17 AC08 CC20 DD06 5C053 FA05 FA14 FA21 FA23 GB09 GB19 HA29 HA30 HA40 JA12 JA24 JA30 KA05

Claims

【特許請求の範囲】[Claims]

【請求項１】順次入力される映像のフレームを統合す
る映像処理装置であって、フレーム間の特徴量変化に基づいてショットのカットを
検出するカット検出手段と、前記検出されたカットにより区別された各ショットのフ
レーム間の類似度に基づいてシーンチェンジを検出する
シーン検出手段と、統合のための基準フレームから入力フレームまでの相対
時刻を求める相対時刻計算手段と、前記相対時刻が指定された一定時間となるたびに、前記
検出されたシーンチェンジを区切りとしてシーン統合す
る統合手段と、を有することを特徴とする映像処理装
置。An image processing apparatus for integrating frames of a video that are sequentially input, comprising: a cut detection unit configured to detect a cut of a shot based on a change in a feature amount between frames; Scene detection means for detecting a scene change based on the similarity between frames of each shot, relative time calculation means for calculating a relative time from a reference frame for integration to an input frame, and the relative time is designated. An integration means for integrating scenes each time a predetermined time period elapses with the detected scene change as a break.

【請求項２】順次入力される映像のフレームを統合す
る映像処理装置であって、フレーム間の特徴量変化に基づいてショットのカットを
検出するカット検出手段と、統合のための基準フレームから入力フレームの相対時刻
を求める相対時刻計算手段と、前記検出されたカットの基準フレームからの前記相対時
刻が指定された一定時間の整数倍である場合に、ＣＭを
検出するＣＭ検出手段と、前記検出されたＣＭを区切りとしてショット統合する統
合手段と、を有することを特徴とする映像処理装置。2. A video processing apparatus for integrating frames of video input sequentially, comprising: a cut detection means for detecting a cut of a shot based on a change in a feature value between frames; Relative time calculation means for calculating a relative time of a frame; CM detection means for detecting a CM when the relative time of the detected cut from the reference frame is an integral multiple of a specified fixed time; And an integrating means for integrating shots by using the divided CM as a break.

【請求項３】映像に同期して入力される音声に基づい
て順次入力される映像のフレームを統合する映像処理装
置であって、フレーム間の特徴量変化に基づいてショットのカットを
検出するカット検出手段と、前記検出されたカットにより区別されたショット間の入
力音声の類似度に基づいて検出されたシーンチェンジを
区切りとしてシーン統合する統合手段と、を有すること
を特徴とする映像処理装置。3. A video processing device for integrating frames of a video sequentially input based on audio input in synchronization with the video, wherein the cut detects a cut of a shot based on a change in a feature amount between frames. An image processing apparatus, comprising: a detecting unit; and an integrating unit that integrates scenes with a scene change detected based on a similarity of an input voice between shots distinguished by the detected cut as a break.

【請求項４】順次入力される映像に関する特徴を抽出
する特徴検出手段と、前記抽出された特徴を前記カット検出手段により検出さ
れたカットにより区別されたショットごとに記憶する記
憶手段と、抽出された特徴に基づいて前記記憶手段のショットを指
定されたショット数以下まで分割する分割手段と、を有
することを特徴とする請求項１乃至３記載の映像処理装
置。4. A feature detecting means for extracting a feature related to a sequentially input video, and a storing means for storing the extracted feature for each shot distinguished by a cut detected by the cut detecting means. 4. The video processing apparatus according to claim 1, further comprising: a dividing unit configured to divide the shots in the storage unit into a number of shots or less based on the feature.

【請求項５】順次入力される映像に関する特徴を抽出
する特徴検出手段と、前記抽出された特徴を前記カット検出手段により検出さ
れたカットにより区別されたショットごとに記憶する記
憶手段と、検索条件として設定された特徴に基づいて前記記憶手段
からショットを検索する検索手段と、を有することを特
徴とする請求項１乃至３記載の映像処理装置。5. A feature detecting means for extracting a feature related to a sequentially input video, a storing means for storing the extracted feature for each shot distinguished by a cut detected by the cut detecting means, a search condition. 4. The video processing apparatus according to claim 1, further comprising: a search unit configured to search for a shot from the storage unit based on the characteristic set as a parameter.

【請求項６】前記特徴検出手段が抽出する特徴は、フ
レーム間の動きベクトルから得られる映像中のカメラの
動き情報であることを特徴とする請求項４乃至５記載の
映像処理装置。6. The video processing device according to claim 4, wherein the feature extracted by the feature detection means is motion information of a camera in a video obtained from a motion vector between frames.

【請求項７】前記特徴検出手段が抽出する特徴は、音
量情報であることを特徴とする請求項４乃至５記載の映
像処理装置。7. The video processing apparatus according to claim 4, wherein the feature extracted by the feature detection unit is volume information.

【請求項８】前記特徴検出手段が抽出する特徴は、あ
る特定の環境下における画像モデルと比較することによ
り求まる類似度であることを特徴とする請求項４乃至５
記載の映像処理装置。8. The feature extracted by the feature detecting means is a similarity obtained by comparing the feature with an image model under a specific environment.
The video processing device according to the above.

【請求項９】前記特徴検出手段が抽出する特徴は、あ
る特定の環境下における音声モデルと比較することによ
り求まる類似度であることを特徴とする請求項４乃至５
記載の映像処理装置。9. The feature extracted by the feature detecting means is a similarity obtained by comparing with a voice model under a specific environment.
The video processing device according to the above.