JP2022169009A

JP2022169009A - Program, information processing method, and information processing device

Info

Publication number: JP2022169009A
Application number: JP2021074755A
Authority: JP
Inventors: 俊彦山崎; Toshihiko Yamazaki; 賢亮張; Xianliang Zhang
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2022-11-09
Also published as: WO2022230777A1

Abstract

To provide an information processing apparatus, or the like, configured to generate an appropriate summary video with unsupervised learning.SOLUTION: An information processing device executes: converting a video that includes a plurality of frames into a plurality of shots, the number of which is smaller than the plurality of frames; generating a second shot group by applying, to a plurality of shots included in a first shot group, a first process that is related to preservation of the relevance to the video, and generating a third shot group by applying, to the plurality of shots, a second process that reduces the relevance to the video as compared to the first process; calculating a score of each shot by a trained model that is generated by self-supervised contrastive learning; selecting, on the basis of scores of the shots in the first shot group that are calculated with the trained model optimized with use of a loss function, whether or not to include each of the shots in a summary video; and generating a summary video on the basis of the selected shots.SELECTED DRAWING: Figure 1

Description

本発明は、動画要約におけるプログラム、情報処理方法及び情報処理装置に関する。 The present invention relates to a program, an information processing method, and an information processing apparatus for summarizing moving images.

近年、動画からより短い長さの要約動画を生成するため、深層学習の応用が試みられている。例えば、下記非特許文献１には、教師あり学習によって要約動画をつくるニューラルネットワークを生成する研究が記載されている。ここで、教師あり学習に用いる学習動画には、動画のフレーム毎に要約動画に含めるか否かを示すラベルが付与される。また、下記非特許文献２には、深層強化学習を用いて、教師なしで要約動画を生成する研究が記載されている。 Recently, applications of deep learning have been attempted to generate short-length digest videos from videos. For example, Non-Patent Document 1 below describes research on generating a neural network that creates a summary video by supervised learning. Here, a label indicating whether or not each frame of the moving image is included in the summarized moving image is assigned to the learning moving image used for supervised learning. In addition, Non-Patent Literature 2 below describes research on generating a summary video without a teacher using deep reinforcement learning.

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman, "Video summarization with long short-term memory," ECCV, 2016.Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman, "Video summarization with long short-term memory," ECCV, 2016. Zhou Kaiyang, Qiao Yu, and Xiang Tao, "Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward," AAAI, 2018.Zhou Kaiyang, Qiao Yu, and Xiang Tao, "Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward," AAAI, 2018.

しかしながら、非特許文献１のように教師あり学習によって動画を要約する学習モデルを生成する場合、動画のフレーム全てにラベル付けをする必要があり、アノテーションコストが膨大となる。 However, when generating a learning model that summarizes a moving image by supervised learning as in Non-Patent Document 1, it is necessary to label all the frames of the moving image, resulting in a huge annotation cost.

この点、非特許文献２ではラベル付けが不要だが、要約動画全体に対して強化学習の報酬を算出し、その報酬を、個々のフレームを要約動画に含めるか否かを選択する行動に分配しているため、報酬に差が付きづらく、適切な要約動画を生成することが難しいことがある。 In this regard, although labeling is not required in Non-Patent Document 2, the reward for reinforcement learning is calculated for the entire summarized video, and the reward is distributed to actions that select whether or not to include individual frames in the summarized video. Therefore, it is difficult to differentiate the rewards, and it is difficult to generate an appropriate summary video.

そこで、本発明は、動画要約に対照学習を適用し、教師なし学習であっても適切な要約動画を生成することができるプログラム、情報処理方法及び情報処理装置を提供することを目的の一つとする。 Accordingly, one of the objects of the present invention is to provide a program, an information processing method, and an information processing apparatus capable of applying contrast learning to video summarization and generating an appropriate summary video even in unsupervised learning. do.

本発明の一態様に係るプログラムは、情報処理装置に、複数のフレームを含む動画を、前記複数のフレームより数が少ない複数のショットに変換すること、前記複数のショットを含む第１ショット群に対し、前記動画との関連性維持に関する第１処理を前記複数のショットに加えて第２ショット群を生成し、前記第１処理よりも前記動画との関連性をなくす第２処理を前記複数のショットに加えて第３ショット群を生成すること、前記第１ショット群をアンカー、前記第２ショット群を正例、前記第３ショット群を負例とし、前記第１ショット群、前記第２ショット群、及び前記第３ショット群ごとに、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習により生成される学習モデルによって、各ショットのスコアを算出すること、前記第１ショット群の各ショットのスコアと前記第２ショット群の各ショットのスコアとに基づく第１類似度と、前記第１ショット群の各ショットのスコアと前記第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いて最適化された前記学習モデルにより算出される前記第１ショット群の各ショットのスコアに基づいて、前記各ショットそれぞれを要約動画に含めるか否かを選択すること、選択されたショットに基づいて、要約動画を生成することと、を実行させる。 A program according to an aspect of the present invention instructs an information processing device to convert a moving image including a plurality of frames into a plurality of shots that are fewer in number than the plurality of frames, and convert a first shot group including the plurality of shots into On the other hand, a first process for maintaining relevance with the moving image is added to the plurality of shots to generate a second shot group, and a second process for eliminating relevance to the moving image is performed by performing the first processing on the plurality of shots. generating a third shot group in addition to the shots, setting the first shot group as an anchor, the second shot group as a positive example, and the third shot group as a negative example; calculating a score for each shot by a learning model generated by self-supervised contrast learning on whether to include each shot in a video summary, for each group and said third shot group, said first shot group a first degree of similarity based on the score of each shot in the second shot group and the score of each shot in the second shot group, and the score of each shot in the first shot group and the score of each shot in the third shot group Based on the score of each shot in the first shot group calculated by the learning model optimized using a loss function including a first function using 2 similarities and another second function, Selecting whether or not to include each shot in the digest video, and generating the digest video based on the selected shots are executed.

本発明によれば、動画要約に対照学習を適用し、教師なし学習であっても適切な要約動画を生成することができるプログラム、情報処理方法及び情報処理装置を提供することができる。 According to the present invention, it is possible to provide a program, an information processing method, and an information processing apparatus capable of applying contrast learning to video summarization and generating an appropriate summary video even with unsupervised learning.

本発明の実施形態に係る情報処理装置の処理構成の一例を示すブロック図である。1 is a block diagram showing an example of a processing configuration of an information processing device according to an embodiment of the present invention; FIG. 本実施形態に係る情報処理装置の物理的構成の一例を示す図である。1 is a diagram illustrating an example of a physical configuration of an information processing device according to an embodiment; FIG. 本実施形態に係る情報処理装置により実行される処理の概要を示す図である。It is a figure which shows the outline|summary of the process performed by the information processing apparatus which concerns on this embodiment. 本実施形態に係る情報処理装置により生成される要約動画のＦ値と比較例１及び２の要約動画のＦ値を示す図である。FIG. 10 is a diagram showing the F value of a digest video generated by the information processing apparatus according to the present embodiment and the F values of the digest videos of Comparative Examples 1 and 2; 本実施形態の評価に用いられる各データセットを示す図である。It is a figure which shows each data set used for the evaluation of this embodiment. 本実施形態に係る情報処理装置により生成される要約動画のＦ値と比較例３の要約動画のＦ値を示す図である。FIG. 11 is a diagram showing the F value of a digest video generated by the information processing apparatus according to the embodiment and the F value of a digest video of Comparative Example 3; 本実施形態に係る情報処理装置により生成される要約動画のτ値及びρ値と比較例により生成される要約動画のτ値及びρ値を示す図である。FIG. 5 is a diagram showing the τ and ρ values of a digest video generated by the information processing apparatus according to the embodiment and the τ and ρ values of a digest video generated by a comparative example; 教師なし学習のＳＵＭ－ＧＡＮのモデルを示す図である。FIG. 2 shows a model of SUM-GAN for unsupervised learning; 既存の学習モデルに対して実施手法の適用有無を比較するための図である。FIG. 10 is a diagram for comparing whether or not an implementation method is applied to an existing learning model; 本実施形態に係る各手法の収束速度を示す図である。It is a figure which shows the convergence speed of each method based on this embodiment. 本実施形態に係る実施手法と比較手法により選択されたフレームを示す図である。FIG. 10 is a diagram showing frames selected by an implementation method and a comparison method according to the present embodiment; 本実施形態に係る情報処理装置により実行される動画要約処理の一例を示すフローチャートである。6 is a flow chart showing an example of video abstract processing executed by the information processing apparatus according to the embodiment; 本実施形態に係る情報処理装置により実行される学習処理の一例を示すフローチャートである。4 is a flowchart showing an example of learning processing executed by the information processing apparatus according to the embodiment;

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that, in each figure, the same reference numerals have the same or similar configurations.

＜構成＞
図１は、本発明の実施形態に係る情報処理装置１０の処理構成の一例を示すブロック図である。情報処理装置１０は、取得部１１、変換部１２、第１生成部１３、算出部１４、選択部１５及び第２生成部１６を備える。 <Configuration>
FIG. 1 is a block diagram showing an example processing configuration of an information processing apparatus 10 according to an embodiment of the present invention. The information processing apparatus 10 includes an acquisition unit 11 , a conversion unit 12 , a first generation unit 13 , a calculation unit 14 , a selection unit 15 and a second generation unit 16 .

取得部１１は、動画データベースＤＢから動画を取得する。動画データベースＤＢは、任意の動画を格納するデータベースであり、例えば公開されている動画データセットを含む。動画データベースＤＢは、例えば、ＳｕｍＭｅデータセット（Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool, "Creating Summaries from User Videos," ECCV 2014.）や、ＴＶＳｕｍデータセット（Song, Yale, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes, "TVSum: Summarizing web videos using titles," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179-5187, 2015.）、OVP（Open Video Project）（https://open-video.org/）、YouTube（登録商標）のデータセット、又はユーザ等により撮影された所定の動画を含んでよい。 Acquisition unit 11 acquires a moving image from moving image database DB. The moving picture database DB is a database that stores arbitrary moving pictures, and includes, for example, public moving picture data sets. The video database DB includes, for example, the SumMe dataset (Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool, "Creating Summaries from User Videos," ECCV 2014.) and the TVSum dataset (Song, Yale, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes, "TVSum: Summarizing web videos using titles," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179-5187, 2015.), OVP (Open Video Project) (https:// open-video.org/), a YouTube (registered trademark) dataset, or a predetermined video taken by a user or the like.

変換部１２は、任意の自然数をＮと表すとき、動画に含まれる複数のフレームを、複数のフレームより数が少ない複数のＮのショットに変換する。例えば、変換部１２は、複数のフレームを画像特徴量に変換し、画像特徴量の類似度に基づいて、複数のフレームの画像特徴量からＮのショットを抽出してよい。ここで、フレームの画像特徴量は、ＣＮＮ（Convolutional Neural Network）の特徴マップであってよい。 The conversion unit 12 converts a plurality of frames included in a moving image into a plurality of N shots, the number of which is smaller than that of the plurality of frames, where N is an arbitrary natural number. For example, the conversion unit 12 may convert a plurality of frames into image feature amounts, and extract N shots from the image feature amounts of the plurality of frames based on the similarity of the image feature amounts. Here, the image feature amount of the frame may be a CNN (Convolutional Neural Network) feature map.

また、変換部１２は、例えば、D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, "Category-specific video summarization," ECCV 2014.に記載されいてる技術を用いて、動画に含まれる複数のフレームを、複数のフレームより数が少ないＮのショットに変換してよい。変換部１２によって、代表的なショットを抽出して、適切な要約動画が生成されるようにすることができる。 Also, the conversion unit 12 uses the technique described in, for example, D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, "Category-specific video summarization," ECCV 2014. Multiple frames may be converted into N fewer shots than multiple frames. The conversion unit 12 can extract representative shots and generate an appropriate summary video.

第１生成部１３は、変換された複数のショットを含む第１ショット群に対し、元の動画との関連性維持に関する第１処理を複数のショットに加えて第２ショット群を生成する。また、第１生成部１３は、変換された複数のショットを含む第１ショット群に対し、第１処理よりも元の動画との関連性をなくす第２処理を複数のショットに加えて第３ショット群を生成する。 The first generation unit 13 generates a second shot group by applying a first process related to maintaining the relationship with the original moving image to the first shot group including the converted shots. In addition, the first generation unit 13 applies a second process of eliminating the relevance to the original moving image to the first shot group including the converted shots, and performs a third process on the shots. Generate a group of shots.

例えば、第１生成部１３は、第１ショット群の各ショットの順番を逆順にする処理を実行し、第２ショット群を生成してよい。また、第１生成部１３は、第１ショット群の各ショットの順番をランダムにする処理を実行し、第３ショット群を生成してよい。 For example, the first generator 13 may reverse the order of the shots in the first shot group to generate the second shot group. Further, the first generating unit 13 may perform a process of randomizing the order of each shot in the first shot group to generate the third shot group.

ここで、本実施形態では，教師なし学習に分類される自己教師ありの対照学習を用いるため、第１ショット群はアンカーに設定され、第２ショット群は正例（ポジティブサンプル）に設定され、第３ショット群は負例（ネガティブサンプル）に設定される。このように各サンプルが設定されることで、各サンプルのショット数がアンカーのショット数と同数になり、損失関数に用いる各ショットの類似度の算出など効率よく学習を行うことが可能になる。 Here, in this embodiment, since self-supervised contrastive learning classified as unsupervised learning is used, the first shot group is set as an anchor, the second shot group is set as a positive example (positive sample), The third shot group is set as a negative example (negative sample). By setting each sample in this way, the number of shots in each sample becomes the same as the number of shots in the anchor, enabling efficient learning such as calculation of the similarity of each shot used in the loss function.

算出部１４は、所定の学習モデル１４ａによって、Ｎのショットを要約動画に含めるか否かを表すスコアを算出する。ここで、所定の学習モデル１４ａは、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習（Contrastive Self-Supervised Learning）により生成される。上述したとおり、算出部１４は、第１ショット群をアンカーに、第２ショット群を正例に、第３ショット群を負例に設定し、後述する損失関数を用いて、損失関数の値が最小化するように学習モデル１４ａのパラメータを更新して学習を行う。 The calculation unit 14 calculates a score indicating whether or not the N shots are to be included in the digest video using a predetermined learning model 14a. Here, the predetermined learning model 14a is generated by contrastive self-supervised learning as to whether each shot should be included in the video summary. As described above, the calculation unit 14 sets the first shot group as an anchor, sets the second shot group as a positive example, and sets the third shot group as a negative example. Learning is performed by updating the parameters of the learning model 14a so as to minimize it.

従来、教師あり学習によって要約動画を生成する場合、複数のフレーム毎又は複数のショット毎に、フレームを要約動画に含めるか否かのラベル付けがされた学習動画を用いている。このような学習動画は、アノテーションコストが高く、データ量を増やすことが難しかった。この点、本実施形態に係る情報処理装置１０の学習モデル１４ａは、アノテーションを不要とし、アノテーションコストをなくすことができる。また、対照学習において、アンカーから正例と負例を生成するため、例えば負例として別の動画を用意する必要がない。したがって、要約を生成したい動画を準備するだけで本実施形態を適用することができ、実用化の面で大きなメリットがある。 Conventionally, when a digest video is generated by supervised learning, a learning video labeled as to whether or not the frame is included in the digest video is used for each of a plurality of frames or for each of a plurality of shots. Such learning videos have a high annotation cost, and it was difficult to increase the amount of data. In this respect, the learning model 14a of the information processing apparatus 10 according to the present embodiment does not require annotation, and can eliminate annotation costs. In contrast learning, positive and negative examples are generated from anchors, so there is no need to prepare separate moving images for negative examples, for example. Therefore, this embodiment can be applied simply by preparing a moving image for which a summary is to be generated, which is a great advantage in terms of practical use.

また、本実施形態では、アノテーションを必要としないため、任意の外部データを用いて大規模な学習をすることが可能である。任意の外部データは、例えば、ＹＦＣＣ１００Ｍのデータセットに代表されるように各種ＳＮＳ（Social Networking Service）に投稿された動画やテレビ放送に用いられた動画などである。また、本実施形態では、このような大規模学習データを用いて学習モデルを学習しておくことで、精度が向上することが実験的に確認されている（図４参照）。 In addition, since annotations are not required in this embodiment, large-scale learning can be performed using arbitrary external data. Arbitrary external data are, for example, videos posted on various SNSs (Social Networking Services) and videos used in television broadcasting, as typified by the YFCC100M data set. Further, in the present embodiment, it has been experimentally confirmed that the accuracy is improved by learning the learning model using such large-scale learning data (see FIG. 4).

選択部１５は、所定の損失関数を用いてパラメータが最適化された学習モデルにより算出される第１ショット群の各ショットのスコアに基づいて、各ショットそれぞれを要約動画に含めるか否かを選択する。所定の損失関数は、例えば、第１ショット群の各ショットのスコアと第２ショット群の各ショットのスコアとに基づく第１類似度と、第１ショット群の各ショットのスコアと第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数である。 The selection unit 15 selects whether or not to include each shot in the summarized video based on the score of each shot in the first shot group calculated by a learning model whose parameters are optimized using a predetermined loss function. do. The predetermined loss function is, for example, a first similarity based on the score of each shot in the first shot group and the score of each shot in the second shot group, the score of each shot in the first shot group and the third shot group. and a second similarity based on the score of each shot of .

選択部１５は、例えば、要約動画が所定の長さになるように、重要度に関するナップサック問題を解くことで、Ｎのショットを要約動画に含めるか否かを選択してよい。なお、ナップサック問題を解くためのアルゴリズムは任意であるが、例えば貪欲法を用いたり、動的計画法を用いたりしてよい。 For example, the selection unit 15 may select whether or not to include the N shots in the digest video by solving the knapsack problem regarding importance so that the digest video has a predetermined length. Although any algorithm may be used to solve the knapsack problem, for example, a greedy method or dynamic programming may be used.

第２生成部１６は、選択されたショットに基づいて、要約動画を生成する。本実施形態に係る情報処理装置１０によれば、対照学習を用いて自己教師ありの学習モデルを用いることで、アノテーションコストが不要であり、後述する実験結果が示すように適切な要約動画を生成することができる。 The second generator 16 generates a summary video based on the selected shots. According to the information processing apparatus 10 according to the present embodiment, by using a self-supervised learning model using contrastive learning, no annotation cost is required, and an appropriate summary video is generated as shown by experimental results described later. can do.

図２は、本実施形態に係る情報処理装置１０の物理的構成の一例を示す図である。情報処理装置１０は、演算部に相当するＣＰＵ（Central Processing Unit）１０ａと、記憶部に相当するＲＡＭ（Random Access Memory）１０ｂと、記憶部に相当するＲＯＭ（Read only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆと、を有する。これらの各構成は、バスを介して相互にデータ送受信可能に接続される。なお、本例では情報処理装置１０が一台のコンピュータで構成される場合について説明するが、情報処理装置１０は、複数のコンピュータが組み合わされて実現されてもよい。また、図２で示す構成は一例であり、情報処理装置１０はこれら以外の構成を有してもよいし、これらの構成のうち一部を有さなくてもよい。なお、ＣＰＵ１０ａは、ＧＰＵ（Graphical Processing Unit）でもよい。 FIG. 2 is a diagram showing an example of the physical configuration of the information processing device 10 according to this embodiment. The information processing apparatus 10 includes a CPU (Central Processing Unit) 10a equivalent to a calculation unit, a RAM (Random Access Memory) 10b equivalent to a storage unit, a ROM (Read only memory) 10c equivalent to a storage unit, and a communication unit. 10d, an input unit 10e, and a display unit 10f. These components are connected to each other via a bus so that data can be sent and received. In this example, a case where the information processing apparatus 10 is composed of one computer will be described, but the information processing apparatus 10 may be realized by combining a plurality of computers. Moreover, the configuration shown in FIG. 2 is an example, and the information processing apparatus 10 may have configurations other than these, or may not have some of these configurations. The CPU 10a may be a GPU (Graphical Processing Unit).

ＣＰＵ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されたプログラムの実行に関する制御やデータの演算、加工を行う制御部である。ＣＰＵ１０ａは、動画を構成する複数のフレームのうち一部を抽出して要約動画を生成するプログラム（要約生成プログラム）を実行する演算部である。ＣＰＵ１０ａは、入力部１０ｅや通信部１０ｄから種々のデータを受け取り、データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂに格納したりする。 The CPU 10a is a control unit that controls the execution of programs stored in the RAM 10b or ROM 10c and performs data calculation and processing. The CPU 10a is an arithmetic unit that executes a program (summary generating program) for extracting some of a plurality of frames constituting a moving image and generating a summarized moving image. The CPU 10a receives various data from the input section 10e and the communication section 10d, and displays the calculation results of the data on the display section 10f and stores them in the RAM 10b.

ＲＡＭ１０ｂは、記憶部のうちデータの書き換えが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＡＭ１０ｂは、ＣＰＵ１０ａが実行するプログラム、要約対象となる動画といったデータを記憶してよい。なお、これらは例示であって、ＲＡＭ１０ｂには、これら以外のデータが記憶されていてもよいし、これらの一部が記憶されていなくてもよい。 The RAM 10b is a rewritable part of the storage unit, and may be composed of, for example, a semiconductor memory element. The RAM 10b may store programs executed by the CPU 10a and data such as moving images to be summarized. Note that these are examples, and the RAM 10b may store data other than these, or may not store some of them.

ＲＯＭ１０ｃは、記憶部のうちデータの読み出しが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＯＭ１０ｃは、例えば要約生成プログラムや、書き換えが行われないデータを記憶してよい。 The ROM 10c is one of the storage units from which data can be read, and may be composed of, for example, a semiconductor memory element. The ROM 10c may store, for example, a summary generation program and data that is not rewritten.

通信部１０ｄは、情報処理装置１０を他の機器に接続するインターフェースである。通信部１０ｄは、インターネット等の通信ネットワークに接続されてよい。 The communication unit 10d is an interface that connects the information processing device 10 to other devices. The communication unit 10d may be connected to a communication network such as the Internet.

入力部１０ｅは、ユーザからデータの入力を受け付けるものであり、例えば、キーボード及びタッチパネルを含んでよい。 The input unit 10e receives data input from the user, and may include, for example, a keyboard and a touch panel.

表示部１０ｆは、ＣＰＵ１０ａによる演算結果を視覚的に表示するものであり、例えば、ＬＣＤ（Liquid Crystal Display）により構成されてよい。表示部１０ｆは、要約対象となる動画や要約した動画を表示してよい。 The display unit 10f visually displays the calculation result by the CPU 10a, and may be configured by, for example, an LCD (Liquid Crystal Display). The display unit 10f may display a moving image to be summarized or a summarized moving image.

要約生成プログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な非一時的な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続される通信ネットワークを介して提供されてもよい。情報処理装置１０では、ＣＰＵ１０ａが要約生成プログラムを実行することにより、図１を用いて説明した様々な動作が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、情報処理装置１０は、ＣＰＵ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。また、情報処理装置１０は、ＧＰＵを備えていてもよく、ＣＰＵ及びＣＰＵ１０ａが要約生成プログラムを実行することにより、図１を用いて説明した様々な動作が実現されてよい。 The abstract generation program may be provided by being stored in a computer-readable non-temporary storage medium such as the RAM 10b or ROM 10c, or may be provided via a communication network connected by the communication unit 10d. In the information processing device 10, the CPU 10a executes the abstract generation program, thereby realizing various operations described with reference to FIG. It should be noted that these physical configurations are examples, and do not necessarily have to be independent configurations. For example, the information processing apparatus 10 may include an LSI (Large-Scale Integration) in which the CPU 10a, the RAM 10b, and the ROM 10c are integrated. The information processing apparatus 10 may also include a GPU, and various operations described with reference to FIG. 1 may be realized by the CPU and the CPU 10a executing a summary generation program.

＜処理例＞
図３は、本実施形態に係る情報処理装置１０により実行される処理の概要を示す図である。本実施形態に係る処理は、（１）事前処理、（２）要約ネットワーク、（３）事後処理の主な３つに分けられる。 <Processing example>
FIG. 3 is a diagram showing an outline of processing executed by the information processing apparatus 10 according to this embodiment. The processing according to the present embodiment is divided into three main parts: (1) preprocessing, (2) summary network, and (3) postprocessing.

（１）事前処理
情報処理装置１０の変換部１２は、動画Ｖ０に含まれる複数のフレームを画像特徴量に変換し、画像特徴量に基づいて、複数のフレームの画像特徴量からＮのショットに変換する。 (1) Pre-processing The conversion unit 12 of the information processing device 10 converts a plurality of frames included in the moving image V0 into image feature amounts, and converts the image feature amounts of the plurality of frames into N shots based on the image feature amounts. Convert.

例えば、変換部１２は、公知の技術を用いて動画を各ショットに変換してよいが、一例として、GoogLeNet（Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CVPR, page 1-9, 2015.）を用いて、ダウンサンプリングされたショットのキーフレームｖと、特徴ｘに基づきＮのショットに変換する。
ｖ＝｛ｖ_i｝，ｉ∈［１，２，．．．，Ｎ］
ｘ＝｛ｘ_i｝，ｉ∈［１，２，．．．，Ｎ］
ｘ_i＝Ｆ（ｖ_i）（Ｆ（）は特徴量を求める関数）
ここで、Ｎは、ダウンサンプリングされたフレーム数を表し、ショット数を表す。ショット内のフレームは１以上の任意の数であり、１５枚程度が好ましい。 For example, the conversion unit 12 may convert a moving image into each shot using a known technique.
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Convert to N shots.
Let v={v _i }, iε[1, 2, . . . , N]
Let x={x _i }, iε[1, 2, . . . , N]
x _i =F(v _i ) (F() is a function for obtaining a feature amount)
Here, N represents the number of downsampled frames and the number of shots. The number of frames in a shot is any number of 1 or more, preferably about 15 frames.

（２）要約ネットワーク
要約ネットワークでは、自己教師ありの対照学習を用いて、各ショットを要約動画に含めるか否かに関するスコアが算出される。まず、第１生成部１３は、Ｎ個の第１ショット群（Anchor）から、対照学習の正例に用いる第２ショット群（Positive）と、負例に用いる第３ショット群（Intra-negative（単にNegativeとも表記する。））とを生成する。 (2) Summarization Network The summarization network uses self-supervised contrastive learning to calculate a score for whether or not each shot should be included in the summary video. First, from the N first shot groups (Anchor), the first generation unit 13 generates a second shot group (Positive) used for positive examples of contrastive learning, and a third shot group (Intra-negative ( It is also simply written as Negative.)) and is generated.

第１生成部１３は、元の動画との関連性を維持するような第１処理を第１ショット群の各ショットに加えて第２ショット群を生成する。第１処理は、ユーザが第２ショット群を視聴した場合に、元の第１ショット群と同じ動画であると認識できるような処理である。例えば、第１処理は、元の動画に対し、所定の時間的関係又は空間的関係を維持する処理を含む。 The first generation unit 13 generates a second shot group by applying a first process to each shot of the first shot group so as to maintain the relationship with the original moving image. The first process is a process that allows the user, when viewing the second shot group, to recognize that it is the same moving image as the original first shot group. For example, the first processing includes processing that maintains a predetermined temporal or spatial relationship with respect to the original moving image.

所定の時間的関係を維持する処理の一例として、第１処理は、第１ショット群の各ショットの順番を逆順にする処理を含んでよい。この場合の第２ショット群の各ショットを以下の式（１）ｘ^posで表す。
ｘ^pos＝ｒｅｖｅｒｓｅｄ（ｘ）
＝｛ｘ_j ^pos｝，ｊ∈［１，２，．．．，Ｎ］（１）
ここで、ｘ_j ^pos＝ｘ_N+1-j，ｊ∈［１，２，．．．，Ｎ］
また、第１処理は、第１ショット群の各ショットを複数のグループに分け、各グループの順番を入れ替えるなどの元の各ショットの時間的関係をある程度維持するような処理でもよい。 As an example of processing for maintaining a predetermined temporal relationship, the first processing may include processing for reversing the order of shots in the first shot group. Each shot of the second shot group in this case is represented by the following equation (1) x ^pos .
^xpos = reversed(x)
={x _j ^pos }, jε[1, 2, . . . , N] (1)
where x _j ^pos =x _N+1-j , jε[1, 2, . . . , N]
Alternatively, the first process may be a process of dividing the shots of the first shot group into a plurality of groups and rearranging the order of the groups to maintain the original temporal relationship of the shots to some extent.

所定の空間的関係を維持する処理の一例として、第１処理は、第１ショット群の各ショットの左右を反転させる処理を含んでよい。また、第１処理は、第１ショット群の各ショットを回転させたり、グレースケール化したり、元画像の特徴を壊さないような画像変換処理でもよい。 As an example of processing for maintaining a predetermined spatial relationship, the first processing may include processing for horizontally reversing each shot in the first shot group. Further, the first processing may be image transformation processing such as rotating each shot of the first shot group, grayscaling, or not destroying the features of the original image.

また、第１生成部１３は、元の動画との関連性を壊すような第２処理を第１ショット群の各ショットに加えて第３ショット群を生成する。第２処理は、第１処理よりも動画との関連性をなくすような処理を含む。例えば、第２処理は、元の動画に対し、所定の時間的関係又は空間的関係をなくす処理、あるいは各ショットの任意のフレームを他のフレームに置換する処理を含んでよい。 Further, the first generation unit 13 generates a third shot group by applying a second process that destroys the relationship with the original moving image to each shot of the first shot group. The second processing includes processing that eliminates the relationship with the moving image more than the first processing. For example, the second processing may include processing for eliminating a predetermined temporal or spatial relationship with respect to the original moving image, or processing for replacing an arbitrary frame of each shot with another frame.

所定の時間的関係をなくす処理の一例として、第２処理は、第１ショット群の各ショットの順番をシャッフルし、順番をランダムにする処理を含む。この場合の第３ショット群の各ショットを以下の式（２－１）ｘ^negで表す。
ｘ^neg＝ｓｈｕｆｆｌｅ（ｘ）（２－１）
ここで、ｘ^neg≠ｘ
また、第２処理は、各ショット内の全てのフレームを、特定のフレーム（例えば最初のフレーム）に置き換える処理を含んでもよい。例えば、第３ショット群の各ショットを以下の式（２－２）ｘ^negで表してもよい。

ここで、ｍは繰り返しインターバルのサイズ、ｋは、各インターバルのインデックス、ｘ_iは、ｉ番目のショットの特徴ベクトルを表す。 As an example of processing for eliminating a predetermined temporal relationship, the second processing includes processing for shuffling the order of shots in the first shot group and randomizing the order. Each shot of the third shot group in this case is represented by the following equation (2-1) x ^neg .
x ^neg =shuffle(x) (2-1)
where x ^neg ≠x
The second process may also include replacing all frames in each shot with a specific frame (eg, the first frame). For example, each shot of the third shot group may be represented by the following equation (2-2) x ^neg .

Here, m is the size of the repeating interval, k is the index of each interval, and x _i is the feature vector of the i-th shot.

次に、算出部１４は、所定の学習モデル１４ａによって、各ショットに対し、要約動画に含めるか否かに関するスコアを算出する。図３に示す例では、所定の学習モデル１４ａとして、ＬＳＴＭ（Long Short-Term Memory）が用いられる。具体例としては、双方向ＬＳＴＭ（Ｂｉ－ＬＳＴＭ）が使用され、学習モデル１４ａの関数ｆ（）と定義するとき、算出部１４は、以下の式（３）～（５）を用いて、第１ショット群の各ショットのスコアｓ、第２ショット群のスコアｓ^pos、第３ショット群のスコアｓ^negを算出する。
ｓ＝ｆ（ｘ）（３）
ｓ^pos＝ｒｅｓｅｒｖｅｄ（ｆ（ｘ^pos））（４）
ｓ^neg＝ｆ（ｘ^neg）（５） Next, the calculation unit 14 calculates a score regarding whether or not each shot should be included in the summarized video using a predetermined learning model 14a. In the example shown in FIG. 3, LSTM (Long Short-Term Memory) is used as the predetermined learning model 14a. As a specific example, when a bidirectional LSTM (Bi-LSTM) is used and defined as the function f() of the learning model 14a, the calculation unit 14 uses the following equations (3) to (5) to calculate the The score s of each shot in the first shot group, the score s ^pos of the second shot group, and the score ^sneg of the third shot group are calculated.
s=f(x) (3)
s ^pos =reserved(f(x ^pos )) (4)
s ^neg =f(x ^neg ) (5)

本実施形態の場合、スコアｓ＝｛ｓ_i｝、ｓ^pos＝｛ｓ_j ^pos｝、ｓ^neg＝｛ｓ_k ^neg｝それぞれは、ｘ＝｛ｘ_i｝、ｘ^pos＝｛ｘ_j ^pos｝、ｘ^neg＝｛ｘ_k ^neg｝，ｉ，ｊ，ｋ∈［１，２，．．．，Ｎ］から求められる重要度でもある。 In the case of this embodiment, the scores s={s _i }, s ^pos ={s _j ^pos }, s ^neg ={s _k ^neg } are respectively x={x _i }, x ^pos ={x _j ^pos }, Let x ^neg ={x _k ^neg },i,j,kε[1,2, . . . , N].

ここで、要約動画に含めるかの重要性を示す重要度ｓ^pos＝｛ｓ_j ^pos｝について、第２ショット群の重要度を逆順に並べ替えた重要度は、元動画との時間的依存性が壊されていないので、第１ショット群の重要度ｓ＝｛ｓ_i｝に類似するはずである。他方、第３ショット群の重要度ｓ^neg＝｛ｓ_k ^neg｝は、元動画との時間的依存性が壊されているので、第１ショット群の重要度ｓ＝｛ｓ_i｝に類似しないはずである。 Here, regarding the importance s ^pos ={s _j ^pos } indicating the importance of inclusion in the summary video, the importance obtained by rearranging the importance of the second shot group in reverse order is the temporal dependence with the original video. is not broken, it should be similar to the importance of the first group of shots s={s _i }. On the other hand, the importance s ^neg ={s _k ^neg } of the third shot group is not similar to the importance s={s _i } of the first shot group because the temporal dependence with the original video is broken. should be.

上述した重要度（スコア）の関係を用いて損失関数が設定される。本実施形態では、算出部１４は、第１ショット群のスコアｓと第２ショット群のスコアｓ^posとに基づく第１類似度と、第１ショット群のスコアｓと第３ショット群のスコアｓ^negとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いる。 A loss function is set using the importance (score) relationship described above. In this embodiment, the calculation unit 14 calculates the first similarity based on the score s of the first shot group and the score s ^pos of the second shot group, the score s of the first shot group and the score s of the third shot group A loss function is used that includes a first function that uses a second similarity measure based on ^neg and another second function.

まず、第２関数について説明する。第２関数は、要約動画が、元の動画のうち所定の箇所（時間帯）から集中して選択されることを避けるべく、なるべく様々な時間帯から選択されるようにするための損失関数である。例えば、第２関数は、第１ショット群の各ショットのスコアｓと所定値σとの差、及び第２ショット群の各ショットのスコアｓ^posと所定値σとの差を用いる損失関数Ｌ_percentageであり、以下の式（６）で表される。

σは、所定のハイパーパラメータである。
なお、第２関数は、上記例に限られるものではなく、後述するように、再構成損失関数などでも適切に実装可能であることが、発明者らの実験により分かっている。 First, the second function will be explained. The second function is a loss function for selecting digest videos from as many different time zones as possible in order to avoid concentrated selection from a predetermined portion (time zone) of the original video. be. For example, the second function is _a loss function L ^percentage and is represented by the following equation (6).

σ is a predetermined hyperparameter.
The second function is not limited to the above example, and experiments by the inventors have shown that a reconstruction loss function or the like can be appropriately implemented as described later.

次に、第１関数について説明する。例えば、第１関数は、第２ショット群のスコアｓ^pos＝｛ｓ_j ^pos｝と、第１ショット群のスコアｓ＝｛ｓ_i｝が類似するように、他方、第３ショット群のスコアｓ^neg＝｛ｓ_k ^neg｝と、第１ショット群のスコアｓ＝｛ｓ_i｝が類似しないようにするための損失関数Ｌ_contrastiveである。各類似度は、例えば式（９）を用いて、以下の式（７）（８）により算出される。

Next, the first function will be explained. For example, the first function is such that the second shot group score s ^pos ={s _j ^pos } and the first shot group score s = {s _i } are similar, while the third shot group score s ^neg = {s _k ^neg } and the loss function L _contrastive for dissimilarity between the score s = {s _i } of the first shot group. Each degree of similarity is calculated by the following equations (7) and (8) using equation (9), for example.

また、算出部１４は、対照学習における損失関数として、雑音対照推定（ＮＣＥ：Noise Contrastive Estimation）損失を適用し、第１関数Ｌ_contrastiveを次の式（１０）で定義する。

算出部１４は、最終的な損失関数として、次の式（１１）で定義される関数Ｌ_pを用いる。

The calculation unit 14 also applies noise contrastive estimation (NCE) loss as a loss function in contrastive learning, and defines a first function L _contrastive by the following equation (10).

The calculator 14 uses a function L _p defined by the following equation (11) as the final loss function.

算出部１４は、損失関数Ｌ_pが最小となるように、誤差逆伝搬法を用いて学習モデル１４ａのパラメータを更新し、学習モデルの最適化を図る。算出部１４は、パラメータが最適化された学習モデルを用いて最終的なスコアを算出する。 The calculation unit 14 updates the parameters of the learning model 14a using the error back propagation method so that the loss function L _p is minimized, thereby optimizing the learning model. The calculation unit 14 calculates the final score using the learning model whose parameters are optimized.

次に、第２関数として、再構成損失関数を用いる例について説明する。再構成損失関数Ｌ_reconは、次の式（１２）で表される。

算出部１４は、最終的な損失関数として、次の式（１３）で定義される関数Ｌ_rを用いてもよい。

なお、関数Ｌ_p又はＬ_rは、関数Ｌ_totalと表記してもよい。 Next, an example using a reconstruction loss function as the second function will be described. The reconstruction loss function L _recon is represented by the following equation (12).

The calculator 14 may use a function L _r defined by the following equation (13) as the final loss function.

Note that the function L _p or L _r may be written as a function L _total .

（３）事後処理
選択部１５は、例えば、要約動画が所定の長さになるように、スコアに関するナップサック問題を解くことで、各ショットを要約動画に含めるか否かを選択してよい。なお、ナップサック問題を解くためのアルゴリズムは任意であるが、例えば貪欲法を用いたり、動的計画法を用いたりしてよい。 (3) Post-processing The selection unit 15 may select whether or not to include each shot in the digest video by solving a knapsack problem regarding scores so that the digest video has a predetermined length, for example. Although any algorithm may be used to solve the knapsack problem, for example, a greedy method or dynamic programming may be used.

第２生成部１６は、選択されたショットに基づいて、要約動画Ｖ１を生成する。例えば、第２生成部１６は、選択されたショットを順番に連結して要約動画Ｖ１を生成する。本実施形態に係る情報処理装置１０によれば、対照学習を用いて自己教師ありの学習モデルを用いることで、アノテーションコストが不要であり、後述する実験結果が示すように適切な要約動画を生成することができる。 The second generator 16 generates a summary video V1 based on the selected shots. For example, the second generation unit 16 sequentially connects the selected shots to generate the digest video V1. According to the information processing apparatus 10 according to the present embodiment, by using a self-supervised learning model using contrastive learning, no annotation cost is required, and an appropriate summary video is generated as shown by experimental results described later. can do.

＜評価＞
図４は、本実施形態に係る情報処理装置１０により生成される要約動画のＦ値と比較例１及び２の要約動画のＦ値を示す図である。ここで、Ｆ値は、PrecisionとRecallの調和平均である２×Precision×Recall／（Precision＋Recall）で定義される値であり、Precision＝Ａ∩Ｂ／Ａ、Recall＝Ａ∩Ｂ／Ｂで定義される値であり、Ａは人が作成した要約動画であり、Ｂは本実施形態に係る情報処理装置１０（又は比較例）によって生成された要約動画である。Ｆ値は、１に近いほど正確かつ漏れの少ない要約ができていることを表す。 <Evaluation>
FIG. 4 is a diagram showing the F value of the digest video generated by the information processing apparatus 10 according to the present embodiment and the F values of the digest videos of Comparative Examples 1 and 2. As shown in FIG. Here, the F value is a value defined by 2×Precision×Recall/(Precision+Recall), which is the harmonic average of Precision and Recall, and is defined by Precision=A∩B/A and Recall=A∩B/B. A is a digest video created by a person, and B is a digest video generated by the information processing apparatus 10 according to the present embodiment (or a comparative example). The closer the F value is to 1, the more accurate and less leaky the summary is.

図５は、本実施形態の評価に用いられる各データセットを示す図である。図４に示す例では、図５に示す各データセットが用いられる。 FIG. 5 is a diagram showing each data set used for evaluation in this embodiment. In the example shown in FIG. 4, each data set shown in FIG. 5 is used.

図４に示す比較例は、以下の（１）教師なし学習（unsupervised）と、（２）弱教師あり学習（weakly supervised）との手法が用いられる。
（比較例１）教師なし学習
ＳＵＭ－ＧＡＮ（Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial lstm networks. CVPR, pages 2982-2991, 2017.）
ＤＲ－ＤＳＮ（Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. AAAI, page 7582-7589, 2018.）
ＳＵＭ－ＧＡＮ－ｓｌ（Evlampios Apostolidis, Alexandros I. Metsai, Eleni Adamantidou, Vasileios Mezaris, and Ioannis Patras. Stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. AI4TV, page 17-25, 2019.）
Ｃｙｃｌｅ－ＳＵＭ（Li Yuan, Francis EH Tay, Ping Li, Li Zhou, and Jiashi Feng.
Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. AAAI, pages 2711-2722, 2019.）
ＡＣＧＡＮ（Xufeng He, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Unsupervised video summarization with attentive conditional generative adversarial networks. ACMMM, page 2296-2304,
2019.）
ＳＵＭ－ＧＡＮ－ＡＡＥ（Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. Unsupervised video summarization via attention-driven adversarial learning. International Conference on Multimedia Modeling, pages 492-504, 2020.）
（比較例２）弱教師あり学習
ＭＷＳｕｍ（Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. Weakly supervised video summarization by hierarchical reinforcement learning. ACMMMAsia, page 1-6, 2019.） The comparative example shown in FIG. 4 uses the following methods of (1) unsupervised learning and (2) weakly supervised learning.
(Comparative example 1) Unsupervised learning SUM-GAN (Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial lstm networks. CVPR, pages 2982-2991, 2017.)
DR-DSN (Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. AAAI, page 7582-7589, 2018.)
SUM-GAN-sl (Evlampios Apostolidis, Alexandros I. Metsai, Eleni Adamantidou, Vasileios Mezaris, and Ioannis Patras. Stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. AI4TV, page 17-25, 2019.)
Cycle-SUM (Li Yuan, Francis EH Tay, Ping Li, Li Zhou, and Jiashi Feng.
Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. AAAI, pages 2711-2722, 2019.)
ACGAN (Xufeng He, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Unsupervised video summarization with attentive conditional generative adversarial networks. ACMMM, page 2296-2304,
2019.)
SUM-GAN-AAE (Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. Unsupervised video summarization via attention-driven adversarial learning. International Conference on Multimedia Modeling, pages 492-504, 2020.)
(Comparative example 2) Weakly supervised learning MWSum (Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. Weakly supervised video summarization by hierarchical reinforcement learning. ACMMMAsia, page 1-6, 2019.)

図４に示す例では、本実施形態に記載の手法（以下、「実施手法」とも表記する。）は、Ｐｒｏｐｏｓａｌとして表記され、ｐはＬ_p、ｒはＬ_rを表し、ｓｈは式（２－１）の第３ショット群を表し、ｒｅは式（２－２）の第３ショット群を表す（インターバルサイズは２０）。また、ｐｒｅ－ｔｒａｉｎｅｄは、アノテーションなしのＹＦＣＣ１００Ｍ内の９９２本のビデオを用いて、本実施形態の学習モデルを事前訓練した手法を表す。 In the example shown in FIG. 4, the method described in this embodiment (hereinafter also referred to as “implementation method”) is represented as Proposal, p represents L _p , r represents L _r , and sh represents formula (2 -1), and re represents the third shot group of equation (2-2) (the interval size is 20). Also, pre-trained represents a method of pre-training the learning model of this embodiment using 992 videos in YFCC100M without annotations.

図４に示すとおり、本実施形態に記載の各実施手法（各Proposal）は、同じ教師なし学習の比較例に比べて、ほぼ全てにおいて適切な要約動画を生成することができている。また、本実施形態に記載の各実施手法は、弱教師あり学習の比較例と比べても、ほぼ全てにおいて適切な要約動画を生成することができている。なお、図４に示す本実施形態に記載の手法は、図３に示すモデルに基づいている。 As shown in FIG. 4, each implementation method (each proposal) described in this embodiment can generate an appropriate summary video in almost all cases compared to the same comparison example of unsupervised learning. In addition, each implementation method described in this embodiment can generate an appropriate summary video in almost all cases, even when compared with comparative examples of weakly supervised learning. The method described in this embodiment shown in FIG. 4 is based on the model shown in FIG.

図６は、本実施形態に係る情報処理装置１０により生成される要約動画のＦ値と比較例３の要約動画のＦ値を示す図である。比較例３は、以下の教師あり学習（Supervised）の手法が用いられる。
（比較例３）教師あり学習
ｖｓＬＳＴＭ（Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. ECCV pages 766-782, 2016）
ｄｐｐＬＳＴＭ（Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. ECCV pages 766-782, 2016）
ＳＵＭ－ＧＡＮｓｕｐ（Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial lstm networks. CVPR, pages 2982-2991, 2017.）
ＤＲ－ＤＳＮｓｕｐ（Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. AAAI, page 7582-7589, 2018.）
ＶＡＳＮｅｔ（Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. Summarizing videos with attention. ACCV, pages 39-54, 2018.）
ＤＭＡＳｕｍ（Li Yuan, Francis EH Tay, Ping Li, Li Zhou, and Jiashi Feng. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. AAAI, pages 2711-2722, 2019.） FIG. 6 is a diagram showing the F value of the digest video generated by the information processing apparatus 10 according to the present embodiment and the F value of the digest video of Comparative Example 3. As shown in FIG. Comparative Example 3 uses the following supervised learning method.
(Comparative Example 3) Supervised learning vs LSTM (Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. ECCV pages 766-782, 2016)
dppLSTM (Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. ECCV pages 766-782, 2016)
SUM-GANsup (Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial lstm networks. CVPR, pages 2982-2991, 2017.)
DR-DSNsup (Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. AAAI, page 7582-7589, 2018.)
VASNet (Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. Summarizing videos with attention. ACCV, pages 39-54, 2018.)
DMASum (Li Yuan, Francis EH Tay, Ping Li, Li Zhou, and Jiashi Feng. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. AAAI, pages 2711-2722, 2019.)

図６に示すラベルフリー（label-free）の「Ｘ」は、人手によるアノテーションが必須であることを示し、「Ｙ」は、ラベルが要求されないことを示す。また、「＋」は、実施手法よりも良いことを示し、「－」は、実施手法の方が改善できていることを示す。図６に示すとおり、教師なし学習の各実施手法は、ほとんどのケースにおいて、教師あり学習の手法よりも改善できている。これにより、実施手法はアノテーションがないにも関わらず実用性が高いと言える。 A label-free "X" shown in FIG. 6 indicates that manual annotation is required, and a "Y" indicates that no label is required. Also, "+" indicates that the implementation method is better, and "-" indicates that the implementation method is better. As shown in FIG. 6, each implementation method of unsupervised learning can improve over the method of supervised learning in most cases. From this, it can be said that the implementation method is highly practical despite the absence of annotations.

なお、実施手法の損失関数における第２関数Ｌ_percengateで用いられるσについて、０．１から１．０までの間で変動させ、ＳｕｍＭｅとＴＶＳｕｍとのデータセットについてＦ値が調べられたところ、０．５が双方で良い結果であったので、本実施形態では、σ＝０．５が使用される。しかしながら、所定値σの０．５は一例であって、動画の特徴に応じて適宜変更されてもよい。 The σ used in the second function L _percengate in the loss function of the implementation method was varied between 0.1 and 1.0, and the F value was examined for the SumMe and TVSum data sets. .sigma.=0.5 is used in this embodiment since .5 gave good results on both. However, the predetermined value σ of 0.5 is just an example, and may be changed as appropriate according to the characteristics of the moving image.

以上、実施手法は、教師あり学習、弱教師あり学習、その他の教師なし学習の比較手法に比べて、より適切かつ漏れの少ない要約動画を生成することができていると言える。 As described above, it can be said that the implementation method is able to generate a more appropriate summary video with fewer omissions than the comparison methods of supervised learning, weakly supervised learning, and other unsupervised learning.

図７は、本実施形態に係る情報処理装置１０により生成される要約動画のτ値及びρ値と比較例により生成される要約動画のτ値及びρ値を示す図である。τ値は、ケンドールの順位相関係数であり、正答の要約動画と情報処理装置１０（又は比較例）によって生成された要約動画との関連性を表す。また、ρ値は、スピアマンの順位相関係数であり、正答の要約動画と情報処理装置１０（又は比較例）によって生成された要約動画との関連性を表す。いずれの値も、１に近いほど正答との関連性が強いことを表す。なお、図７では、参考のため、人（Human）が要約動画を作成した場合のτ値及びρ値を記載している。 FIG. 7 is a diagram showing the τ and ρ values of the digest video generated by the information processing apparatus 10 according to the present embodiment and the τ and ρ values of the digest video generated by the comparative example. The τ value is Kendall's rank correlation coefficient, and represents the relevance between the summarized video of the correct answer and the summarized video generated by the information processing device 10 (or the comparative example). The ρ value is Spearman's rank correlation coefficient, and represents the relationship between the summarized video of the correct answer and the summarized video generated by the information processing device 10 (or the comparative example). For any value, the closer to 1, the stronger the relationship with the correct answer. For reference, FIG. 7 shows the τ value and the ρ value when a human creates a digest video.

図７に示す例では、データセットとしてＴＶＳｕｍが用いられる。また、比較例として、教師あり学習は、ＤＰＰ－ＬＳＴＭ、ＤＭＡＳｕｍ、教師なし学習は、ＳＵＭ－ＧＡＮ、ＤＲ－ＤＳＮ、弱教師あり学習は、ＭＷＳｕｍがそれぞれ使用される。 In the example shown in FIG. 7, TVSum is used as the data set. As comparative examples, DPP-LSTM and DMASum are used for supervised learning, SUM-GAN and DR-DSN are used for unsupervised learning, and MWSum is used for weakly supervised learning.

まず、Ｐｒｏｐｏｓａｌ（ｐｒｅ）の実施手法は、事前の訓練の効果が表れ、各提案手法の中で一番よい結果となっている。また、各実施手段は、τ値及びρ値について、いずれもよい結果を表しているが、特に、Ｐｒｏｐｏｓａｌ（ｐ＋ｓｈ）、Ｐｒｏｐｏｓａｌ（ｒ＋ｒｅ）、Ｐｒｏｐｏｓａｌ（ｐｒｅ）がＤＭＡＳｕｍ以外の比較例よりも良い結果となっている。 First, the implementation method of Proposal (pre) shows the effect of prior training and is the best result among the proposed methods. In addition, each implementation means shows good results for both the τ value and the ρ value. It has become.

このように、Ｆ値以外の指標によって比較しても、本実施形態に係る情報処理装置１０は、従来の比較例より適切な要約動画を生成できていることが確認できる。 As described above, it can be confirmed that the information processing apparatus 10 according to the present embodiment can generate a more appropriate digest video than the conventional comparative example, even when compared with indexes other than the F value.

次に、実施手法の一般性・汎用性について説明する。図８は、教師なし学習のＳＵＭ－ＧＡＮのモデルを示す図である。図８に示すＳＵＭ－ＧＡＮのモデルのｓＬＳＴＭ部分に、実施手法を適用することが可能である。すなわち、実施手法は、既存の学習モデルにも適用可能であり、汎用性が高い。 Next, the generality and versatility of the implementation method will be explained. FIG. 8 is a diagram showing a model of SUM-GAN for unsupervised learning. The implementation approach can be applied to the sLSTM part of the SUM-GAN model shown in FIG. In other words, the implementation method is applicable to existing learning models and has high versatility.

図９は、既存の学習モデルに対して実施手法の適用有無を比較するための図である。図９に示す例では、データセットして、ＳｕｍＭｅとＴＶＳｕｍとが使用される。また、図８に示すＳＵＭ－ＧＡＮの学習モデルに対して、実施手法の適用有無によるＦ値の違いを示し、単純なＬＳＴＭのＦ値と、図３に示す実施手法とのＦ値の違いを示す。 FIG. 9 is a diagram for comparing the presence/absence of application of an implementation method to an existing learning model. In the example shown in FIG. 9, SumMe and TVSum are used as data sets. In addition, for the SUM-GAN learning model shown in FIG. show.

図９に示すとおり、既存のＳＵＭ－ＧＡＮよりも、対照学習を用いる図３に示す要約ネットワークを適用したＳＵＭ－ＧＡＮの方が、Ｆ値が高い。また、単純なＬＳＴＭよりも、図３に示す要約ネットワークを適用したＬＳＴＭ（実施手法）の方が、Ｆ値が高い。 As shown in FIG. 9, the SUM-GAN applying the summary network shown in FIG. 3 with contrast learning has a higher F-measure than the existing SUM-GAN. Also, the LSTM (implementation method) to which the summary network shown in FIG. 3 is applied has a higher F value than the simple LSTM.

次に、実施手法の収束速度について説明する。図１０は、図９に示す各手法の収束速度を示す図である。ＬＳＴＭについて、（ｂ）に表される実施手法のエポック数は、（ａ）に表される単純ＬＳＴＭのエポック数よりも少ない。したがって、実施手法は、単純ＬＳＴＭよりも学習速度が速いことを示す。また、ＳＵＭ－ＧＡＮについて、（ｄ）に表される実施手法を適用したＳＵＭ－ＧＡＮのエポック数は、（ｃ）に表される実施手法を適用していないＳＵＭ－ＧＡＮのエポック数よりも少ない。したがって、実施手法は、既存のＳＵＭ－ＧＡＮに適用されることで、性能も学習速度も速くなることを示す。 Next, the convergence speed of the implementation method will be described. FIG. 10 is a diagram showing the convergence speed of each technique shown in FIG. For the LSTM, the number of epochs of the implementation scheme depicted in (b) is less than that of the simple LSTM depicted in (a). Therefore, the implementation approach shows faster learning speed than the naive LSTM. Also, for SUM-GAN, the number of epochs of SUM-GAN to which the implementation method represented in (d) is applied is less than the number of epochs of SUM-GAN to which the implementation method represented in (c) is not applied. . Therefore, the implementation approach is shown to be applied to existing SUM-GANs to improve both performance and learning speed.

実施手法の適用により学習速度（収束速度）が速くなる理由としては、第１関数Ｌ_contrastiveを損失関数に含めることで、動画に対する表現能力が高くなり、学習の反復回数を減らすことができるからと考えられる。 The reason why the learning speed (convergence speed) is increased by applying the implementation method is that by including the first function L _contrastive in the loss function, the ability to express moving images increases and the number of iterations of learning can be reduced. Conceivable.

図１１は、本実施形態に係る実施手法と比較手法により選択されたフレームを示す図である。図１１に示す例では、は、ＴＶＳｕｍに含まれる、犬の耳を掃除する動画について要約動画が生成される。（ａ）は、オリジナルの動画を示し、（ｂ）は、比較手法の一つ、教師なし学習のＤＲ－ＤＳＮにより生成される要約動画を示し、（ｃ）は、比較手法の一つ、弱教師あり学習のＭＷＳｕｍにより生成される要約動画を示し、（ｄ）は、図３に示すＬ_percentageを用いる実施手法により生成される要約動画を示す。 FIG. 11 is a diagram showing frames selected by the implementation method and the comparison method according to this embodiment. In the example shown in FIG. 11, a summary video is generated for a video of cleaning a dog's ears included in TVSum. (a) shows the original video, (b) shows a summary video generated by DR-DSN of unsupervised learning, one of the comparison methods, and (c) shows one of the comparison methods, weak 3 shows a summary video generated by supervised learning MWSum, and (d) shows a summary video generated by the implementation method using L _percentage shown in FIG.

また、図１１に示す（ｂ）～（ｄ）のバーの高さは、アノテーションにより得られた要約動画に含められるか否かを示すスコアであり、バーが高いほど、そのフレームは要約動画に含められるべきであることを示す。（ｄ）の要約動画は、（ｂ）の要約動画よりも、冒頭部分の重要ではないフレームが選択されておらず、要約動画として選択されるべき、バーの高さが高い中間部分から多くのフレームが選択されている。また、（ｄ）の要約動画は、（ｃ）の要約動画よりも、要約動画として選択されるべき、バーの高さが高い中間部分から多くのフレームが選択されている。これにより、（ｄ）の要約動画のＦ値（７５．２）が、他の従来技術の手法のＦ値よりも大きくなることが分かる。なお、図３に示すＬ_reconを用いる場合、Ｆ値は７１．９であることが確認されており、いずれの従来技術の手法のＦ値よりも大きい。 In addition, the height of the bars (b) to (d) shown in FIG. 11 is a score indicating whether or not the frame is included in the video summary obtained by annotation. Indicates that it should be included. In the summary video of (d), less important frames at the beginning than the summary video of (b) are selected, and many A frame is selected. Also, in the digest video of (d), more frames are selected from the middle part with the taller bar, which should be selected as the digest video than in the digest video of (c). It can be seen that this results in a larger F-number for the digest movie in (d) (75.2) than the F-number for other prior art approaches. It should be noted that when using the L _recon shown in FIG. 3, the F-number was found to be 71.9, which is larger than the F-number of any prior art approach.

さらに、実施手法の損失関数は、選択されるショット（又はフレーム）が同じ場面に偏るのを防ぐための第２関数を含めているため、中間部分だけではなく、冒頭部分などのショット（又はフレーム）も要約動画として選択されている。 Furthermore, since the loss function of the implementation method includes a second function to prevent the selected shots (or frames) from being biased toward the same scene, the shots (or frames) such as the beginning portion, as well as the middle portion, ) is also selected as a summary video.

＜動作手順＞
図１２は、本実施形態に係る情報処理装置１０により実行される動画要約処理の一例を示すフローチャートである。 <Operation procedure>
FIG. 12 is a flow chart showing an example of video abstract processing executed by the information processing apparatus 10 according to this embodiment.

ステップＳ１０２において、情報処理装置１０の変換部１２は、取得部１１により取得された複数のフレームを含む動画を、複数のフレームより数が少ない複数のショットに変換する。 In step S<b>102 , the conversion unit 12 of the information processing device 10 converts the moving image including a plurality of frames acquired by the acquisition unit 11 into a plurality of shots that are smaller in number than the plurality of frames.

ステップＳ１０４において、第１生成部１３は、複数のショットを含む第１ショット群に対し、オリジナルの動画との関連性維持に関する第１処理を複数のショットに加えて第２ショット群を生成する。 In step S104, the first generating unit 13 generates a second shot group by applying a first process related to maintaining the relationship with the original moving image to the first shot group including the multiple shots.

ステップＳ１０６において、第１生成部１３は、第１ショット群に対し、第１処理よりもオリジナルの動画との関連性をなくす第２処理を複数のショットに加えて第３ショット群を生成する。ステップＳ１０４とＳ１０６との順序は不問であり、同時に処理されてよい。 In step S<b>106 , the first generation unit 13 generates a third shot group by applying a second process to the first shot group, which eliminates the relationship with the original moving image more than the first process, to a plurality of shots. The order of steps S104 and S106 does not matter, and they may be processed simultaneously.

ステップＳ１０８において、算出部１４は、第１ショット群をアンカー、第２ショット群を正例、第３ショット群を負例とし、第１ショット群、第２ショット群、及び第３ショット群ごとに、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習により生成される学習モデル１４ａによって、各ショットのスコアを算出する。 In step S108, the calculation unit 14 treats the first shot group as an anchor, the second shot group as a positive example, and the third shot group as a negative example. , a score for each shot is calculated by a learning model 14a generated by self-supervised contrastive learning on whether to include each shot in the video summary.

ステップＳ１１０において、選択部１５は、第１ショット群の各ショットのスコアと第２ショット群の各ショットのスコアとに基づく第１類似度と、第１ショット群の各ショットのスコアと第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いて最適化された学習モデル１４ａにより算出される第１ショット群の各ショットのスコアに基づいて、各ショットそれぞれを要約動画に含めるか否かを選択する。 In step S110, the selection unit 15 calculates a first similarity based on the score of each shot in the first shot group and the score of each shot in the second shot group, the score of each shot in the first shot group, and the third shot of the first shot group calculated by the learning model 14a optimized using a loss function including a first function using a second similarity based on the score of each shot in the group and another second function Based on the score of each shot, select whether or not to include each shot in the summary video.

ステップＳ１１２において、第２生成部１６は、選択されたショットに基づいて、要約動画を生成する。 In step S112, the second generator 16 generates a digest video based on the selected shots.

図１３は、本実施形態に係る情報処理装置１０により実行される学習処理の一例を示すフローチャートである。図１３に示す学習処理は、図１２に示すステップＳ１０８の学習処理の一例を示す。 FIG. 13 is a flowchart showing an example of learning processing executed by the information processing apparatus 10 according to this embodiment. The learning process shown in FIG. 13 is an example of the learning process in step S108 shown in FIG.

ステップＳ２０２において、算出部１４は、例えば式（３）～（５）により、第１～第３の各ショット群に対し、学習モデル１４ａによって、各ショットを動画要約に含めるか否かに関するスコアを算出する。 In step S202, the calculation unit 14 uses the learning model 14a for each of the first to third shot groups, for example, by formulas (3) to (5) to obtain a score regarding whether or not each shot is included in the video summary. calculate.

ステップＳ２０４において、算出部１４は、例えば式（７）により、第１ショット群のスコアｓと、第２ショット群のスコアｓ^posとの第１類似度を算出する。 In step S204, the calculation unit 14 calculates a first similarity between the score s of the first shot group and the score s ^pos of the second shot group, for example, using Equation (7).

ステップＳ２０６において、算出部１４は、例えば式（８）により、第１ショット群のスコアｓと第３ショット群のスコアｓ^negとの第２類似度を算出する。ステップＳ２０４とＳ２０６の順序は不問であり、同時に処理されてもよい。 In step S206, the calculation unit 14 calculates a second degree of similarity between the score s of the first shot group and the score ^sneg of the third shot group, for example, using Equation (8). The order of steps S204 and S206 does not matter, and they may be processed simultaneously.

ステップＳ２０８において、算出部１４は、例えば式（１１）により、第１及び第２類似度を用いる第１関数（例えば式（１０））と、他の第２関数（例えば式（６））とを含む損失関数の値を算出する。 In step S208, the calculation unit 14 calculates a first function (eg, equation (10)) using the first and second similarities (eg, equation (10)) and another second function (eg, equation (6)) using equation (11), for example. Calculate the value of the loss function including

ステップＳ２１０において、算出部１４は、損失関数の値が最小化されるように、所定の学習条件が満たされたか否かを判定する。所定の学習条件は、例えば、所定数のエポック数を超えることでもよい。学習条件が満たされれば（ステップＳ２１０－ＹＥＳ）、処理は終了し、学習条件が満たされていなければ（ステップＳ２１０－ＮＯ）、処理はステップＳ２１２に進む。 In step S210, the calculator 14 determines whether or not a predetermined learning condition is satisfied so that the value of the loss function is minimized. A predetermined learning condition may be, for example, exceeding a predetermined number of epochs. If the learning condition is satisfied (step S210-YES), the process ends, and if the learning condition is not satisfied (step S210-NO), the process proceeds to step S212.

ステップＳ２１２において、算出部１４は、誤差逆伝搬法により学習モデル１４ａのハイパーパラメータを更新する。その後、処理はステップＳ２０２に戻り、更新されたハイパーパラメータを用いて学習が続行される。 In step S212, the calculator 14 updates the hyperparameters of the learning model 14a by the error backpropagation method. The process then returns to step S202 to continue learning using the updated hyperparameters.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。なお、本実施形態は、スポーツを撮影した動画の要約や、結婚式の様子を撮影した動画の要約など、様々な動画の要約生成に利用することが可能である。 The embodiments described above are for facilitating understanding of the present invention, and are not intended to limit and interpret the present invention. Each element included in the embodiment and its arrangement, materials, conditions, shape, size, etc. are not limited to those illustrated and can be changed as appropriate. Also, it is possible to partially replace or combine the configurations shown in different embodiments. Note that this embodiment can be used to generate a summary of various moving images, such as a summary of a moving image of sports or a moving image of a wedding ceremony.

１０…情報処理装置、１０ａ…ＣＰＵ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１１…取得部、１２…変換部、１３…第１生成部、１４…算出部、１４ａ…学習モデル、１５…選択部、１６…第２生成部 DESCRIPTION OF SYMBOLS 10... Information processing apparatus 10a...CPU, 10b...RAM, 10c...ROM, 10d...Communication part, 10e...Input part, 10f...Display part, 11... Acquisition part, 12... Conversion part, 13...First generation part, 14... calculator, 14a... learning model, 15... selector, 16... second generator

Claims

情報処理装置に、
複数のフレームを含む動画を、前記複数のフレームより数が少ない複数のショットに変換すること、
前記複数のショットを含む第１ショット群に対し、前記動画との関連性維持に関する第１処理を前記複数のショットに加えて第２ショット群を生成し、前記第１処理よりも前記動画との関連性をなくす第２処理を前記複数のショットに加えて第３ショット群を生成すること、
前記第１ショット群をアンカー、前記第２ショット群を正例、前記第３ショット群を負例とし、前記第１ショット群、前記第２ショット群、及び前記第３ショット群ごとに、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習により生成される学習モデルによって、各ショットのスコアを算出すること、
前記第１ショット群の各ショットのスコアと前記第２ショット群の各ショットのスコアとに基づく第１類似度と、前記第１ショット群の各ショットのスコアと前記第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いて最適化された前記学習モデルにより算出される前記第１ショット群の各ショットのスコアに基づいて、前記各ショットそれぞれを要約動画に含めるか否かを選択すること、
選択されたショットに基づいて、要約動画を生成することと、
を実行させる、プログラム。 information processing equipment,
converting a video including a plurality of frames into a plurality of shots having a smaller number than the plurality of frames;
For a first shot group including the plurality of shots, applying a first process related to maintaining the relationship with the moving image to the plurality of shots to generate a second shot group, and performing the first processing to generate a second shot group. applying a second disassociation process to the plurality of shots to generate a third group of shots;
With the first shot group as an anchor, the second shot group as a positive example, and the third shot group as a negative example, each shot for each of the first shot group, the second shot group, and the third shot group calculating a score for each shot by a learning model generated by self-supervised contrastive learning on whether to include in the video summary,
a first similarity based on the score of each shot in the first shot group and the score of each shot in the second shot group; the score of each shot in the first shot group and the score of each shot in the third shot group; a score of each shot in the first shot group calculated by the learning model optimized using a loss function including a first function using a second similarity based on the score and another second function selecting whether to include each of the shots in the summary video based on
generating a summary video based on the selected shots;
The program that causes the to run.

前記第２関数は、前記第１ショット群の各ショットのスコアと所定値との差、及び前記第２ショット群の各ショットのスコアと前記所定値との差を用いる関数を含む、請求項１に記載のプログラム。 2. The second function includes a function using a difference between the score of each shot in the first shot group and a predetermined value and a difference between the score of each shot in the second shot group and the predetermined value. program described in .

前記第１処理は、前記動画に対し、所定の時間的関係又は空間的関係を維持する処理を含む、請求項１又は２に記載のプログラム。 3. The program according to claim 1, wherein said first processing includes processing for maintaining a predetermined temporal relationship or spatial relationship with respect to said moving image.

前記所定の時間的関係を維持する処理は、前記第１ショット群の各ショットの順番を逆順にする処理を含む、請求項３に記載のプログラム。 4. The program according to claim 3, wherein said processing for maintaining said predetermined temporal relationship includes processing for reversing the order of said shots of said first shot group.

前記第２処理は、前記動画に対し、所定の時間的関係又は空間的関係をなくす処理、あるいは各ショットの任意のフレームを他のフレームに置換する処理を含む、請求項１から４のいずれか一項に記載のプログラム。 5. The second process according to any one of claims 1 to 4, wherein the moving image includes a process of eliminating a predetermined temporal or spatial relationship, or a process of replacing an arbitrary frame of each shot with another frame. 1. The program according to item 1.

前記所定の時間的関係をなくす処理は、前記第１ショット群の各ショットの順番をランダムにする処理を含む、請求項５に記載のプログラム。 6. The program according to claim 5, wherein the process of eliminating said predetermined temporal relationship includes a process of randomizing the order of each shot of said first shot group.

情報処理装置が、
複数のフレームを含む動画を、前記複数のフレームより数が少ない複数のショットに変換すること、
前記複数のショットを含む第１ショット群に対し、前記動画との関連性維持に関する第１処理を前記複数のショットに加えて第２ショット群を生成し、前記第１処理よりも前記動画との関連性をなくす第２処理を前記複数のショットに加えて第３ショット群を生成すること、
前記第１ショット群をアンカー、前記第２ショット群を正例、前記第３ショット群を負例とし、前記第１ショット群、前記第２ショット群、及び前記第３ショット群ごとに、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習により生成される学習モデルによって、各ショットのスコアを算出すること、
前記第１ショット群の各ショットのスコアと前記第２ショット群の各ショットのスコアとに基づく第１類似度と、前記第１ショット群の各ショットのスコアと前記第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いて最適化された前記学習モデルにより算出される前記第１ショット群の各ショットのスコアに基づいて、前記各ショットそれぞれを要約動画に含めるか否かを選択すること、
選択されたショットに基づいて、要約動画を生成することと、
を実行する、情報処理方法。 The information processing device
converting a video including a plurality of frames into a plurality of shots having a smaller number than the plurality of frames;
For a first shot group including the plurality of shots, applying a first process related to maintaining the relationship with the moving image to the plurality of shots to generate a second shot group, and performing the first processing to generate a second shot group. applying a second disassociation process to the plurality of shots to generate a third group of shots;
With the first shot group as an anchor, the second shot group as a positive example, and the third shot group as a negative example, each shot for each of the first shot group, the second shot group, and the third shot group calculating a score for each shot by a learning model generated by self-supervised contrastive learning on whether to include in the video summary,
a first similarity based on the score of each shot in the first shot group and the score of each shot in the second shot group; the score of each shot in the first shot group and the score of each shot in the third shot group; a score of each shot in the first shot group calculated by the learning model optimized using a loss function including a first function using a second similarity based on the score and another second function selecting whether to include each of the shots in the summary video based on
generating a summary video based on the selected shots;
A method of processing information that performs

複数のフレームを含む動画を、前記複数のフレームより数が少ない複数のショットに変換する変換部と、
前記複数のショットを含む第１ショット群に対し、前記動画との関連性維持に関する第１処理を前記複数のショットに加えて第２ショット群を生成し、前記第１処理よりも前記動画との関連性をなくす第２処理を前記複数のショットに加えて第３ショット群を生成する第１生成部と、
前記第１ショット群をアンカー、前記第２ショット群を正例、前記第３ショット群を負例とし、前記第１ショット群、前記第２ショット群、及び前記第３ショット群ごとに、各ショットを動画要約に含めるか否かに関する自己教師ありの対照学習により生成される学習モデルによって、各ショットのスコアを算出する算出部と、
前記第１ショット群の各ショットのスコアと前記第２ショット群の各ショットのスコアとに基づく第１類似度と、前記第１ショット群の各ショットのスコアと前記第３ショット群の各ショットのスコアとに基づく第２類似度とを用いる第１関数と、他の第２関数とを含む損失関数を用いて最適化された前記学習モデルにより算出される前記第１ショット群の各ショットのスコアに基づいて、前記各ショットそれぞれを要約動画に含めるか否かを選択する選択部と、
選択されたショットに基づいて、要約動画を生成する第２生成部と、
を備える、情報処理装置。 a conversion unit that converts a moving image including a plurality of frames into a plurality of shots that are fewer in number than the plurality of frames;
For a first shot group including the plurality of shots, applying a first process related to maintaining the relationship with the moving image to the plurality of shots to generate a second shot group, and performing the first processing to generate a second shot group. a first generation unit that generates a third shot group by applying a second process of disassociation to the plurality of shots;
With the first shot group as an anchor, the second shot group as a positive example, and the third shot group as a negative example, each shot for each of the first shot group, the second shot group, and the third shot group A calculation unit that calculates the score of each shot by a learning model generated by self-supervised contrast learning regarding whether to include in the video summary;
a first similarity based on the score of each shot in the first shot group and the score of each shot in the second shot group; the score of each shot in the first shot group and the score of each shot in the third shot group; a score of each shot in the first shot group calculated by the learning model optimized using a loss function including a first function using a second similarity based on the score and another second function a selection unit that selects whether to include each of the shots in the summary video based on
a second generation unit that generates a summary video based on the selected shot;
An information processing device.