JP6938698B2

JP6938698B2 - A framework that combines multi-global descriptors for image search

Info

Publication number: JP6938698B2
Application number: JP2020031803A
Authority: JP
Inventors: 秉秀高; 希宰全; 鍾澤金; 金　永俊; 永俊金; 仁植金
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2019-03-22
Filing date: 2020-02-27
Publication date: 2021-09-22
Anticipated expiration: 2040-02-27
Also published as: JP2020155111A

Description

以下の説明は、イメージ検索のためのディープラーニングモデルのフレームワークに関する。 The following description relates to a deep learning model framework for image retrieval.

畳み込みニューラルネットワーク（ＣＮＮ）を基盤としたイメージディスクリプタは、分類（ｃｌａｓｓｉｆｉｃａｔｉｏｎ）、オブジェクト検出（ｏｂｊｅｃｔｄｅｔｅｃｔｉｏｎ）、セマンティックセグメンテーション（ｓｅｍａｎｔｉｃｓｅｇｍｅｎｔａｔｉｏｎ）を含んだコンピュータビジョン技術において一般的なディスクリプタとして利用されている。この他にも、イメージキャプション（ｉｍａｇｅｃａｐｔｉｏｎｉｎｇ）やビジュアル質問応答（ｖｉｓｕａｌｑｕｅｓｔｉｏｎａｎｓｗｅｒｉｎｇ）のように極めて意味のある研究にも利用されている。 Image descriptors based on convolutional neural networks (CNNs) are used as common descriptors in computer vision technology including classification, object detection, and semantic segmentation. In addition to this, it is also used in extremely meaningful research such as image captioning and visual question answering.

ＣＮＮ基盤のイメージディスクリプタを活用する最近の研究では、ローカルディスクリプタマッチング（ｌｏｃａｌｄｅｓｃｒｉｐｔｏｒｍａｔｃｈｉｎｇ）に依存する従来の方法の適用により、空間検証（ｓｐａｔｉａｌｖｅｒｉｆｉｃａｔｉｏｎ）によって再び順位を付ける即刻性のあるレベルイメージ検索に適用されている。 Recent studies leveraging CNN-based image descriptors have applied traditional methods that rely on local descriptor matching to re-rank by spatial verification for immediate level image retrieval. It has been applied.

イメージ検索（ｉｍａｇｅｒｅｔｒｉｅｖａｌ）分野において、ＣＮＮ以後にプーリング（ａｖｅｒａｇｅｐｏｏｌｉｎｇ、ｍａｘｐｏｏｌｉｎｇ、ｇｅｎｅｒａｌｉｚｅｄｍｅａｎｐｏｏｌｉｎｇなど）結果として出た特徴をグローバルディスクリプタ（ｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒ）として使用することがある。また、畳み込み層（ｃｏｎｖｏｌｕｔｉｏｎｌａｙｅｒｓ）以後に全結合層（ＦＣ層：ｆｕｌｌｙｃｏｎｎｅｃｔｅｄｌａｙｅｒｓ）を追加し、ＦＣ層から出た特徴をグローバルディスクリプタとして使用することもある。ここで、ＦＣ層は、次元数（ｄｉｍｅｎｓｉｏｎａｌｉｔｙ）を減らすために使用されるものであるため、次元数を減らす必要がない場合にはＦＣ層を省略してもよい。 In the field of image retrieval, the features obtained as a result of pooling (average polling, max polling, generalized mean polling, etc.) after CNN may be used as a global descriptor. Further, a fully connected layer (FC layer: full connected layer) may be added after the convolution layer, and the feature from the FC layer may be used as a global descriptor. Here, since the FC layer is used to reduce the number of dimensions (dimensionality), the FC layer may be omitted when it is not necessary to reduce the number of dimensions.

一例として、特許文献１（登録日２０１８年１１月０５日）には、畳み込みニューラルネットワークを利用した映像検索技術が開示されている。 As an example, Patent Document 1 (registration date: November 05, 2018) discloses a video search technique using a convolutional neural network.

グローバルプーリング方法（ｇｌｏｂａｌｐｏｏｌｉｎｇｍｅｔｈｏｄ）によって生成された代表的なグローバルディスクリプタには、畳み込みの合計プーリング（ＳＰｏＣ：ｓｕｍｐｏｏｌｉｎｇｏｆｃｏｎｖｏｌｕｔｉｏｎ）、畳み込みの最大活性化（ＭＡＣ：ｍａｘｉｍｕｍａｃｔｉｖａｔｉｏｎｏｆｃｏｎｖｏｌｕｔｉｏｎ）、さらに一般化平均プーリング（ＧｅＭ：ｇｅｎｅｒａｌｉｚｅｄ−ｍｅａｎｐｏｏｌｉｎｇ）が含まれる。各グローバルディスクリプタの性能はそれぞれ属性が異なるため、データセットによって異なる。例えば、ＳＰｏＣはイメージ表現でより大きな領域を活性化させる反面、ＭＡＣはより多くの集中領域を活性化させる。能力を高めるために、加重値合計プーリング（ｗｅｉｇｈｔｅｄｓｕｍｐｏｏｌｉｎｇ）、加重値ＧｅＭ、領域（ｒｅｇｉｏｎａｌ）ＭＡＣ（Ｒ−ＭＡＣ）などのような代表的なグローバルディスクリプタの変形が存在する。 Typical global descriptors generated by the global pooling method include total convolution pooling (SPoC), maximum convolution of convolution (MAC), and generalization of convolution (MAC). Mean pooling (GeM) is included. The performance of each global descriptor has different attributes and therefore depends on the dataset. For example, SPoC activates larger regions in image representation, while MAC activates more concentrated regions. There are variations of typical global descriptors such as weighted sum polling, weighted GeM, regional MAC (R-MAC), etc. to enhance capacity.

最近の研究は、イメージ検索のためのアンサンブル技法（ｅｎｓｅｍｂｌｅｔｅｃｈｎｉｑｕｅｓ）に焦点を合わせている。従来には、複数の学習者（ｌｅａｒｎｅｒ）を個別に教育し、モデルリードを使用して性能を高める従来のアンサンブル技法が主流であったが、最近では、個別に教育を受けた多様なグローバルディスクリプタを組み合わせて検索性能を向上させる接近方式が多く見られる。言い換えれば、現在には、イメージ検索分野において検索性能を高めるために、互いに異なるＣＮＮバックボーン（ｂａｃｋｂｏｎｅ）モデルと複数のグローバルディスクリプタを組み合わせて（ｅｎｓｅｍｂｌｅ）使用している。この従来の例は、下記の非特許文献に記載されている。 Recent research has focused on ensemble techniques for image retrieval. Traditionally, traditional ensemble techniques that educate multiple learners individually and use model leads to improve performance have been the mainstream, but nowadays, a variety of individually educated global descriptors. There are many approach methods that improve search performance by combining. In other words, at present, in order to improve the search performance in the image search field, different CNN backbone models and a plurality of global descriptors are used in combination (ensemble). This conventional example is described in the following non-patent literature.

しかし、アンサンブルのために互いに異なる学習者（ＣＮＮバックボーンモデルあるいはグローバルディスクリプタ）を明示的に訓練させるとなると、訓練時間が長くなる上にメモリ消耗量が増加する。これに加え、学習者間のダイバシティ（ｄｉｖｅｒｓｉｔｙ）を統制するために特別にデザインされた戦略や損失が必要となるため、厳密かつ困難な訓練過程を招くようになる。 However, explicitly training different learners (CNN backbone model or global descriptor) for an ensemble increases training time and memory consumption. In addition to this, specially designed strategies and losses are required to control diversity among learners, leading to rigorous and difficult training processes.

韓国登録特許第１０−１９１７３６９号Korean Registered Patent No. 10-197376 W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention-based ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), September 2018. 1, 2, 3, 4, 5, 8W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention-based ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), September 2018. 1, 2, 3, 4, 5, 8 Z. Lin, Z. Yang, F. Huang, and J. Chen. Regional maximum activations of convolutions with attention for cross-domain beauty and personal care product retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 2073−2077. ACM, 2018. 1, 2, 3Z. Lin, Z. Yang, F. Huang, and J. Chen. Regional maximum activations of convolutions with attention for cross-domain beauty and personal care product retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 2073-2077. ACM , 2018. 1, 2, 3 M. Opitz, G. Waltner, H. Possegger, and H. Bischof. Bierboosting independent embeddings robustly. In International Conference on Computer Vision (ICCV), 2017. 1, 2M. Opitz, G. Waltner, H. Possegger, and H. Bischof. Bierboosting independent embeddings robustly. In International Conference on Computer Vision (ICCV), 2017. 1, 2 H. Xuan, R. Souvenir, and R. Pless. Deep randomized ensembles for metric learning. In The European Conference on Computer Vision (ECCV), September 2018. 1, 2H. Xuan, R. Souvenir, and R. Pless. Deep randomized ensembles for metric learning. In The European Conference on Computer Vision (ECCV), September 2018. 1, 2

互いに異なるグローバルディスクリプタを単一モデルによって一度に学習して使用することが可能なディープラーニングモデルフレームワークを提供する。 It provides a deep learning model framework that allows different global descriptors to be learned and used at once by a single model.

複数の学習者（ｌｅａｒｎｅｒｓ）を明示的に訓練させたり学習者間のダイバシティ（ｄｉｖｅｒｓｉｔｙ）を統制したりしなくても、複数のグローバルディスクリプタ（ｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒ）を活用することでアンサンブルと同様の効果を得ることができる方法を提供する。 By utilizing multiple global descriptors, the same effect as an ensemble can be achieved without explicitly training multiple learners or controlling diversity among learners. Provide a method that can be obtained.

コンピュータシステムが実現するイメージ検索のためのフレームワークであって、畳み込みニューラルネットワーク（ＣＮＮ：ｃｏｎｖｏｌｕｔｉｏｎｎｅｕｒａｌｎｅｔｗｏｒｋ）から抽出された互いに異なる複数のグローバルディスクリプタ（ｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒ）を連結して（ｃｏｎｃａｔｅｎａｔｅ）学習するメインモジュール、および前記複数のグローバルディスクリプタのうちのいずれか１つの特定のグローバルディスクリプタを追加学習する補助モジュールを含む、イメージ検索のためのフレームワークを提供する。 It is a framework for image search realized by a computer system, and is a main for learning by concatenating multiple global descriptors extracted from a convolutional neural network (CNN). Provided is a framework for image retrieval, which includes a module and an auxiliary module for additionally learning a specific global descriptor of any one of the plurality of global descriptors.

一側面によると、前記メインモジュールは、イメージ表現（ｉｍａｇｅｒｅｐｒｅｓｅｎｔａｔｉｏｎ）のランキング損失（ｒａｎｋｉｎｇｌｏｓｓ）のための学習モジュールであり、前記補助モジュールは、前記イメージ表現の分類損失（ｃｌａｓｓｉｆｉｃａｔｉｏｎｌｏｓｓ）のための学習モジュールであり、前記イメージ検索のためのフレームワークは、エンドツーエンド（ｅｎｄ−ｔｏ−ｅｎｄ）方式によって前記ランキング損失と前記分類損失の合計である最終損失として訓練される。 According to one aspect, the main module is a learning module for ranking loss of image representation, and the auxiliary module is learning for classification loss of image representation. A module, the framework for image retrieval is trained by an end-to-end method as a final loss, which is the sum of the ranking loss and the classification loss.

他の側面によると、前記ＣＮＮは、与えられたイメージの特徴マップを提供するバックボーン（ｂａｃｋｂｏｎｅ）ネットワークとして、前記バックボーンネットワークの最後の段階（ｓｔａｇｅ）以前にはダウンサンプリング（ｄｏｗｎｓａｍｐｌｉｎｇ）を作動させない。 According to another aspect, the CNN does not activate down sampling prior to the stage of the backbone network as a backbone network that provides a feature map of a given image.

また他の側面によると、前記メインモジュールは、前記複数のグローバルディスクリプタを、正規化（ｎｏｒｍａｌｉｚａｔｉｏｎ）を経た後に連結して１つの最終グローバルディスクリプタとして形成し、前記最終グローバルディスクリプタをランキング損失（ｒａｎｋｉｎｇｌｏｓｓ）によって学習してよい。 According to another aspect, the main module connects the plurality of global descriptors after normalization to form one final global descriptor, and the final global descriptor is a ranking loss. You may learn by.

また他の側面によると、前記メインモジュールには、前記複数のグローバルディスクリプタを使用してそれぞれのイメージ表現を出力する複数のブランチ（ｂｒａｎｃｈ）が含まれ、前記ブランチの個数は、使用しようとするグローバルディスクリプタによって変更されてよい。 According to another aspect, the main module includes a plurality of branches that output each image representation using the plurality of global descriptors, and the number of the branches is the number of globals to be used. It may be modified by the descriptor.

また他の側面によると、前記補助モジュールは、前記複数のグローバルディスクリプタのうち、学習性能に基づいて決定された前記特定のグローバルディスクリプタを分類損失によって学習してよい。 According to another aspect, the auxiliary module may learn the specific global descriptor determined based on the learning performance among the plurality of global descriptors by the classification loss.

また他の側面によると、前記補助モジュールは、分類損失による学習時に、ラベルスムージング（ｌａｂｅｌｓｍｏｏｔｈｉｎｇ）と温度スケーリング（ｔｅｍｐｅｒａｔｕｒｅｓｃａｌｉｎｇ）技術のうちの少なくとも一方を利用してよい。 According to another aspect, the auxiliary module may utilize at least one of label smoothing and temperature scaling techniques during learning by classification loss.

コンピュータシステムが実行するディスクリプタ学習方法であって、前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、当該ディスクリプタ学習方法は、ＣＮＮから抽出された互いに異なる複数のグローバルディスクリプタを連結してランキング損失によって学習するメイン学習段階、および前記複数のグローバルディスクリプタのうちのいずれか１つの特定のグローバルディスクリプタを分類損失によって追加学習する補助学習段階を含む、ディスクリプタ学習方法を提供する。 A descriptor learning method performed by a computer system, wherein the computer system includes at least one processor configured to execute a computer-readable instruction contained in memory, and the descriptor learning method is extracted from the CNN. It includes a main learning stage in which a plurality of global descriptors different from each other are concatenated and learned by ranking loss, and an auxiliary learning stage in which a specific global processor of any one of the plurality of global descriptors is additionally learned by classification loss. , Provides a descriptor learning method.

前記ディスクリプタ学習方法を前記コンピュータシステムに実行させるためのコンピュータプログラムを提供する。 A computer program for causing the computer system to execute the descriptor learning method is provided.

本発明の実施形態によると、複数のグローバルディスクリプタを組み合わせる新たなフレームワーク、すなわち、エンドツーエンド方式（ｅｎｄ−ｔｏ−ｅｎｄｍａｎｎｅｒ）によって訓練可能な多数のグローバルディスクリプタを組み合わせたＣＧＤ（ｃｏｍｂｉｎａｔｉｏｎｏｆｍｕｌｔｉｐｌｅｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒｓ）を適用することにより、各グローバルディスクリプタに対する明示的なアンサンブルモデルやダイバシティの統制がなくても、アンサンブルと同様の効果を達成することができる。これは、グローバルディスクリプタ、ＣＮＮバックボーン、損失、およびデータセットによって柔軟かつ拡張可能な特性を備えながらも、組み合わせディスクリプタの使用によって異なる類型の特徴を使用することができるため、単一グローバルディスクリプタよりも性能が優れるだけでなく、イメージ検索性能を向上させることもできる。 According to an embodiment of the present invention, a new framework that combines a plurality of global descriptors, that is, a CGD (combination of multiple global) that combines a large number of global descriptors that can be trained by an end-to-end manner. By applying descriptors), it is possible to achieve the same effect as an ensemble without an explicit ensemble model or diversity control for each global descriptor. It outperforms a single global descriptor because it has characteristics that are flexible and extensible with global descriptors, CNN backbones, losses, and datasets, but can use different types of features with the use of combination descriptors. Not only is it excellent, but it can also improve image search performance.

本発明の一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。It is a block diagram for demonstrating an example of the internal structure of the computer system in one Embodiment of this invention. 本発明の一実施形態における、イメージ検索のためのＣＧＤ（ｃｏｍｂｉｎａｔｉｏｎｏｆｍｕｌｔｉｐｌｅｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒｓ）フレームワークを示した図である。It is a figure which showed the CGD (combination of multiple global descriptors) framework for image search in one Embodiment of this invention. 本発明の一実施形態における、分類損失とランキング損失の両方を使用するＣＧＤフレームワークの性能を説明するためのテーブルである。It is a table for explaining the performance of the CGD framework which uses both the classification loss and the ranking loss in one embodiment of the present invention. 本発明の一実施形態における、ラベルスムージング（ｌａｂｅｌｓｍｏｏｔｈｉｎｇ）と温度スケーリング（ｔｅｍｐｅｒａｔｕｒｅｓｃａｌｉｎｇ）を使用するＣＧＤフレームワークの性能を説明するためのテーブルである。It is a table for explaining the performance of the CGD framework using label smoothing and temperature scaling in one embodiment of the present invention. マルチグローバルディスクリプタを訓練するための他の類型のアキテクチャの例を示した図である。It is a figure which showed the example of the architecture of another type for training a multi-global descriptor. マルチグローバルディスクリプタを訓練するための他の類型のアキテクチャの例を示した図である。It is a figure which showed the example of the architecture of another type for training a multi-global descriptor. 本発明に係るＣＧＤフレームワークの性能と他の類型のアキテクチャとの比較結果を示したテーブルである。It is a table which showed the comparison result between the performance of the CGD framework which concerns on this invention, and the architecture of other types. 本発明の一実施形態における、多数のグローバルディスクリプタを連結方法（ｃｏｎｃａｔｅｎａｔｉｏｎ）によって組み合わせたＣＧＤフレームワークの性能を説明するためのテーブルである。It is a table for demonstrating the performance of the CGD framework which combined a large number of global descriptors by a concatenation method in one Embodiment of this invention. 本発明の一実施形態における、複数のグローバルディスクリプタが組み合わされた構成の性能を説明するためのグラフとテーブルである。It is a graph and a table for demonstrating the performance of the composition which combined the plurality of global descriptors in one Embodiment of this invention. 本発明の一実施形態における、複数のグローバルディスクリプタが組み合わされた構成の性能を説明するためのグラフとテーブルである。It is a graph and a table for demonstrating the performance of the composition which combined the plurality of global descriptors in one Embodiment of this invention. 本発明の一実施形態における、複数のグローバルディスクリプタが組み合わされた構成の性能を説明するためのグラフとテーブルである。It is a graph and a table for demonstrating the performance of the composition which combined the plurality of global descriptors in one Embodiment of this invention. 本発明の一実施形態における、複数のグローバルディスクリプタが組み合わされた構成の性能を説明するためのグラフとテーブルである。It is a graph and a table for demonstrating the performance of the composition which combined the plurality of global descriptors in one Embodiment of this invention.

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

本発明の実施形態は、イメージ検索のためのディープラーニングモデルのフレームワークに関し、より詳細には、イメージ検索のためのマルチグローバルディスクリプタを組み合わせる技術に関する。 Embodiments of the present invention relate to a framework of deep learning models for image search, and more specifically to techniques for combining multi-global descriptors for image search.

本明細書において具体的に開示される事項を含む実施形態は、エンドツーエンド方式によって訓練可能な複数のグローバルディスクリプタを活用することでアンサンブルと同様の効果を得ることができるフレームワークを提案するものであり、これによって柔軟性、拡張性、時間短縮、費用節減、検索性能などの側面において相当な長所を達成する。 The embodiments, including the matters specifically disclosed herein, propose a framework that can achieve the same effect as an ensemble by utilizing a plurality of global descriptors that can be trained by an end-to-end method. This achieves considerable advantages in terms of flexibility, scalability, time savings, cost savings, search performance, and so on.

図１は、本発明の一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。例えば、本発明の実施形態に係るディスクリプタ学習システムは、図１のコンピュータシステム１００によって実現されてよい。図１に示すように、コンピュータシステム１００は、ディスクリプタ学習方法を実行するための構成要素として、プロセッサ１１０、メモリ１２０、永続的記録装置１３０、バス１４０、入力／出力インタフェース１５０、およびネットワークインタフェース１６０を含んでよい。 FIG. 1 is a block diagram for explaining an example of an internal configuration of a computer system according to an embodiment of the present invention. For example, the descriptor learning system according to the embodiment of the present invention may be realized by the computer system 100 of FIG. As shown in FIG. 1, the computer system 100 includes a processor 110, a memory 120, a persistent recording device 130, a bus 140, an input / output interface 150, and a network interface 160 as components for executing the descriptor learning method. May include.

プロセッサ１１０は、ディスクリプタ学習のための構成要素として命令語であるシーケンスを処理することのできる任意の装置を含んでもよいし、その一部であってもよい。プロセッサ１１０は、例えば、コンピュータプロセッサ、移動装置、または他の電子装置内のプロセッサおよび／またはデジタルプロセッサを含んでよい。プロセッサ１１０は、例えば、サーバコンピュータデバイス、サーバコンピュータ、一連のサーバコンピュータ、サーバファーム、クラウドコンピュータ、コンテンツプラットフォームなどに含まれてよい。プロセッサ１１０は、バス１４０を介してメモリ１２０に接続してよい。 The processor 110 may include, or may be a part of, any device capable of processing a sequence of instructions as a component for descriptor learning. Processor 110 may include, for example, a processor and / or a digital processor in a computer processor, mobile device, or other electronic device. The processor 110 may be included, for example, in a server computer device, a server computer, a set of server computers, a server farm, a cloud computer, a content platform, and the like. The processor 110 may be connected to the memory 120 via the bus 140.

メモリ１２０は、コンピュータシステム１００によって使用されるか、これから出力される情報を記録するための揮発性メモリ、永続的、仮想、またはその他のメモリを含んでよい。メモリ１２０は、例えば、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）および／またはＤＲＡＭ（ｄｙｎａｍｉｃＲＡＭ）を含んでよい。メモリ１２０は、コンピュータシステム１００の状態情報のような任意の情報を記録するのに使用されてよい。メモリ１２０は、例えば、ディスクリプタ学習のための命令語を含むコンピュータシステム１００の命令語を記録するのに使用されてよい。コンピュータシステム１００は、必要によって、または適切な場合に、１つ以上のプロセッサ１１０を含んでよい。 The memory 120 may include volatile memory, persistent, virtual, or other memory for recording information used or output by the computer system 100. The memory 120 may include, for example, a RAM (random access memory) and / or a DRAM (dynamic RAM). The memory 120 may be used to record arbitrary information such as state information of the computer system 100. The memory 120 may be used, for example, to record the instructions of the computer system 100, including the instructions for descriptor learning. The computer system 100 may include one or more processors 110, if desired or where appropriate.

バス１４０は、コンピュータシステム１００の多様なコンポーネント間の相互作用を可能にする通信基盤構造を含んでよい。バス１４０は、例えば、コンピュータシステム１００のコンポーネント間、例えば、プロセッサ１１０とメモリ１２０との間でデータを運搬してよい。バス１４０は、コンピュータシステム１００のコンポーネント間の無線および／または有線通信媒体を含んでよく、並列、直列、または他のトポロジ配列を含んでよい。 The bus 140 may include a communication infrastructure structure that allows interaction between the various components of the computer system 100. The bus 140 may carry data, for example, between the components of the computer system 100, for example, between the processor 110 and the memory 120. Bus 140 may include wireless and / or wired communication media between the components of computer system 100 and may include parallel, serial, or other topology arrays.

永続的記録装置１３０は、（例えば、メモリ１２０に比べて）所定の延長された期間中にデータを記録するために、コンピュータシステム１００によって使用されるもののようなメモリまたは他の永続的記録装置のようなコンポーネントを含んでよい。永続的記録装置１３０は、コンピュータシステム１００内のプロセッサ１１０によって使用されるもののような非揮発性メインメモリを含んでよい。永続的記録装置１３０は、例えば、フラッシュメモリ、ハードディスク、光ディスク、または他のコンピュータ読み取り可能媒体を含んでよい。 Persistent recording device 130 is of memory or other persistent recording device, such as that used by computer system 100 to record data during a predetermined extended period of time (eg, compared to memory 120). May include such components. Persistent recording device 130 may include non-volatile main memory such as that used by processor 110 in computer system 100. Permanent recording device 130 may include, for example, flash memory, a hard disk, an optical disk, or other computer readable medium.

入力／出力インタフェース１５０は、キーボード、マウス、音声命令入力、ディスプレイ、または他の入力または出力装置に対するインタフェースを含んでよい。構成命令および／またはディスクリプタ学習のための入力が、入力／出力インタフェース１５０に受信されてよい。 The input / output interface 150 may include an interface to a keyboard, mouse, voice command input, display, or other input or output device. Inputs for configuration instructions and / or descriptor learning may be received by the input / output interface 150.

ネットワークインタフェース１６０は、近距離ネットワークまたはインターネットのようなネットワークに対する１つ以上のインタフェースを含んでよい。ネットワークインタフェース１６０は、有線または無線接続に対するインタフェースを含んでよい。構成命令および／またはディスクリプタ学習のための入力が、ネットワークインタフェース１６０に受信されてよい。 The network interface 160 may include one or more interfaces to a short-range network or a network such as the Internet. The network interface 160 may include an interface for a wired or wireless connection. Inputs for configuration instructions and / or descriptor learning may be received at network interface 160.

また、他の実施形態において、コンピュータシステム１００は、図１の構成要素よりも多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータシステム１００は、上述した入力／出力インタフェース１５０と連結する入力／出力装置のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ（ｔｒａｎｓｃｅｉｖｅｒ）、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）モジュール、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, the computer system 100 may include more components than the components of FIG. However, it is not necessary to clearly illustrate most of the prior art components. For example, the computer system 100 may be realized to include at least a part of the input / output devices connected to the input / output interface 150 described above, or a transceiver, GPS (Global Positioning System) module. , Cameras, various sensors, databases, etc. may be further included.

本発明の実施形態は、互いに異なるグローバルディスクリプタを単一モデルによって一度に学習して使用することのできるディープラーニングモデルのフレームワークに関する。 An embodiment of the present invention relates to a framework of a deep learning model in which different global descriptors can be learned and used at a time by a single model.

最近のイメージ検索研究において、深層学習ＣＮＮに基盤を置いたグローバルディスクリプタは、ＳＩＦＴ（ＳｃａｌｅＩｎｖａｒｉａｎｔＦｅａｔｕｒｅＴｒａｎｓｆｏｒｍ）のような従来技術よりも完全な特徴を有する。ＳＰｏＣ（ｓｕｍｐｏｏｌｉｎｇｏｆｃｏｎｖｏｌｕｔｉｏｎ）は、ＣＮＮの最後の特徴マップで合計プーリング（ｓｕｍｐｏｏｌｉｎｇ）を施したものである。ＭＡＣ（ｍａｘｉｍｕｍａｃｔｉｖａｔｉｏｎｏｆｃｏｎｖｏｌｕｔｉｏｎ）は、また違った強力なディスクリプタである反面、Ｒ−ＭＡＣ（ｒｅｇｉｏｎａｌ−ＭＡＣ）は、領域内の最大値プーリングを実行した後、最後に領域内のＭＡＣディスクリプタを合計する。ＧｅＭ（ｇｅｎｅｒａｌｉｚｅｄ−ｍｅａｎｐｏｏｌｉｎｇ）は、プーリングパラメータによって最大および平均値プーリングを一般化する。他のグローバルディスクリプタ方法としては、ｗｅｉｇｈｔｅｄｓｕｍｐｏｏｌｉｎｇ、ｗｅｉｇｈｔｅｄ−ＧｅＭ、ＭｕｌｔｉｓｃａｌｅＲ−ＭＡＣなどがある。 In recent image search studies, deep learning CNN-based global descriptors have more complete features than prior art such as SIFT (Scale Invariant Feature Transfer). SPoC (sum polling of convolution) is the final feature map of CNN with total pooling. While MAC (maximum activation of convolution) is another powerful descriptor, R-MAC (regional-MAC) sums up the MAC descriptors in the region after performing the maximum value pooling in the region. .. GeM (generalized-mean polling) generalizes maximum and mean pooling by pooling parameters. Other global descriptor methods include weighted sum polling, weighted-GeM, and Multiscale R-MAC.

一部の研究では、特徴マップにおいて重要な特徴の活性化を最大化するために追加戦略（ａｄｄｉｔｉｏｎａｌｓｔｒａｔｅｇｙ）または注意機構（ａｔｔｅｎｔｉｏｎｍｅｃｈａｎｉｓｍ）を利用して試したり、他の領域の特徴表現を最適化するようにネットワークを強制するＢＦＥという戦略を提示したりしている。また、特徴表現を同時に最適化するとともに、柔らかいピクセルと困難な領域的注意を有するモデルを適用したりもする。上述した技術には、ネットワークの大きさと訓練時間を増加させるだけでなく、訓練のために追加の媒介変数を要求するという短所がある。 Some studies have tried using additional strategies or attention mechanisms to maximize the activation of key features in feature maps, or optimized feature representations in other areas. They are presenting a strategy called BFE that forces the network to do so. It also optimizes feature representations at the same time and applies models with soft pixels and difficult regional attention. The techniques described above have the disadvantages of not only increasing the size of the network and training time, but also requiring additional parameters for training.

言い換えれば、イメージ検索作業に関する最近の研究は、互いに異なるモデルを組み合わせて複数のグローバルディスクリプタを組み合わせるものであるが、このようなアンサンブルのために互いに異なるモデルを訓練させることは困難なだけでなく、時間やメモリの側面においても効率的でない。 In other words, recent research on image retrieval work has combined different models to combine multiple global descriptors, but not only is it difficult to train different models for such an ensemble, but also It is also inefficient in terms of time and memory.

本実施形態では、エンドツーエンド方式によって訓練する間、複数のグローバルディスクリプタを活用することでアンサンブルと同様の効果を得ることができる新たなフレームワークを提案する。本発明に係るフレームワークは、グローバルディスクリプタ、ＣＮＮバックボーン、損失、およびデータセットによって柔軟かつ拡張可能である。また、本発明に係るフレームワークは、訓練のために数種類の追加の媒介変数を要求するだけで、追加の戦略や注意機構は必要としない。 In this embodiment, we propose a new framework that can obtain the same effect as an ensemble by utilizing multiple global descriptors while training by the end-to-end method. The framework according to the invention is flexible and extensible with global descriptors, CNN backbones, losses, and datasets. Also, the framework of the present invention only requires a few additional parameters for training and does not require any additional strategy or attention mechanism.

アンサンブルとは、数名の学習者を訓練させることで成果を上昇させ、訓練された学習者から組み合わされた結果を得るという周知の技法であり、ここ数十年にわたってイメージ検索で広く利用されている。しかし、従来のアンサンブル技法は、モデルの複雑性の増加が演算費用の増加に繋がり、学習者間のダイバシティを算出するために追加の制御が必要となるという短所がある。 Ensemble is a well-known technique that trains several learners to improve outcomes and obtain combined results from trained learners, and has been widely used in image search for decades. There is. However, traditional ensemble techniques have the disadvantage that increasing model complexity leads to increased computational costs and requires additional control to calculate diversity among learners.

本発明に係るフレームワークは、ダイバシティの統制なく、エンドツーエンド方式によって訓練されるときにアンサンブル技法のアイディアを活用することができる。 The framework according to the present invention can utilize the ideas of ensemble techniques when trained in an end-to-end manner without control of diversity.

図２は、本発明の一実施形態における、イメージ検索のためのＣＧＤ（ｃｏｍｂｉｎａｔｉｏｎｏｆｍｕｌｔｉｐｌｅｇｌｏｂａｌｄｅｓｃｒｉｐｔｏｒｓ）フレームワークを示した図である。 FIG. 2 is a diagram showing a CGD (combination of multiple global descriptors) framework for image retrieval in one embodiment of the present invention.

本発明に係るＣＧＤフレームワーク２００は、上述したコンピュータシステム１００によって実現されてよく、ディスクリプタ学習のための構成要素としてプロセッサ１１０に含まれてよい。 The CGD framework 200 according to the present invention may be realized by the computer system 100 described above, and may be included in the processor 110 as a component for descriptor learning.

図２を参照すると、ＣＧＤフレームワーク２００は、ＣＮＮバックボーンネットワーク２０１と、２つのモジュールであるメインモジュール２１０、および補助モジュール２２０で構成されてよい。 Referring to FIG. 2, the CGD framework 200 may consist of a CNN backbone network 201, two modules, a main module 210, and an auxiliary module 220.

このとき、メインモジュール２１０は、イメージ表現（ｉｍａｇｅｒｅｐｒｅｓｅｎｔａｔｉｏｎ）を学習する役割をし、ランキング損失（ｒａｎｋｉｎｇｌｏｓｓ）のための複数のグローバルディスクリプタの組み合わせで構成される。補助モジュール２２０は、分類損失（ｃｌａｓｓｉｆｉｃａｔｉｏｎｌｏｓｓ）によってＣＮＮを微調整するための役割をする。 At this time, the main module 210 plays a role of learning image representation, and is composed of a combination of a plurality of global descriptors for ranking loss. The auxiliary module 220 serves to fine-tune the CNN by classification loss.

ＣＧＤフレームワーク２００は、分類損失方式によるメインモジュール２１０からのランキング損失と補助モジュール２２０からの分類損失の合計である最終損失として訓練されてよい。 The CGD framework 200 may be trained as the final loss, which is the sum of the ranking loss from the main module 210 and the classification loss from the auxiliary module 220 according to the classification loss method.

１．ＣＮＮバックボーンネットワーク２０１
ＣＮＮバックボーンネットワーク２０１としては、すべてのＣＮＮモデルが使用可能である。ＣＧＤフレームワーク２００は、ＢＮ−Ｉｎｃｅｐｔｉｏｎ、ＳｈｕｆｆｌｅＮｅｔ−ｖ２、ＲｅｓＮｅｔ、またはこの他の変形モデルなどのようなＣＮＮバックボーンが使用されてよく、例えば、図２に示すように、ＲｅｓＮｅｔ−５０がＣＮＮバックボーンネットワーク２０１として使用されてよい。 1. 1. CNN backbone network 201
As the CNN backbone network 201, all CNN models can be used. The CGD framework 200 may use CNN backbones such as BN-Inception, ShuffleNet-v2, ResNet, or other variant models, eg, ResNet-50 is the CNN backbone network, as shown in FIG. It may be used as 201.

一例として、ＣＮＮバックボーンネットワーク２０１は、４段階からなるネットワークを利用してよく、このとき、最後の特徴マップ（ｆｅａｔｕｒｅｍａｐ）でより多くの情報を記録するために、３段階（ｓｔａｇｅ３）と４段階（ｓｔａｇｅ４）の間のダウンサンプリングを作動させないことにより該当のネットワークを修正してよい。これにより、２２４×２２４の入力サイズに対する１４×１４サイズの特徴マップを提供するようになるため、個別グローバルディスクリプタの性能が向上するようになる。言い換えれば、グローバルディスクリプタの性能向上のために、ＲｅｓＮｅｔ−５０の３段階（ｓｔａｇｅ３）以後から最後の段階（ｓｔａｇｅ４）以前まではダウンサンプリングを行わないことでより多くの情報が含まれるようにするのである。 As an example, the CNN backbone network 201 may utilize a network consisting of four stages, with three stages (stage 3) and four stages in order to record more information in the final feature map. The network may be modified by not invoking downsampling during the stage (stage 4). This will provide a 14x14 size feature map for the 224x224 input size, thus improving the performance of the individual global descriptors. In other words, in order to improve the performance of the global descriptor, more information can be included by not performing downsampling from after the 3rd stage (stage 3) of ResNet-50 to before the final stage (stage 4). To do.

２．メインモジュール２１０：複数のグローバルディスクリプタ
メインモジュール２１０は、ＣＮＮバックボーンネットワーク２０１の最後の特徴マップにおいて複数の特徴総合（ｆｅａｔｕｒｅａｇｇｒｅｇａｔｉｏｎ）方法によってグローバルディスクリプタを抽出し、ＦＣ層と正規化（ｎｏｒｍａｌｉｚａｔｉｏｎ）を経る。 2. Main Module 210: Multiple Global Descriptors The main module 210 extracts global descriptors by a plurality of feature aggregation methods in the final feature map of the CNN backbone network 201 and undergoes FC layer and normalization.

メインモジュール２１０で抽出されたグローバルディスクリプタは連結され（ｃｏｎｃａｔｅｎａｔｅ）、正規化を経て１つの最終グローバルディスクリプタを形成してよい。このとき、最終グローバルディスクリプタは、ランキング損失によってインスタンスレベル（ｉｎｓｔａｎｃｅｌｅｖｅｌ）に学習される。ここで、ランキング損失は、メトリックラーニング（ｍｅｔｒｉｃｌｅａｒｎｉｎｇ）のための損失と代替可能であり、代表的にはトリプレット（ｔｒｉｐｌｅｔ）損失を使用してよい。 The global descriptors extracted by the main module 210 may be concatenate and normalized to form one final global descriptor. At this time, the final global descriptor is learned to the instance level (instance level) by the ranking loss. Here, the ranking loss can be replaced with the loss for metric learning, and a triplet loss may be typically used.

より詳細には、メインモジュール２１０には、最後の畳み込み層で互いに異なるグローバルディスクリプタを使用して各イメージ表現を出力する複数のブランチ（分岐、ｂｒａｎｃｈ）が含まれる。一例として、メインモジュール２１０は、ＳＰｏＣ（ｓｕｍｐｏｏｌｉｎｇｏｆｃｏｎｖｏｌｕｔｉｏｎ）、ＭＡＣ（ｍａｘｉｍｕｍａｃｔｉｖａｔｉｏｎｏｆｃｏｎｖｏｌｕｔｉｏｎ）、ＧｅＭ（ｇｅｎｅｒａｌｉｚｅｄ−ｍｅａｎｐｏｏｌｉｎｇ）を含み、各ブランチで最も代表的なグローバルディスクリプタの３つの類型を使用する。 More specifically, the main module 210 includes a plurality of branches (branches) that output each image representation using different global descriptors in the last convolution layer. As an example, the main module 210 includes SPoC (sum polling of convolution), MAC (maximum activation of convolution), and GeM (generalized-mean polling), and uses three types of global descriptors that are most representative in each branch. ..

メインモジュール２１０に含まれるブランチの個数は、増減可能であり、ユーザのニーズに合うように使用しようとするグローバルディスクリプタを変形したり組み合わせたりしてよい。 The number of branches contained in the main module 210 can be increased or decreased, and global descriptors to be used may be modified or combined to meet the needs of the user.

イメージＩが与えられたとき、最後の畳み込み層の出力は、Ｃ×Ｈ×Ｗ次元の３Ｄテンソル（ｔｅｎｓｏｒ）ｘとなるが、ここで、Ｃは特徴マップの個数である。ｘ_ｃを特徴マップｃ∈｛１．．．Ｃ｝のＨ×Ｗ活性化セットであると仮定する。ネットワーク出力は、２Ｄ特徴マップのＣチャンネルで構成される。グローバルディスクリプタはｘを入力として使用し、プーリングプロセスによる出力としてベクトルｆを生成する。このようなプーリング方法は、数式（１）のように一般化してよい。 Given the image I, the output of the last convolution layer is a C × H × W dimensional 3D tensor x, where C is the number of feature maps. _{Feature map c} ∈ {1. .. .. Assume that it is an H × W activation set of C}. The network output consists of the C channel of the 2D feature map. The global descriptor uses x as an input and produces a vector f as an output by the pooling process. Such a pooling method may be generalized as in the equation (1).

ｐ_ｃ＝１のときにはＳＰｏＣをｆ^（ｓ）、ｐ_ｃ→∞のときにはＳＰｏＣをｆ^（ｍ）として定義し、残りの場合に対してＧｅＭをｆ^（ｍ）として定義する。ＧｅＭの場合、実験によって固定されたｐ_ｃパラメータ３を使用してよく、実施形態によっては、パラメータｐ_ｃをユーザが手動で設定してもよいし、パラメータｐ_ｃ自体を学習してもよい。 p _c = 1 in the SPoC when ^{f (s),} when _{p c} → ∞ defines the SPoC as ^{f (m),} defines the GeM for the case of the remaining as ^{f (m).} For GeM, it may be used and p _c parameter 3 fixed by experiment, in some embodiments, may be set the parameters p _c user manually may learn the parameters p _c itself.

ｉ番目のブランチの出力特徴ベクトル Output feature vector of i-th branch

は、ＦＣ層による次元減少およびｌ_２−正規化（ｎｏｒｍａｌｉｚａｔｉｏｎ）層による正規化によって生成される。

Is the dimension reduction and _{l 2} by FC layer - is generated by normalization by the normalization (normalization) layer.

ｉ∈｛１．．．ｎ｝とするとき、ｎはブランチの数であり、Ｗ^ｉはＦＣ層の加重値であって、グローバルディスクリプタ i ∈ {1. .. .. When the n}, n is the number of branches, W ⁱ is a weight value of the FC layer, global descriptor

は、ａ_ｉ＝ｓのときにＳＰｏＣ、ａ_ｉ＝ｍのときにＭＡＣ、ａ_ｉ＝ｇのときにＧｅＭであってよい。

May be SPoC when _ai = s, MAC when _ai = m, and GeM when _{ai = g.}

本発明に係るＣＧＤフレームワーク２００の組み合わせディスクリプタψＣＧＤの最終特徴ベクトルは、多様なブランチの出力特徴ベクトルを連結し、順にｌ_２−正規化を実行する。 The final feature vector of the combinatorial descriptor ψCGD of the CGD framework 200 according to the present invention concatenates the output feature vectors of various branches and performs _{l 2-normalization in order.}

ａ_ｉ∈｛ｓ，ｍ，ｇ｝とするとき、 When a _i ∈ {s, m, g}

は連結（ｃｏｎｃａｔｅｎａｔｉｏｎ）である。

Is a concatenation.

このような組み合わせディスクリプタは、どのような類型のランキング損失であっても訓練可能であり、一例として、ｂａｔｃｈ−ｈａｒｄｔｒｉｐｌｅｔ損失を代表的に使用する。 Such combinatorial descriptors can be trained for any type of ranking loss, and batch-hard triplet loss is typically used as an example.

ＣＧＤフレームワーク２００では、多数のグローバルディスクリプタを組み合わせることで２つの長所が得られる。１つ目に、数種類の追加の媒介変数だけでアンサンブルと同様の効果が得られる。上述した研究と同じようにアンサンブル効果が得られるが、これをエンドツーエンド方式によって訓練できるようにするために、ＣＧＤフレームワーク２００は、単一のＣＮＮバックボーンネットワーク２０１から複数のグローバルディスクリプタを抽出している。２つ目に、ダイバシティの統制がなくても、各ブランチの出力に対して自動で他の属性を提供する。最近の研究では、学習者間のダイバシティを奨励するために特別にデザインされた損失を提案しているが、ＣＧＤフレームワーク２００は、ブランチ間のダイバシティを統制するために特別にデザインされた損失は要求しない。 In the CGD framework 200, two advantages can be obtained by combining a large number of global descriptors. First, an ensemble-like effect can be achieved with just a few additional parameters. The ensemble effect is similar to the study described above, but to allow it to be trained end-to-end, the CGD Framework 200 extracts multiple global descriptors from a single CNN backbone network 201. ing. Second, it automatically provides other attributes for the output of each branch, even without diversity control. Recent studies have proposed losses specifically designed to encourage diversity between learners, while the CGD Framework 200 proposes losses specifically designed to control diversity between branches. Do not request.

グローバルディスクリプタに対する複数の組み合わせの性能を比較実験することにより、ディスクリプタ組み合わせを見つけ出せるようになる。ただし、データごとに出力特徴次元による性能の差が大きくない場合がある。例えば、ＳＰｏＣ１５３６次元と７６８次元の性能が大きくなければ、ＳＰｏＣ１５３６次元（単一グローバルディスクリプタ）よりもＳＰｏＣ７６８次元＋ＧｅＭ７６８次元（マルチグローバルディスクリプタ）の組み合わせを使用する方が、より優れた性能を得ることができる。 Descriptor combinations can be found by comparing and experimenting with the performance of multiple combinations for global descriptors. However, the difference in performance depending on the output feature dimension may not be large for each data. For example, if the performance of SPoC 1536 and 768 dimensions is not large, it is better to use the combination of SPoC 768 dimensions + GeM 768 dimensions (multi-global descriptor) than SPoC 1536 dimensions (single global descriptor). Obtainable.

３．補助モジュール２２０：分類損失
補助モジュール２２０は、エンベディングの範疇レベル（ｃａｔｅｇｏｒｉｃａｌｌｅｖｅｌ）で学習するために、メインモジュール２１０の１番目のグローバルディスクリプタから出力されるイメージ表現を分類損失によって学習してよい。このとき、分類損失を利用した学習時の性能向上のために、ラベルスムージング（ｌａｂｅｌｓｍｏｏｔｈｉｎｇ）と温度スケーリング（ｔｅｍｐｅｒａｔｕｒｅｓｃａｌｉｎｇ）技術が適用されてよい。 3. 3. Auxiliary module 220: Classification loss The auxiliary module 220 may learn the image representation output from the first global descriptor of the main module 210 by the classification loss in order to learn at the category level of embedding (categorical level). At this time, label smoothing and temperature scaling techniques may be applied in order to improve the performance during learning by utilizing the classification loss.

言い換えれば、補助モジュール２２０は、補助分類損失を利用することにより、メインモジュール２１０の１番目のグローバルディスクリプタを基盤としてＣＮＮバックボーンを微調整する。補助モジュール２２０は、メインモジュール２１０に含まれるグローバルディスクリプタのうちの１番目のグローバルディスクリプタから出るイメージ表現を分類損失によって学習してよい。これは、２つの段階で構成された接近法によるものであり、これは、ＣＮＮバックボーンを分類損失とともに微調整して畳み込みフィルタを改善した後、ネットワークを微調整してグローバルディスクリプタの性能を改善する。 In other words, the auxiliary module 220 fine-tunes the CNN backbone based on the first global descriptor of the main module 210 by utilizing the auxiliary classification loss. The auxiliary module 220 may learn the image representation from the first global descriptor among the global descriptors included in the main module 210 by the classification loss. This is due to a two-step approach, which fine-tunes the CNN backbone with classification loss to improve the convolution filter, and then fine-tunes the network to improve the performance of the global descriptor. ..

ＣＧＤフレームワーク２００では、このような処理方式を修正することにより、エンドツーエンド訓練のための単一の段階を有するようにする。補助分類損失のある訓練は、等級間の分離属性を有するイメージ表現を可能とし、ランキング損失だけに対して使用するよりも、ネットワークをより迅速かつ安定に訓練できるようにサポートする。 The CGD Framework 200 modifies such a processing scheme to have a single step for end-to-end training. Training with auxiliary classification losses allows for image representation with separation attributes between grades and supports the training of networks more quickly and stably than when used solely for ranking losses.

ソフトマックス交差エントロピー損失（ｓｏｆｔｍａｘｌｏｓｓ）における温度スケーリングとラベルスムージングは、分類損失訓練をサポートするものであり、ソフトマックス損失は数式（４）のように定義される。 Temperature scaling and label smoothing in softmax cross entropy loss support classification loss training, and softmax loss is defined as in equation (4).

ここで、Ｎ、Ｍ、ｙ_ｉはそれぞれ、配置の大きさ、クラスの個数、およびｉ番目の入力のＩＤラベルを意味する。Ｗとｂはそれぞれ、訓練可能な加重値とバイアス（ｂｉａｓ）である。さらに、ｆは、１番目のブランチのグローバルディスクリプタであるが、ここで、Ｔは、基本値（ｄｅｆａｕｌｔｖａｌｕｅ）１の温度パラメータである。 Here, N, M, and y _i mean the size of the arrangement, the number of classes, and the ID label of the i-th input, respectively. W and b are trainable weights and biases, respectively. Further, f is the global descriptor of the first branch, where T is the temperature parameter of the default value 1.

数式（４）で温度パラメータＴを使用した温度スケーリングは、さらに困難な例にさらに大きな勾配（ｇｒａｄｉｅｎｔ）を割り当てることで、クラス内のコンパクトおよびクラス間のスプレッド−アウトエンベディングに有用となる。ラベルスムージングは、モデルを強化し、訓練中のラベルドロップアウトの限界効果を推定して一般化を改善する。したがって、オーバーフィッティングを防いでより優れたエンベディング方法を学習するために、補助分類損失にラベルスムージングと温度スケーリングを追加する。 Temperature scaling using the temperature parameter T in equation (4) is useful for compact within classes and spread-out embedding between classes by assigning larger gradients to more difficult examples. Label smoothing enhances the model and improves generalization by estimating the marginal effects of label dropouts during training. Therefore, label smoothing and temperature scaling are added to the auxiliary classification loss to prevent overfitting and learn better embedding methods.

分類損失計算のための１番目のグローバルディスクリプタは、各グローバルディスクリプタの性能を考慮した上で決定してよい。一例として、組み合わせに使用しようとするグローバルディスクリプタを単一ブランチとして使用して学習を進めた後、その中でも性能が優れたグローバルディスクリプタを分類損失計算のための１番目のグローバルディスクリプタとして使用してよい。例えば、ＳＰｏＣ、ＭＡＣ、ＧｅＭをそれぞれ学習した結果性能がＧｅＭ＞ＳＰｏＣ＞ＭＡＣとなれば、ＧｅＭ＋ＭＡＣの組み合わせがＭＡＣ＋ＧｅＭの組み合わせよりもより優れた性能を出す傾向にあるため、これを考慮した上で、ＧｅＭを分類損失計算のためのグローバルディスクリプタとして使用してよい。 The first global descriptor for the classification loss calculation may be determined in consideration of the performance of each global descriptor. As an example, after training using the global descriptor to be used in the combination as a single branch, the best performing global descriptor may be used as the first global descriptor for classification loss calculation. .. For example, if the performance as a result of learning SPoC, MAC, and GeM is GeM> SPoC> MAC, the combination of GeM + MAC tends to give better performance than the combination of MAC + GeM. GeM may be used as a global descriptor for classification loss calculations.

４．フレームワーク構成
ＣＧＤフレームワーク２００は、グローバルディスクリプタのブランチの個数によって拡張されてよく、グローバルディスクリプタの構成によって他の類型のネットワークを許容する。例えば、３個のグローバルディスクリプタ（ＳＰｏＣ、ＭＡＣ、ＧｅＭ）を使用し、補助分類損失に対して単独で最初のグローバルディスクリプタを使用するため１２個の可能な構成を生成してよい。 4. Framework Configuration The CGD Framework 200 may be extended by the number of branches of the global descriptor and allows other types of networks depending on the configuration of the global descriptor. For example, three global descriptors (SPoC, MAC, GeM) may be used to generate twelve possible configurations to use the first global descriptor alone for auxiliary classification losses.

説明の便宜のために、ＳＰｏＣはＳ、ＭＡＣはＭ、ＧｅＭはＧと略称し、表記のうちの１番目の文字は、補助分類損失に使用される１番目のグローバルディスクリプタを意味する。ＣＧＤフレームワーク２００は、１つのＣＮＮバックボーンネットワーク２０１から３種類のグローバルディスクリプタＳ、Ｍ、Ｇを抽出してよく、このとき、グローバルディスクリプタＳ、Ｍ、Ｇを基準として１２種の構成が可能となる（Ｓ、Ｍ、Ｇ、ＳＭ、ＭＳ、ＳＧ、ＧＳ、ＭＧ、ＧＭ、ＳＭＧ、ＭＳＧ、ＧＳＭ）。すべてのグローバルディスクリプタの組み合わせがランキング損失によって学習され、１番目のグローバルディスクリプタだけが分類損失によって付加的に学習されてよい。例えば、ＳＭＧの場合、グローバルディスクリプタのＳだけが分類損失によって付加的に学習され、すべてのＳ、Ｍ、およびＧの組み合わせ（ＳＭ、ＭＳ、ＳＧ、ＧＳ、ＭＧ、ＧＭ、ＳＭＧ、ＭＳＧ、ＧＳＭ）はランキング損失によって学習される。 For convenience of explanation, SPoC is abbreviated as S, MAC is abbreviated as M, and GeM is abbreviated as G, and the first letter of the notation means the first global descriptor used for auxiliary classification loss. The CGD framework 200 may extract three types of global descriptors S, M, and G from one CNN backbone network 201, and at this time, 12 types of configurations can be made based on the global descriptors S, M, and G. (S, M, G, SM, MS, SG, GS, MG, GM, SMG, MSG, GSM). All global descriptor combinations may be trained by ranking loss and only the first global descriptor may be additionally trained by classification loss. For example, in the case of SMG, only the global descriptor S is additionally trained by classification loss and all combinations of S, M, and G (SM, MS, SG, GS, MG, GM, SMG, MSG, GSM). Is learned by ranking loss.

したがって、複数のグローバルディスクリプタをアンサンブルするために複数のモデルを別途で学習する従来の方法とは異なり、本発明は、１つのモデルだけをエンドツーエンドによって学習することで、アンサンブルと同様の効果を得ることができる。従来の方法は、アンサンブルのために別途で製作された損失によってダイバシティを統制する反面、本願の方法は、ダイバシティの統制がなくてもアンサンブルと同様の効果を得ることができる。本発明によると、最終グローバルディスクリプタをイメージ検索に使用してよく、必要によっては、より小さな次元を使用するために連結（ｃｏｎｃａｔｅｎａｔｅ）直前のイメージ表現を使用してよい。ユーザのニーズによって多様なグローバルディスクリプタの使用が可能であり、グローバルディスクリプタの個数を調節してモデルを拡張および縮小することが可能である。 Therefore, unlike the conventional method of learning a plurality of models separately in order to ensemble a plurality of global descriptors, the present invention achieves the same effect as an ensemble by learning only one model end-to-end. Obtainable. While the conventional method controls diversity by a loss produced separately for the ensemble, the method of the present application can obtain the same effect as the ensemble without controlling the diversity. According to the present invention, the final global descriptor may be used for image retrieval, and optionally the image representation immediately before concatenation to use a smaller dimension. Various global descriptors can be used according to the needs of the user, and the number of global descriptors can be adjusted to expand or contract the model.

上述したＣＧＤフレームワーク２００の実施例は、次のとおりとなる。 An example of the CGD framework 200 described above is as follows.

イメージ検索のためのデータセットとして、文献“Ｃ．Ｗａｈ，Ｓ．Ｂｒａｎｓｏｎ，Ｐ．Ｗｅｌｉｎｄｅｒ，Ｐ．Ｐｅｒｏｎａ，ａｎｄＳ．Ｂｅｌｏｎｇｉｅ．Ｔｈｅｃａｌｔｅｃｈ−ｕｃｓｄｂｉｒｄｓ−２００−２０１１ｄａｔａｓｅｔ．２０１１．”で利用されたデータセット（ＣＵＢ２００）と、文献“Ｊ．Ｋｒａｕｓｅ，Ｍ．Ｓｔａｒｋ，Ｊ．Ｄｅｎｇ，ａｎｄＬ．Ｆｅｉ−Ｆｅｉ．３ｄｏｂｊｅｃｔｒｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｏｒｆｉｎｅ−ｇｒａｉｎｅｄｃａｔｅｇｏｒｉｚａｔｉｏｎ．ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎＷｏｒｋｓｈｏｐｓ，ｐａｇｅｓ５５４−５６１，２０１３．”で利用されたデータセット（ＣＡＲＳ１９６）を利用しながら、本発明に係るＣＧＤフレームワーク２００を評価する。ＣＵＢ２００とＣＡＲＳ１９６の場合、境界ボックス（ｂｏｕｎｄｉｎｇｂｏｘ）情報のある切り取られた映像を使用する。 As a data set for image retrieval, it was used in the literature "C. Wah, S. Brandon, P. Welder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 data set. 2011." Data set (CUB200) and the literature "J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d observrepresentations for fine-grained categorization. The CGD framework 200 according to the present invention is evaluated using the data set (CARS196) used in "-561, 2013.". In the case of CUB200 and CARS196, a clipped image with bounding box information is used.

すべての実験は、２４ＧＢメモリのＴｅｓｌａＰ４０ＧＰＵでＭＸＮｅｔを使用して実行される。さらに、ＭＸＮｅｔＧｌｕｏｎＣＶのＩｍａｇｅＮｅｔＩＬＳＶＲＣ事前加重値とともに、ＢＮＩｎｃｅｐｔｉｏｎ、ＳｈｕｆｆｌｅＮｅｔ−ｖ２、ＲｅｓＮｅｔ−５０、ＳＥＲｅｓＮｅｔ−５０を使用する。すべての実験において、２２４×２２４の入力サイズと１５３６次元のエンベディングを使用する。訓練段階において、入力映像は２５２×２５２に調整し、任意で２２４×２２４に切った後、水平にランダムでフリップする。学習速度が１ｅ−４であるアダムオプティマイザを用い、学習速度をスケジューリングするのに段階的減衰が使用される。すべての実験において、ｔｒｉｐｌｅｔ損失のマージンｍは０．１であり、ソフトマックス損失の温度は０．５である。配置の大きさはすべてのデータセットに１２８個が使用され、クラスあたりのインスタンスはＣＡＲＳ１９６、ＣＵＢ２００に６４個が使用され、基本入力サイズである２２４×２２４にのみイメージサイズを調整する。 All experiments are performed using MXNet on a Tesla P40 GPU with 24GB memory. In addition, BNInception, ShuffleNet-v2, ResNet-50, and SEResNet-50 are used with the ImageNet ILSVRC pre-weighted value of MXNet GrunCV. All experiments use 224 x 224 input sizes and 1536 dimensional embeddings. In the training stage, the input video is adjusted to 252 x 252, optionally cut to 224 x 224, and then flipped horizontally at random. Using an Adam optimizer with a learning rate of 1e-4, gradual attenuation is used to schedule the learning rate. In all experiments, the triplet loss margin m is 0.1 and the softmax loss temperature is 0.5. 128 are used for all datasets, 64 instances are used for CARS196 and CUB200 for CUB200, and the image size is adjusted only to the basic input size of 224 x 224.

１．アキテクチャデザイン実験
（１）訓練順位と分類損失
分類損失
ＣＧＤフレームワーク２００は、１番目のグローバルディスクリプタの分類損失とともに、ランキング損失によって訓練される。図３のテーブルは、ＣＡＲＳ１９６でランキング損失だけを使用する場合（Ｒａｎｋ）と、補助分類損失とランキング損失の両方を使用する場合（Ｂｏｔｈ）の成果を比べたものである。この実験では、ラベルスムージングと温度スケーリングを、すべての場合に分類損失には適用しない。これは、２つの損失をすべて使用する方が、ランキング損失を単独で使用するよりもさらに高い性能を提供するということを立証する。分類損失は、範疇型水準で各クラスを閉鎖されたエンベディング空間にクラスタリングすることに焦点を合わせる。ランキング損失は、同じ等級でサンプルを収集し、インスタンスレベルの互いに異なる等級でサンプル間の距離を置くことに焦点を合わせる。したがって、ランキング損失を補助分類損失とともに訓練すれば、範疇型および細分化された特徴エンベディングに対する最適化が改善される。 1. 1. Architecture design experiment (1) Training ranking and classification loss
Classification Loss The CGD Framework 200 is trained by ranking loss, along with the classification loss of the first global descriptor. The table in FIG. 3 compares the results when only the ranking loss is used in CARS196 (Rank) and when both the auxiliary classification loss and the ranking loss are used (Both). In this experiment, label smoothing and temperature scaling are not applied to classification losses in all cases. This demonstrates that using all two losses provides even higher performance than using ranking losses alone. Classification loss focuses on clustering each class into a closed embedding space at a categorical level. Ranking loss focuses on collecting samples at the same grade and keeping distances between samples at different grades at the instance level. Therefore, training ranking losses along with auxiliary classification losses improves optimization for categorical and subdivided feature embedding.

ラベルスムージングおよび温度スケーリング
図４のテーブルは、ＣＡＲＳ１９６でラベルスムージングと温度スケーリングの両方とも使用しない場合（ｎｏｔｒｉｃｋ）（Ｎｏｎｅ）、ラベルスムージングを使用する場合（ＬＳ）、温度スケーリングを使用する場合（ＴＳ）、さらにラベルスムージングと温度スケーリングの両方を使用する場合（ｂｏｔｈｔｒｉｃｋｓ）（Ｂｏｔｈ）の成果を比べたものである。これは、グローバルディスクリプタＳＭを使用してＲｅｓＮｅｔ−５０バックボーンで実行され、各ラベルスムージングと温度スケーリングを使用する方が、「ｎｏｔｒｉｃｋｓ」に比べて性能が向上することを示している。さらに、ラベルスムージングと温度スケーリングをともに適用すれば、それぞれの性能が向上し、最高の性能が得られるようになることが分かる。 Label Smoothing and Temperature Scaling The table in FIG. 4 shows CARS196 when both label smoothing and temperature scaling are not used (notrick) (None), when label smoothing is used (LS), and when temperature scaling is used (TS). In addition, the results of using both label smoothing and temperature scaling (bottricks) (Both) are compared. This is performed on the ResNet-50 backbone using the global descriptor SM and shows that using each label smoothing and temperature scaling improves performance compared to "notricks". Furthermore, it can be seen that if both label smoothing and temperature scaling are applied, the performance of each will be improved and the best performance will be obtained.

（２）マルチグローバルディスクリプタの組み合わせ
組み合わせの位置
ＣＧＤフレームワーク２００は、複数のグローバルディスクリプタを使用するため、最高のアキテクチャを選択するために複数のグローバルディスクリプタの組み合わせの他の位置によって実験を行う。 (2) Combination of multi-global descriptors
Combination Positions Since the CGD Framework 200 uses multiple global descriptors, experiments are performed with other positions in the combination of multiple global descriptors to select the best texture.

図５は、マルチグローバルディスクリプタを訓練するための第１類型のアキテクチャを示しており、図６は、マルチグローバルディスクリプタを訓練するための第２類型のアキテクチャを示している。 FIG. 5 shows the first type of texture for training the multi-global descriptor, and FIG. 6 shows the second type of texture for training the multi-global descriptor.

図５に示すように、第１類型のアキテクチャは、各グローバルディスクリプタを個別のランキング損失によって訓練させた後、推論段階において組み合わせるが、各ブランチに対して同じグローバルディスクリプタを使用し、分類損失は使用しない。 As shown in FIG. 5, the first type of architecture trains each global descriptor with a separate ranking loss and then combines them in the inference stage, but uses the same global descriptor for each branch and the classification loss is do not use.

一方、図６に示した第２類型のアキテクチャは、グローバルディスクリプタの遠眼出力を組み合わせて単一ランキング損失によって訓練するが、複数のグローバルディスクリプタは使用しない。 On the other hand, the second type of architecture shown in FIG. 6 is trained with a single ranking loss in combination with the far-eye output of the global descriptor, but does not use multiple global descriptors.

この反面、本発明に係るＣＧＤフレームワーク２００は、図２に示すように、ＦＣ層以後の多数のグローバルディスクリプタとｌ_２−正規化を組み合わせる。 On the other hand, the CGD framework 200 according to the present invention _{combines a large number of global descriptors after the FC layer and l 2} -normalization as shown in FIG.

図７のテーブルは、ＣＵＢ２００でグローバルディスクリプタＳＭを使用するものであり、ＣＧＤフレームワークの性能を第１類型のアキテクチャＡおよび第２類型のアキテクチャＢと比べたものである。ＣＧＤフレームワークの性能が最も高いことが分かる。 The table in FIG. 7 uses the global descriptor SM in the CUB200 and compares the performance of the CGD framework with the first type architecture A and the second type architecture B. It can be seen that the performance of the CGD framework is the highest.

第２類型のアキテクチャＢは、複数のブランチ特性と出力特徴ベクトルのダイバシティを含んでいる。ＣＧＤフレームワークとは対照的に、訓練段階において第１類型のアキテクチャＡの最終エンベディングは、推論段階とは異なり、第２類型のアキテクチャＢの最終エンベディングは、連結後のＦＣ層によってグローバルディスクリプタの各属性を失う。 The second type, architecture B, contains the diversity of a plurality of branch characteristics and output feature vectors. In contrast to the CGD framework, the final embedding of type 1 architecture A in the training phase is different from the inference stage, and the final embedding of type 2 architecture B is a global descriptor by the FC layer after concatenation. Lose each attribute of.

組み合わせ方法
組み合わせ方法の観点において、多数のグローバルディスクリプタの連結（ｃｏｎｃａｔｅｎａｔｉｏｎ）と要約（ｓｕｍｍａｔｉｏｎ）は、モデル成果を向上させる。したがって、本発明に係るＣＧＤフレームワークは、２つの組み合わせ方法を比べ、より優れた方法を選択してよい。 Combination method In terms of combination method, concatenation and summation of a large number of global descriptors improve the model outcome. Therefore, the CGD framework according to the present invention may compare the two combination methods and select a better method.

図８のテーブルは、ＣＵＢ２００でグローバルディスクリプタＳＭを使用するものであり、組み合わせ方法である要約方法（Ｓｕｍ）と連結方法（Ｃｏｎｃａｔ）の成果を比べたものである。多数のグローバルディスクリプタの連結方法（Ｃｏｎｃａｔ）は、要約方法（Ｓｕｍ）に比べてより優れた性能を提供する。要約方法（Ｓｕｍ）は、グローバルディスクリプタの活性化が互いに混合するため（ｍｉｘ）各グローバルディスクリプタの特性を失うことがある反面、連結方法（Ｃｏｎｃａｔ）は、各グローバルディスクリプタの属性を記録してダイバシティを保持することができる。 The table of FIG. 8 uses the global descriptor SM in CUB200, and compares the results of the summarization method (Sum) and the concatenation method (Concat), which are combination methods. The method of concatenating a large number of global descriptors (Concat) provides better performance than the method of summarizing (Sum). The summarization method (Sum) may lose the characteristics of each global descriptor because the activations of the global descriptors are mixed with each other (mix), while the concatenation method (Concat) records the attributes of each global descriptor to increase diversity. Can be retained.

２．組み合わせディスクリプタの効果
（１）定量分析
本発明に係るＣＧＤフレームワークの核心は、マルチグローバルディスクリプタを活用することにある。ＣＧＤフレームワークが補助分類損失に温度スケーリングを使用する各イメージ検索データセットに対し、１２種類の可能な構成を実験する。 2. Effects of Combined Descriptors (1) Quantitative Analysis The core of the CGD framework according to the present invention is to utilize multi-global descriptors. The CGD framework will experiment with 12 possible configurations for each image search dataset that uses temperature scaling for auxiliary classification losses.

図９は、ＣＡＲＳ１９６に対するＣＧＤフレームワークの多様な構成の性能を比べたものであり、図１０は、ＣＵＢ２００に対するＣＧＤフレームワークの多様な構成の性能を比べたものである。本実験は、クラスあたり１００個のインスタンスをサンプリングしたテストセットを利用した。ディープラーニングモデルの不確実性により、箱ひげ図を用いて１０回以上の結果を示した。 FIG. 9 compares the performance of various configurations of the CGD framework with respect to CARS196, and FIG. 10 compares the performance of various configurations of the CGD framework with respect to CUB200. This experiment used a test set that sampled 100 instances per class. Due to the uncertainty of the deep learning model, boxplots were used to show results more than 10 times.

図９および図１０を参照すると、組み合わせディスクリプタ（ＳＧ、ＧＳＭ、ＳＭＧ、ＳＭ、ＧＭ、ＧＳ、ＭＳ、ＭＳＧ、ＭＧ）が、単一グローバルディスクリプタ（Ｓ、Ｍ、Ｇ）よりも超越した性能を示すことが分かる。ＣＵＢ２００の場合、単一グローバルディスクリプタＧとＭは相対的に高い性能を示す反面、最高の性能構成は組み合わせディスクリプタＭＧである。性能は、データセットの属性、分類損失に使用される特徴、入力の大きさ、および出力次元などによって異なる。主な本質は、多数のグローバルディスクリプタを活用すれば、単一グローバルディスクリプタに比べて性能が向上するということにある。 With reference to FIGS. 9 and 10, combination descriptors (SG, GSM, SMG, SM, GM, GS, MS, MSG, MG) show superior performance over single global descriptors (S, M, G). You can see that. In the case of CUB200, the single global descriptors G and M show relatively high performance, while the best performance configuration is the combined descriptor MG. Performance depends on the attributes of the dataset, the characteristics used for classification losses, the size of the inputs, and the output dimensions. The main essence is that utilizing a large number of global descriptors will improve performance compared to a single global descriptor.

図１１のテーブルは、ＣＡＲＳ１９６に対する組み合わせディスクリプタ（ＳＧ、ＧＳＭ、ＳＭＧ、ＳＭ、ＧＭ、ＧＳ、ＭＳ、ＭＳＧ、ＭＧ）と単一グローバルディスクリプタ（Ｓ、Ｍ、Ｇ）の性能を比べたものである。個別ディスクリプタは、各ブランチの出力特徴ベクトルを意味する。組み合わせディスクリプタは、ＣＧＤフレームワークの最終特徴ベクトルである。 The table in FIG. 11 compares the performance of combination descriptors (SG, GSM, SMG, SM, GM, GS, MS, MSG, MG) and single global descriptors (S, M, G) for CARS196. The individual descriptor means the output feature vector of each branch. The combination descriptor is the final feature vector of the CGD framework.

図１１は、組み合わせ前の個別グローバルディスクリプタの性能と組み合わせ後に算出される性能向上の程度を示したものである。すべての組み合わせディスクリプタは、１５３６次元エンベッドベクトルを有している反面、それぞれの個別ディスクリプタは、ＳＭ、ＭＳ、ＳＧ、ＧＳ、ＭＧ、ＧＭのための１５３６次元エンベッドベクトルとＳＭＧ、ＭＳＧ、ＧＳ、ＭＧ、ＧＳ、ＧＭのための５１２次元のエンベッドベクトルを有している。より大きなエンベッドベクトルの殆どは、より優れた性能を提供する。しかし、大きなエンベッドと小さなインベットとの性能の差が大きくない場合、異なるグローバルディスクリプタの多数の小さなエンベッドを使用する方が好ましいことがある。例えば、７６８次元のエンベッドＳＧの個別ディスクリプタＧｅＭは、１５３６次元のエンベッドの単一ディスクリプタＧと類似の性能を有しているため、ＳＧはＳＰＣとＧｅＭの他の特徴を組み合わせて大きな性能向上を得る。 FIG. 11 shows the performance of the individual global descriptor before the combination and the degree of performance improvement calculated after the combination. While all combination descriptors have 1536 dimensional embed vectors, each individual descriptor has 1536 dimensional embed vectors for SM, MS, SG, GS, MG, GM and SMG, MSG, GS, MG, It has a 512-dimensional embed vector for GS, GM. Most of the larger embed vectors offer better performance. However, if the performance difference between a large embed and a small inbed is not large, it may be preferable to use a large number of small embeds with different global descriptors. For example, since the individual descriptor GeM of the 768-dimensional embed SG has similar performance to the single descriptor G of the 1536-dimensional embed, the SG obtains a great performance improvement by combining other features of the SPC and GeM. ..

３．ＣＧＤフレームワークの柔軟性
図１２は、本発明に係るＣＧＤフレームワークが多様なランキング損失（ｂａｔｃｈ−ｈａｒｄｔｒｉｐｌｅｔ損失、ＨＡＰ２Ｓ損失、加重サンプリングマージン損失など）を使用できることを示したものである。単一グローバルディスクリプタＳとマルチグローバルディスクリプタＳＭの性能を比べるとき、すべての場合において、マルチグローバルディスクリプタＳＭの性能の方が単一グローバルディスクリプタＳよりも優れるという点において、多様な損失を適用することができ、柔軟であるということが分かる。 3. 3. Flexibility of CGD Framework FIG. 12 shows that the CGD framework according to the present invention can use various ranking losses (batch-hard triplet loss, HAP2S loss, weighted sampling margin loss, etc.). When comparing the performance of a single global descriptor S and a multi-global descriptor SM, it is possible to apply various losses in that in all cases the performance of the multi-global descriptor SM is superior to that of the single global descriptor S. It turns out to be possible and flexible.

ランキング損失の他にも、本発明に係るＣＧＤフレームワークは、多様な種類のＣＮＮバックボーンネットワークはもちろん、多様なイメージ検索データセットを適用してよい。マルチグローバルディスクリプタを適用したＣＧＤフレームワークは、大部分のバックボーンやデータセットにおいて、従来のモデルよりもさらに高い性能を提供する。 In addition to the ranking loss, the CGD framework according to the present invention may apply various image search datasets as well as various types of CNN backbone networks. The CGD framework with multi-global descriptors provides even higher performance than traditional models on most backbones and datasets.

このように、本発明の実施形態によると、複数のグローバルディスクリプタを組み合わせた新たなフレームワーク、すなわち、分類損失方式によって訓練可能な多数のグローバルディスクリプタを組み合わせたＣＧＤを適用することにより、各グローバルディスクリプタに対する明示的なアンサンブルモデルやダイバシティの統制がなくても、アンサンブルと同様の効果を達成することができる。本発明に係るＣＧＤフレームワークは、グローバルディスクリプタ、ＣＮＮバックボーン、損失、およびデータセットによって柔軟かつ拡張可能な特性を備え、組み合わせディスクリプタを使用することによって他の類型の特徴を使用することが可能になるため、単一グローバルディスクリプタよりも性能が優れる上に、イメージ検索性能を向上させることもできる。 Thus, according to an embodiment of the present invention, each global descriptor is applied by applying a new framework that combines a plurality of global descriptors, that is, a CGD that combines a large number of global descriptors that can be trained by the classification loss method. The same effect as an ensemble can be achieved without explicit ensemble model or diversity control over. The CGD framework according to the present invention has characteristics that are flexible and extensible by global descriptors, CNN backbones, losses, and datasets, and the use of combination descriptors makes it possible to use other types of features. Therefore, the performance is superior to that of a single global descriptor, and the image search performance can be improved.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを格納、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The devices described above may be implemented by hardware components, software components, and / or combinations of hardware components and software components. For example, the apparatus and components described in the embodiments include a processor, a controller, an ALU (arithmetic logic unit), a digital signal processor, a microcomputer, an FPGA (field programgate array), a PLU (programmable log unit), a microprocessor, and the like. Alternatively, it may be implemented using one or more general purpose computers or special purpose computers, such as various devices capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the OS. The processing device may also respond to the execution of the software, access the data, store, manipulate, process, and generate the data. For convenience of understanding, one processing device may be described as being used, but one of ordinary skill in the art may indicate that the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. You can understand. For example, the processing device may include multiple processors or one processor and one controller. Other processing configurations, such as parallel processors, are also possible.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、仮想装置、コンピュータ格納媒体または装置に具現化されてよい。ソフトウェアは、ネットワークに接続したコンピュータシステム上に分散され、分散された状態で格納されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に格納されてよい。 The software may include computer programs, code, instructions, or a combination of one or more of these, configuring the processing equipment to operate at will, or instructing the processing equipment independently or collectively. You may do it. Software and / or data is embodied in any type of machine, component, physical device, virtual device, computer storage medium or device to be interpreted based on the processing device or to provide instructions or data to the processing device. May be converted. The software is distributed on a networked computer system and may be stored or executed in a distributed state. The software and data may be stored on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例は、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ−ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiment may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer-readable medium. Here, the medium may be one that continuously records a computer-executable program, or one that temporarily records for execution or download. Further, the medium may be various recording means or storage means in the form of a combination of a single piece of hardware or a plurality of pieces of hardware, and is not limited to a medium directly connected to a certain computer system, and is distributed on a network. It may exist. Examples of media include hard disks, floppy (registered trademark) disks, magnetic media such as magnetic tapes, optical media such as CD-ROMs and DVDs, optical magnetic media such as floptic discs, and optical media. It may include a ROM, a RAM, a flash memory, and the like, and may be configured to record program instructions. Other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various other software, servers, and the like.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって代替されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and modifications from the above description. For example, the techniques described may be performed in a different order than the methods described, and / or components such as the systems, structures, devices, circuits described may be in a form different from the methods described. Appropriate results can be achieved even if they are combined or combined, or replaced or replaced by other components or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even if the embodiments are different, they belong to the attached claims as long as they are equal to the claims.

１００：コンピュータシステム
１１０：プロセッサ
１２０：メモリ
１３０：永続的記録装置
１４０：バス
１５０：入力／出力インタフェース
１６０：ネットワークインタフェース 100: Computer system 110: Processor 120: Memory 130: Persistent recording device 140: Bus 150: Input / output interface 160: Network interface

Claims

コンピュータシステムが実現するイメージ検索のためのフレームワークであって、
畳み込みニューラルネットワーク（ＣＮＮ）から抽出された互いに異なる複数のグローバルディスクリプタを連結して学習するメインモジュール、および
前記複数のグローバルディスクリプタのうちのいずれか１つの特定のグローバルディスクリプタを追加学習する補助モジュール
を含む、イメージ検索のためのフレームワーク。 It is a framework for image search realized by computer systems.
Includes a main module that concatenates and learns multiple different global descriptors extracted from a convolutional neural network (CNN), and an auxiliary module that additionally learns a specific global descriptor of any one of the plurality of global descriptors. , A framework for image retrieval.

前記メインモジュールは、イメージ表現のランキング損失のための学習モジュールであり、
前記補助モジュールは、前記イメージ表現の分類損失のための学習モジュールであり、
前記イメージ検索のためのフレームワークは、エンドツーエンド方式によって前記ランキング損失と前記分類損失の合計である最終損失として訓練される、
請求項１に記載のイメージ検索のためのフレームワーク。 The main module is a learning module for ranking loss of image representation.
The auxiliary module is a learning module for classification loss of the image representation.
The framework for image retrieval is trained by an end-to-end method as a final loss, which is the sum of the ranking loss and the classification loss.
The framework for image search according to claim 1.

前記ＣＮＮは、与えられたイメージの特徴マップを提供するバックボーンネットワークとして、前記バックボーンネットワークの最後の段階以前にはダウンサンプリングを作動させない、
請求項１に記載のイメージ検索のためのフレームワーク。 The CNN, as a backbone network that provides a feature map of a given image, does not activate downsampling prior to the final stage of the backbone network.
The framework for image search according to claim 1.

前記メインモジュールは、
前記複数のグローバルディスクリプタを、正規化を経た後に連結して１つの最終グローバルディスクリプタとして形成し、前記最終グローバルディスクリプタをランキング損失によって学習する、
請求項１に記載のイメージ検索のためのフレームワーク。 The main module
The plurality of global descriptors are connected after normalization to form one final global descriptor, and the final global descriptor is learned by ranking loss.
The framework for image search according to claim 1.

前記メインモジュールには、
前記複数のグローバルディスクリプタを使用してそれぞれのイメージ表現を出力する複数のブランチが含まれ、
前記ブランチの個数は、使用しようとするグローバルディスクリプタによって変更される、
請求項１に記載のイメージ検索のためのフレームワーク。 The main module
It contains multiple branches that output each image representation using the multiple global descriptors mentioned above.
The number of branches is changed by the global descriptor to be used.
The framework for image search according to claim 1.

前記補助モジュールは、
前記複数のグローバルディスクリプタのうち、学習性能に基づいて決定された前記特定のグローバルディスクリプタを分類損失によって学習する、
請求項１に記載のイメージ検索のためのフレームワーク。 The auxiliary module
Among the plurality of global descriptors, the specific global descriptor determined based on the learning performance is learned by the classification loss.
The framework for image search according to claim 1.

前記補助モジュールは、
分類損失による学習時に、ラベルスムージングと温度スケーリング技術のうちの少なくとも一方を利用する、
請求項６に記載のイメージ検索のためのフレームワーク。 The auxiliary module
Utilize at least one of label smoothing and temperature scaling techniques when learning by classification loss,
The framework for image search according to claim 6.

コンピュータシステムが実行するディスクリプタ学習方法であって、
前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、
当該ディスクリプタ学習方法は、
ＣＮＮから抽出された互いに異なる複数のグローバルディスクリプタを連結してランキング損失によって学習するメイン学習段階、および
前記複数のグローバルディスクリプタのうちのいずれか１つの特定のグローバルディスクリプタを分類損失によって追加学習する補助学習段階
を含む、ディスクリプタ学習方法。 It is a descriptor learning method executed by a computer system.
The computer system includes at least one processor configured to execute computer-readable instructions contained in memory.
The descriptor learning method is
A main learning stage in which multiple global descriptors extracted from CNNs are concatenated and learned by ranking loss, and an auxiliary learning in which a specific global descriptor of any one of the plurality of global descriptors is additionally learned by classification loss. Descriptor learning method including steps.

当該ディスクリプタ学習方法は、
前記複数のグローバルディスクリプタをエンドツーエンド方式によって前記ランキング損失と前記分類損失の合計である最終損失として訓練する、
請求項８に記載のディスクリプタ学習方法。 The descriptor learning method is
The plurality of global descriptors are trained by an end-to-end method as a final loss which is the sum of the ranking loss and the classification loss.
The descriptor learning method according to claim 8.

前記ＣＮＮは、与えられたイメージの特徴マップを提供するバックボーンネットワークとして、前記バックボーンネットワークの最後の段階以前にはダウンサンプリングを作動させない、
請求項８に記載のディスクリプタ学習方法。 The CNN, as a backbone network that provides a feature map of a given image, does not activate downsampling prior to the final stage of the backbone network.
The descriptor learning method according to claim 8.

前記メイン学習段階は、
前記複数のグローバルディスクリプタを、正規化を経た後に連結して１つの最終グローバルディスクリプタとして形成し、前記最終グローバルディスクリプタを前記ランキング損失によって学習する、
請求項８に記載のディスクリプタ学習方法。 The main learning stage is
The plurality of global descriptors are connected after normalization to form one final global descriptor, and the final global descriptor is learned by the ranking loss.
The descriptor learning method according to claim 8.

前記補助学習段階は、
前記複数のグローバルディスクリプタのうち、学習性能に基づいて決定された前記特定のグローバルディスクリプタを前記分類損失によって学習する、
請求項８に記載のディスクリプタ学習方法。 The auxiliary learning stage is
Among the plurality of global descriptors, the specific global descriptor determined based on the learning performance is learned by the classification loss.
The descriptor learning method according to claim 8.

前記補助学習段階は、
前記分類損失による学習時に、ラベルスムージングと温度スケーリング技術のうちの少なくとも一方を利用する、
請求項１２に記載のディスクリプタ学習方法。 The auxiliary learning stage is
Utilize at least one of label smoothing and temperature scaling techniques when learning from the classification loss.
The descriptor learning method according to claim 12.

請求項８〜１３のうちのいずれか一項に記載のディスクリプタ学習方法をコンピュータシステムに実行させるためのコンピュータプログラム。 A computer program for causing a computer system to execute the descriptor learning method according to any one of claims 8 to 13.