JP2022022139A

JP2022022139A - Image identification device, method of performing semantic segmentation, and program

Info

Publication number: JP2022022139A
Application number: JP2021118014A
Authority: JP
Inventors: 淳樹長内; Atsuki Osanai
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2020-07-22
Filing date: 2021-07-16
Publication date: 2022-02-03

Abstract

To provide an image identification device capable of computing significance more accurately, a method of performing semantic segmentation, and a program.SOLUTION: An image identification device is provided, comprising an image acquisition unit for acquiring an image, a feature value extraction unit configured to extract multiple feature values of the acquired image, a feature map generation unit configured to generate a feature map for each of the multiple feature values, and a multiplication unit configured to multiply each feature map by a weight coefficient which is an arbitrary positive value representing significance of the feature.SELECTED DRAWING: Figure 1

Description

本発明は、画像識別装置、セマンティックセグメンテーションを行う方法、およびプログラムに関する。 The present invention relates to an image identification device, a method for performing semantic segmentation, and a program.

セマンティックセグメンテーションは各ピクセルのカテゴリ識別を目的とする基本的、かつ難易度の高い問題であり、自律移動ロボットや自動運転といったシステムを構築するためにその高精度化が求められている。実環境においては、物体のスケール、照明環境、オクルージョンといった要因に対するロバスト性に加え、類似の外観を持つ異なるカテゴリを識別する能力が必要となる。そのため、高精度な認識を実現するためには、より識別性の高い特徴量の獲得および選択が必要となる（例えば、特許文献１、２参照）。 Semantic segmentation is a basic and difficult problem for the purpose of class identification of each pixel, and its high accuracy is required to construct a system such as an autonomous mobile robot or automatic driving. In a real environment, you need the ability to identify different categories with similar appearances, as well as robustness to factors such as object scale, lighting environment, and occlusion. Therefore, in order to realize highly accurate recognition, it is necessary to acquire and select a feature amount with higher distinctiveness (see, for example, Patent Documents 1 and 2).

また、深層学習を用いたセマンティックセグメンテーションは、コンテクスト情報を組み合わせることで大きな改善がもたらされた。近年、コンテクスト把握の技術としては、特徴抽出器（Ｂａｃｋｂｏｎｅ）から得られる特徴量を、ピクセルレベルあるいはカテゴリレベルの類似度を用いて修正するものが提案されている。 In addition, semantic segmentation using deep learning has been greatly improved by combining contextual information. In recent years, as a technique for grasping a context, a technique has been proposed in which a feature amount obtained from a feature extractor (Backbone) is modified by using a degree of similarity at a pixel level or a category level.

特開２０１９－１２８８０４号公報Japanese Unexamined Patent Publication No. 2019-128804 再公表ＷＯ２００８／１２９８８１号公報Republished WO2008 / 129881

しかしながら、従来技術では、最終的な分類を担うネットワークに入力される特徴量が特徴マップ毎に平等に扱われるため、特徴マップ間の区別がつき辛いという課題があった。また。残差構造を利用した特徴マップの増強を行う従来技術では、増強しか許容しない構造になっており分別性に課題があった。 However, in the prior art, since the feature amount input to the network responsible for the final classification is treated equally for each feature map, there is a problem that it is difficult to distinguish between the feature maps. Also. In the conventional technique for enhancing the feature map using the residual structure, the structure allows only enhancement, and there is a problem in segregation.

本発明は、上記の問題点に鑑みてなされたものであって、従来より正確な重要度を算出することができる画像識別装置、セマンティックセグメンテーションを行う方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an image identification device capable of calculating more accurate importance than before, a method for performing semantic segmentation, and a program. ..

（１）上記目的を達成するため、本発明の一態様に係る画像識別装置は、画像（Ｘ）を取得する画像取得部と、取得された前記画像の複数の特徴量を抽出する特徴量抽出部と、前記複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成する特徴マップ作成部と、前記特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ^ｉ）を乗算する乗算部と、を備える。 (1) In order to achieve the above object, the image identification device according to one aspect of the present invention has an image acquisition unit for acquiring an image (X) and feature quantity extraction for extracting a plurality of feature quantities of the acquired image. A feature map creation section that creates a feature map (X _i ) for each of the plurality of feature quantities, and a weighting coefficient ( ^{ai i} ) that is an arbitrary positive value expressing the importance of the feature for each feature map. ) Is provided with a multiplication unit.

（２）また、本発明の一態様に係る画像識別装置において、重み係数（ａ_ｉ）は、前記画像（Ｘ）を畳み込み、畳み込み層を作成する処理と、前記畳み込み層にＲｅＬＵ関数を適用し特徴量Ｆを算出する処理と、特徴量ＦにＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇ（ＧＡＰ）層を適用する処理とから計算されるようにしてもよい。 (2) Further, in the image identification device according to one aspect of the present invention, the weight coefficient ( _ai ) is a process of convolving the image (X) to create a convolution layer and applying a ReLU function to the convolution layer. It may be calculated from the process of calculating the feature amount F and the process of applying the Global Image Pooling (GAP) layer to the feature amount F.

（３）上記目的を達成するため、本発明の一態様に係る画像識別装置は、画像を取得する画像取得部と、取得された前記画像の複数の特徴量を抽出する特徴量抽出部と、前記複数の特徴量それぞれに対して畳み込み処理によって特徴マップを作成する作成部と、前記特徴マップに対して畳み込み処理によって修正特徴量を算出し、算出した前記修正特徴量に対して全体平均プーリング処理を行ってコンテクストを集約し、チャンネル毎の重み係数であるアテンションを生成し、生成された前記アテンションを前記特徴マップに乗算することで、前記複数の特徴マップに増強と減衰の重み付けを行って重み付けした特徴量を生成する重付特徴量生成部と、を備える。 (3) In order to achieve the above object, the image identification device according to one aspect of the present invention includes an image acquisition unit for acquiring an image, a feature amount extraction unit for extracting a plurality of feature quantities of the acquired image, and a feature amount extraction unit. A creation unit that creates a feature map by convolution processing for each of the plurality of feature quantities, and a correction feature quantity is calculated by convolution processing for the feature map, and an overall average pooling process is performed on the calculated modified feature quantity. To aggregate the context, generate an attention that is a weighting coefficient for each channel, and multiply the generated attention by the feature map to weight the plurality of feature maps by weighting the enhancement and attenuation. It is provided with a weighted feature amount generation unit for generating the created feature amount.

（４）また、本発明の一態様に係る画像識別装置において、前記重み付けした特徴量に対して畳み込みとアップサンプリング処理を行って出力を算出し、算出した前記出力と教師データとを比較して第１損失を算出する第１損失算出部と、前記特徴マップに対して畳み込みとアップサンプリング処理を行って出力を算出し、算出した前記出力と教師データとを比較して第２損失を算出する第２損失算出部と、をさらに備え、前記第１損失と前記第２損失から、全体の損失関数を算出し、算出した前記損失関数を用いて前記重み係数の学習を行うようにしてもよい。 (4) Further, in the image identification device according to one aspect of the present invention, an output is calculated by performing convolution and upsampling processing on the weighted feature amount, and the calculated output is compared with the teacher data. The first loss calculation unit that calculates the first loss and the feature map are convolved and upsampled to calculate the output, and the calculated output is compared with the teacher data to calculate the second loss. A second loss calculation unit may be further provided, the entire loss function may be calculated from the first loss and the second loss, and the weighting coefficient may be learned using the calculated loss function. ..

（５）上記目的を達成するため、本発明の一態様に係るセマンティックセグメンテーションを行う方法は、ニューラルネットワークシステムを使用して画像（Ｘ）のセマンティックセグメンテーションを行う方法であって、前記画像を入力する処理と、取得された前記画像の複数の特徴量を抽出する処理と、前記画像が有する複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成する処理と、特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ_ｉ）を乗算する処理とを有する。 (5) In order to achieve the above object, the method of performing semantic segmentation according to one aspect of the present invention is a method of performing semantic segmentation of an image (X) using a neural network system, and the image is input. Processing, processing to extract a plurality of acquired feature quantities of the image, processing to create a feature map ( _Xi ) for each of the plurality of feature quantities of the image, and importance of features for each feature map. It has a process of multiplying a weighting coefficient ( _ai ) which is an arbitrary positive value expressing.

（６）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、画像を取得させ、取得された前記画像の複数の特徴量を抽出させ、前記画像が有する複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成させ、前記特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ_ｉ）を乗算させる。 (6) In order to achieve the above object, the program according to one aspect of the present invention causes a computer to acquire an image, extract a plurality of feature quantities of the acquired image, and have a plurality of feature quantities of the image. A feature map (Xi) is created for each of the features, and a weighting coefficient ( _ai ₎ , which is an arbitrary positive value expressing the importance of the feature, is multiplied for each feature map.

（１）～（６）によれば、従来より正確な重要度を算出することができる。 According to (1) to (6), it is possible to calculate the importance more accurately than before.

実施形態に係るセマンティックセグメンテーション装置を含む画像識別装置の構成を示すブロック図である。It is a block diagram which shows the structure of the image identification apparatus which includes the semantic segmentation apparatus which concerns on embodiment. コンテクストを取り入れるネットワーク構造の例を示す図である。It is a figure which shows the example of the network structure which takes in the context. 実施形態に係るＣＦＡＮｅｔの概略構成図である。It is a schematic block diagram of CFA Net which concerns on embodiment. 本実施形態に係るＣＦＡＮｅｔを簡略化した計算グラフである。It is a calculation graph which simplified CFA Net which concerns on this embodiment. ＰＡＳＡＣＬＶＯＣ２０１２ｖａｌｉｄａｔｉｏｎセットでの評価結果を示す図である。It is a figure which shows the evaluation result in the PASACL VOC 2012 validation set. ＰＡＳＡＣＬＶＯＣ２０１２ｔｅｓｔｓｅｔでの評価結果を示す図である。It is a figure which shows the evaluation result in PASACL VOC 2012 test set. Ｃｏｓｉｎｅ類似度の可視化結果例を示す図である。It is a figure which shows the example of the visualization result of Cosine similarity. 実施形態に係る画像識別装置の処理手順例のフローチャートである。It is a flowchart of the processing procedure example of the image identification apparatus which concerns on embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［実施形態の概要］
特徴マップの重要度を乗算した特徴量を用いることで、各特徴マップの影響度を増大、あるいは減衰させる機構を設け、出力に寄与する特徴マップを区別し易くした。特徴マップの重要度を算出する際に、画像全体のコンテクストを捉えるＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇ層を用いた。重要度を算出するブランチに補助的な推論を行うＨｅａｄネットワークを設置することで、より正確な重要度を算出する構造とした。 [Outline of Embodiment]
By using a feature amount multiplied by the importance of the feature map, a mechanism is provided to increase or attenuate the influence of each feature map, making it easier to distinguish the feature maps that contribute to the output. In calculating the importance of the feature map, a Global Average Polling layer was used to capture the context of the entire image. By installing a Head network that performs auxiliary inference in the branch that calculates the importance, a structure that calculates the importance more accurately was adopted.

本実施形態は、課題を解決するために、ＣｏｎｔｅｘｔａｗａｒｅＦｅａｔｕｒｅＡｔｔｅｎｔｉｏｎＮｅｔｗｏｒｋ（ＣＦＡＮｅｔ）を用いる。ＣＦＡＮｅｔでは、ＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇ（ＧＡＰ）を用いてコンテクストを集約し、チャンネル毎のアテンションを生成する。得られたアテンションは特徴マップに直接乗算され、各特徴マップは増強・減衰という双方向の重み付けがなされる。そのため、従来手法以上の弁別性を獲得することが可能となる。 In this embodiment, the ContextawareFature Attention Network (CFANet) is used to solve the problem. In CFANet, contexts are aggregated using Global Average Polling (GAP) to generate attention for each channel. The obtained attention is directly multiplied by the feature map, and each feature map is weighted in both directions of enhancement and attenuation. Therefore, it is possible to obtain discriminativeness higher than that of the conventional method.

［画像識別装置１の構成例］
図１は、本実施形態に係るセマンティックセグメンテーション装置１０を含む画像識別装置１の構成を示すブロック図である。図１のように、画像識別装置１は、画像取得部１１、特徴量抽出部１２、セマンティックセグメンテーション装置１０、および可視化部３０を備える。セマンティックセグメンテーション装置１０は、特徴量取得部２１、乗算部２２（特徴マップ作成部、重付特徴量生成部）、第１畳込層２３（第１損失算出部）、第２畳込層２４（特徴マップ作成部、作成部）、第３畳込層２５（特徴マップ作成部、重付特徴量生成部）、ＧＡＰ部２６（特徴マップ作成部、重付特徴量生成部）、および第４畳込層２７（第２損失算出部）を備える。可視化部３０は、Ｈｅａｄ３１（第１損失算出部）、補助Ｈｅａｄ３２（第２損失算出部）、教師ラベル提供部３３、および類似度マップ作成部３４を備える。 [Configuration example of image identification device 1]
FIG. 1 is a block diagram showing a configuration of an image identification device 1 including a semantic segmentation device 10 according to the present embodiment. As shown in FIG. 1, the image identification device 1 includes an image acquisition unit 11, a feature amount extraction unit 12, a semantic segmentation device 10, and a visualization unit 30. The semantic segmentation device 10 includes a feature amount acquisition unit 21, a multiplication unit 22 (feature map creation unit, a weighted feature amount generation unit), a first convolutional layer 23 (first loss calculation unit), and a second convolutional layer 24 (feature map creation unit, weighted feature amount generation unit). Feature map creation unit, creation unit), 3rd convolutional layer 25 (feature map creation unit, heavy feature amount generation unit), GAP unit 26 (feature map creation unit, heavy feature amount generation unit), and 4th tatami mat. A built-in layer 27 (second loss calculation unit) is provided. The visualization unit 30 includes a head 31 (first loss calculation unit), an auxiliary head 32 (second loss calculation unit), a teacher label providing unit 33, and a similarity map creation unit 34.

［コンテクストを取り入れるネットワーク構造の例］
ここで、コンテクストを取り入れるネットワーク構造の例を説明する。図２は、コンテクストを取り入れるネットワーク構造の例を示す図である。
図２の画像ｇ１１０は、特徴量をチャンネル方向に結合する構造例である。図２の画像ｇ１２０は、コンテクストを残差特徴量として取り込む構造である。図２の画像ｇ１３０は、本実施形態のコンテクストを考慮して特徴マップを増強・減衰双方向に変調する構造例である。 [Example of network structure that incorporates context]
Here, an example of a network structure that incorporates the context will be described. FIG. 2 is a diagram showing an example of a network structure that incorporates a context.
The image g110 in FIG. 2 is a structural example in which the feature amounts are combined in the channel direction. The image g120 of FIG. 2 has a structure that captures the context as a residual feature amount. The image g130 in FIG. 2 is a structural example in which the feature map is modulated in both directions of enhancement and attenuation in consideration of the context of the present embodiment.

［ネットワーク構造］
図３は、本実施形態に係るＣＦＡＮｅｔの概略構成図である。実施形態では、ＢａｃｋｂｏｎｅとしてＲｅｓＮｅｔを用いる。参考文献１（Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J., “Pyramid Scene Parsing Network”, CVPR ,2017）と同様に、ＲｅｓＮｅｔの最終２ブロックにＤｉｌａｔｅｄＣｏｎｖｏｌｕｔｉｏｎを適用し、解像度の低下を入力画像の１／８に抑制している。Ｂａｃｋｂｏｎｅから得られた特徴量Ｆ_０は、破線で示されたＣＦＡモジュールに伝搬する。Ｆ_０は、Ｃｏｎｖｏｌｕｔｉｏｎ層を通しＦ_１に変換され、その後２つの方向へと伝搬する。１つ目はチャンネルレベルのアテンションを生成するためのネットワークである。
このように、Ｂａｃｋｂｏｎｅで得られた特徴マップはＣＦＡモジュールに伝搬し、チャンネル毎の重み付けをなされた後にＨｅａｄに取り込まれる。 [Network structure]
FIG. 3 is a schematic configuration diagram of CFANet according to the present embodiment. In the embodiment, ResNet is used as the backbone. Similar to Reference 1 (Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J., “Pyramid Scene Parsing Network”, CVPR, 2017), in the last two blocks of ResNet. A Dilated Convolution is applied to suppress the decrease in resolution to 1/8 of the input image. The feature amount F ₀ obtained from the backbone propagates to the CFA module shown by the broken line. F ₀ is converted to F ₁ through the Convolution layer and then propagates in two directions. The first is a network for generating channel-level attention.
In this way, the feature map obtained by Backbone propagates to the CFA module, is weighted for each channel, and then is incorporated into Head.

Ｆ_１はＣｏｎｖｏｌｕｔｉｏｎ－ＢａｔｃｈＮｏｒｍ－ＲｅＬＵ層を通過した後に、ＧＡＰ（ＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇ）によって大域的なコンテクストを集約しチャンネル毎のアテンションａへと変換される。具体的に、あるチャンネルｃに対するアテンションａｃは、次式（１）のように表される。 After passing through the Convolution-BatchNorm-ReLU layer, F ₁ aggregates the global context by GAP (Global Average Polling) and converts it into attention a for each channel. Specifically, the attention ac for a certain channel c is expressed by the following equation (1).

ここで、Ｈは高さ、Ｗは幅、α∈Ｒは学習によって適合されるスケール係数、Ｆ_１’は、ＲｅＬＵ後の特徴量（修正特徴量）である。また、ｕ＝（ｃ，ｖ，ｕ）であり、ｖは行方向、ｕは列方向の位置を表し、それらの和は特徴マップ全体に対して取られる。コンテクストを考慮して重み付けされた特徴量Ｆ_２は次式（２）で表される。 Here, H is the height, W is the width, α ∈ R is the scale coefficient adapted by learning, and F _1'is the feature amount (corrected feature amount) after ReLU. Further, u = (c, v, u), where v represents the position in the row direction and u represents the position in the column direction, and the sum of them is taken for the entire feature map. The feature amount F ₂ weighted in consideration of the context is expressed by the following equation (2).

得られたＦ_２はＨｅａｄに入力され出力Ｙを得る。
Ｆ_１のもう一方の伝搬方向は補助Ｈｅａｄで、これにより補助出力Ｙ’を得る。それぞれの出力と教師ラベルＴを比較し、損失_{Ｌｍａｉｎ}（第１損失）、およびＬ_ａｕｘ（第２損失）を得る。全体の損失関数は次式（３）のように定義する。 The obtained F ₂ is input to the head to obtain an output Y.
The other propagation direction of F ₁ is the auxiliary Head, which obtains the auxiliary output Y'. The respective outputs are compared with the teacher label T to obtain the losses _Lmain (first loss) and _Laux (second loss). The total loss function is defined by the following equation (3).

［ＣＦＡモジュールの性質］
次に、ＣＦＡＮｅｔと混合エキスパートモデルの等価性、およびＣＦＡモジュールから分岐する補助Ｈｅａｄの効果について説明する。説明を簡潔にするために、図４に示すＣＦＡＮｅｔを簡略化した計算グラフを考える。図４は、本実施形態に係るＣＦＡＮｅｔを簡略化した計算グラフである。図４では、全ての活性化関数を線形関数とする。また、図３中（ｃ）で表されるＣｏｎｖｏｌｕｔｉｏｎ層を省略する。これらは議論の一般性を失わない仮定である。図４において、各Ｆは特徴量、各ＷはＣｏｎｖｏｌｕｔｉｏｎの重み行列を表す。また、ａはチャンネル毎のアテンションである、Ｘ、Ｙ、およびＴはそれぞれ入力画像、推定結果、入力に対応する正解ラベルである。補助Ｈｅａｄに対する変数については’を付けて表している。 [Characteristics of CFA module]
Next, the equivalence between CFANet and the mixed expert model, and the effect of the auxiliary head branching from the CFA module will be described. For the sake of brevity, consider a computational graph that simplifies CFANet shown in FIG. FIG. 4 is a calculation graph which simplifies CFA Net according to the present embodiment. In FIG. 4, all activation functions are linear functions. Further, the Convolution layer represented by (c) in FIG. 3 is omitted. These are assumptions that do not lose the generality of the argument. In FIG. 4, each F represents a feature quantity, and each W represents a weight matrix of Convolution. Further, a is the attention for each channel, and X, Y, and T are the input image, the estimation result, and the correct label corresponding to the input, respectively. Variables for auxiliary heads are indicated by'.

［混合エキスパートモデル］
混合エキスパートモデルはＣ個のエキスパート（Ｅ，…，Ｅ_Ｃ―１）とＣ次元の重みを生成するゲーティングネットワークＧからなる。入力ｘに対し、出力ｙは次式（４）のように与えられる。 [Mixed expert model]
The mixed expert model consists of C experts (E, ..., EC _-1 ) and a gating network G that produces C-dimensional weights. For the input x, the output y is given by the following equation (4).

ここで、Ｇ（ｘ）_ｉはインデックスｉのエキスパートＥ_ｉに割り当てられた重みである。
混合エキスパートモデルとＣＦＡＮｅｔの等価性を確認するために、まず図４のＨｅａｄに表れる次式（５）の重み行列を考える。 Here, G (x) _i is a weight assigned to the expert E _i of the index i.
In order to confirm the equivalence between the mixed expert model and CFANet, first consider the weight matrix of the following equation (5) that appears in the Head of FIG.

式（５）において、ｋ_２はカーネルサイズ、Ｃは入力のチャンネル数、Ｃｏｕｔは出力のチャンネル数である。最終出力Ｙは式（２）を用いることで次式（６）のように変形できる。 In equation (5), k ₂ is the kernel size, C is the number of input channels, and Cout is the number of output channels. The final output Y can be transformed as in the following equation (6) by using the equation (2).

ここで、カーネル内の位置依存性をｋ＝（ｋ_ｖ，ｋ_ｕ）と表した。
式（４）と式（６）を比較すると、ＣＦＡＮｅｔが混合エキスパートモデルと等価であることが分かる。これによりＣＦＡＮｅｔは、入力画像に含まれる対象に特有の特徴量に重点を置いて識別することが可能となる。 Here, the position dependence in the kernel is expressed as k = (k _v , _ku ).
Comparing Eqs. (4) and (6), it can be seen that CFANet is equivalent to the mixed expert model. As a result, the CFA Net can be identified by focusing on the feature amount peculiar to the object included in the input image.

［補助Ｈｅａｄの効果］
ＣＦＡＮｅｔに設けた補助Ｈｅａｄの存在が、ＣＦＡモジュール内の重みＷ_２の学習を促進することを示す。出力ＹからノードＦ_２に逆伝搬する勾配をＧ_Ｆ２、補助出力Ｙ’からノードＦ_１に逆伝搬する勾配をＧ_Ｆ１とする（図４中の破線矢印）。
ノードＦ_１に伝搬する総勾配は、次式（７）のようになる。 [Effect of auxiliary head]
It is shown that the presence of the auxiliary head provided in the CFA Net promotes the learning of the weight W ₂ in the CFA module. The gradient back-propagating from the output Y to the node F ₂ is GF2, and the gradient back _- propagating from the auxiliary output _Y'to the node _F1 is GF1 (dashed arrow in FIG. 4).
The _total gradient propagating to the node F1 is as shown in the following equation (7).

ＧＡＰを用いてチャンネル毎のアテンションを生成した結果、ｕ＝（ｃ，ｕ，ｖ）の依存性は第二項にのみ表れる。連鎖則を用いて重みＷ_１についての勾配を求めると、次式（８）のようになる。 As a result of generating the attention for each channel using GAP, the dependence of u = (c, u, v) appears only in the second term. When the gradient for the weight W ₁ is obtained by using the chain law, it becomes the following equation (8).

式（８）の第一項は近似的に次式（９）のように書ける。 The first term of the equation (8) can be approximately written as the following equation (9).

つまり補助Ｈｅａｄを用いなかった場合（Ｇ_Ｆ１＝０）、式（８）からｋ＝（ｋ_ｖ，ｋ_ｕ）の依存性は完全に消失し、重みＷ_１の学習に支障をきたすことになる。補助Ｈｅａｄの設置はその依存性を回復し、より識別的な特徴量を得ることに繋がる。 In other words, when the auxiliary head is not used ( _GF1 = 0), the dependence of k = ( _kv , _ku ) from equation ( ₈ ) disappears completely, which hinders the learning of the weight W1. .. The installation of the auxiliary head restores its dependence and leads to the acquisition of more discriminating features.

［実験結果］
以下の説明ではＣＦＡＮｅｔをＰＡＳＣＡＬＶＯＣ２０１２データセット（参考文献２；Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A., “The Pascal Visual Object Classes”, (VOC) Challenge, International Journal of Computer Vision ,2010, p303-338）で評価した結果について説明する。評価指標については各クラスのＩｏＵを平均した値（ｍＩｏＵ）を用いる。 [Experimental result]
In the following description, CFANet is referred to as the PASCAL VOC 2012 dataset (Reference 2; Everingham, M., Eslami, SMA, Van Gool, L., Williams, CKI, Winn, J. and Zisserman, A., “The Pascal Visual Object”. Classes ”, (VOC) Challenge, International Journal of Computer Vision, 2010, p303-338) will be explained. As the evaluation index, the average value (mIoU) of IoU of each class is used.

ＰＡＳＣＡＬＶＯＣ２０１２は１，４６４枚のｔｒａｉｎデータ、１，４４９枚のｖａｌｉｄａｔｉｏｎデータ、および１，４５６枚のｔｅｓｔデータからなるデータセットである、その中に含まれるカテゴリは、背景クラスも含め２１クラスである。ＰＡＳＣＡＬＶＯＣ２０１２データセットに加え、確認ではＰＡＳＣＡＬＶＯＣ２０１１データセットから抽出した１０，５８２枚の画像にアノテーションを施したＳＢＤデータセット（参考文献３；Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S. and Malik, J., “Semantic Contours from Inverse Detectors”, ICCV ,2011）も学習に使用した。 PASCAL VOC 2012 is a data set consisting of 1,464 train data, 1,449 validation data, and 1,456 test data. The categories included in it are 21 classes including the background class. be. In addition to the PASCAL VOC 2012 dataset, the SBD dataset annotated 10,582 images extracted from the PASCAL VOC 2011 dataset (Reference 3; Hariharan, B., Arbelaez, P., Bourdev, L. ., Maji, S. and Malik, J., “Semantic Contours from Inverse Detectors”, ICCV, 2011) were also used for learning.

最適化にはＳＧＤを用い、モーメンタムを０．９、重み減衰を０．０００１に設定した。学習率のスケジューリングとして、参考文献１に倣い初期設定学習率に（１－ｉｔｅｒ／（ｔｏｔａｌ－ｉｔｅｒ）^０．９を乗じる方法を用いた。事前学習としてＳＢＤデータセットで５０エポックの学習を行い、その重みを初期値としてＰＡＳＣＡＬＶＯＣ２０１２データセットで５０エポックのファインチューニングを行った。事前学習、およびファインチューニングの学習率はそれぞれ０．００１５、０．０００１５である。
データの水増しとして、水平方向のランダム反転［０．５，２．０］の範囲でのランダムスケーリング、５１３×５１３のサイズでのランダムクロッピングを適用した。単一スケールでの評価に加え、左右反転（Ｆｌｉｐ）、およびマルチスケール化（ＭＳ）した入力画像から得られる結果の評価も実施した。 SGD was used for optimization, and the momentum was set to 0.9 and the weight attenuation was set to 0.0001. As the learning rate scheduling, the method of multiplying the initial setting learning rate by (1-itter / (total-ita) ^0.9 ) was used according to Reference 1. As pre-learning, 50 epochs were learned with the SBD data set. With the weight as the initial value, 50 epochs of fine tuning were performed on the PASCAL VOC 2012 data set. The pre-learning and fine tuning learning rates were 0.0015 and 0.00015, respectively.
Random scaling in the horizontal random inversion [0.5, 2.0] range and random cropping in a size of 513 × 513 was applied as data padding. In addition to single-scale evaluation, evaluation of results obtained from left-right inverted (Flip) and multi-scale (MS) input images was also performed.

ＣＦＡＮｅｔに取り込んだ三つの要素に対する効果の切り分けをＰＡＳＣＡＬＶＯＣ２０１２ｖａｌｉｄａｔｉｏｎセットを用いて行う。まず、ベースラインとしてＲｅｓＮｅｔ５０をＢａｃｋｂｏｎｅとするＦＣＮを評価し７１．３８％の精度を得た。これに対し、ＲｅｓＮｅｔの最終ブロックに含まれる３つの３×３Ｃｏｎｖｏｌｕｔｉｏｎ層のＤｉｌａｔｉｏｎを（４，８，１６）とするＭｕｌｔｉ－Ｇｒｉｄ（参考文献４；Chen, L.-C., Papandreou, G., Schroff, F. and Adam, H., “Rethinking Atrous Convolution for Semantic Image Segmentation”, arXiv:1706.05587 ,2017）を適用した結果、精度が７７．９０％まで向上した。これは受容野が広がったために得られる効果であると解釈できる。 The effects on the three elements incorporated into CFANet are isolated using the PASCAL VOC 2012 validation set. First, FCN with ResNet50 as the backbone was evaluated as a baseline, and an accuracy of 71.38% was obtained. On the other hand, Multi-Grid (Reference 4; Chen, L.-C., Papandreou, G.,) having a Dilation of three 3 × 3 Convolution layers included in the final block of ResNet as (4,8,16). As a result of applying Schroff, F. and Adam, H., “Rethinking Atrous Convolution for Semantic Image Segmentation”, arXiv: 1706.05587, 2017), the accuracy improved to 77.90%. This can be interpreted as an effect obtained due to the expansion of the receptive field.

本実施形態の手法であるＣＦＡモジュールの追加は性能を７８．９０％まで引き上げる。これは大域的なコンテクストを元に、ＣＦＡモジュールがチャンネル毎の重要度を適切に推定できた結果であると捉えられる。更に上述した補助Ｈｅａｄを加えることで、ＣＦＡモジュール内の重みが効果的に学習され、性能が７９．４６％まで改善した。更なる向上のために、ＢａｃｋｂｏｎｅをＲｅｓＮｅｔ１０１に変更することで、性能は８１．５４％まで改善し、Ｆｌｉｐ、およびＭＳを用いた推論を行うことで最終的に８２．３３％の性能を達成した。 The addition of the CFA module, which is the method of this embodiment, raises the performance to 78.90%. This is considered to be the result of the CFA module being able to appropriately estimate the importance of each channel based on the global context. Further, by adding the above-mentioned auxiliary head, the weight in the CFA module was effectively learned, and the performance was improved to 79.46%. By changing Backbone to ResNet101 for further improvement, the performance was improved to 81.54%, and finally 82.33% was achieved by inference using Flip and MS. ..

図５は、ＰＡＳＡＣＬＶＯＣ２０１２ｖａｌｉｄａｔｉｏｎセットでの評価結果を示す図である。図５において、ＭＧはＭｕｌｔｉ－Ｇｒｉｄ、ＣＦＡはＣＦＡモジュール、Ａｕｘは補助Ｈｅａｄ、ＭＳ＋Ｆｌｉｐはマルチスケール、および左右反転入力である。 FIG. 5 is a diagram showing the evaluation results of the PASACL VOC 2012 validation set. In FIG. 5, MG is Multi-Grid, CFA is a CFA module, Aux is an auxiliary head, MS + Flip is a multiscale, and left-right inverted input.

［特徴類似度の可視化］
ＣＦＡモジュールの持つ効果を理解しやすくするために、可視化部３０は、対象ピクセルとその他のピクセル間の特徴量空間におけるＣｏｓｉｎｅ類似度を可視化する。可視化を行う対象として、Ｈｅａｄに入力される特徴量（図３のＦ_２）に焦点を当てる。また、比較対象としてＦＣＮにおけるＨｅａｄ直前の特徴量に対する類似度も併せて可視化する。 [Visualization of feature similarity]
In order to make it easier to understand the effect of the CFA module, the visualization unit 30 visualizes the Cosine similarity in the feature space between the target pixel and other pixels. Focus on the feature amount (F ₂ in FIG. 3) input to the head as the object to be visualized. In addition, as a comparison target, the degree of similarity to the feature amount immediately before Head in FCN is also visualized.

図６は、Ｃｏｓｉｎｅ類似度の可視化結果例を示す図である。
図６の四角ｇ６００（上からｇ６０１～ｇ６０３）は入力画像を表し、類似度計算の対象となるピクセルを十字（ｇ６４１～ｇ６４３）でマークしている。
図６の四角ｇ６１０（上からｇ６１１～ｇ６１３）は入力画像に対応する正解ラベルである。
図６の四角ｇ６２０（上からｇ６２１～ｇ６２３）、図６の四角ｇ６４０（上からｇ６３１～ｇ６３３）は、それぞれベースラインであるＦＣＮと実施形態の手法ＣＦＡＮｅｔの類似度マップを示している。 FIG. 6 is a diagram showing an example of visualization results of Cosine similarity.
The square g600 (g601 to g603 from the top) in FIG. 6 represents an input image, and the pixels to be calculated for similarity are marked with crosses (g641 to g643).
The square g610 (g611 to g613 from the top) in FIG. 6 is a correct label corresponding to the input image.
The squares g620 (g621 to g623 from the top) in FIG. 6 and the squares g640 (g631 to g633 from the top) in FIG. 6 show a similarity map between the baseline FCN and the method CFANet of the embodiment, respectively.

類似度マップは赤色に近いピクセル（ｇ６５１～ｇ６５６）は類似度が高く、青色に近い箇所（ｇ６６１～ｇ６６３）は類似度が低いことを表している。ＦＣＮは対象ピクセルと同じ物体領域に高い類似性を示しているものの、背景など同一物体以外の領域にも比較的高い類似度（緑～黄色）（ｇ６７１～ｇ６７３）を示してしまっている。
これに対して、実施形態のＣＦＡＮｅｔは対象領域と無関係な領域がより識別的になっている事が分かる。これはＦＣＮが重要ではない特徴マップも他と等しい寄与度で扱ってしまうのに対し、ＣＦＡＮｅｔではそのような特徴マップの寄与度は落とし、重要なチャンネルの寄与度を高められる効果を示している。この効果が性能の向上、更には弁別性の改善をもたらしている。 In the similarity map, pixels close to red (g651 to g656) have high similarity, and pixels close to blue (g661 to g663) have low similarity. Although FCN shows high similarity to the same object area as the target pixel, it also shows relatively high similarity (green to yellow) (g671 to g673) to areas other than the same object such as the background.
On the other hand, in CFANet of the embodiment, it can be seen that the region unrelated to the target region is more discriminative. This shows the effect that FCN treats non-important feature maps with the same contribution as others, while CFANet reduces the contribution of such feature maps and increases the contribution of important channels. .. This effect leads to an improvement in performance and an improvement in discrimination.

［既存手法との性能比較］
ＣＦＡＮｅｔの性能を他の既存手法とも比較した。比較はＰＡＳＣＡＬＶＯＣ２０１２のｔｅｓｔセットで行った。このｔｅｓｔセットは入力画像のみが与えられており、自身のモデルで推論した結果を評価サーバに送ることで評価されるという公平な評価方法をとっている。ｔｅｓｔセットでの評価のために、ＳＢＤデータセットで学習したモデルを、ＰＡＳＣＡＬＶＯＣ２０１２のｔｒａｉｎ＋ｖａｌｉｄａｔｉｏｎセットでファインチューニングした。図７にその結果を示す。図７は、ＰＡＳＡＣＬＶＯＣ２０１２ｔｅｓｔｓｅｔでの評価結果を示す図である。実施形態のＣＦＡＮｅｔは既存手法を上回る８４．５％の精度を達成した。 [Performance comparison with existing methods]
The performance of CFANet was compared with other existing methods. Comparisons were made with the PASCAL VOC 2012 test set. This test set is given only the input image, and adopts a fair evaluation method in which the result inferred by its own model is sent to the evaluation server for evaluation. Models trained on the SBD dataset were fine-tuned on the PASCALVOC 2012 train + assessment set for evaluation on the test set. The result is shown in FIG. FIG. 7 is a diagram showing the evaluation results in the PASACL VOC 2012 test set. The CFA Net of the embodiment achieved an accuracy of 84.5%, which exceeds the existing method.

以上のように、本実施形態では、チャンネルレベルのアテンションを用いて特徴マップの重み付けを行う機構を持つＣＦＡＮｅｔを用いた。実験結果から、ＣＦＡＮｅｔは従来手法よりも特徴マップを弁別的に扱えている事が特徴マップの可視化によって確認できた。また、弁別性の改善だけでなく性能面でも大きな向上を達成し、ＰＡＳＣＡＬＶＯＣ２０１２ｔｅｓｔセットにおいて既存手法を上回る精度を達成することができた。 As described above, in the present embodiment, CFANet having a mechanism for weighting the feature map using attention at the channel level is used. From the experimental results, it was confirmed by visualizing the feature map that CFANet can handle the feature map more discriminatively than the conventional method. In addition to the improvement of discriminability, we also achieved a great improvement in performance, and we were able to achieve accuracy higher than the existing method in the PASCAL VOC 2012 test set.

ここで、図２を参照して、コンテクストを取り入れるネットワーク構造の例を、さらに説明する。ここでは、チャンネル次元の依存性に注目している。
画像ｇ１１０の特徴量をチャンネル方向に結合する構造では、バックボーン（Backbone）ｇ１１１によって画像から抽出された特徴量がコンテクストモジュールｇ１１２に入力される。この構造では、特徴量を重要度にかかわらずチャンネル方向に結合する。 Here, with reference to FIG. 2, an example of a network structure incorporating a context will be further described. Here, we focus on the channel dimension dependency.
In the structure in which the features of the image g110 are combined in the channel direction, the features extracted from the image by the backbone g111 are input to the context module g112. In this structure, features are combined in the channel direction regardless of their importance.

バックボーンは、例えばImageNet（参考文献５；Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., “Imagenet: A large-scale hierarchical image database”, In: CVPR09 ,2009）で事前に学習したResNet（参考文献１）を用いる。本実施形態では、例えば、ＰＳＰＮｅｔに倣い、最初の７×７畳み込みを３×３畳み込み層に置き換え、ＲｅｓＮｅｔの最後の２つのブロックでは拡張された畳み込みを使用するため，特徴マップの出力ストライドは８となる。 The backbone is, for example, ImageNet (Reference 5; Deng, J., Dong, W., Socher, R., Li, LJ, Li, K., Fei-Fei, L., “Imagenet: A large-scale hierarchical image”. Use ResNet (Reference 1) learned in advance in database ”, In: CVPR09, 2009). In this embodiment, for example, following PSPNet, the first 7x7 convolution is replaced with a 3x3 convolution layer, and the last two blocks of ResNet use the expanded convolution, so the output stride of the feature map is 8. It becomes.

画像ｇ１２０のコンテクストを残差特徴量として取り込む構造では、バックボーンｇ１２１によって画像から抽出された特徴量は、コンテクストモジュールｇ１２２と演算部ｇ１２３に入力される。この構造では、重要度な特徴量を強調してチャンネル方向に結合する。なお、演算部ｇ１２３（Element-wise summation）は、バックボーンｇ１２１とコンテクストモジュールｇ１２２で得られた特徴量の要素毎の加算を行う。すなわち、残差形式は、ピクセル単位のコンテクストを集約することで、各ピクセルの表現を拡張増強している。 In the structure that captures the context of the image g120 as the residual feature amount, the feature amount extracted from the image by the backbone g121 is input to the context module g122 and the calculation unit g123. In this structure, important features are emphasized and combined in the channel direction. The arithmetic unit g123 (Element-wise summation) adds the feature quantities obtained by the backbone g121 and the context module g122 for each element. That is, the residual format expands and enhances the expression of each pixel by aggregating the context of each pixel.

画像ｇ１３０のコンテクストを考慮して特徴マップを増強・減衰双方向に変調する構造では、バックボーンｇ１３１によって画像から抽出された特徴量がコンテクストモジュールｇ１３２と演算部ｇ１３３に入力される。この構造では、重要度の高い特徴量を強調し重要度の低い特徴量を減衰させてチャンネル方向に結合する。なお、演算部ｇ１３３（Channel-wise multiplication）は、チャンネル毎の積を求める。この構成では、特徴に注目することで、各特徴マップの重みを調整することができる。このため、この構成では、関連する特徴をより識別しやすくすることができる。 In the structure in which the feature map is augmented / attenuated in both directions in consideration of the context of the image g130, the feature amount extracted from the image by the backbone g131 is input to the context module g132 and the calculation unit g133. In this structure, the features with high importance are emphasized and the features with low importance are attenuated and combined in the channel direction. The arithmetic unit g133 (Channel-wise multiplication) obtains the product for each channel. In this configuration, the weight of each feature map can be adjusted by focusing on the features. Therefore, in this configuration, it is possible to make it easier to identify related features.

セマンティックセグメンテーションの目的は、各ピクセルに意味的なカテゴリを割り当てることである。セマンティックセグメンテーションでは、カテゴリの数が増えれば増えるほどクラスが曖昧になる。このためモデルは、より高品質な画像セグメンテーションのために、より識別性の高い特徴を選択するようにモデルを学習する必要がある。
しかしながら、特徴量をチャンネル方向に結合する構造（ｇ１１０）では、これらの集約された特徴が、ヘッドネットワークによって等しく重要に扱われるため、より特徴的な特定の特徴を識別することは困難である。 The purpose of semantic segmentation is to assign a semantic category to each pixel. In semantic segmentation, the larger the number of categories, the more ambiguous the class. For this reason, the model needs to be trained to select more discriminating features for higher quality image segmentation.
However, in the structure (g110) in which the features are combined in the channel direction, it is difficult to identify a more characteristic specific feature because these aggregated features are treated equally and importantly by the head network.

最新のコンテキスト・モデリング・アプローチでは、ピクセルレベルの類似性マップを利用して，バックボーン特徴を改良している。コンテクストを残差特徴量として取り込む構造（ｇ１２０）では、残差形式を採用している。しかしながら、この構造では、選択された特徴によって強化されるだけなので、特徴の識別性が制限される。 The latest context modeling approach utilizes pixel-level similarity maps to improve backbone features. In the structure (g120) that captures the context as the residual feature amount, the residual format is adopted. However, this structure limits the distinctiveness of the features as they are only enhanced by the selected features.

このため、本実施形態では図３の構成のＣＦＡモジュール（Context-aware Feature Attention Network（CFANet））を用いるようにした。
ＣＦＡＮｅｔでは、コンテクストを意識した個々の特徴の重要度を適応的に調整するＣＦＡ（Contextaware Feature Attention）モジュールを導入した。グローバルなコンテクストを利用することは、正確なセグメンテーションに不可欠である。このため、本実施形態では、グローバルアベレージプーリング（GAP）を用いて，グローバルな特徴を集約し、チャンネルワイズアテンションを直接生成するようにした。
この構成によれば、図２の画像ｇ１３０に示すように、個々の特徴マップを対応する注目度の重みで強めたり弱めたりすることができる。これにより、本実施形態によれば、注目度の重みは正の値を取ることができるので、他の手法よりも各特徴をより区別して扱うことができる。 Therefore, in this embodiment, the CFA module (Context-aware Feature Attention Network (CFANet)) having the configuration shown in FIG. 3 is used.
CFANet has introduced a CFA (Contextaware Feature Attention) module that adaptively adjusts the importance of individual features that are context conscious. Utilizing a global context is essential for accurate segmentation. Therefore, in this embodiment, global average pooling (GAP) is used to aggregate global features and directly generate channelwise attention.
According to this configuration, as shown in image g130 of FIG. 2, individual feature maps can be strengthened or weakened by the corresponding attention weights. As a result, according to the present embodiment, the weight of attention can take a positive value, so that each feature can be treated more distinctly than other methods.

次に、図３を参照して、ＣＦＡＮｅｔの概略構成について、さらに説明する。
バックボーンネットワークの個々のフィーチャーマップは、入力画像に存在するオブジェクトやスタッフのある種の特徴を表している。その中でも対象となる物体のカテゴリに対応する特徴的なパターンを区別するためには、シーンのコンテクストに基づいて対応する特徴をより重視する必要がある。このため、本実施形態では、このような再優先順位付けを行うために、ＣＦＡモジュールを導入した。 Next, the schematic configuration of CFANet will be further described with reference to FIG.
The individual feature maps of the backbone network represent certain features of the objects and staff present in the input image. Among them, in order to distinguish the characteristic patterns corresponding to the category of the target object, it is necessary to place more emphasis on the corresponding characteristics based on the context of the scene. Therefore, in this embodiment, a CFA module is introduced in order to perform such re-prioritization.

取得された画像Ｘ（ｇ２１０）の次元は、３×Ｈ_０×Ｗ_０である。なお、３はチャンネル数、Ｈは特徴マップの高さを表し、Ｗは特徴マップの幅を表す。
上述したように、バックボーンｇ２１１には例えばＲｅｓＮｅｔを用いる。バックボーンｇ２１１は、バックボーン特徴量Ｆ_０（∈Ｒ（Ｒは二重線文字で実数全体の集合）^{Ｃ×Ｈ×Ｗ}）（ｇ２２１）を抽出する。なお、Ｃはチャンネル数である。 The dimension of the acquired image X (g210) is 3 × H ₀ × W ₀ . Note that 3 represents the number of channels, H represents the height of the feature map, and W represents the width of the feature map.
As described above, for example, ResNet is used for the backbone g211. The backbone g211 extracts the backbone feature quantity F ₀ (∈ R (R is a set of all real numbers in double line characters) ^{C × H × W} ) (g221). Note that C is the number of channels.

ＣＦＡモジュールｇ２２０は、バックボーン特徴量Ｆ_０を、畳み込み層によって特徴マップＦ_１（∈Ｒ^{Ｃ×Ｈ×Ｗ}）ｇ２２３に変換する。
次に、ＣＦＡモジュールｇ２２０は、特徴マップＦ_１に対して１×１の畳み込み（ｇ２２４）を行って、修正特徴量Ｆ１’を算出する。 The CFA module g220 converts the backbone feature amount F ₀ into the feature map F ₁ (∈ ^{RC × H × W} ) g223 by the convolution layer.
Next, the CFA module g220 performs 1 × ₁ convolution (g224) with respect to the feature map F1 to calculate the modified feature amount F1'.

次に、ＣＦＡモジュールｇ２２０は、修正特徴量Ｆ１’に対し全体平均プーリング（ＧＡＰ（Global Average Pooling））処理を行って（ｇ２２５）、チャンネル毎のアテンションａ（Ｃ×１×１）（ｇ２２６）を生成する。 Next, the CFA module g220 performs global average pooling (GAP (Global Average Pooling)) processing on the modified feature amount F1'(g225) to obtain attention a (C × 1 × 1) (g226) for each channel. Generate.

次に、ＣＦＡモジュールｇ２２０は、バックボーン特徴量Ｆ_０とアテンションａを用いて、グローバルな特徴を集約しチャンネルワイズアテンションを生成する（ｇ２２７）。この処理では、アテンションａとバックボーン特徴量Ｆ_０をチャンネル毎に掛け合わせて、重み付けされた特徴量Ｆ_２（∈Ｒ^{Ｃ×Ｈ×Ｗ}）（ｇ２２８）を生成している。 Next, the CFA module g220 aggregates global features and generates channelwise attention using the backbone feature amount F ₀ and attention a (g227). In this process, the attention a and the backbone feature amount F ₀ are multiplied for each channel to generate a weighted feature amount F ₂ (∈ ^{RC × H × W} ) (g228).

補助Ｈｅａｄ３２（ｇ２３２）は、修正特徴量Ｆ１’に対して、例えば畳み込みとアップサンプリングを行って出力Ｙ’（Ｃ_ｏｕｔ×Ｈ_０×Ｗ_０）（ｇ２４４）を算出して出力する。 The auxiliary Head 32 ( _g232 ) calculates and outputs an output Y'(Cout × _H0 × W0) ( _g244 ) by, for example, convolution and upsampling the modified feature amount F1'.

Ｈｅａｄ３１（ｇ２３１）は、重み付けされた特徴量Ｆ_２に対して、例えば畳み込みとアップサンプリングを行って出力Ｙ（Ｃ_ｏｕｔ×Ｈ_０×Ｗ_０）（ｇ２４１）を算出して出力する。 The Head 31 ( _g231 ) calculates and outputs an output Y (Cout × _H0 × _W0 ) ( _g241 ) by, for example, convolution and upsampling the weighted feature amount F2.

画像識別装置１は、教師ラベルＴ（Ｃ_ｏｕｔ×Ｈ_０×Ｗ_０）（ｇ２４２）と出力Ｙ（ｇ２４１）とを比較して損失_{Ｌｍａｉｎ}を算出する。また、画像識別装置１は、教師ラベルＴ（Ｃ_ｏｕｔ×Ｈ_０×Ｗ_０）（ｇ２４２）と出力Ｙ’（ｇ２４４）とを比較して損失Ｌ_ａｕｘを算出する。 The image identification device 1 calculates the loss _Lmain by comparing the teacher label T (Cout × _H0 × W0) ( _g242 ) with the output Y ( _g241 ). Further, the image identification device 1 calculates the loss _Laux by comparing the teacher label T (Cout × _H0 × W0) ( _g242 ) with the output Y'( _g244 ).

次に、ＣＦＡＮｅｔをシンプルな形で表現した図４を参照して、ＣＦＡＮｅｔを簡略化した計算グラフについて、さらに説明する。なお、図４では、活性化関数がすべて線形であると仮定し、図３の畳み込み層（ｇ２２４等）を省略している。なお、図４では、簡略化のため，図３の畳み込み層（ｇ２２４等）を省略しているが、説明において一般性は失われていない。 Next, with reference to FIG. 4 in which CFANet is expressed in a simple form, a calculation graph in which CFANet is simplified will be further described. In FIG. 4, it is assumed that all the activation functions are linear, and the convolutional layer (g224, etc.) in FIG. 3 is omitted. In FIG. 4, the convolutional layer (g224, etc.) in FIG. 3 is omitted for simplification, but the generality is not lost in the explanation.

バックボーンｇ３１０には、入力画像Ｘが入力される。バックボーンｇ３１０は、重み付けＷ_０を用いて、畳み込みｇ３４１を行って、バックボーン特徴量Ｆ_０を算出する。 The input image X is input to the backbone g310. The backbone g310 uses the weighting W ₀ to perform convolution g341 to calculate the backbone feature amount F ₀ .

ＣＦＡモジュールｇ３２０は、重み付けＷ_１を用いて、特徴マップＦ_０に対して畳み込みｇ３４４を行って、修正特徴量Ｆ_１を算出する。
ＣＦＡモジュールｇ３２０は、ＧＡＰ処理（ｇ３４４）によってアテンションａを算出する。
ＣＦＡモジュールｇ３２０は、アテンションａとバックボーン特徴量Ｆ_０を用いてチャンネル毎の積を求めて、重み付けされた特徴量Ｆ_２を算出する（ｇ３４５）。
なお、ＣＦＡモジュールｇ３２０は、特徴マップの解像度を維持するために、ダウンサンプリングを行わない。 The CFA module _g320 performs _a convolution _g344 with respect to the feature map F0 using the weighting W1 to calculate the modified feature amount F1.
The CFA module g320 calculates the attention a by GAP processing (g344).
The CFA module g320 calculates the weighted feature amount F ₂ by obtaining the product for each channel using the attention a and the backbone feature amount F ₀ (g345).
The CFA module g320 does not perform downsampling in order to maintain the resolution of the feature map.

Ｈｅａｄ３１（ｇ３３０）は、重み付けされた特徴量Ｆ_２に対して、重み付けＷ_２を用いて、畳み込み（ｇ３４６）を行って出力Ｙを算出して出力する。
補助Ｈｅａｄ３２（ｇ３４０）は、修正特徴量Ｆ_１に対して、重み付けＷ_２’を用いて、畳み込み（ｇ３４７）を行って出力Ｙ’を算出して出力する。 The Head 31 (g330) performs convolution ( _g346 ) with respect to the weighted feature amount F2 by using the weighting W2 to calculate and _output the output Y.
The auxiliary Head 32 (g340) performs convolution (g347) with respect to the modified feature amount F1 using the weighting W _2'to calculate and output the _output Y'.

なお、図３、図４で説明した特徴量、アテンション等の算出に用いる式や算出方法は上述したとおりである。 The formulas and calculation methods used for calculating the features, attention, etc. described in FIGS. 3 and 4 are as described above.

なお、本実施形態では、Ｈｅａｄ３１と補助Ｈｅａｄ３２には、畳み込み層とドロップアウト層からなるネットワークを採用した。また、本実施形態では、最適化を容易にするために、例えばＰＳＰＮｅｔで提案されたディープ・スーパービジョン・ヘッドをＲｅｓＮｅｔの最後から２番目のブロックに採用した。本実施形態では、このように、Ｌ_ｍａｉｎ、Ｌ_ａｕｘ、Ｌ_ｄｓの３つの損失を計算し、純損失を例えば次式（１）のように算出する。なお、損失Ｌ_ｄｓは、バックボーンの中間層の特徴量を用いたセグメンテーション出力に対する損失で、ＰＳＰＮｅｔで提案されたものである。 In this embodiment, a network composed of a convolution layer and a dropout layer is adopted for the Head 31 and the auxiliary Head 32. Further, in the present embodiment, in order to facilitate optimization, for example, the deep supervision head proposed by PSP Net is adopted as the penultimate block of ResNet. In this embodiment, the three losses of L _main , _Laux , and L _ds are calculated in this way, and the net loss is calculated, for example, by the following equation (1). The loss L _ds is a loss for the segmentation output using the feature amount of the backbone intermediate layer, and is proposed by PSP Net.

なお、式（１０）では、ＰＳＰＮｅｔに従って損失Ｌ_ｄｓの重みを０．４に設定しているが重みはこれに限らない。
なお、画像識別装置１は、損失関数または総損失を用いて、重み係数であるアテンションの学習を行う。なお、画像識別装置１は、損失関数または総損失を用いて、第２畳込層２４が用いる重みと、第３畳込層２５が用いる重みと、第１畳込層２３とＨｅａｄ３１が用いる重みと、第４畳込層２７と補助Ｈｅａｄ３２が用いる重みの学習を行うようにしてもよい。 In the equation (10), the weight of the loss L _ds is set to 0.4 according to PSP Net, but the weight is not limited to this.
The image identification device 1 learns attention, which is a weighting coefficient, by using a loss function or total loss. The image identification device 1 uses the loss function or the total loss to use the weights used by the second convolutional layer 24, the weights used by the third convolutional layer 25, and the weights used by the first convolutional layer 23 and the head 31. And, the weights used by the fourth convolutional layer 27 and the auxiliary Head 32 may be learned.

［処理手順］
次に、画像識別装置１の処理手順例を説明する。
図８は、本実施形態に係る画像識別装置１の処理手順例のフローチャートである。 [Processing procedure]
Next, an example of the processing procedure of the image identification device 1 will be described.
FIG. 8 is a flowchart of a processing procedure example of the image identification device 1 according to the present embodiment.

（ステップＳ１）画像取得部１１は、画像を取得する。
（ステップＳ２）特徴量抽出部１２は、重み付けＷ_０を用いて、取得された画像に対して畳み込みを行ってバックボーン特徴量Ｆ_０を抽出する。 (Step S1) The image acquisition unit 11 acquires an image.
(Step S2) The feature amount extraction unit 12 uses the weighting W ₀ to convolve the acquired image to extract the backbone feature amount F ₀ .

（ステップＳ３）第２畳込層２４は、重み付けＷ_１を用いて、バックボーン特徴量Ｆ_０に対して畳み込みを行って、特徴マップＦ_１を算出する。
（ステップＳ４）第３畳込層２５は、特徴マップＦ_１に対して畳み込みを行って、修正特徴量Ｆ_１’を算出する。 (Step S3) The second convolution layer 24 uses the weighting W ₁ to convolve the backbone feature amount F ₀ to calculate the feature map F ₁ .
(Step S4) The third convolution layer 25 convolves the feature map F ₁ to calculate the modified feature amount F ₁ '.

（ステップＳ５）ＧＡＰ部２６は、修正特徴量Ｆ１’に対し全体平均プーリング処理を行って、チャンネル毎のアテンションａを算出する。 (Step S5) The GAP unit 26 performs an overall average pooling process on the modified feature amount F1'to calculate the attention a for each channel.

（ステップＳ６）乗算部２２は、アテンションａとバックボーン特徴量Ｆ_０をチャンネル毎に掛け合わせて、重み付けされた特徴量Ｆ_２を算出する。 (Step S6) The multiplication unit 22 calculates the weighted feature amount F ₂ by multiplying the attention a and the backbone feature amount F ₀ for each channel.

（ステップＳ７）第１畳込層２３とＨｅａｄ３１は、重み付けされた特徴量Ｆ_２に対して、例えば畳み込みとアップサンプリングを行って出力Ｙを算出して出力する。Ｈｅａｄ３１は、出力Ｙを用いて損失Ｌ_ｍａｉｎを算出する。 (Step S7) The first convolution layer 23 and the Head 31 calculate and output an _output Y by, for example, convolution and upsampling the weighted feature amount F2. The Head 31 calculates the loss L _mine using the output Y.

（ステップＳ８）第４畳込層２７と補助Ｈｅａｄ３２は、修正特徴量Ｆ’に対して例えば畳み込みとアップサンプリングを行って出力Ｙ’を算出して出力する。続けて、補助Ｈｅａｄ３２は、出力Ｙ’を用いて損失Ｌ_ａｕｘを算出する。 (Step S8) The fourth convolution layer 27 and the auxiliary head 32 perform, for example, convolution and upsampling with respect to the modified feature amount F'to calculate and output an output Y'. Subsequently, the auxiliary Head 32 calculates the loss _Laux using the output Y'.

（ステップＳ９）画像識別装置１は、損失Ｌ_ｍａｉｎと損失Ｌ_ａｕｘから、全体の損失関数を算出し、算出した損失関数Ｌを用いてアテンションを学習する。 (Step S9) The image identification device 1 calculates the entire loss function from the loss L _mine and the loss _Laux , and learns the attention using the calculated loss function L.

なお、上述した処理手順は一例であり、これに限らない。例えば、いくつかの処理は平衡して行われてもよく、処理順番が逆であってもよい。また、学習が済んでいる場合、損失、損失関数の算出および学習の処理は行わなくてもよい。 The above-mentioned processing procedure is an example and is not limited to this. For example, some processes may be performed in equilibrium and the order of processes may be reversed. Further, when the learning has been completed, it is not necessary to perform the loss, the calculation of the loss function, and the learning process.

以上のように、本実施形態では、セマンティックセグメンテーションのためのフィーチャーアテンションのアイデアを検討し、グローバルなコンテクストに基づいて対応するフィーチャーマップの重要性を調整するＣＦＡ（Ｃｏｎｔｅｘｔ－ａｗａｒｅＦｅａｔｕｒｅＡｔｔｅｎｔｉｏｎ）モジュールを備えるようにした。なお、ＦＣＮとＣＦＡモジュールを組み合わせることでＣＦＡＮｅｔを構築した。
これにより、本実施形態によれば、ＣＦＡモジュールを用いたことによって、特徴マップ間の識別性を向上させることができた。 As described above, the present embodiment includes a CFA (Context-aware Feature Attention) module that examines the idea of feature attention for semantic segmentation and adjusts the importance of the corresponding feature map based on the global context. I did it. CFA Net was constructed by combining FCN and CFA module.
As a result, according to the present embodiment, the distinctiveness between the feature maps could be improved by using the CFA module.

また、本実施形態によれば、セマンティックセグメンテーションの精度を向上でき、推定した重みが特徴マップの重要度を示していることを実験的に示すことができた。また、本実施形態によれば、特徴マップ間の分別性を向上でき、ピクセル間の特徴量の類似度を比較した場合に、従来よりも領域間の区別が明瞭につくようになった。 In addition, according to the present embodiment, the accuracy of semantic segmentation can be improved, and it can be experimentally shown that the estimated weight indicates the importance of the feature map. Further, according to the present embodiment, the separability between the feature maps can be improved, and when the similarity of the feature quantities between the pixels is compared, the distinction between the regions becomes clearer than in the conventional case.

また、本実施形態によれば、ピクセル間の分別性が改善したことにより、視覚的な判断が容易になる。本実施形態によれば、判断根拠の理解、誤検知の原因解析等に応用できる。本実施形態によれば、特徴マップの重要度が分かるため、重要度の高いマップのみを抽出することで計算時間を短縮できる。本実施形態によれば、重要度をスパースにすることでチャンネルの枝狩りへ応用でき得る。本実施形態によれば、重要度の分布が学習データを増やす程に先鋭化される傾向がある（データが増える程，重要度の確信度が向上する）。本実施形態によれば、この傾向を利用することで、新しいデータを入力し分布を確認することで、そのデータを教師データとして教示すべきかを判断することができる。これにより、本実施形態によれば、教示のコストを抑えることができる。 Further, according to the present embodiment, the improvement in the separability between pixels facilitates visual judgment. According to this embodiment, it can be applied to understanding the basis of judgment, analyzing the cause of false positives, and the like. According to this embodiment, since the importance of the feature map is known, the calculation time can be shortened by extracting only the map having high importance. According to this embodiment, it can be applied to channel branch hunting by setting the importance to sparse. According to this embodiment, the distribution of importance tends to be sharpened as the training data increases (the more data, the higher the certainty of importance). According to the present embodiment, by utilizing this tendency, by inputting new data and confirming the distribution, it is possible to determine whether the data should be taught as teacher data. Thereby, according to the present embodiment, the cost of teaching can be suppressed.

特徴マップ（チャンネル）毎の重要度を算出するためのネットワーク構造は、上述した構成に限らず、他の構成であってもよい。なお、上述した実施例では、補助分類ネットワークの学習をメインタスクと全く同じタスクで学習させたが、他のタスク（シーン分類，エッジ検出，キャプション生成など）と組み合わせた場合にも適切な重要度が算出できる。また、ＣＦＡモジュールやＣＦＡモジュールの各機能部等を挿入する位置は、上述した位置に限らず他の位置であってもよい。また、上述した例では、解釈性の向上のために重要度は正の値に限定したが。負の値を持たせてもよい。 The network structure for calculating the importance of each feature map (channel) is not limited to the above-mentioned configuration, and may be another configuration. In the above-mentioned embodiment, the learning of the auxiliary classification network is trained by the same task as the main task, but the importance is appropriate even when combined with other tasks (scene classification, edge detection, caption generation, etc.). Can be calculated. Further, the position for inserting each functional unit of the CFA module or the CFA module is not limited to the above-mentioned position and may be another position. Also, in the above example, the importance is limited to a positive value in order to improve the interpretability. It may have a negative value.

なお、本発明における画像識別装置１の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより画像識別装置１が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or part of the functions of the image identification device 1 in the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. By doing so, all or part of the processing performed by the image identification device 1 may be performed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer system" shall also include a WWW system provided with a homepage providing environment (or display environment). Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system. Furthermore, a "computer-readable recording medium" is a volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, it shall include those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system in which this program is stored in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above program may be for realizing a part of the above-mentioned functions. Further, a so-called difference file (difference program) may be used, which can realize the above-mentioned function in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…画像識別装置、１１…画像取得部、１２…特徴量抽出部、１０…セマンティックセグメンテーション装置、３０…可視化部、２１…特徴量取得部、２２…乗算部、２３…第１畳込層、２４…第２畳込層、２５…第３畳込層、２６…ＧＡＰ部、２７…第４畳込層、３１…Ｈｅａｄ、３２…補助Ｈｅａｄ、３３…教師ラベル提供部、３４…類似度マップ作成部 1 ... image identification device, 11 ... image acquisition unit, 12 ... feature amount extraction unit, 10 ... semantic segmentation device, 30 ... visualization unit, 21 ... feature amount acquisition unit, 22 ... multiplication unit, 23 ... first convolution layer, 24 ... 2nd folding layer, 25 ... 3rd folding layer, 26 ... GAP part, 27 ... 4th folding layer, 31 ... Head, 32 ... Auxiliary Head, 33 ... Teacher label providing part, 34 ... Similarity map Creation department

Claims

画像（Ｘ）を取得する画像取得部と、
取得された前記画像の複数の特徴量を抽出する特徴量抽出部と、
前記複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成する特徴マップ作成部と、
前記特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ_ｉ）を乗算する乗算部と、
を備える画像識別装置。 The image acquisition unit that acquires the image (X) and
A feature amount extraction unit that extracts a plurality of feature amounts of the acquired image, and a feature amount extraction unit.
A feature map creation unit that creates a feature map (X _i ) for each of the plurality of feature quantities, and a feature map creation unit.
A multiplication unit that multiplies the weighting factor ( _ai ), which is an arbitrary positive value expressing the importance of the feature for each feature map, and
An image identification device comprising.

重み係数（ａ_ｉ）は、
前記画像（Ｘ）を畳み込み、畳み込み層を作成する処理と、
前記畳み込み層にＲｅＬＵ関数を適用し特徴量Ｆを算出する処理と、
特徴量ＦにＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇ（ＧＡＰ）層を適用する処理とから計算される
請求項１に記載の画像識別装置。 The weighting factor ( _ai ) is
The process of convolving the image (X) to create a convolution layer,
The process of applying the ReLU function to the convolution layer to calculate the feature amount F, and
The image identification apparatus according to claim 1, which is calculated from a process of applying a Global Average Pooling (GAP) layer to a feature amount F.

画像を取得する画像取得部と、
取得された前記画像の複数の特徴量を抽出する特徴量抽出部と、
前記複数の特徴量それぞれに対して畳み込み処理によって特徴マップを作成する作成部と、
前記特徴マップに対して畳み込み処理によって修正特徴量を算出し、算出した前記修正特徴量に対して全体平均プーリング処理を行ってコンテクストを集約し、チャンネル毎の重み係数であるアテンションを生成し、生成された前記アテンションを前記特徴マップに乗算することで、前記複数の特徴マップに増強と減衰の重み付けを行って重み付けした特徴量を生成する重付特徴量生成部と、
を備える画像識別装置。 The image acquisition unit that acquires images and
A feature amount extraction unit that extracts a plurality of feature amounts of the acquired image, and a feature amount extraction unit.
A creation unit that creates a feature map by convolution processing for each of the plurality of feature quantities,
The modified features are calculated by the convolution process for the feature map, the overall average pooling process is performed on the calculated modified features, the contexts are aggregated, and the attention, which is the weighting coefficient for each channel, is generated and generated. A weighted feature amount generation unit that generates a weighted feature amount by multiplying the feature map by the attention given to the feature map by weighting the plurality of feature maps with enhancement and attenuation.
An image identification device comprising.

前記重み付けした特徴量に対して畳み込みとアップサンプリング処理を行って出力を算出し、算出した前記出力と教師データとを比較して第１損失を算出する第１損失算出部と、
前記特徴マップに対して畳み込みとアップサンプリング処理を行って出力を算出し、算出した前記出力と教師データとを比較して第２損失を算出する第２損失算出部と、
をさらに備え、
前記第１損失と前記第２損失から、全体の損失関数を算出し、算出した前記損失関数を用いて前記重み係数の学習を行う、
請求項３に記載の画像識別装置。 A first loss calculation unit that calculates an output by performing convolution and upsampling processing on the weighted feature amount, compares the calculated output with the teacher data, and calculates the first loss.
A second loss calculation unit that calculates the output by performing convolution and upsampling processing on the feature map, and compares the calculated output with the teacher data to calculate the second loss.
Further prepare
The entire loss function is calculated from the first loss and the second loss, and the weighting coefficient is learned using the calculated loss function.
The image identification device according to claim 3.

ニューラルネットワークシステムを使用して画像（Ｘ）のセマンティックセグメンテーションを行う方法であって、
前記画像を入力する処理と、
取得された前記画像の複数の特徴量を抽出する処理と、
前記画像が有する複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成する処理と、
特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ_ｉ）を乗算する処理と
を有するセマンティックセグメンテーションを行う方法。 A method of performing semantic segmentation of an image (X) using a neural network system.
The process of inputting the image and
Processing to extract a plurality of acquired feature quantities of the image and
A process of creating a feature map ( _Xi ) for each of the plurality of features of the image, and
A method of performing semantic segmentation with a process of multiplying a weighting factor ( _ai ), which is an arbitrary positive value expressing the importance of a feature for each feature map.

コンピュータに、
画像を取得させ、
取得された前記画像の複数の特徴量を抽出させ、
前記画像が有する複数の特徴量のそれぞれについて特徴マップ（Ｘ_ｉ）を作成させ、
前記特徴マップごとに特徴の重要度を表現した任意の正の値である重み係数（ａ_ｉ）を乗算させる、
プログラム。 On the computer
Get the image,
A plurality of features of the acquired image are extracted, and the images are extracted.
A feature map ( _Xi ) is created for each of the plurality of features of the image.
Multiply each feature map by a weighting factor ( _ai ) which is an arbitrary positive value expressing the importance of the feature.
program.