JP6713162B2

JP6713162B2 - Image recognition device, image recognition method, and image recognition program

Info

Publication number: JP6713162B2
Application number: JP2016008273A
Authority: JP
Inventors: 淳司立間; 青野　雅樹; 雅樹青野
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2016-01-19
Filing date: 2016-01-19
Publication date: 2020-06-24
Anticipated expiration: 2036-01-19
Also published as: JP2017129990A

Description

本発明は、画像認識装置、画像認識方法、及び画像認識プログラムに関する。とくに、ニューラルネットワークを用いて画像認識を実現するものである。 The present invention relates to an image recognition device, an image recognition method, and an image recognition program. In particular, it realizes image recognition using a neural network.

近年、畳み込みニューラルネットワーク（Convolutional Neural Networks、以下CNNと記す。たとえば、非特許文献１を参照。）が、画像認識において優れた認識性能を得ている。 In recent years, convolutional neural networks (hereinafter, referred to as CNNs; see, for example, Non-Patent Document 1) have obtained excellent recognition performance in image recognition.

ニューラルネットワークによる画像認識技術には、たとえば、特許文献１及び２がある。特許文献１では、学習結果あるいは識別結果に応じて、共分散による線形分類と、ニューラルネットワークなどによる非線形分類を切り替えることにより画像認識の性能を向上させている。 Image recognition techniques using a neural network include, for example, Patent Documents 1 and 2. In Patent Document 1, the image recognition performance is improved by switching between linear classification by covariance and non-linear classification by a neural network or the like according to a learning result or a classification result.

また、特許文献２では、CNNによる画像認識の精度を向上させるため、CNNの計算コストを削減し、かつ畳み込み層の複数の重みを適正に設定する装置が開示されている。 Further, Patent Document 2 discloses a device for reducing the calculation cost of the CNN and appropriately setting a plurality of weights of the convolutional layer in order to improve the accuracy of image recognition by the CNN.

一方、料理の盛り付け写真など食事画像の認識（以下、食事画像認識ということがある。）は、食生活に関する多くのアプリケーションにとって、重要な研究課題となっている。食事画像認識のベンチマーク（非特許文献２）においても、CNNは従来手法のBag-of-Visual-Words Histogram（以下、BoVWと記す。たとえば、非特許文献３を参照。）やFisher Vector（たとえば、非特許文献４を参照。）よりも優れた認識性能を得ている。 On the other hand, the recognition of food images such as food serving photos (hereinafter sometimes referred to as food image recognition) has become an important research subject for many applications related to eating habits. In the benchmark of food image recognition (Non-Patent Document 2), CNN also uses the conventional method Bag-of-Visual-Words Histogram (hereinafter referred to as BoVW; see, for example, Non-Patent Document 3) and Fisher Vector (for example, (See Non-Patent Document 4)).

特許第4121061号公報Japanese Patent No. 4121061 特開2015-052832号公報JP 2015-052832 JP

A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems (NIPS’12), vol.25, pp.1097-1105, 2012.A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems (NIPS′12), vol.25, pp.1097-1105, 2012. L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101 - Mining discriminative components with random forests,” Proc. of the 13th European Conference on Computer Vision, ECCV’14, pp.446-461, 2014.L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101-Mining discriminative components with random forests,” Proc. of the 13th European Conference on Computer Vision, ECCV′14, pp.446-461, 2014. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” Proc. of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06), vol.2, pp.2169-2178, 2006.S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” Proc. of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'06), vol. 2, pp.2169-2178, 2006. J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International Journal of Computer Vision, vol.105, no.3, pp.222-245, 2013.J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International Journal of Computer Vision, vol.105, no.3, pp.222-245, 2013. X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework for tensor computing,” International Journal of Computer Vision, 66 (1), pp.41-66, 2006.X. Pennec, P. Fillard, and N. Ayache, “A riemannian framework for tensor computing,” International Journal of Computer Vision, 66 (1), pp.41-66, 2006. D. Tosato, M. Spera, M. Cristani, and V. Murino, “Characterizing humans on Riemannian manifolds,” IEEE Trans. Pattern Analysis and Machine Intelligence, 35 (8), pp. 1972-1984, 2013.D. Tosato, M. Spera, M. Cristani, and V. Murino, “Characterizing humans on Riemannian manifolds,” IEEE Trans. Pattern Analysis and Machine Intelligence, 35 (8), pp. 1972-1984, 2013. H. Jegou and O. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening,” Proc. of the 12th European Conference on Computer Vision (ECCV’12), 2, pp.774-787, 2012.H. Jegou and O. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening,” Proc. of the 12th European Conference on Computer Vision (ECCV'12), 2, pp.774-787 , 2012. S. Singh, A. Gupta, and A.A. Efros, “Unsupervised discovery of mid-level discriminative patches,” Proc. of the 12th European Conference on Computer Vision (ECCV’12), vol.2, pp.73-86, 2012.S. Singh, A. Gupta, and AA Efros, “Unsupervised discovery of mid-level discriminative patches,” Proc. of the 12th European Conference on Computer Vision (ECCV'12), vol.2, pp.73-86, 2012 . A.S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” Proc. of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW ’14), pp.512-519, 2014.AS Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” Proc. of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW '14 ), pp.512-519, 2014.

しかしながら、CNNで優れた高い認識性能を得るためには、大規模な画像データセットでニューラルネットワークを訓練する必要がある。さらに、現実的な時間で訓練を行うためには、GPU（Graphical Processing Unit）による並列処理システムなどの高度な処理能力を必要とする。 However, it is necessary to train a neural network with a large image data set in order to obtain a high recognition performance with CNN. Furthermore, in order to perform training in a realistic time, it is necessary to have a high processing capacity such as a parallel processing system using a GPU (Graphical Processing Unit).

前記の問題の解決策として、種々雑多な画像データセットで学習済みのCNNの全結合層の出力を特徴量として、Support Vector Machine（以下、SVMと記す。）などの識別器を訓練する手法が提案されている。当該手法は、大規模な画像データセットによるニューラルネットワークの訓練を必要としない一方で、食事画像認識などのドメインを限定した画像認識タスクでは、十分な認識性能が得られないという課題がある。特定のドメインの画像認識を行うためには、対象となるドメインの画像データセットを用意して、ニューラルネットワークを再度訓練する必要があり、画像認識の前処理に時間がかかる。 As a solution to the above problem, a method of training a discriminator such as Support Vector Machine (hereinafter referred to as SVM) using the output of the fully connected layer of CNN trained with various image data sets as a feature quantity is used. Proposed. This method does not require training of a neural network with a large-scale image data set, but has a problem that sufficient recognition performance cannot be obtained in domain-specific image recognition tasks such as meal image recognition. In order to perform image recognition of a specific domain, it is necessary to prepare an image data set of the target domain and retrain the neural network, and preprocessing of image recognition takes time.

本発明は、上記の先行技術の課題を鑑み、なされたものである。 The present invention has been made in view of the above problems of the prior art.

本発明に係る第一の画像認識装置は、複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識装置であって、画像から共分散記述子を抽出する演算部と、画像データセットに含まれる画像について前記演算部によって処理されて抽出された共分散記述子を記憶する記憶部と、新たに入力される画像について前記演算部によって処理されて抽出される共分散記述子を前記記憶部に記憶される共分散記述子と比較する識別処理部とを備え、前記演算部は、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算手段と、前記局所特徴量について共分散行列を導く共分散行列導出手段と、前記共分散行列をベクトル化するための演算を行うベクトル演算手段と、前記ベクトルを正規化する正規化手段とを備えることを特徴とする。 A first image recognition apparatus according to the present invention is an image recognition apparatus that uses a convolutional neural network including a plurality of convolutional layers, and is included in an arithmetic unit that extracts a covariance descriptor from an image and an image data set. A storage unit that stores a covariance descriptor that is processed and extracted by the operation unit for an image, and a covariance descriptor that is processed and extracted by the operation unit for a newly input image is stored in the storage unit. The covariance descriptor is compared with the identification processing unit, the calculation unit calculates a local feature amount from a feature map output from any layer of the convolutional layer, local feature amount calculation means, A covariance matrix derivation means for deriving a covariance matrix for the local feature quantity, a vector operation means for performing an operation for vectorizing the covariance matrix, and a normalization means for normalizing the vector. And

本発明に係る第二の画像装置は、前記本発明に係る第一の画像認識装置であって、前記局所特徴量演算手段は、前記畳み込みニューラルネットワークの全ての畳み込み層から出力される特徴マップについて局所特徴量を演算するものであることを特徴とする。 A second image device according to the present invention is the first image recognition device according to the present invention, wherein the local feature amount computing means is a feature map output from all convolutional layers of the convolutional neural network. The feature is that the local feature amount is calculated.

本発明に係る第三の画像認識装置は、前記本発明に係る第一または第二の画像認識装置であって、前記局所特徴量演算手段は、一つの畳み込み層からｄチャンネルで大きさｗ×ｈの特徴マップが得られるとき、前記特徴マップをｎ＝ｗ×ｈの点によるｄ次元の局所特徴量を演算するものであることを特徴とする。 A third image recognition device according to the present invention is the first or second image recognition device according to the present invention, wherein the local feature amount computing means is a convolutional layer and has a size w× When a feature map of h is obtained, the feature map is characterized by calculating a d-dimensional local feature amount by points of n=w×h.

本発明に係る第四の画像認識装置は、前記本発明に係る第三の画像認識装置であって、前記演算部は、前記共分散行列の大きさをｄ×ｄとするとき、抽出される共分散記述子の次元が（ｄ^２＋ｄ）／２であることを特徴とする。 A fourth image recognition apparatus according to the present invention is the third image recognition apparatus according to the present invention, wherein the arithmetic unit is extracted when the size of the covariance matrix is d×d. The dimension of the covariance descriptor is (d ² +d)/2.

本発明に係る第一の画像認識方法は、複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識方法であって、画像から共分散記述子を抽出する抽出ステップと、画像データセットに含まれる画像について前記抽出ステップによって処理されて抽出された共分散記述子を記憶する記憶ステップと、新たに入力される画像について前記抽出ステップによって処理されて抽出される共分散記述子を前記記憶ステップによって記憶される共分散記述子と比較する識別処理ステップとを備え、前記抽出ステップは、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算ステップと、前記局所特徴量について共分散行列を導く共分散行列導出ステップと、前記共分散行列をベクトル化するための演算を行うベクトル演算ステップと、前記ベクトルを正規化する正規化ステップとを含むことを特徴とする。 A first image recognition method according to the present invention is an image recognition method using a convolutional neural network including a plurality of convolutional layers, and includes an extraction step of extracting a covariance descriptor from an image and an image data set. Storing a covariance descriptor processed and extracted by the extraction step for an image, and storing a covariance descriptor processed and extracted by the extraction step for a newly input image by the storage step An identification processing step of comparing with a covariance descriptor that is performed, the extracting step, a local feature amount calculating step of calculating a local feature amount from a feature map output from any layer of the convolutional layers, A covariance matrix derivation step of deriving a covariance matrix for the local feature amount, a vector operation step of performing an operation for vectorizing the covariance matrix, and a normalization step of normalizing the vector. And

本発明に係る第二の画像認識方法は、前記本発明に係る第一の画像認識方法であって、前記局所特徴量演算ステップは、前記畳み込みニューラルネットワークの全ての畳み込み層から出力される特徴マップについて局所特徴量を演算するものであることを特徴とする。 A second image recognition method according to the present invention is the first image recognition method according to the present invention, wherein the local feature amount computing step is a feature map output from all convolution layers of the convolutional neural network. It is characterized in that the local feature amount is calculated.

本発明に係る第三の画像認識方法は、前記本発明に係る第一または第二の画像認識方法であって、前記局所特徴量演算ステップは、一つの畳み込み層からｄチャンネルで大きさｗ×ｈの特徴マップが得られるとき、前記特徴マップをｎ＝ｗ×ｈの点によるｄ次元の局所特徴量を演算するものであることを特徴とする。 A third image recognizing method according to the present invention is the first or second image recognizing method according to the present invention, wherein the local feature amount calculating step includes a size w× When a feature map of h is obtained, the feature map is characterized by calculating a d-dimensional local feature amount by points of n=w×h.

本発明に係る第四の画像認識方法は、前記本発明に係る第三の画像認識方法であって、前記抽出ステップは、前記共分散行列の大きさをｄ×ｄとするとき、抽出される共分散記述子の次元が（ｄ^２＋ｄ）／２であることを特徴とする。 A fourth image recognition method according to the present invention is the third image recognition method according to the present invention, wherein the extraction step is performed when the size of the covariance matrix is d×d. The dimension of the covariance descriptor is (d ² +d)/2.

本発明に係る画像認識プログラムは、複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識のためのコンピュータプログラムであって、コンピュータを、画像から共分散記述子を抽出する演算手段と、画像データセットに含まれる画像について前記演算部によって処理されて抽出された共分散記述子を記憶する記憶手段と、新たに入力される画像について前記演算部によって処理されて抽出される共分散記述子を前記記憶部に記憶される共分散記述子と比較する識別処理手段として機能させ、さらに、前記演算手段において、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算手段と、前記局所特徴量について共分散行列を導く共分散行列導出手段と、前記共分散行列をベクトル化するための演算を行うベクトル演算手段と、前記ベクトルを正規化する正規化手段として機能させることを特徴とする。 An image recognition program according to the present invention is a computer program for image recognition that uses a convolutional neural network having a plurality of convolutional layers, the computer program comprising: computing means for extracting a covariance descriptor from an image; Storage means for storing the covariance descriptor processed and extracted by the arithmetic unit for the images included in the set; and the covariance descriptor processed and extracted by the arithmetic unit for the newly input image. A local processing unit that functions as an identification processing unit that compares with a covariance descriptor stored in the storage unit, and further that, in the calculating unit, calculates a local feature amount from a feature map output from any layer of the convolutional layers. Feature quantity computing means, covariance matrix deriving means for deriving a covariance matrix for the local feature quantity, vector computing means for performing computation for vectorizing the covariance matrix, and normalizing means for normalizing the vector It is characterized by making it function as.

CNNのもつ複数の畳み込み層のうち、第１層や第２層など低層の畳み込み層は、それぞれエッジ（edge）やコーナー（corner）といった画像の低レベルの特徴を捉えており、該画像の特徴は、ニューラルネットワークの全結合層と比較して、特定ドメインの訓練画像データセットの影響が少ない。本発明により、低レベルの特徴を用いることで、画像の抽象的な特徴を抽出することができる。 Of the multiple convolutional layers of CNN, the lower convolutional layers such as the first layer and the second layer respectively capture low-level features of the image such as edges and corners, and the features of the image Has less influence on the training image data set of the specific domain than the fully connected layer of the neural network. According to the present invention, abstract features of an image can be extracted by using low-level features.

さらに、本発明では、CNNの畳み込み層の出力（特徴マップ：feature maps）の共分散を画像の特徴量とすることにより、特定ドメインの画像による該CNNの再度の訓練なしで、特定ドメインにおける画像認識を実現し、特定ドメインにおける画像認識の精度を向上させることができる。
Furthermore, in the present invention, the covariance of the output (feature maps) of the convolutional layer of the CNN is used as the feature amount of the image, so that the image in the specific domain is not retrained by the image in the specific domain. It is possible to realize recognition and improve the accuracy of image recognition in a specific domain.

本発明係る特徴マップ共分散記述子の概要図である。It is a schematic diagram of a feature map covariance descriptor according to the present invention. 本発明係る特徴マップ共分散記述子による画像認識装置の概要図である。It is a schematic diagram of an image recognition apparatus by a feature map covariance descriptor according to the present invention. 本発明係る特徴マップ共分散記述子の抽出処理のフローチャートを示す図である。It is a figure which shows the flowchart of the extraction process of the feature map covariance descriptor which concerns on this invention.

本発明係る画像認識は、対象となるドメインの画像データセットについての畳み込みニューラルネットワークの再度の訓練及びそのための訓練画像データセットを要することなく、本発明に係る特徴量、及び識別器により、精度よく実現される。前記畳み込みニューラルネットワークは種々雑多な画像で構成される画像データセットにより訓練済みである。また、前記識別器の訓練には、前記対象となるドメインの画像データセットの訓練画像データセットを用いる。 The image recognition according to the present invention does not require retraining of the convolutional neural network for the image data set of the target domain and the training image data set for that purpose, and the feature quantity and the classifier according to the present invention accurately Will be realized. The convolutional neural network has been trained with an image data set consisting of various miscellaneous images. Further, the training image data set of the image data set of the target domain is used for the training of the classifier.

（畳み込み層の特徴マップの共分散）
図１に示すように、学習済みCNNから有効な特徴量を得るために、全結合層より以前の畳み込み層の特徴マップから特徴量を抽出する。当該畳み込み層には画像のedgeやcornerなど基礎的な特徴が含まれる。当該画像の特徴は、全結合層と比較して、学習用画像データセットの内容に影響をうけない。そこで、畳み込み層の特徴マップの共分散行列を特徴量として求める。なお、本発明では、ニューラルネットワークや識別器の学習とニューラルネットワークや識別器の訓練は同じ意味で用いている。 (Covariance of feature map of convolutional layer)
As shown in FIG. 1, in order to obtain an effective feature amount from the learned CNN, the feature amount is extracted from the feature map of the convolutional layer before the fully connected layer. The convolutional layer includes basic features such as edges and corners of the image. The characteristics of the image are not affected by the contents of the learning image data set, as compared with the fully connected layer. Therefore, the covariance matrix of the feature map of the convolutional layer is obtained as the feature amount. In the present invention, learning of the neural network or the discriminator and training of the neural network or the discriminator have the same meaning.

学習済みCNNに画像を入力すると、各層に配置されたユニットから値が出力される。畳み込み層における出力は、特徴マップと呼ばれる。特徴マップは、ユニットから出力された値が配置された、複数枚の二次元平面で構成される。一般に、特徴マップを構成する１つの二次元平面は、１チャネルと数えられる。いま、１番目の畳込み層から、ｄチャネルで大きさｗ×ｈの特徴マップが得られたとする。本発明では、この特徴マップをｎ＝ｗ×ｈ点のｄ次元の局所特徴量Ｆとし、 When the image is input to the trained CNN, the value is output from the units arranged in each layer. The output in the convolutional layer is called the feature map. The feature map is composed of a plurality of two-dimensional planes on which the values output from the unit are arranged. In general, one two-dimensional plane forming the feature map is counted as one channel. Now, suppose that a feature map of size w×h is obtained in the d channel from the first convolutional layer. In the present invention, this feature map is a d-dimensional local feature F at n=w×h points,

と考える。すると、局所特徴量Ｆの共分散行列Ｃは、

I think. Then, the covariance matrix C of the local feature F is

で得られる。ここで、ｍは局所特徴量Ｆの平均ベクトルである。共分散行列は、多次元空間上での局所特徴量のばら付きの傾向を表すことから、画像のedgeやcornerといった特徴の傾向を捉えることができる。

Can be obtained at. Here, m is an average vector of the local feature amount F. The covariance matrix represents the tendency of local feature amounts to scatter in a multidimensional space, so that the tendency of features such as edges and corners of an image can be captured.

共分散行列Ｃは、ユークリッド空間ではなく、半正定値行列のリーマン多様体上にある。多くの機械学習アルゴリズムは、入力としてユークリッド空間上のベクトルを前提としているため、このままでは識別器の学習などを行うことができない。そこで、共分散行列Ｃをユークリッド空間に写像する。前記写像の手段として、行列演算とベクトル操作により、正定値行列をユークリッド空間に写像し、ベクトルの形式に変換する。たとえば、非特許文献５記載のPennecらが提案した方法を用いることができる。ユークリッド空間上のベクトルに変換することで、SVMなどの一般的な識別器での学習が可能となる。 The covariance matrix C is not on the Euclidean space, but on the Riemannian manifold of the positive semidefinite matrix. Since many machine learning algorithms assume a vector on the Euclidean space as an input, the classifier cannot be learned as it is. Therefore, the covariance matrix C is mapped to the Euclidean space. As a means for the mapping, a positive definite matrix is mapped to the Euclidean space by matrix operation and vector operation, and converted into a vector format. For example, the method proposed by Pennec et al. in Non-Patent Document 5 can be used. By converting to a vector on the Euclidean space, learning with a general classifier such as SVM becomes possible.

まず、共分散行列Ｃを接点Ｐにおいてリーマン多様体に接しているユークリッド空間に射影する。射影した共分散行列ＣのベクトルＹは数３で与えられる。 First, the covariance matrix C is projected on the Euclidean space tangent to the Riemannian manifold at the contact point P. The projected vector Y of the covariance matrix C is given by Equation 3.

ここで、ｌｏｇ（・）は行列対数であり、固有値分解をＡ＝ＵΛＵ^Ｔ（Ｔは転置行列を示す）とすると、数４で求める事ができる。

Here, log(·) is the logarithm of the matrix, and if the eigenvalue decomposition is A=UΛU ^T (T represents a transposed matrix), then it can be obtained by Equation 4.

また、行列Λのような対角行列の行列対数は、その対角要素λ_１，・・・，λ_ｄの対数を計算することで数５として得られる。

Also, the matrix logarithm of a diagonal matrix such as the matrix Λ can be obtained as Equation 5 by calculating the logarithm of the diagonal elements λ ₁ ,..., λ _d .

そして、射影したベクトルの直交座標系を数６のベクトル操作により得る。

Then, the orthogonal coordinate system of the projected vector is obtained by the vector operation of Equation 6.

ここで、ｖｅｃ_Ｉは、単位行列による接空間上でのベクトル操作であり、

Here, vec _I is a vector operation on a tangent space by an identity matrix,

で定義される。これは、Ｙの上三角要素を並べてベクトルとしたものである。ここで、Ｙの非対角要素（たとえば、ｙ_１，２やｙ_１，３など）に２の平方根を掛けているのは、ベクトルとＹのノルムを一致させるためである。

Is defined by This is a vector in which the upper triangular elements of Y are arranged. Here, the reason why the non-diagonal elements of Y (for example, y _1,2 , y _1,3, etc.) are multiplied by the square root of 2 is to match the norm of Y with the vector.

計算コストの観点から行列Ｐには単位行列を使用する。結果として、ベクトル化した共分散行列Ｃは、数８から与えられる。つまり、共分散行列Ｃの対数行列を求め、その上三角要素を並べてベクトルとする。 A unit matrix is used for the matrix P from the viewpoint of calculation cost. As a result, the vectorized covariance matrix C is given by Eq. That is, the logarithmic matrix of the covariance matrix C is obtained, and the upper triangular elements are arranged into a vector.

共分散行列Ｃの大きさをｄ×ｄとすると、特徴量の次元数（ベクトルの要素の数）は（ｄ^２＋ｄ）／２となる。

When the size of the covariance matrix C is d×d, the dimension number of the feature amount (the number of vector elements) is (d ² +d)/2.

最終的な特徴量は、ベクトルｃを符号付平方根正規化とｌ_２正規化することで得る。符号付平方根正規化とは、ベクトルの各要素ｘに以下の操作を行う。 The final feature quantity is obtained by normalizing the vector c with a square root and l ₂ normalization. Signed square root normalization performs the following operation on each element x of the vector.

ここで、ｓｉｇｎ（ｘ）は、ｘの符号を返す関数である。この正規化処理は、ベクトルのスパース性を緩和する効果がある（ベクトルｃがスパースでない場合には必要のない処理であり、必ずしも行わなければならない処理ではない）。ｌ_２正規化は、ベクトルの各要素を、ベクトルのユークリッドノルムで割ることである。ベクトルの大きさを一定にする効果がある。

Here, sign(x) is a function that returns the sign of x. This normalization process has the effect of alleviating the sparsity of the vector (it is a process that is not necessary when the vector c is not sparse, and is not necessarily a process that must be performed). I ₂ normalization is the division of each element of a vector by the Euclidean norm of the vector. This has the effect of making the magnitude of the vector constant.

（特徴マップ共分散記述子による画像認識システムの構築）
図２は、特徴マップ共分散記述子を利用した、画像認識システムの概要図である。まず、識別器を学習するためのラベルが付与された認識対象ドメインの画像で構成された訓練画像データセットと、特徴マップ共分散記述子を抽出するための学習済みCNNを用意する。 (Construction of image recognition system by feature map covariance descriptor)
FIG. 2 is a schematic diagram of an image recognition system using a feature map covariance descriptor. First, we prepare a training image data set consisting of images of the recognition target domain with labels for learning the discriminator, and a trained CNN for extracting feature map covariance descriptors.

次に、訓練画像データセットに含まれる全ての画像から、特徴マップ共分散記述子を抽出し、ラベル情報とともに識別器学習部に入力する。本発明では、該識別器にSupport Vector Machineを用いる。前記該識別器学習部は、与えられた訓練画像データセットの特徴マップ共分散記述子とラベル情報から、識別モデルを学習する。得られた識別モデルを記憶装置に記憶しておく。 Next, the feature map covariance descriptor is extracted from all the images included in the training image data set, and is input to the discriminator learning unit together with the label information. In the present invention, a Support Vector Machine is used as the discriminator. The discriminator learning unit learns a discriminant model from the feature map covariance descriptor and label information of a given training image data set. The obtained identification model is stored in the storage device.

そして、識別段階では、識別対象画像が与えられると、訓練時と同様にして，画像から特徴マップ共分散記述子を抽出し、識別処理部に入力する。該識別処理部は入力された特徴マップ共分散記述子と、記憶装置に保存しておいた前記識別モデルから、識別画像対象の識別結果を計算し出力する。 Then, in the identification step, when the image to be identified is given, the feature map covariance descriptor is extracted from the image and input to the identification processing unit in the same manner as during training. The identification processing unit calculates and outputs the identification result of the identification image target from the input feature map covariance descriptor and the identification model stored in the storage device.

図３は、特徴マップ共分散記述子の抽出処理の流れをフローチャートで示したものである。特徴マップ共分散記述子抽出部は、画像が与えられると、CNNのネットワーク構成に合わせて、画像のリサイズを行う（たとえば、OverFeatのaccurateネットワークであれば２２１×２２１の大きさ）。リサイズした前記画像に対して、必要であれば、平均ピクセル値を引くなどの前処理を行う。さらに、学習済みCNN内部で前処理も行われる場合がある。 FIG. 3 is a flowchart showing the flow of the feature map covariance descriptor extraction process. When the image is given, the feature map covariance descriptor extraction unit resizes the image in accordance with the network configuration of the CNN (for example, in the case of OverFeat accurate network, the size is 221×221). If necessary, pre-processing such as subtracting an average pixel value is performed on the resized image. In addition, preprocessing may also be performed inside the learned CNN.

次に、CNNにリサイズを含む前処理を施した前記画像を入力し、任意のｌ番目の畳み込み層の出力（特徴マップ）を得る。さらに、特徴マップを局所特徴量とみなしてサンプル行列の形式に変換し、行列演算ライブラリ（例えばC++であればEigen，PythonであればNumpy）を用いて、共分散行列を計算する。また、行列演算ライブラリを用いて、前記共分散行列の行列対数を計算する。 Next, the pre-processed image including resizing is input to the CNN, and the output (feature map) of an arbitrary l-th convolutional layer is obtained. Furthermore, the feature map is regarded as a local feature amount, converted into a sample matrix format, and a covariance matrix is calculated using a matrix operation library (for example, Eigen for C++ and Numpy for Python). Also, the matrix logarithm of the covariance matrix is calculated using a matrix calculation library.

得られた、前記行列対数を計算した前記共分散行列の上三角部分に該当する要素を並べ、ベクトルの形式にする。必要であれば、得られた該ベクトルに対して、符号付平方根正規化とｌ_２正規化を行う。 The elements corresponding to the upper triangular part of the obtained covariance matrix for which the logarithm of the matrix is calculated are arranged to form a vector. If necessary, signed square root normalization and l ₂ normalization are performed on the obtained vector.

以上から得られたベクトルが、特徴マップ共分散記述子である。 The vector obtained from the above is the feature map covariance descriptor.

（実験環境）
食事画像データセットETHZ Food-101 (Food-101)（非特許文献２を参照）を用いて認識精度の評価を行った。Food-101には、１０１個のクラスに分類された１０１，０００枚の食事画像が含まれている。各クラスには、７５０枚の訓練画像、２５０枚のテスト画像が含まれている。食事画像の認識を課題として選択した理由は、一般に公開されている学習済みCNNは、ImageNetから取得した種々雑多な画像で学習されている。認識対象を食事画像に絞ることで、学習に用いた画像と、認識対象となる画像の分野が異なっていても、優れた認識精度が得られるかを確認できる。 (Experiment environment)
The recognition accuracy was evaluated using the food image data set ETHZ Food-101 (Food-101) (see Non-Patent Document 2). Food-101 includes 101,000 meal images classified into 101 classes. Each class contains 750 training images and 250 test images. The reason why we selected the recognition of meal images as the task is that the learned CNNs that are open to the public are learned with various images acquired from ImageNet. By narrowing down the recognition target to the meal image, it is possible to confirm whether excellent recognition accuracy can be obtained even when the image used for learning is different from the field of the recognition target image.

学習済みCNNには、ニューヨーク大学が提供するOverFeat (http://cilvr.nyu.edu/doku.php?id=software:overfeat:startを参照)を用いた。OverFeatではfastネットワークとaccurateネットワークの二種類が提供されているが、本実験ではaccurateネットワークを用いた。識別器にはSVMを用いて、その実装にはliblinear（https://www.csie.ntu.edu.tw/~cjlin/liblinear/を参照）を用いた。 OverFeat provided by New York University (see http://cilvr.nyu.edu/doku.php?id=software:overfeat:start) was used as the learned CNN. OverFeat provides two types of networks, fast network and accurate network. In this experiment, accurate network was used. SVM was used for the discriminator, and liblinear (see https://www.csie.ntu.edu.tw/~cjlin/liblinear/) was used for its implementation.

実験に使用した計算機のスペックは、CPUがデュアルコア・プロセッサで、Intel社製 Xeon(登録商標) E5-2630 2.3GHzであり、メモリが64GBである。また、OSはDebian GNU/Linux(登録商標) 8.2である。 The specifications of the computer used for the experiment are that the CPU is a dual core processor, Intel Xeon (registered trademark) E5-2630 2.3 GHz, and the memory is 64 GB. The operating system is Debian GNU/Linux (registered trademark) 8.2.

従来手法には、Bag-of-Visual-Wordsヒストグラム（BoVW）法（非特許文献３）、Improved Fisher Vector（IFV）法（非特許文献４）、Mid-Level Discriminative Superpixels（MLDS）法（非特許文献８）、Random Forest Discriminant Components（RFDC）法（非特許文献２）、Food-101で訓練したCNN（非特許文献１）、OverFeatの全結合層を特徴量としてSVMで分類する方法（CNN-SVM）（非特許文献９）を用いた。CNN-SVMを除いた従来手法の評価尺度の値は、Food-101が提案された非特許文献２からの引用である（表１上部）。CNN-SVMは、OverFeatの全結合層の出力によるベクトルをｌ_２正規化したものを画像の特徴量として、識別器であるSVMを訓練した。いずれの手法も実験のデータ及び条件は、本発明と同様となる。 Conventional methods include a Bag-of-Visual-Words Histogram (BoVW) method (Non-Patent Document 3), an Improved Fisher Vector (IFV) method (Non-Patent Document 4), and a Mid-Level Discriminative Superpixels (MLDS) method (Non-Patent Document). Reference 8), Random Forest Discriminant Components (RFDC) method (Non-Patent Document 2), CNN trained on Food-101 (Non-Patent Document 1), and a method of classifying all connected layers of OverFeat by SVM as a feature amount (CNN- SVM) (Non-Patent Document 9) was used. The values of the evaluation scale of the conventional method excluding CNN-SVM are quoted from Non-Patent Document 2 in which Food-101 was proposed (Table 1 upper part). The CNN-SVM trained the SVM that is a discriminator, using the vector obtained by the output of the fully connected layer of OverFeat as l ₂ normalized as the image feature amount. The data and conditions of the experiments in both methods are the same as in the present invention.

認識精度をはかる評価尺度には、正確度（Accuracy）を用いた。全データ数をＮ、正しく認識されたデータ数をＲとすると、正確度は以下で定義される。 Accuracy was used as the evaluation scale for recognition accuracy. The accuracy is defined as follows, where N is the total number of data and R is the number of correctly recognized data.

（実験結果）
本発明では、OverFeatの第１層の特徴マップを使用したもの（FMCD-L1）、第２層の特徴マップから抽出したもの（FMCD-L2）、それらの要素を並べて１つのベクトルとすることで連結したもの（FMCD-L1+FMCD-L2）、全結合層と連結したもの（FMCD-L1+FUL及びFMCD-L2+FUL）、全てを連結したもの（FMCD-L1+FMCD-L2+FUL）を用いた。識別器には全て線形SVMを用いた。 (Experimental result)
In the present invention, by using the feature map of the first layer of OverFeat (FMCD-L1), the one extracted from the feature map of the second layer (FMCD-L2), and arranging those elements into one vector, Those connected (FMCD-L1+FMCD-L2), those connected to all bonding layers (FMCD-L1+FUL and FMCD-L2+FUL), those connected all (FMCD-L1+FMCD-L2+FUL) Was used. Linear SVM was used for all discriminators.

OverFeatの第１層では、９６チャンネルで大きさ３６×３６ユニットの特徴マップが取得できる。これを、９６次元で１，２９６（＝３６×３６）サンプルの局所特徴と考え、特徴マップ共分散記述子を計算する。結果として、４，６５６（＝（９６^２＋９６）／２）次元の特徴マップ共分散記述子が抽出される。同様に、第２層では、２５６チャンネルで大きさ１５×１５ユニットの特徴マップが取得できる。 In the first layer of OverFeat, a feature map of size 36×36 units can be acquired with 96 channels. Considering this as a local feature of 1,296 (=36×36) samples in 96 dimensions, a feature map covariance descriptor is calculated. As a result, 4,656 (=(96 ² +96)/2)-dimensional feature map covariance descriptors are extracted. Similarly, in the second layer, a characteristic map having a size of 15×15 units can be obtained with 256 channels.

表１は、Food-101データセットにおける各手法の識別性能を示すものである。正確度（Accuracy）で評価した各手法の識別性能をまとめたものである。 Table 1 shows the discrimination performance of each method in the Food-101 data set. This is a summary of the identification performance of each method evaluated by the accuracy.

表１よりFMCD-L1とFMCD-L2が全結合層を特徴量とする手法CNN-SVMを上回っていることがわかる。また、FMCD-L1+FMCD-L2では、CNNと同等の識別性能を得ている。本発明と全結合層の出力を連結したFMCD-L1+FUL、FMCD-L2+FULも同様の識別性能を得ている。さらに、全てを連結したFMCD-L1+FMCD-L2+FULでは、Food-101により学習したCNNを上回っており、本発明の有効性がわかる。

From Table 1, it can be seen that FMCD-L1 and FMCD-L2 outperform the CNN-SVM method, which uses the fully connected layer as a feature quantity. Also, FMCD-L1+FMCD-L2 has the same discrimination performance as CNN. FMCD-L1+FUL and FMCD-L2+FUL in which the outputs of the present invention and the total coupling layer are connected also obtain the same discrimination performance. Furthermore, FMCD-L1+FMCD-L2+FUL in which all are connected exceeds the CNN learned by Food-101, which shows the effectiveness of the present invention.

Claims

複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識装置であって、
画像から共分散記述子を抽出する演算部と、画像データセットに含まれる画像について前記演算部によって処理されて抽出された共分散記述子を記憶する記憶部と、新たに入力される画像について前記演算部によって処理されて抽出される共分散記述子を前記記憶部に記憶される共分散記述子と比較する識別処理部とを備え、
前記演算部は、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算手段と、前記局所特徴量について共分散行列を導く共分散行列導出手段と、前記共分散行列をベクトル化するための演算を行うベクトル演算手段と、前記ベクトルにかかる各要素を下式による符号付平方根正規化およびｌ _２正規化により最終的な特徴量を得るための正規化手段とを備え、
前記ベクトル演算手段は、ユークリッド空間に写像するとき、ベクトルとノルムとを一致させるように処理されるものであることを特徴とする画像認識装置。

An image recognition device using a convolutional neural network comprising a plurality of convolutional layers, comprising:
An arithmetic unit that extracts a covariance descriptor from an image, a storage unit that stores the covariance descriptor that is extracted by being processed by the arithmetic unit for an image included in an image data set, and a new input image An identification processing unit that compares the covariance descriptor processed and extracted by the arithmetic unit with the covariance descriptor stored in the storage unit,
The calculation unit calculates a local feature amount from a feature map output from any one of the convolutional layers, and a covariance matrix derivation unit that derives a covariance matrix for the local feature amount. , A vector operation means for performing an operation for vectorizing the covariance matrix , and a normal for obtaining a final feature amount by the signed square root normalization and l ₂ normalization of each element related to the vector by the following formula and a means,
Said vector calculating means, when mapping the Euclidean space, the image recognition apparatus according to claim der Rukoto shall be processed so as to match the vector norm.

前記局所特徴量演算手段は、前記畳み込みニューラルネットワークの全ての畳み込み層から出力される特徴マップについて局所特徴量を演算するものであることを特徴とする請求項１に記載の画像認識装置。 The image recognition apparatus according to claim 1, wherein the local feature amount calculation means calculates a local feature amount for a feature map output from all convolution layers of the convolutional neural network.

前記局所特徴量演算手段は、一つの畳み込み層からｄチャンネルで大きさｗ×ｈの特徴マップが得られるとき、前記特徴マップをｎ＝ｗ×ｈの点によるｄ次元の局所特徴量を演算するものであることを特徴とする請求項１または２に記載の画像認識装置。 When a feature map of size w×h is obtained from one convolutional layer with d channels, the local feature amount computing means computes a d-dimensional local feature amount of the feature map by n=w×h points. The image recognition apparatus according to claim 1 or 2, wherein the image recognition apparatus is an object.

前記演算部は、前記共分散行列の大きさをｄ×ｄとするとき、抽出される共分散記述子の次元が（ｄ^２＋ｄ）／２であることを特徴とする請求項３に記載の画像認識装置。 The calculation unit according to claim 3, wherein, when the size of the covariance matrix is d×d, the dimension of the covariance descriptor extracted is (d ² +d)/2. Image recognition device.

複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識方法であって、
画像から共分散記述子を抽出する抽出ステップと、画像データセットに含まれる画像について前記抽出ステップによって処理されて抽出された共分散記述子を記憶する記憶ステップと、新たに入力される画像について前記抽出ステップによって処理されて抽出される共分散記述子を前記記憶ステップによって記憶される共分散記述子と比較する識別処理ステップとを備え、
前記抽出ステップは、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算ステップと、前記局所特徴量について共分散行列を導く共分散行列導出ステップと、前記共分散行列をベクトル化するための演算を行うベクトル演算ステップと、前記ベクトルにかかる各要素を下式による符号付平方根正規化およびｌ _２正規化により最終的な特徴量を得るための正規化ステップとを含み、
前記ベクトル演算ステップは、ユークリッド空間に写像するとき、ベクトルとノルムとを一致させるように処理されるものであることを特徴とする画像認識方法。

An image recognition method using a convolutional neural network comprising a plurality of convolutional layers, comprising:
An extracting step of extracting a covariance descriptor from the image; a storing step of storing the extracted covariance descriptor processed by the extracting step for the images included in the image data set; An identification processing step of comparing the extracted covariance descriptor processed by the extraction step with the covariance descriptor stored by the storing step,
The extracting step includes a local feature amount calculating step of calculating a local feature amount from a feature map output from any layer of the convolutional layers, and a covariance matrix deriving step of deriving a covariance matrix for the local feature amount. , A vector operation step for performing an operation for vectorizing the covariance matrix , and a normalization for obtaining a final feature amount by squared root normalization and l ₂ normalization of each element related to the vector according to the following equation. viewing including the step,
An image recognition method characterized in that the vector calculation step is processed so that the vector and the norm match when mapping to the Euclidean space .

前記局所特徴量演算ステップは、前記畳み込みニューラルネットワークの全ての畳み込み層から出力される特徴マップについて局所特徴量を演算するものであることを特徴とする請求項５に記載の画像認識方法。 The image recognition method according to claim 5, wherein the local feature amount calculating step calculates a local feature amount for a feature map output from all convolutional layers of the convolutional neural network.

前記局所特徴量演算ステップは、一つの畳み込み層からｄチャンネルで大きさｗ×ｈの特徴マップが得られるとき、前記特徴マップをｎ＝ｗ×ｈの点によるｄ次元の局所特徴量を演算するものであることを特徴とする請求項５または６に記載の画像認識方法。 In the local feature amount calculating step, when a feature map of size w×h is obtained from one convolutional layer with d channels, the feature map is used to calculate a d-dimensional local feature amount of n=w×h points. The image recognition method according to claim 5 or 6, wherein the image recognition method is an image recognition method.

前記抽出ステップは、前記共分散行列の大きさをｄ×ｄとするとき、抽出される共分散記述子の次元が（ｄ^２＋ｄ）／２であることを特徴とする請求項７に記載の画像認識方法。 The extraction step is, when the magnitude of the covariance matrix and d × d, dimensional covariance descriptors extracted according to claim 7, characterized in that the ^(d 2 ⁺ d) / 2 Image recognition method.

複数の畳み込み層を備える畳み込みニューラルネットワークを使用する画像認識のためのコンピュータプログラムであって、コンピュータを、
画像から共分散記述子を抽出する演算手段と、画像データセットに含まれる画像について前記演算部によって処理されて抽出された共分散記述子を記憶する記憶手段と、新たに入力される画像について前記演算部によって処理されて抽出される共分散記述子を前記記憶部に記憶される共分散記述子と比較する識別処理手段として機能させ、
さらに、前記演算手段において、前記畳み込み層のうちの任意の層から出力される特徴マップから局所特徴量を演算する局所特徴量演算手段と、前記局所特徴量について共分散行列を導く共分散行列導出手段と、前記共分散行列をベクトル化するための演算を行うベクトル演算手段と、前記ベクトルにかかる各要素を下式による符号付平方根正規化およびｌ _２正規化により最終的な特徴量を得るための正規化手段として機能させ、
前記ベクトル演算手段は、ユークリッド空間に写像するとき、ベクトルとノルムとを一致させるように処理されるものであることを特徴とする画像認識プログラム。

A computer program for image recognition using a convolutional neural network comprising multiple convolutional layers, the computer program comprising:
An arithmetic means for extracting the covariance descriptor from the image, a storage means for storing the covariance descriptor extracted by being processed by the arithmetic unit for the image included in the image data set, and for the newly input image, Causing the covariance descriptor processed and extracted by the arithmetic unit to function as an identification processing means for comparing with the covariance descriptor stored in the storage unit,
Further, in the calculation means, a local feature quantity calculation means for calculating a local feature quantity from a feature map output from any one of the convolutional layers, and a covariance matrix derivation for deriving a covariance matrix for the local feature quantity. Means, vector operation means for performing an operation for vectorizing the covariance matrix , and for obtaining a final feature amount by the signed square root normalization and l ₂ normalization of each element related to the vector by the following equation. is a function as a normalizing means,
It said vector calculating means, when mapping the Euclidean space, the image recognition program characterized der Rukoto shall be processed so as to match the vector norm.