JP2019200772A

JP2019200772A - Identification system, identification device, method for identification, and program

Info

Publication number: JP2019200772A
Application number: JP2018195848A
Authority: JP
Inventors: 佐藤　智; Satoshi Sato; 智佐藤; 育規石井; Yasunori Ishii; 吾妻　健夫; Takeo Azuma; 健夫吾妻; 登　一生; Kazuo Nobori; 一生登; 信彦若井; Nobuhiko Wakai; ラサンポンサク; Lasang Pongsak; スブラマニアンカーティク; Subramaniam Karthik; メイシェンシェン; Mei Shen Shen
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2018-05-14
Filing date: 2018-10-17
Publication date: 2019-11-21

Abstract

To provide an identification system which can increase the accuracy of identification of an object using an image and can also increase the rate of identification processing.SOLUTION: An identification system 1 includes: a camera (imaging unit 11) for taking a calculation taken image including an imaging target and surrounding environments of the imaging target; and a processing circuit (image identification device 10) for identifying the imaging target in the calculation taken image, using an identification model. The identification model is made of a network having no pooling layers.SELECTED DRAWING: Figure 1

Description

本開示は、識別システム、識別装置、識別方法及びプログラムに関する。 The present disclosure relates to an identification system, an identification device, an identification method, and a program.

自動運転の車両及びロボットにおいて、周囲の物体を識別し、環境を認識する技術は重要である。近年、例えば自動運転の車両及びロボットにおける物体識別のために、ディープラーニング（Deep Learning）と呼ばれる技術が注目されている。ディープラーニングとは、多層構造のニューラルネットワークを用いた機械学習であり、学習において大量の学習データを使用している。このようなディープラーニングを用いることにより、従来法と比べて、より高精度な識別性能を実現することが可能である。そして、このような物体識別において、画像情報は特に有効である。非特許文献１では、画像情報を入力としたディープラーニングによって、従来の物体識別能力を大幅に向上させる手法が開示されている。また、高精度に識別するためには、入力画像が高解像度である必要がある。低解像度の画像は、例えば遠方の被写体について十分な解像度で撮像することができておらず、入力画像が低解像度である場合には、識別性能が低下してしまうためである。 Technology for recognizing surrounding objects and recognizing the environment is important in autonomous driving vehicles and robots. 2. Description of the Related Art In recent years, for example, a technique called deep learning has attracted attention for object identification in autonomously driven vehicles and robots. Deep learning is machine learning using a multi-layered neural network, and a large amount of learning data is used in learning. By using such deep learning, it is possible to realize higher-precision identification performance as compared with the conventional method. In such object identification, image information is particularly effective. Non-Patent Document 1 discloses a technique for greatly improving conventional object identification capability by deep learning using image information as input. Further, in order to identify with high accuracy, the input image needs to have a high resolution. This is because, for example, a low-resolution image cannot be captured at a sufficient resolution with respect to a distant subject, and the identification performance deteriorates when the input image has a low resolution.

一方で、非特許文献２では、画像情報に加え、３次元レンジファインダによる奥行情報も入力とすることで、ディープラーニングの識別能力をさらに向上させる手法が開示されている。奥行情報を使用すると、近傍と遠方との被写体を分離できる。そのため、奥行情報を使用することで遠方の被写体に対しても識別性能を上げることができる。また、低解像度の画像を撮像しながら、高解像度の画像を復元するために、例えば、非特許文献３に開示されるような圧縮センシングと呼ばれる手法が知られている。 On the other hand, Non-Patent Document 2 discloses a technique for further improving the deep learning identification ability by inputting depth information by a three-dimensional range finder in addition to image information. By using depth information, it is possible to separate the near and far subjects. Therefore, the identification performance can be improved even for a distant subject by using the depth information. In order to restore a high-resolution image while capturing a low-resolution image, for example, a technique called compression sensing as disclosed in Non-Patent Document 3 is known.

A. Krizhevsky, I. Sutskever及びG. E. Hinton著、「ImageNet Classication with Deep Convolutional Neural Networks」、NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems、2012年、P1097-1105A. Krizhevsky, I. Sutskever and G. E. Hinton, `` ImageNet Classication with Deep Convolutional Neural Networks '', NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012, P1097-1105 Andreas Eitel他著、「Multimodal Deep Learning for Robust RGB-D Object Recognition」、2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)、2015年Andreas Eitel et al., `` Multimodal Deep Learning for Robust RGB-D Object Recognition '', 2015 IEEE / RSJ International Conference on Intelligent Robots and Systems (IROS), 2015 Y．Oike及びA．E．Gamal著、「A 256×256 CMOS Image Sensor with ΔΣ-Based Single-Shot Compressed Sensing」、2012 IEEE International Solid-State Circuits Conference（ISSCC） Dig. of Tech. Papers、2012年、P386-387Y. Oike and A. E. Gamal, “A 256 × 256 CMOS Image Sensor with ΔΣ-Based Single-Shot Compressed Sensing”, 2012 IEEE International Solid-State Circuits Conference (ISSCC) Dig. Of Tech. Papers, 2012, P386-387 M. Salman Asif，Ali Ayremlou, Ashok Veeraraghavan, Richard Baraniuk及びAswin Sankaranarayanan著、「FlatCam: Replacing Lenses with Masks and Computation」、International Conference on Computer Vision Workshop (ICCVW)、2015年、P.663-666M. Salman Asif, Ali Ayremlou, Ashok Veeraraghavan, Richard Baraniuk and Aswin Sankaranarayanan, "FlatCam: Replacing Lenses with Masks and Computation", International Conference on Computer Vision Workshop (ICCVW), 2015, P.663-666 Yusuke Nakamura, Takeshi Shimano, Kazuyuki Tajima, Mayu Sao及びTaku Hoshizawa著、「Lensless Light-field Imaging with Fresnel Zone Aperture」、3rd International Workshop on Image Sensors and Imaging Systems (IWISS2016) ITE-IST2016-51、2016年、no.40、P.7-8Yusuke Nakamura, Takeshi Shimano, Kazuyuki Tajima, Mayu Sao and Taku Hoshizawa, `` Lensless Light-field Imaging with Fresnel Zone Aperture '', 3rd International Workshop on Image Sensors and Imaging Systems (IWISS2016) ITE-IST2016-51, 2016, no .40, P.7-8

しかしながら、上記非特許文献１〜３に開示された技術では、画像を用いた物体の識別精度の向上及び識別処理速度の向上を両立することが難しいという問題がある。 However, the techniques disclosed in Non-Patent Documents 1 to 3 have a problem that it is difficult to achieve both improvement in identification accuracy of an object using an image and improvement in identification processing speed.

そこで、本開示は、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上する識別システム等を提供する。 Therefore, the present disclosure provides an identification system that improves the identification accuracy of an object using an image and improves the identification processing speed.

上記課題を解決するために、本開示の識別システムの一態様は、撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を撮像するカメラと、識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別する処理回路とを備え、前記識別モデルは、プーリング層を有しないネットワークで構成されている。 In order to solve the above-described problem, an aspect of the identification system according to the present disclosure includes a camera that captures a calculated captured image including an imaging target and a surrounding environment of the imaging target, and the calculated captured image using an identification model. And a processing circuit for identifying the imaging object therein, and the identification model is configured by a network having no pooling layer.

なお、上記の包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム又はコンピュータ読み取り可能な記録ディスク等の記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。コンピュータ読み取り可能な記録媒体は、例えばＣＤ−ＲＯＭ（Compact Disc-Read Only Memory）等の不揮発性の記録媒体を含む。 The comprehensive or specific aspect described above may be realized by a system, an apparatus, a method, an integrated circuit, a recording medium such as a computer program or a computer-readable recording disk, and the system, apparatus, method, and integrated circuit. The present invention may be realized by any combination of a computer program and a recording medium. The computer-readable recording medium includes a nonvolatile recording medium such as a CD-ROM (Compact Disc-Read Only Memory).

本開示の識別システム等によると、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上することが可能になる。 According to the identification system and the like of the present disclosure, it is possible to improve the accuracy of identifying an object using an image and improve the identification processing speed.

本開示の一態様の付加的な恩恵及び有利な点は本明細書及び図面から明らかとなる。この恩恵及び／又は有利な点は、本明細書及び図面に開示した様々な態様及び特徴により個別に提供され得るものであり、その１つ以上を得るために全てが必要ではない。 Additional benefits and advantages of one aspect of the present disclosure will become apparent from the specification and drawings. This benefit and / or advantage may be provided individually by the various aspects and features disclosed in this specification and the drawings, and not all are required to obtain one or more thereof.

図１は、実施の形態に係る画像識別装置を備える識別システムの機能的な構成の一例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of a functional configuration of an identification system including an image identification device according to an embodiment. 図２は、従来のネットワーク構成の模式図である。FIG. 2 is a schematic diagram of a conventional network configuration. 図３は、従来のネットワークのパラメータの一例を示す図である。FIG. 3 is a diagram illustrating an example of parameters of a conventional network. 図４Ａは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第一の模式図である。FIG. 4A is a first diagram for explaining that the robustness with respect to the position on the image can be realized by combining the convolution layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図４Ｂは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第二の模式図である。FIG. 4B is a second diagram for explaining that the robustness to the position on the image can be realized by combining the convolution layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図４Ｃは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第三の模式図である。FIG. 4C is a third diagram for explaining that the robustness with respect to the position on the image can be realized by combining the convolution layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図４Ｄは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第四の模式図である。FIG. 4D is a fourth diagram for explaining that the robustness with respect to the position on the image can be realized by combining the convolution layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図４Ｅは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第五の模式図である。FIG. 4E is a fifth diagram for explaining that the robustness with respect to the position on the image can be realized by combining the convolution layer with the kernel size 3 × 3 and the stride 1 and the pooling layer with the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図４Ｆは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための第六の模式図である。FIG. 4F is a sixth diagram for explaining that the robustness with respect to the position on the image can be realized by combining the convolution layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram. 図５は、実施の形態におけるネットワーク構成の模式図である。FIG. 5 is a schematic diagram of a network configuration in the embodiment. 図６は、実施の形態におけるネットワークのパラメータの一例を示す図である。FIG. 6 is a diagram illustrating an example of network parameters according to the embodiment. 図７Ａは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第一の模式図である。FIG. 7A is a first schematic diagram for explaining that a network using a convolution layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図７Ｂは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第二の模式図である。FIG. 7B is a second schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図７Ｃは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第三の模式図である。FIG. 7C is a third schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図７Ｄは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第四の模式図である。FIG. 7D is a fourth schematic diagram for explaining that a network using a convolution layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図７Ｅは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第五の模式図である。FIG. 7E is a fifth schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図７Ｆは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第六の模式図である。FIG. 7F is a sixth schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image. 図８Ａは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第一の模式図である。FIG. 8A is a first schematic diagram for explaining that a network using a convolution layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. 図８Ｂは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第二の模式図である。FIG. 8B is a second schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. 図８Ｃは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第三の模式図である。FIG. 8C is a third schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. 図８Ｄは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第四の模式図である。FIG. 8D is a fourth schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. 図８Ｅは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための第五の模式図である。FIG. 8E is a fifth schematic diagram for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. 図９は、実施の形態の変形例に係る識別システムの機能的な構成の一例を示す模式図である。FIG. 9 is a schematic diagram illustrating an example of a functional configuration of an identification system according to a modification of the embodiment. 図１０は、実施の形態の変形例に係る識別システムのハードウェア構成の一例を示す模式図である。FIG. 10 is a schematic diagram illustrating an example of a hardware configuration of an identification system according to a modification of the embodiment. 図１１は、実施の形態の変形例に係る学習装置の主要な処理の流れの一例を示すフローチャートである。FIG. 11 is a flowchart illustrating an example of a main processing flow of the learning device according to the modification of the embodiment. 図１２は、マルチピンホールを使用したライトフィールドカメラの例を示す図である。FIG. 12 is a diagram illustrating an example of a light field camera using a multi-pinhole. 図１３は、通常撮像された被写体の画像（撮像画像）の例を示す模式図である。FIG. 13 is a schematic diagram illustrating an example of a subject image (captured image) that is normally captured. 図１４は、マルチピンホールマスクを含むライトフィールドカメラを使用して撮像された被写体の画像（計算撮像画像）の例を示す模式図である。FIG. 14 is a schematic diagram illustrating an example of a subject image (calculated captured image) captured using a light field camera including a multi-pinhole mask. 図１５Ａは、識別領域枠が重畳表示された撮像画像を示す模式図である。FIG. 15A is a schematic diagram illustrating a captured image in which an identification area frame is superimposed and displayed. 図１５Ｂは、識別領域枠のみを示す模式的な図である。FIG. 15B is a schematic diagram showing only the identification area frame. 図１６は、画像上でマスクとして与えられた識別正解の例を示す模式図である。FIG. 16 is a schematic diagram illustrating an example of an identification correct answer given as a mask on an image. 図１７は、実施の形態に係る画像識別装置の動作の流れの一例を示すフローチャートである。FIG. 17 is a flowchart illustrating an example of an operation flow of the image identification device according to the embodiment. 図１８は、識別部の機能的な構成の一例を示す模式図である。FIG. 18 is a schematic diagram illustrating an example of a functional configuration of the identification unit. 図１９は、ランダムマスクを符号化絞りとして使用する符号化開口マスクの例の模式図である。FIG. 19 is a schematic diagram of an example of a coded aperture mask that uses a random mask as a coded stop. 図２０は、識別部の機能的な構成の別の一例を示す模式図である。FIG. 20 is a schematic diagram illustrating another example of the functional configuration of the identification unit. 図２１Ａは、第二画像取得部の光軸と第一画像取得部の光軸とがおおよそ一致することを示す模式図である。FIG. 21A is a schematic diagram showing that the optical axis of the second image acquisition unit and the optical axis of the first image acquisition unit approximately match. 図２１Ｂは、第二画像取得部を構成するステレオカメラの各光軸と第一画像取得部の光軸とがおおよそ一致することを示す模式図である。FIG. 21B is a schematic diagram showing that the optical axes of the stereo camera constituting the second image acquisition unit and the optical axes of the first image acquisition unit approximately coincide with each other. 図２２は、第一画像取得部の光軸と第二画像取得部の光軸とを一致させるために、ビームスプリッタが利用されることを示す模式図である。FIG. 22 is a schematic diagram showing that a beam splitter is used to match the optical axis of the first image acquisition unit with the optical axis of the second image acquisition unit. 図２３は、３つのピンホールを有するマルチピンホールマスクの一例を示す模式図である。FIG. 23 is a schematic diagram showing an example of a multi-pinhole mask having three pinholes. 図２４は、図２３のマルチピンホールマスクを有する撮像部を利用して撮像された計算撮像画像の一例を示す模式図である。FIG. 24 is a schematic diagram illustrating an example of a calculated captured image captured using the imaging unit having the multi-pinhole mask of FIG. 図２５は、図２４の計算撮像画像に対する識別部の識別結果を示した模式図である。FIG. 25 is a schematic diagram showing the identification result of the identification unit for the calculated captured image of FIG. 図２６は、識別領域枠の重心位置の距離と、撮像部から被写体までの距離の関係を説明するための模式図である。FIG. 26 is a schematic diagram for explaining the relationship between the distance of the center of gravity position of the identification area frame and the distance from the imaging unit to the subject. 図２７Ａは、識別部の識別器として利用したＤｅｅｐＬｅａｒｎｉｎｇの一例を示す模式図である。FIG. 27A is a schematic diagram illustrating an example of Deep Learning used as a discriminator of the discriminating unit. 図２７Ｂは、識別部の識別器として利用したＤｅｅｐＬｅａｒｎｉｎｇの他の一例を示す模式図である。FIG. 27B is a schematic diagram illustrating another example of deep learning used as a classifier of the classifier. 図２８は、図２４の計算撮像画像に対応するシーンにおける識別検出枠候補を設定する処理を説明するための模式図である。FIG. 28 is a schematic diagram for explaining processing for setting identification detection frame candidates in a scene corresponding to the calculated captured image of FIG. 図２９は、実施の形態の変形例である奥行推定装置を有する識別システムの機能的な構成の一例を示す模式図である。FIG. 29 is a schematic diagram illustrating an example of a functional configuration of an identification system including a depth estimation device that is a modification of the embodiment. 図３０は、同一被写体に対する複数のピンホールに対応する識別検出枠それぞれの重心間の距離と、その被写体の撮像部からの距離を説明するための模式図である。FIG. 30 is a schematic diagram for explaining the distance between the centers of gravity of the identification detection frames corresponding to a plurality of pinholes for the same subject and the distance from the imaging unit of the subject.

「背景技術」の欄で記載したように、ディープラーニング等の機械学習が用いられることにより、機械装置による高精度な識別技術の実現が可能になった。このような識別技術を、車両の自動運転及びロボットの動作に適用することが試みられている。車両及びロボットは、移動体であるため、移動しつつ、カメラの撮像画像から周囲の物体を認識する必要がある。このため、高い識別処理速度が要求される。 As described in the “Background Technology” section, the use of machine learning such as deep learning has made it possible to implement a highly accurate identification technique using a mechanical device. Attempts have been made to apply such identification technology to automatic driving of a vehicle and movement of a robot. Since the vehicle and the robot are moving bodies, it is necessary to recognize surrounding objects from the captured image of the camera while moving. For this reason, a high identification processing speed is required.

非特許文献１に開示される技術は、高い識別精度を得るために、高解像度の画像を必要とする。高解像度の画像情報を取得するためには、高価なカメラを使用する必要があり、物体の識別システム自体が高価になるという課題がある。また、高解像度の画像の取得には、高価なカメラが必要になるだけでなく、高解像度の画像の処理量が大きくなり、処理に遅延が生じる可能性がある。 The technique disclosed in Non-Patent Document 1 requires a high-resolution image in order to obtain high identification accuracy. In order to acquire high-resolution image information, it is necessary to use an expensive camera, and there is a problem that the object identification system itself is expensive. In addition, acquiring a high-resolution image not only requires an expensive camera, but also increases the processing amount of the high-resolution image, which may cause a delay in processing.

非特許文献２には、奥行情報を使用する高精度な識別システムについての技術が開示されている。このようなシステムは、奥行情報を取得するために高価な３次元レンジファインダを必要とするため、コストが増大するという課題がある。さらに、この技術では、撮像画像と奥行情報とを関連付けて処理する必要があるため、処理量が多くなる。３次元レンジファインダによる奥行情報は、例えばレーダを用いた走査による数多くの点からなる点群情報を含むことから、そのデータサイズは大きいためである。つまり、画像情報に加えこのような３次元レンジファインダ等による奥行情報も入力として用いることで、ニューラルネットワークのネットワークサイズが大きくなり、識別処理速度が低下するという問題もある。 Non-Patent Document 2 discloses a technique regarding a highly accurate identification system that uses depth information. Such a system requires an expensive three-dimensional range finder in order to acquire depth information, and there is a problem that the cost increases. Furthermore, in this technique, since it is necessary to process a captured image and depth information in association with each other, the processing amount increases. This is because the depth information obtained by the three-dimensional range finder includes point group information including a large number of points obtained by scanning using a radar, for example, and thus has a large data size. That is, using depth information from such a three-dimensional range finder in addition to image information as an input causes a problem that the network size of the neural network increases and the identification processing speed decreases.

また、非特許文献３に開示される技術では、低解像度の画像から高解像度の画像を復元する処理量が膨大である。本開示に係る本発明者らは、非特許文献１〜３の技術に上述のような問題を見出し、識別精度を向上しつつ、識別処理速度を向上する技術を検討し、以下に示すような技術を創案した。 In the technique disclosed in Non-Patent Document 3, the amount of processing for restoring a high-resolution image from a low-resolution image is enormous. The present inventors according to the present disclosure have found the above-described problems in the techniques of Non-Patent Documents 1 to 3, and studied a technique for improving the identification processing speed while improving the identification accuracy, as shown below. Invented technology.

本開示の一態様に係る識別システムは、撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を撮像するカメラと、識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別する処理回路とを備え、前記識別モデルは、プーリング層を有しないネットワークで構成されている。 An identification system according to one aspect of the present disclosure uses a camera that captures a captured image including a captured object and a surrounding environment of the captured object, and an identification model to identify the captured object in the calculated captured image. A processing circuit for identifying the identification model, and the identification model includes a network having no pooling layer.

計算撮像画像には、画像自体に奥行情報等の他の情報を付加することができるため、物体の識別にあたり、単に画像自体を入力として用いるだけでよく、３次元レンジファインダ等によるデータサイズの大きい点群情報等を入力として用いることを要さない。このため、ニューラルネットワークのネットワークサイズが大きくなることを抑制でき、識別処理速度を向上できる。また、低解像度の画像から高解像度の画像を復元する処理も要さないため、識別処理速度を向上できる。また、計算撮像画像によって奥行情報等の他の情報を用いることができるため、識別精度を向上できる。このように、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上することが可能になる。 Since other information such as depth information can be added to the calculated captured image, the image itself can be simply used as an input to identify the object, and the data size of the three-dimensional range finder is large. There is no need to use point cloud information or the like as input. For this reason, it can suppress that the network size of a neural network becomes large, and can improve the identification processing speed. In addition, since the process of restoring the high resolution image from the low resolution image is not required, the identification processing speed can be improved. Moreover, since other information, such as depth information, can be used by the calculated captured image, the identification accuracy can be improved. As described above, it is possible to improve the identification accuracy of an object using an image and improve the identification processing speed.

ただし、計算撮像画像では、撮像対象物及び撮像対象物の周辺環境がそれぞれ複数重畳され、画像上のこれらの位置ずれに奥行情報等の他の情報を付加している。従来のニューラルネットワークを用いた識別モデルを用いてこのような計算撮像画像中の撮像対象物及び撮像対象物の周辺環境を識別しようとすると、奥行情報等の他の情報を付加するために敢えて生じさせていた当該位置ずれが荒く整理されてしまい（言い換えると位置ずれが吸収されてしまい）、奥行情報等の他の情報が失われる。従来のニューラルネットワークは、畳み込み層とプーリング層を繰り返すことで画像上の位置に対するロバスト性を取得するように構成されているためである。 However, in the calculated captured image, a plurality of imaging objects and surrounding environments of the imaging object are respectively superimposed, and other information such as depth information is added to these positional shifts on the image. When an identification model using a conventional neural network is used to identify the imaging object in the calculated captured image and the surrounding environment of the imaging object, it is intentionally generated to add other information such as depth information. The misplaced position is roughly arranged (in other words, the misalignment is absorbed), and other information such as depth information is lost. This is because a conventional neural network is configured to acquire robustness with respect to a position on an image by repeating a convolution layer and a pooling layer.

これに対して、本態様における識別モデルは、プーリング層を有しないネットワークで構成されているため、畳み込み層の出力がそのまま用いられ、上記位置ずれによる奥行情報等の他の情報が失われることを抑制できる。このようなプーリング層を有しないネットワークは、計算撮像画像のように物体がずれて重畳された画像における位置ずれを荒く整理せずに利用できるため、位置情報に敏感なネットワーク（言い換えると位置情報を保持可能なネットワーク）となる。このようなネットワーク構成とすることで、奥行情報等の他の情報を効率的に取得することができ、物体の識別精度を向上することが可能となる。 On the other hand, since the identification model in this aspect is configured by a network that does not have a pooling layer, the output of the convolution layer is used as it is, and other information such as depth information due to the positional deviation is lost. Can be suppressed. Such a network that does not have a pooling layer can be used without roughly organizing misalignment in an image in which an object is misaligned and superimposed like a calculated captured image. Network). With such a network configuration, other information such as depth information can be acquired efficiently, and the object identification accuracy can be improved.

例えば、前記識別モデルのネットワークにおいて、畳み込み層の入力は重ならなくてもよい。 For example, in the identification model network, the convolutional layer inputs do not have to overlap.

これによれば、畳み込み層の入力が重ならないため、プーリング層を有しないネットワークであっても、ネットワークサイズが大きくなることを抑制できる。 According to this, since the inputs of the convolution layer do not overlap, it is possible to suppress an increase in the network size even in a network that does not have a pooling layer.

例えば、前記識別モデルのネットワークにおいて、畳み込み層の入力のストライドが２以上であってもよい。具体的には、前記識別モデルのネットワークにおいて、畳み込み層の入力のストライドが当該カーネルサイズの半分以上であってもよい。 For example, in the network of the identification model, the input stride of the convolution layer may be two or more. Specifically, in the identification model network, the input stride of the convolution layer may be half or more of the kernel size.

このように、畳み込み層の入力が一部重なっている場合であっても、プーリング層を有しないネットワークにより、位置情報を保持することができる。 As described above, even when the inputs of the convolution layer partially overlap, the position information can be held by the network having no pooling layer.

例えば、前記カメラは、前記計算撮像画像として、前記撮像対象物及び前記撮像対象物の周辺環境がそれぞれ複数重畳された視差情報を含む画像を撮像してもよい。具体的には、前記カメラは、マルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラであってもよい。 For example, the camera may capture an image including parallax information in which a plurality of the imaging object and a plurality of surrounding environments of the imaging object are superimposed as the calculated captured image. Specifically, the camera may be a multi-pinhole camera, a coded aperture camera, a light field camera, or a lensless camera.

これによれば、撮像対象物及び撮像対象物の周辺環境をそれぞれ複数重畳することで、画像自体に奥行情報を付加することができる。 According to this, depth information can be added to an image itself by superimposing a plurality of imaging objects and surrounding environments of the imaging objects.

本開示の一態様に係る識別装置は、メモリ及び処理回路を備えた識別装置であって、前記処理回路は、前記メモリから撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を取得し、前記メモリに記憶された識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別し、前記識別モデルは、プーリング層を有しないネットワークで構成されている。 An identification apparatus according to an aspect of the present disclosure is an identification apparatus including a memory and a processing circuit, and the processing circuit acquires a calculated captured image including an imaging target object and a surrounding environment of the imaging target object from the memory. Then, using the identification model stored in the memory, the imaging object in the calculated captured image is identified, and the identification model is configured by a network having no pooling layer.

これによれば、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上する識別装置を提供できる。 According to this, it is possible to provide an identification device that improves the identification accuracy of an object using an image and improves the identification processing speed.

本開示の一態様に係る識別方法は、撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を取得し、プーリング層を有しないネットワークで構成されている識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別する。これによれば、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上する識別方法を提供できる。 An identification method according to an aspect of the present disclosure is a method for obtaining a calculated captured image including an imaging target object and a surrounding environment of the imaging target object, and using the identification model configured by a network having no pooling layer. The imaging object in the captured image is identified. According to this, the identification method which improves the identification accuracy of the object using an image and improves the identification processing speed can be provided.

本開示の一態様に係るプログラムは、上記の識別方法をコンピュータに実行させるためのプログラムである。 A program according to an aspect of the present disclosure is a program for causing a computer to execute the above identification method.

これによれば、画像を用いた物体の識別精度を向上し、かつ、識別処理速度を向上するプログラムを提供できる。 According to this, the program which improves the identification accuracy of the object using an image and improves the identification processing speed can be provided.

なお、上記の包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム又はコンピュータ読み取り可能な記録ディスク等の記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。コンピュータ読み取り可能な記録媒体は、例えばＣＤ−ＲＯＭ等の不揮発性の記録媒体を含む。 The comprehensive or specific aspect described above may be realized by a system, an apparatus, a method, an integrated circuit, a recording medium such as a computer program or a computer-readable recording disk, and the system, apparatus, method, and integrated circuit. The present invention may be realized by any combination of a computer program and a recording medium. The computer-readable recording medium includes a non-volatile recording medium such as a CD-ROM.

［実施の形態］
以下、実施の形態について、図面を参照しながら説明する。なお、以下で説明する実施の形態は、いずれも包括的又は具体的な例を示すものである。以下の実施の形態で示される数値、形状、構成要素、構成要素の配置位置及び接続形態、ステップ（工程）、ステップの順序等は、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また、以下の実施の形態の説明において、略一致のような「略」を伴った表現が用いられる場合がある。例えば、略一致とは、完全に一致であることを意味するだけでなく、実質的に一致、すなわち、例えば数％程度の差異を含むことも意味する。他の「略」を伴った表現についても同様である。また、各図は模式図であり、必ずしも厳密に図示されたものではない。さらに、各図において、実質的に同一の構成要素に対しては同一の符号を付しており、重複する説明は省略又は簡略化される場合がある。 [Embodiment]
Hereinafter, embodiments will be described with reference to the drawings. It should be noted that each of the embodiments described below shows a comprehensive or specific example. Numerical values, shapes, components, arrangement positions and connection forms of components, steps (steps), order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present disclosure. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements. Further, in the following description of the embodiment, an expression with “substantially” such as substantially coincidence may be used. For example, “substantially coincident” not only means that they are completely coincident, but also means that they are substantially coincident, that is, include a difference of, for example, several percent. The same applies to expressions involving other “abbreviations”. Each figure is a mimetic diagram and is not necessarily illustrated strictly. Furthermore, in each figure, the same code | symbol is attached | subjected to the substantially same component, and the overlapping description may be abbreviate | omitted or simplified.

実施の形態に係る画像識別装置を説明する。 An image identification device according to an embodiment will be described.

図１は、実施の形態に係る画像識別装置１０を備える識別システム１の機能的な構成の一例を示す模式図である。 FIG. 1 is a schematic diagram illustrating an example of a functional configuration of an identification system 1 including an image identification device 10 according to an embodiment.

識別システム１は、撮像対象物及び撮像対象物の周辺環境を含む計算撮像画像を撮像するカメラと、識別モデルを用いて、計算撮像画像中の撮像対象物を識別する処理回路とを備える。当該識別モデル及び計算撮像画像については後述する。識別システム１は、当該処理回路を有する画像識別装置１０と、当該カメラとして撮像部１１とを備える。画像識別装置１０は、取得部１０１と、識別部１０２と、出力部１０３とを備える。識別システム１は、撮像部１１が取得する画像を用いて、当該画像に含まれる被写体を検出し、検出結果を出力する。画像における被写体の検出を、「識別」とも呼ぶ。 The identification system 1 includes a camera that captures a captured image including a captured object and a surrounding environment of the captured object, and a processing circuit that identifies the captured object in the captured captured image using an identification model. The identification model and the calculated captured image will be described later. The identification system 1 includes an image identification device 10 having the processing circuit and an imaging unit 11 as the camera. The image identification device 10 includes an acquisition unit 101, an identification unit 102, and an output unit 103. The identification system 1 detects a subject included in the image using the image acquired by the imaging unit 11, and outputs a detection result. The detection of the subject in the image is also called “identification”.

識別システム１は、車両及びロボット等の移動体に搭載されてもよく、監視カメラシステム等の固定物に搭載されてもよい。本実施の形態では、識別システム１は、移動体の一例である自動車に搭載されるとして説明する。この場合、撮像部１１及び画像識別装置１０の両方が移動体に搭載されてもよい。又は、撮像部１１が移動体に搭載され、画像識別装置１０が移動体の外部に配置されてもよい。画像識別装置１０が配置される対象の例は、コンピュータ装置又は移動体の操作者の端末装置等である。端末装置の例は、移動体専用の操作用端末装置、又は、スマートフォン、スマートウォッチ及びタブレット等の汎用的な携帯端末装置等である。コンピュータ装置の例は、カーナビゲーションシステム、ＥＣＵ（Engine Control Unit）又はサーバ装置等である。 The identification system 1 may be mounted on a moving body such as a vehicle and a robot, or may be mounted on a fixed object such as a surveillance camera system. In the present embodiment, the identification system 1 will be described as being mounted on an automobile that is an example of a mobile object. In this case, both the imaging unit 11 and the image identification device 10 may be mounted on the moving body. Alternatively, the imaging unit 11 may be mounted on the moving body, and the image identification device 10 may be disposed outside the moving body. An example of a target on which the image identification device 10 is arranged is a computer device or a terminal device of an operator of a moving object. Examples of the terminal device are an operation terminal device dedicated to a mobile body, or a general-purpose portable terminal device such as a smartphone, a smart watch, and a tablet. Examples of the computer device are a car navigation system, an ECU (Engine Control Unit), a server device, or the like.

画像識別装置１０と撮像部１１とが離れて配置される場合、画像識別装置１０及び撮像部１１は、有線通信又は無線通信を介して通信してもよい。有線通信には、例えば、イーサネット（登録商標）規格に準拠したネットワーク等の有線ＬＡＮ（Local Area Network）及びその他のいかなる有線通信が適用されてもよい。無線通信には、第３世代移動通信システム（３Ｇ）、第４世代移動通信システム（４Ｇ）、又はＬＴＥ（登録商標）等のような移動通信システムで利用されるモバイル通信規格、Ｗｉ−Ｆｉ（登録商標）（Wireless Fidelity）などの無線ＬＡＮ、及び、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＺｉｇＢｅｅ（登録商標）等の近距離無線通信が適用されてもよい。 When the image identification device 10 and the imaging unit 11 are arranged apart from each other, the image identification device 10 and the imaging unit 11 may communicate via wired communication or wireless communication. For the wired communication, for example, a wired LAN (Local Area Network) such as a network conforming to the Ethernet (registered trademark) standard and any other wired communication may be applied. For wireless communication, a mobile communication standard used in a mobile communication system such as the third generation mobile communication system (3G), the fourth generation mobile communication system (4G), or LTE (registered trademark), Wi-Fi ( Wireless LAN such as registered trademark (Wireless Fidelity), and short-range wireless communication such as Bluetooth (registered trademark) and ZigBee (registered trademark) may be applied.

撮像部１１は、撮像対象物及び撮像対象物の周辺環境を含む計算撮像画像（computational imaging photography）を撮像する、つまり取得する。具体的には、撮像部１１は、計算撮像画像として、撮像対象物及び撮像対象物の周辺環境がそれぞれ複数重畳された視差情報を含んだ画像を撮像（取得）する。撮像部１１が取得する計算撮像画像を第２の計算撮像画像とも呼ぶ。第２の計算撮像画像は、物体の識別時に用いられる画像である。なお、計算撮像画像は、計算画像とも呼ばれる。例えば、撮像部１１は、所定の周期である第１の周期毎に第２の計算撮像画像を取得してもよいし、連続的に動画として第２の計算撮像画像を取得してもよい。撮像部１１は、時刻と対応付けられた第２の計算撮像画像を取得してもよい。撮像部１１のハードウェアの例はカメラであり、具体的にはマルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラ等である。このようなカメラである場合、撮像部１１は、後述するように、１回の撮像動作で被写体についての複数の画像を同時に取得することができる。なお、撮像部１１は、例えば、撮像部１１が備える撮像素子の撮像領域、つまり受光領域を変化させることによって、上記の複数の画像を複数回の撮像動作で取得してもよい。撮像部１１は、取得した第２の計算撮像画像を、画像識別装置１０の取得部１０１に出力する。 The imaging unit 11 captures, that is, acquires a computational imaging image including an imaging object and the surrounding environment of the imaging object. Specifically, the imaging unit 11 captures (acquires) an image including parallax information in which a plurality of imaging objects and surrounding environments of the imaging object are superimposed as a calculated captured image. The calculated captured image acquired by the imaging unit 11 is also referred to as a second calculated captured image. The second calculated captured image is an image used when identifying an object. Note that the calculated captured image is also called a calculated image. For example, the imaging unit 11 may acquire the second calculated captured image every first cycle that is a predetermined cycle, or may acquire the second calculated captured image continuously as a moving image. The imaging unit 11 may acquire a second calculated captured image associated with the time. An example of hardware of the imaging unit 11 is a camera, specifically, a multi-pinhole camera, a coded aperture camera, a light field camera, a lensless camera, or the like. In the case of such a camera, the imaging unit 11 can simultaneously acquire a plurality of images of the subject in one imaging operation, as will be described later. Note that the imaging unit 11 may acquire the plurality of images by a plurality of imaging operations, for example, by changing an imaging region of the imaging element included in the imaging unit 11, that is, a light receiving region. The imaging unit 11 outputs the acquired second calculated captured image to the acquisition unit 101 of the image identification device 10.

なお、撮像部１１は、物体の識別時に用いられる第２の計算撮像画像だけでなく、後述する図９等で説明する学習時に用いられる第１の計算撮像画像を取得し、取得した第１の計算撮像画像を、学習装置１２の第一画像取得部１２１（図９参照）に出力してもよい。 The imaging unit 11 acquires not only the second calculated captured image used at the time of object identification but also the first calculated captured image used at the time of learning described in FIG. The calculated captured image may be output to the first image acquisition unit 121 (see FIG. 9) of the learning device 12.

ここで、計算撮像画像と通常撮像画像とを説明する。通常撮像画像は、光学系を通して撮像される画像である。通常撮像画像は、通常、光学系により集光された物体からの光を結像（imaging）することによって、取得される。光学系の一例は、レンズである。物体と像内の像点（image point）とを入れ替えて、像点に物体を配置することにより、物体と像内の像点とを入れ替える前と同じ光学系で元の物体の位置に像点ができるような物体の点と像点との位置関係を共役（conjugate）と呼ぶ。本明細書において、このように共役関係にある状態で撮像された画像は、通常撮像画像（又は撮像画像）と表記する。物体が存在する環境下で、人が物体を直接見たとき、人は通常撮像画像とほぼ同様の状態で当該物体を知覚する。言い換えると、人は、通常のデジタルカメラで撮像された通常撮像画像を、実空間の状態と同様に視覚的に認識する。 Here, the calculated captured image and the normal captured image will be described. A normal captured image is an image captured through an optical system. A normal captured image is usually acquired by imaging light from an object collected by an optical system. An example of the optical system is a lens. By swapping the object and the image point in the image and placing the object at the image point, the image point at the position of the original object in the same optical system as before the object and the image point in the image are replaced The positional relationship between an object point and an image point that can be used is called a conjugate. In this specification, an image captured in such a conjugate state is referred to as a normal captured image (or captured image). When a person views the object directly in an environment where the object exists, the person perceives the object in a state almost the same as a normal captured image. In other words, a person visually recognizes a normal captured image captured by a normal digital camera in the same manner as a real space state.

一方、計算撮像画像は、例えばマルチピンホールを用いることで複数の画像がずれて重畳されたものであり、人によって実空間の状態と同様に視覚的に認識できない画像である。ただし、計算撮像画像は、人が視覚的に認識できない画像であり得るが、コンピュータ処理を用いれば、撮像対象物及び周辺環境等の画像に含まれる情報の取得が可能である画像である。計算撮像画像は、画像を復元することによって人が認識できるように視覚化されることができる。計算撮像画像の例は、マルチピンホール又はマイクロレンズを用いて撮像されたライトフィールド画像、時空間で画素情報を重み付け加算して撮像された圧縮センシング画像、又は、符号化絞りとコード化されたマスクとを使用して撮像されたＣｏｄｅｄＡｐｅｒｔｕｒｅ画像（符号化開口画像）などの符号化画像である。例えば、非特許文献３には、圧縮センシング画像の例が示されている。また、計算撮像画像の他の例は、非特許文献４及び非特許文献５に示されるような、屈折による結像光学系を有しないレンズレスカメラを使用して撮像された画像である。上記のいずれの計算撮像画像も、既知な技術であるため、その詳細な説明を省略する。 On the other hand, the calculated captured image is an image in which a plurality of images are shifted and superimposed by using, for example, a multi-pinhole, and is an image that cannot be visually recognized by a person in the same manner as the state of the real space. However, the calculated captured image may be an image that cannot be visually recognized by a person, but if computer processing is used, it is possible to acquire information included in the image of the imaging target and the surrounding environment. The computed captured image can be visualized so that a person can recognize it by restoring the image. Examples of computed captured images are light field images captured using multi-pinholes or microlenses, compressed sensing images captured by weighted addition of pixel information in space-time, or encoded with an encoded aperture It is an encoded image such as a coded aperture image (encoded aperture image) imaged using a mask. For example, Non-Patent Document 3 shows an example of a compressed sensing image. Another example of the calculated captured image is an image captured using a lensless camera that does not have an imaging optical system by refraction as shown in Non-Patent Document 4 and Non-Patent Document 5. Since any of the above calculated captured images is a known technique, a detailed description thereof will be omitted.

例えば、ライトフィールド画像には、各画素に、画像値に加えて、奥行情報も含まれる。ライトフィールド画像は、撮像素子の前に配置された複数のピンホール又はマイクロレンズを介して、撮像素子によって取得された画像である。複数のピンホール及びマイクロレンズは、撮像素子の受光面に沿って平面的に配置され、例えば、格子状に配置される。撮像素子は、その全体での１回の撮像動作において、複数のピンホール又はマイクロレンズのそれぞれを通じて複数の像を同時に取得する。複数の像は、異なる視点から撮像された像である。このような複数の像と視点との位置関係から、被写体の奥行方向の距離の取得が可能である。撮像素子の例は、ＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサ又はＣＣＤ（Charge-Coupled Device）イメージセンサ等のイメージセンサである。 For example, a light field image includes depth information in addition to an image value for each pixel. The light field image is an image acquired by the image sensor through a plurality of pinholes or microlenses arranged in front of the image sensor. The plurality of pinholes and microlenses are arranged in a plane along the light receiving surface of the image sensor, for example, arranged in a lattice shape. The imaging device simultaneously acquires a plurality of images through each of a plurality of pinholes or microlenses in one imaging operation as a whole. The plurality of images are images taken from different viewpoints. The distance in the depth direction of the subject can be acquired from the positional relationship between the plurality of images and the viewpoint. An example of the image sensor is an image sensor such as a complementary metal oxide semiconductor (CMOS) image sensor or a charge-coupled device (CCD) image sensor.

圧縮センシング画像は、圧縮センシングの対象画像である。圧縮センシングの対象画像の例は、レンズレスカメラで撮像された画像である。レンズレスカメラは、屈折による結像光学系を有さず、撮像素子の前に配置されたマスクを介して、画像を取得する。マスクは、透過率が異なる複数の領域を、例えば格子状に含む。このようなマスクを通して撮影することで、様々な方向からの光線（ライトフィールド画像）をマスクによってコード化して撮像することができる。圧縮センシングでは、このマスク情報を利用することで、コード化されたライトフィールド画像から、所望の方向の光線のみの画像、又は、すべての距離に焦点が合った全焦点画像を取得することができ、さらには奥行情報を取得することができる。 The compressed sensing image is a target image for compressed sensing. An example of a compression sensing target image is an image captured by a lensless camera. The lensless camera does not have an image forming optical system by refraction, and acquires an image via a mask arranged in front of the image sensor. The mask includes a plurality of regions having different transmittances, for example, in a lattice shape. By photographing through such a mask, light rays (light field images) from various directions can be coded and imaged by the mask. In compressed sensing, by using this mask information, it is possible to acquire an image of only the light beam in the desired direction or an omnifocal image focused at all distances from the coded light field image. Furthermore, depth information can be acquired.

また、このようなマスクをカメラの開口部に絞りとして設置して撮影した画像はＣｏｄｅｄＡｐｅｒｔｕｒｅ画像（符号化開口画像）と呼ばれる。 An image captured by setting such a mask as a diaphragm at the opening of the camera is called a coded aperture image (coded aperture image).

このように、計算撮像画像（第１の計算撮像画像及び第２の計算撮像画像）は、撮像対象物及び撮像対象物の周辺環境がそれぞれ複数重畳された視差情報を含んだ画像であり、具体的には、マルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラによる撮像対象物及び撮像対象物の周辺環境の撮像により得られる画像である。 As described above, the calculated captured image (the first calculated captured image and the second calculated captured image) is an image including parallax information in which a plurality of imaging objects and a plurality of surrounding environments of the imaging objects are superimposed. Specifically, it is an image obtained by imaging the imaging object and the surrounding environment of the imaging object by a multi-pinhole camera, a coded aperture camera, a light field camera, or a lensless camera.

画像識別装置１０の取得部１０１は、撮像部１１から第２の計算撮像画像を取得し、識別部１０２に出力する。また、取得部１０１は、識別部１０２が識別のために用いる識別器を取得してもよく、取得した識別器を識別部１０２に出力してもよい。画像識別装置１０が移動体に搭載される場合、取得部１０１は、移動体から、移動体の速度を取得してもよい。取得部１０１は、移動体の速度をリアルタイムに取得してもよく、定期的に取得してもよい。例えば、取得部１０１は、移動体が速度計を備える場合、速度計から速度を取得してもよく、また、移動体が備えるコンピュータであって、速度計から速度情報を受信するコンピュータから速度を取得してもよい。また、例えば、取得部１０１は、移動体が速度計を備えない場合、移動体が備えるＧＰＳ（Global Positioning System）装置、加速度計及び角速度計などの慣性計測装置等から速度に関連する情報を取得してもよい。 The acquisition unit 101 of the image identification device 10 acquires the second calculated captured image from the imaging unit 11 and outputs it to the identification unit 102. Further, the acquiring unit 101 may acquire a discriminator used by the discriminating unit 102 for discrimination, or may output the acquired discriminator to the discriminating unit 102. When the image identification device 10 is mounted on a moving body, the acquisition unit 101 may acquire the speed of the moving body from the moving body. The acquisition part 101 may acquire the speed of a moving body in real time, and may acquire it regularly. For example, when the moving body includes a speedometer, the acquiring unit 101 may acquire the speed from the speedometer, or the computer provided in the moving body may obtain the speed from a computer that receives speed information from the speedometer. You may get it. Further, for example, when the moving body does not include a speedometer, the acquiring unit 101 acquires speed-related information from an inertial measurement device such as a GPS (Global Positioning System) device, an accelerometer, and an angular velocity meter provided in the moving body. May be.

識別部１０２は、取得部１０１から第２の計算撮像画像を取得する。識別部１０２は、例えば取得部１０１から取得した識別器を含む。識別器は、画像から対象物の情報を取得するための識別モデルであって、識別部１０２が識別のために用いるデータである。識別器は、機械学習を用いて構築される。計算撮像画像を学習用データとして用いて機械学習することによって、識別性能を向上した識別器の構築が可能である。なお、学習用データとして機械学習のために用いられる計算撮像画像を第１の計算撮像画像とも呼ぶ。本実施の形態では、識別器に適用される機械学習モデルは、ＤｅｅｐＬｅａｒｎｉｎｇ（深層学習）等のニューラルネットワークを用いた機械学習モデルであるが、他の学習モデルであってもよい。例えば、機械学習モデルは、ＲａｎｄｏｍＦｏｒｅｓｔ、又はＧｅｎｅｔｉｃＰｒｏｇｒａｍｍｉｎｇ等を用いた機械学習モデルであってもよい。 The identification unit 102 acquires the second calculated captured image from the acquisition unit 101. The identification unit 102 includes, for example, a classifier acquired from the acquisition unit 101. The discriminator is an identification model for acquiring object information from an image, and is data used by the discriminator 102 for discrimination. The classifier is constructed using machine learning. It is possible to construct a discriminator with improved discrimination performance by machine learning using the calculated captured image as learning data. The calculated captured image used for machine learning as the learning data is also referred to as a first calculated captured image. In the present embodiment, the machine learning model applied to the discriminator is a machine learning model using a neural network such as deep learning, but may be another learning model. For example, the machine learning model may be a machine learning model using Random Forest, Genetic Programming, or the like.

識別部１０２は、識別器を用いて、第２の計算撮像画像中の物体（撮像対象物及び撮像対象物の周辺環境）の情報を取得する。具体的には、識別部１０２は、第２の計算撮像画像に含まれる物体を識別し、且つ、第２の計算撮像画像中の物体の位置を取得する。つまり、物体の情報は、物体の存在の有無と、物体の位置とを含む。物体の位置は、画像上における平面的な位置と、画像の奥行方向の位置とを含んでもよい。例えば、識別部１０２は、識別器を用いて、第２の計算撮像画像の少なくとも１つの画素毎に、物体が存在するか否かを識別する。識別部１０２は、第２の計算撮像画像中の物体の位置として、物体が存在することが識別された少なくとも１つの画素の位置を取得する。ここで、本明細書における物体の識別とは、第２の計算撮像画像において、物体が存在する画素を検出することを含む。 The identification unit 102 acquires information on the object (the imaging target object and the surrounding environment of the imaging target object) in the second calculated captured image using the classifier. Specifically, the identification unit 102 identifies an object included in the second calculated captured image, and acquires the position of the object in the second calculated captured image. That is, the object information includes the presence / absence of the object and the position of the object. The position of the object may include a planar position on the image and a position in the depth direction of the image. For example, the identification unit 102 identifies whether or not an object exists for each at least one pixel of the second calculated captured image using a classifier. The identification unit 102 acquires the position of at least one pixel that is identified as the presence of the object as the position of the object in the second calculated captured image. Here, the identification of the object in this specification includes detecting a pixel in which the object exists in the second calculated captured image.

例えば、識別システム１が自動車に搭載される場合、物体の例は、人物、自動車、自転車又は信号である。なお、識別部１０２は、第２の計算撮像画像を用いて、あらかじめ定められた１種類の物体を識別してもよく、複数の種類の物体を識別してもよい。また、識別部１０２は、人物、自動車又は自転車を含む移動体などのカテゴリ単位で、物体を識別してもよい。このとき、識別する物体の種類（カテゴリ）に応じた識別器が用いられてもよい。識別器は、例えば画像識別装置１０が有するメモリ（例えば後述する第一メモリ２０３）に記録される。 For example, when the identification system 1 is mounted on a car, examples of the object are a person, a car, a bicycle, or a signal. Note that the identifying unit 102 may identify a predetermined type of object or a plurality of types of objects using the second calculated captured image. The identification unit 102 may identify an object in units of categories such as a moving object including a person, a car, or a bicycle. At this time, a discriminator corresponding to the type (category) of the object to be identified may be used. The discriminator is recorded in, for example, a memory (for example, a first memory 203 described later) included in the image discrimination device 10.

例えば、ライトフィールド画像には、画像値に加えて、各画素の被写体の奥行情報も含まれる。また、非特許文献２にも記載されるように、被写体の奥行情報を学習データに用いることは、識別器の識別能力向上に有効である。例えば、画像において小さく写っている物体が、遠方に存在する被写体であることを認識でき、ゴミとして認識されない（つまり無視されてしまう）ことを抑制できる。このため、ライトフィールド画像を使用した機械学習により構築された識別器は、その識別性能を向上することができる。同様に、圧縮センシング画像及び符号化開口画像を用いた機械学習も、識別器の識別性能の向上に有効である。 For example, the light field image includes depth information of the subject of each pixel in addition to the image value. Further, as described in Non-Patent Document 2, the use of subject depth information as learning data is effective in improving the discrimination capability of the discriminator. For example, it is possible to recognize that an object that is small in the image is a subject that exists in the distance, and it is possible to prevent the object from being recognized as dust (that is, ignored). For this reason, a discriminator constructed by machine learning using a light field image can improve its discrimination performance. Similarly, machine learning using a compressed sensing image and a coded aperture image is also effective in improving the identification performance of the classifier.

本実施の形態の認識器に適用される機械学習モデルは、撮像部１１の特性を利用したＤｅｅｐＬｅａｒｎｉｎｇ（深層学習）等のニューラルネットワークを用いた機械学習モデルで実現される。このネットワークに関して説明する。 The machine learning model applied to the recognizer of the present embodiment is realized by a machine learning model using a neural network such as deep learning using the characteristics of the imaging unit 11. This network will be described.

従来のニューラルネットワークでは、畳み込み層とプーリング層を繰り返すことで画像上の位置に対するロバスト性を取得する。畳み込み層は、元の画像からフィルタにより特徴を抽出する層であり、プーリング層は、特徴として重要な情報を残しつつ元の画像を縮小する層である。例えば、画像から物体を識別するネットワークの場合、識別対象の被写体が画像上で位置を変えたとしても、その物体の位置ずれをプーリング層によって吸収することができる。 In a conventional neural network, robustness with respect to a position on an image is acquired by repeating a convolution layer and a pooling layer. The convolution layer is a layer that extracts features from the original image using a filter, and the pooling layer is a layer that reduces the original image while leaving important information as features. For example, in the case of a network that identifies an object from an image, even if the subject to be identified changes its position on the image, the displacement of the object can be absorbed by the pooling layer.

図２は、従来のネットワーク構成の模式図であり、図３は、従来のネットワークのパラメータの一例を示す図である。 FIG. 2 is a schematic diagram of a conventional network configuration, and FIG. 3 is a diagram illustrating an example of parameters of a conventional network.

このネットワークは、焦点位置を変更して撮影したＦｏｃａｌＳｔａｃｋ画像から奥行を抽出するネットワークである。このネットワークは、例えば、図２に示すように２６層の畳み込み層とプーリング層などで構成される。なお、詳細な説明は省略するが、このネットワークでは、ネットワークが大きくなり過ぎることを抑制するために、部分的に画像を切りだしてリシェイプをしており、また、畳み込み層及びプーリング層によって抽出した特徴量を利用して、元のサイズと同じ解像度での識別結果を出力するためにアップサンプリングをしている。例えば、畳み込み層はすべてカーネルサイズ３×３、ストライド１であり、また、プーリング層はすべてカーネルサイズ２×２、ストライド２である。これにより、このネットワークは画像上の位置に対するロバスト性を実現する。これについて、図４Ａ〜図４Ｆを用いて説明する。 This network is a network that extracts a depth from a Focal Stack image captured by changing the focal position. This network is composed of, for example, 26 convolutional layers and pooling layers as shown in FIG. Although detailed explanation is omitted, in this network, in order to prevent the network from becoming too large, the image is partially cut out and reshaped, and extracted by the convolution layer and the pooling layer. Upsampling is performed to output the identification result at the same resolution as the original size using the feature amount. For example, the convolution layers are all kernel size 3 × 3 and stride 1, and the pooling layers are all kernel size 2 × 2 and stride 2. Thereby, this network realizes robustness with respect to the position on the image. This will be described with reference to FIGS. 4A to 4F.

図４Ａ〜図４Ｆは、カーネルサイズ３×３、ストライド１の畳み込み層とカーネルサイズ２×２、ストライド２のプーリング層を組み合わせることで、画像上の位置に対するロバスト性が実現できることを説明するための模式図である。 4A to 4F are diagrams for explaining that the robustness to the position on the image can be realized by combining the convolutional layer of the kernel size 3 × 3 and the stride 1 and the pooling layer of the kernel size 2 × 2 and the stride 2. It is a schematic diagram.

図４Ａ〜図４Ｆにおいて、畳み込み層とプーリング層の処理におけるデータが格納されたメモリを模式的に正方形で示している。図４Ａには、６×８個のデータで表された入力層４００、入力層４００に対して畳み込み処理が行なわれて出力される３×４個のデータで表された中間出力層４２０、中間出力層４２０に対してプーリング処理が行なわれて出力される３×４個のデータで表された中間出力層４３０を示している。なお、以下では、データ及び当該データが格納されるメモリについて同じ符号を付けて説明する場合がある。 4A to 4F, the memory in which data in the processing of the convolution layer and the pooling layer is stored is schematically shown as a square. FIG. 4A shows an input layer 400 represented by 6 × 8 data, an intermediate output layer 420 represented by 3 × 4 data output by performing convolution processing on the input layer 400, An intermediate output layer 430 represented by 3 × 4 pieces of data output by performing a pooling process on the output layer 420 is shown. In the following description, data and a memory storing the data may be described with the same reference numerals.

まず、畳み込み処理を説明する。上述したように、従来のネットワークでは畳み込み層のカーネルサイズを３×３としているため、図４Ｂにおいて、入力層４００の例えば左上の３×３のデータ４０１に対して畳み込み処理が行なわれ、その結果がメモリ４２１に格納される。次に、上述したように、従来のネットワークでは畳み込み層の入力のストライドを１としているため、図４Ｃにおいて、入力層４００のデータ４０１から右に１列ずらした３×３のデータ４０２に対して畳み込み処理が行なわれ、その結果がメモリ４２２に格納される。同じように、図４Ｄにおいて、入力層４００のデータ４０１から下に１行ずらした３×３のデータ４０３に対して畳み込み処理が行なわれ、その結果がメモリ４２３に格納される。同じように、図４Ｅにおいて、入力層４００のデータ４０２から下に１行ずらした３×３のデータ４０４に対して畳み込み処理が行なわれ、その結果がメモリ４２４に格納される。 First, the convolution process will be described. As described above, since the kernel size of the convolution layer is 3 × 3 in the conventional network, the convolution processing is performed on, for example, 3 × 3 data 401 in the upper left of the input layer 400 in FIG. Is stored in the memory 421. Next, as described above, in the conventional network, the input stride of the convolution layer is set to 1, so in FIG. 4C, the 3 × 3 data 402 shifted by one column to the right from the data 401 of the input layer 400 is used. A convolution process is performed, and the result is stored in the memory 422. Similarly, in FIG. 4D, a convolution process is performed on 3 × 3 data 403 shifted one row downward from data 401 in input layer 400, and the result is stored in memory 423. Similarly, in FIG. 4E, a convolution process is performed on 3 × 3 data 404 shifted one row downward from data 402 of input layer 400, and the result is stored in memory 424.

次に、プーリング処理について説明する。上述したように、従来のネットワークはプーリング層を有しプーリング層のカーネルサイズを２×２としているため、図４Ｆにおいて、中間出力層４２０の２×２のデータ４２５に対してプーリング処理が実施され、その出力が中間出力層４３０のメモリ４３１に格納される。つまり、中間出力層４３０のデータ４３１では、データ４０５で示した４×４のデータに対して、位置ずれを許した３×３の特徴量を計算しており、このことが画像上の位置に対するロバスト性を生みだしている。位置ずれがあってもプーリング層によって当該位置ずれが荒く整理され吸収されるためである。さらに、ネットワークにおいて、畳み込み層及びプーリング層を重ねることにより、より広い範囲での位置に対するロバスト性が実現される。 Next, the pooling process will be described. As described above, since the conventional network has a pooling layer and the kernel size of the pooling layer is 2 × 2, the pooling process is performed on the 2 × 2 data 425 of the intermediate output layer 420 in FIG. 4F. The output is stored in the memory 431 of the intermediate output layer 430. That is, in the data 431 of the intermediate output layer 430, the 3 × 3 feature amount allowing the positional deviation is calculated with respect to the 4 × 4 data indicated by the data 405, and this corresponds to the position on the image. Produces robustness. This is because even if there is a misalignment, the misalignment is roughly organized and absorbed by the pooling layer. Furthermore, in the network, robustness with respect to a position in a wider range is realized by overlapping the convolution layer and the pooling layer.

位置に対するロバスト性は、通常のテクスチャベースの識別処理には有効である。例えば、識別対象の被写体が画像上での位置が変わったとしても、当該ロバスト性により識別処理を効果的に行うことができるためである。これに対して、本実施の形態では、撮像部１１が計算撮像画像を取得しており、通常のテクスチャベースの識別処理とは異なる。上述したように、計算撮像画像では、位置ずれの情報に、識別で有効となる奥行情報が含まれる。具体的には、計算撮像画像では、撮像対象物及び撮像対象物の周辺環境がそれぞれ複数重畳され、画像上のこれらの位置ずれに奥行情報等を付加している。しかし、この情報は、位置に対するロバスト性が高いネットワークでは取得することが難しい。従来のニューラルネットワークを用いた識別モデルを用いてこのような計算撮像画像中の撮像対象物及び撮像対象物の周辺環境を識別しようとすると、奥行情報等を付加するために敢えて生じさせていた位置ずれが荒く整理されてしまい（言い換えると位置ずれが吸収されてしまい）、奥行情報等が失われるためである。このように、撮像部１１が計算撮像画像を取得する場合には、大きな問題となる。 Robustness with respect to position is useful for normal texture-based identification processing. For example, even if the position of the subject to be identified changes on the image, the identification process can be performed effectively due to the robustness. On the other hand, in the present embodiment, the imaging unit 11 acquires a calculated captured image, which is different from normal texture-based identification processing. As described above, in the calculated captured image, the positional deviation information includes depth information that is effective for identification. Specifically, in the calculated captured image, a plurality of imaging objects and surrounding environments of the imaging object are respectively superimposed, and depth information or the like is added to these positional shifts on the image. However, it is difficult to acquire this information in a network with high robustness to the position. Positions that were generated intentionally to add depth information etc. when trying to identify the imaging object in such a calculated captured image and the surrounding environment of the imaging object using an identification model using a conventional neural network This is because the displacement is roughly arranged (in other words, the displacement is absorbed) and the depth information is lost. As described above, when the imaging unit 11 acquires a calculated captured image, it becomes a big problem.

そこで、本実施の形態における識別器は、プーリング層を有しないネットワークで構成される。当該ネットワークは、位置情報に敏感なネットワーク（言い換えると位置情報を保持可能なネットワーク）となる。 Therefore, the discriminator in the present embodiment is configured by a network that does not have a pooling layer. The network is a network sensitive to position information (in other words, a network capable of holding position information).

図５は、実施の形態におけるネットワーク構成の模式図であり、図６は、実施の形態におけるネットワークのパラメータの一例を示す図である。 FIG. 5 is a schematic diagram of a network configuration in the embodiment, and FIG. 6 is a diagram illustrating an example of network parameters in the embodiment.

実施の形態における識別器のネットワークは、例えば、図５に示すように１８層の畳み込み層などで構成され、プーリング層を有しない。なお、従来のネットワークと同じように、実施の形態におけるネットワークでは、リシェイプ及びアップサンプリングをしているが、しなくてもよい。本実施の形態における識別器のネットワークにおいて、畳み込み層の入力は重ならない。例えば、アップサンプリング前の畳み込み層はすべてカーネルサイズ３×３、ストライド３であり、畳み込み層の入力は重ならないようになっている。これにより、実施の形態におけるネットワークは、位置情報に敏感なネットワークとなる。これについて、図７Ａ〜図７Ｆを用いて説明する。 The network of discriminators in the embodiment is composed of, for example, 18 convolution layers as shown in FIG. 5, and does not have a pooling layer. As with the conventional network, the network in the embodiment performs reshaping and upsampling, but it may not be performed. In the classifier network in this embodiment, the convolutional layer inputs do not overlap. For example, the convolution layers before upsampling all have a kernel size of 3 × 3 and a stride of 3, so that the inputs of the convolution layers do not overlap. Thereby, the network in the embodiment is a network sensitive to position information. This will be described with reference to FIGS. 7A to 7F.

図７Ａ〜図７Ｆは、カーネルサイズ３×３、ストライド３の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための模式図である。 7A to 7F are schematic diagrams for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 3 and having no pooling layer has sensitivity to a position on an image.

図７Ａ〜図７Ｆにおいて、畳み込み層の処理におけるデータが格納されたメモリを模式的に正方形で示している。図７Ａには、６×８個のデータ（メモリ）で表された入力層４００、入力層４００に対して畳み込み処理が行なわれて出力される３×４個のデータで表された中間出力層４４０を示している。 In FIGS. 7A to 7F, the memory in which data in the process of the convolution layer is stored is schematically shown as a square. FIG. 7A shows an input layer 400 represented by 6 × 8 data (memory), and an intermediate output layer represented by 3 × 4 data that is output by performing convolution processing on the input layer 400. 440 is shown.

上述したように、本実施の形態のネットワークでは畳み込み層のカーネルサイズを３×３としているため、図７Ｂにおいて、入力層４００の例えば左上の３×３のデータ４０１に対して畳み込み処理が行なわれ、その結果がメモリ４４１に格納される。次に、上述したように、本実施の形態のネットワークでは畳み込み層の入力のストライドを３としているため、図７Ｃにおいて、入力層４００のデータ４０１から右に３列ずらした、すなわちデータ４０１とは重ならない３×３のデータ４０６に対して畳み込み処理が行なわれ、その結果がメモリ４４２に格納される。同じように、図７Ｄにおいて、入力層４００のデータ４０１から下に３行ずらした、すなわちデータ４０１ともデータ４０６とも重ならない３×３のデータ４０７に対して畳み込み処理が行なわれ、その結果がメモリ４４３に格納される。同じように、図７Ｅにおいて、入力層４００のデータ４０６から下に３行ずらした、すなわち各畳み込み層の入力データであるデータ４０１ともデータ４０６ともデータ４０７とも重ならない３×３のデータ４０８に対して畳み込み処理が行なわれ、その結果がメモリ４４４に格納される。また、本実施の形態のネットワークはプーリング層を有しないため、プーリング処理を行なわない。プーリング処理が行われないことで、図７Ｆにおいて、中間出力層４４０のデータ４４１は入力層４００のデータ４０１から抽出した特徴を保持し、中間出力層４４０のデータ４４２は入力層４００のデータ４０６から抽出した特徴を保持し、中間出力層４４０のデータ４４３は入力層４００のデータ４０７から抽出した特徴を保持し、中間出力層４４０のデータ４４４は入力層４００のデータ４０８から抽出した特徴を保持する。また、入力層４００のデータ４０１、４０６、４０７及び４０８は互いに重ならないため、中間出力層４４０のデータ４４１、４４２、４４３及び４４４をそれぞれ重複のない別情報とすることができ、かつ、プーリング層を有しないネットワークであっても、ネットワークのサイズが大きくなることを抑制できる。 As described above, since the kernel size of the convolution layer is 3 × 3 in the network of the present embodiment, the convolution processing is performed on, for example, 3 × 3 data 401 in the upper left of the input layer 400 in FIG. 7B. The result is stored in the memory 441. Next, as described above, in the network of this embodiment, the stride of the input of the convolution layer is 3, so in FIG. 7C, the data 401 of the input layer 400 is shifted to the right by three columns. A convolution process is performed on the non-overlapping 3 × 3 data 406, and the result is stored in the memory 442. Similarly, in FIG. 7D, a convolution process is performed on 3 × 3 data 407 that is shifted downward from the data 401 of the input layer 400 by three rows, that is, does not overlap the data 401 or the data 406, and the result is stored in the memory. 443 is stored. Similarly, in FIG. 7E, for the 3 × 3 data 408 that is shifted from the data 406 of the input layer 400 by three rows, that is, the data 401, the data 406, and the data 407 that are the input data of each convolution layer. The convolution process is performed, and the result is stored in the memory 444. Further, since the network according to the present embodiment does not have a pooling layer, pooling processing is not performed. By not performing the pooling process, in FIG. 7F, the data 441 of the intermediate output layer 440 retains the features extracted from the data 401 of the input layer 400, and the data 442 of the intermediate output layer 440 is derived from the data 406 of the input layer 400. The extracted features are retained, the data 443 of the intermediate output layer 440 retains the features extracted from the data 407 of the input layer 400, and the data 444 of the intermediate output layer 440 retains the features extracted from the data 408 of the input layer 400. . Further, since the data 401, 406, 407, and 408 of the input layer 400 do not overlap each other, the data 441, 442, 443, and 444 of the intermediate output layer 440 can be set as different information without duplication, respectively, and the pooling layer Even in a network that does not have a network, it is possible to suppress an increase in network size.

このように、本実施の形態のネットワークは位置情報に敏感なネットワークとなる。このようなネットワーク構成を有する本実施の形態の識別器を用いることで、抽出された特徴（位置情報）が保持され、つまり、奥行情報を付加するために敢えて生じさせていた位置ずれが失われず、奥行情報を効率的に取得することができる。 As described above, the network according to the present embodiment is a network sensitive to position information. By using the discriminator of the present embodiment having such a network configuration, the extracted feature (position information) is retained, that is, the position shift that has been intentionally caused to add the depth information is not lost. Depth information can be acquired efficiently.

また、本実施の形態のネットワークは、畳み込み層が一部重なってもかまわない。具体的には、本実施の形態の識別モデルのネットワークにおいて、畳み込み層の入力のストライドが２以上であってもよい。例えば、プーリング層を有しないネットワークであれば、畳み込み層のカーネルサイズが３×３、ストライドが２であってもよい。このように、本実施の形態の識別モデルのネットワークにおいて、畳み込み層のカーネルサイズが３×３の場合、畳み込み層の入力のストライドが当該カーネルサイズの半分以上であってもよい。この場合であっても、このネットワークによって、画像上の位置に対する敏感性を実現できる。これについて、図８Ａ〜図８Ｅを用いて説明する。 In the network of this embodiment, the convolution layers may partially overlap. Specifically, in the identification model network of the present embodiment, the input stride of the convolution layer may be two or more. For example, if the network does not have a pooling layer, the convolutional layer kernel size may be 3 × 3 and the stride may be 2. As described above, in the identification model network according to the present embodiment, when the kernel size of the convolution layer is 3 × 3, the stride of the input of the convolution layer may be half or more of the kernel size. Even in this case, the sensitivity to the position on the image can be realized by this network. This will be described with reference to FIGS. 8A to 8E.

図８Ａ〜図８Ｅは、カーネルサイズ３×３、ストライド２の畳み込み層を利用し、プーリング層を持たないネットワークが、画像上の位置に対する敏感性を有することを説明するための模式図である。図８Ａ〜図８Ｅにおいても、図７Ａ〜図７Ｆと同じように、畳み込み層の処理におけるデータが格納されたメモリを模式的に正方形で示しており、同じ構成要素には同じ番号を付与し、説明を省略する。 8A to 8E are schematic diagrams for explaining that a network using a convolutional layer having a kernel size of 3 × 3 and stride 2 and having no pooling layer has sensitivity to a position on an image. Also in FIGS. 8A to 8E, as in FIGS. 7A to 7F, the memory in which the data in the process of the convolution layer is stored is schematically shown as a square, and the same number is assigned to the same component, Description is omitted.

上述したように、本実施の形態の畳み込み層が一部重なっているネットワークでは畳み込み層のカーネルサイズを３×３としているため、図８Ａにおいて、入力層４００の例えば左上の３×３のデータ４０１に対して畳み込み処理が行なわれ、その結果が中間出力層４４０のメモリ４５１に格納される。次に、上述したように、本実施の形態の畳み込み層が一部重なっているネットワークでは畳み込み層の入力のストライドを２としているため、図８Ｂにおいて、入力層４００のデータ４０１から右に２列ずらした３×３のデータ４０９に対して畳み込み処理が行なわれ、その結果がメモリ４５２に格納される。同じように、図８Ｃにおいて、入力層４００のデータ４０１から下に２行ずらした３×３のデータ４１０に対して畳み込み処理が行なわれ、その結果がメモリ４５３に格納される。同じように、図８Ｄにおいて、入力層４００のデータ４０９から下に２行ずらした３×３のデータ４１１に対して畳み込み処理が行なわれ、その結果がメモリ４５４に格納される。また、本実施の形態の畳み込み層が一部重なっているネットワークは、プーリング層を有しないため、プーリング処理を行なわない。プーリング処理が行われないことで、図８Ｅにおいて、中間出力層４４０のデータ４５１は入力層４００のデータ４０１から抽出した特徴を保持し、中間出力層４４０のデータ４５２は入力層４００のデータ４０９から抽出した特徴を保持し、中間出力層４４０のデータ４５３は入力層４００のデータ４１０から抽出した特徴を保持し、中間出力層４４０のデータ４５４は入力層４００のデータ４１１から抽出した特徴を保持する。しかし、入力層４００のデータ４０１、４０９、４１０及び４１１は互いに一部重なっている。しかし、プーリング処理が行なわれないため、本実施の形態の畳み込み層が一部重なっているネットワークは位置情報に敏感なネットワークとなる。このような畳み込み層が一部重なっているネットワーク構成を有する識別器を用いた場合であっても、奥行情報を効率的に取得することができる。 As described above, in the network in which the convolution layers of the present embodiment partially overlap, the kernel size of the convolution layer is 3 × 3. Therefore, for example, the upper left 3 × 3 data 401 of the input layer 400 in FIG. Is subjected to convolution processing, and the result is stored in the memory 451 of the intermediate output layer 440. Next, as described above, in the network in which the convolution layers of the present embodiment are partially overlapped, the stride of the input of the convolution layer is 2, so in FIG. 8B, two columns to the right from the data 401 of the input layer 400 A convolution process is performed on the shifted 3 × 3 data 409, and the result is stored in the memory 452. Similarly, in FIG. 8C, a convolution process is performed on 3 × 3 data 410 shifted downward by two rows from data 401 of input layer 400, and the result is stored in memory 453. Similarly, in FIG. 8D, a convolution process is performed on 3 × 3 data 411 shifted by two rows downward from data 409 of input layer 400, and the result is stored in memory 454. In addition, the network in which the convolution layers of the present embodiment are partially overlapped does not have a pooling layer and therefore does not perform pooling processing. By not performing the pooling process, in FIG. 8E, the data 451 of the intermediate output layer 440 retains the features extracted from the data 401 of the input layer 400, and the data 452 of the intermediate output layer 440 is derived from the data 409 of the input layer 400. The extracted features are retained, the data 453 of the intermediate output layer 440 retains the features extracted from the data 410 of the input layer 400, and the data 454 of the intermediate output layer 440 retains the features extracted from the data 411 of the input layer 400. . However, the data 401, 409, 410, and 411 of the input layer 400 partially overlap each other. However, since the pooling process is not performed, the network in which the convolution layers of this embodiment are partially overlapped is a network sensitive to position information. Even when a discriminator having a network configuration in which such convolution layers partially overlap is used, depth information can be acquired efficiently.

また、識別システム１は、後述する図９に示すように、識別器を生成するための学習装置１２を備えてもよい。この場合、画像識別装置１０の識別部１０２は、学習装置１２で生成された、言い換えると学習が完了した識別器を使用する。 Moreover, the identification system 1 may be provided with the learning apparatus 12 for producing | generating a discriminator, as shown in FIG. 9 mentioned later. In this case, the identification unit 102 of the image identification device 10 uses the classifier generated by the learning device 12, in other words, the learning is completed.

出力部１０３は、識別部１０２の識別結果を出力する。出力部１０３は、識別システム１がさらにディスプレイを備える場合には、当該ディスプレイに、識別結果を出力する指示を出力する。又は、出力部１０３は、通信部を有し、通信部を介して、有線又は無線で、識別結果を出力してもよい。上述の通り、物体の情報は、物体の存在の有無と、物体の位置とを含み、物体の情報についての識別結果に応じて自動運転等が行われ、また、例えばディスプレイ等に物体の情報が出力されることで、ユーザは識別システム１が搭載された移動体の周辺の状況を認識できる。 The output unit 103 outputs the identification result of the identification unit 102. When the identification system 1 further includes a display, the output unit 103 outputs an instruction to output the identification result to the display. Alternatively, the output unit 103 may include a communication unit, and output the identification result via a communication unit in a wired or wireless manner. As described above, the object information includes the presence / absence of the object and the position of the object, and automatic operation or the like is performed according to the identification result of the object information. By outputting the information, the user can recognize the situation around the moving body on which the identification system 1 is mounted.

上述のような取得部１０１、識別部１０２及び出力部１０３からなる画像識別装置１０の構成要素は、ＣＰＵ（Central Processing Unit）又はＤＳＰ（Digital Signal Processor）等のプロセッサ、並びに、ＲＡＭ（Random Access Memory）及びＲＯＭ（Read−Only Memory）等のメモリなどからなる処理回路により構成されてもよい。上記構成要素の一部又は全部の機能は、ＣＰＵ又はＤＳＰがＲＡＭを作業用のメモリとして用いてＲＯＭに記録されたプログラムを実行することによって達成されてもよい。また、上記構成要素の一部又は全部の機能は、電子回路又は集積回路等の専用のハードウェア回路によって達成されてもよい。上記構成要素の一部又は全部の機能は、上記のソフトウェア機能とハードウェア回路との組み合わせによって構成されてもよい。 The components of the image identification apparatus 10 including the acquisition unit 101, the identification unit 102, and the output unit 103 as described above are a processor such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), and a RAM (Random Access Memory). ) And a processing circuit including a memory such as a ROM (Read-Only Memory). Some or all of the functions of the above components may be achieved by the CPU or DSP executing a program recorded in the ROM using the RAM as a working memory. Further, some or all of the functions of the above components may be achieved by a dedicated hardware circuit such as an electronic circuit or an integrated circuit. A part or all of the functions of the constituent elements may be configured by a combination of the software function and a hardware circuit.

次に、識別システムが学習装置を含むケースとして、実施の形態に係る識別システム１の変形例を、図９を用いて説明する。 Next, as a case where the identification system includes a learning device, a modified example of the identification system 1 according to the embodiment will be described with reference to FIG.

図９は、実施の形態の変形例に係る識別システム１Ａの機能的な構成の一例を示す模式図である。 FIG. 9 is a schematic diagram illustrating an example of a functional configuration of an identification system 1A according to a modification of the embodiment.

図９に示すように、変形例に係る識別システム１Ａは、画像識別装置１０と、撮像部１１と、学習装置１２とを備える。学習装置１２は、第一画像取得部１２１と、第二画像取得部１２２と、識別正解取得部１２３と、学習部１２４とを備える。画像識別装置１０、撮像部１１及び学習装置１２は、１つの装置に搭載されてもよく、複数の装置に分かれて搭載されてもよい。画像識別装置１０、撮像部１１及び学習装置１２が複数の装置に分かれて搭載される場合、有線通信又は無線通信を介して、装置間で情報が授受されてもよい。適用される有線通信及び無線通信は、上記で例示したもののいずれかであってもよい。 As illustrated in FIG. 9, the identification system 1 A according to the modification includes an image identification device 10, an imaging unit 11, and a learning device 12. The learning device 12 includes a first image acquisition unit 121, a second image acquisition unit 122, an identification correct answer acquisition unit 123, and a learning unit 124. The image identification device 10, the imaging unit 11, and the learning device 12 may be mounted on one device, or may be mounted on a plurality of devices. When the image identification device 10, the imaging unit 11, and the learning device 12 are separately installed in a plurality of devices, information may be exchanged between the devices via wired communication or wireless communication. The applied wired communication and wireless communication may be any of those exemplified above.

図１０は、実施の形態の変形例に係る識別システム１Ａのハードウェア構成の一例を示す模式図である。 FIG. 10 is a schematic diagram illustrating an example of a hardware configuration of an identification system 1A according to a modification of the embodiment.

図１０に示すように、学習装置１２は、第二入力回路２２１と、第三入力回路２２２と、第二演算回路２２３と、第二メモリ２２４とを備える。また、画像識別装置１０は、第一入力回路２０１と、第一演算回路２０２と、第一メモリ２０３と、出力回路２０４とを備える。 As illustrated in FIG. 10, the learning device 12 includes a second input circuit 221, a third input circuit 222, a second arithmetic circuit 223, and a second memory 224. Further, the image identification device 10 includes a first input circuit 201, a first arithmetic circuit 202, a first memory 203, and an output circuit 204.

第一入力回路２０１、第一演算回路２０２及び出力回路２０４は、画像識別装置１０が備える処理回路の一例であり、第一メモリ２０３は、画像識別装置１０が備えるメモリの一例である。図１及び図１０を参照すると、第一入力回路２０１は、取得部１０１に対応する。第一演算回路２０２は、識別部１０２に対応する。出力回路２０４は、出力部１０３に対応する。このように、取得部１０１、識別部１０２及び出力部１０３は、第一入力回路２０１、第一演算回路２０２及び出力回路２０４に対応していることから、取得部１０１、識別部１０２及び出力部１０３についても、画像識別装置１０が備える処理回路の一例といえる。第一メモリ２０３は、第一入力回路２０１、第一演算回路２０２及び出力回路２０４が処理を実行するためのコンピュータプログラム、取得部１０１が取得する第２の計算撮像画像、及び、識別部１０２が用いる識別器等を記憶する。第一メモリ２０３は、１つのメモリで構成されてもよく、同じ種類又は異なる種類の複数のメモリで構成されてもよい。第一入力回路２０１及び出力回路２０４は、通信回路を含んでもよい。 The first input circuit 201, the first arithmetic circuit 202, and the output circuit 204 are an example of a processing circuit included in the image identification device 10, and the first memory 203 is an example of a memory included in the image identification device 10. 1 and 10, the first input circuit 201 corresponds to the acquisition unit 101. The first arithmetic circuit 202 corresponds to the identification unit 102. The output circuit 204 corresponds to the output unit 103. Thus, since the acquisition unit 101, the identification unit 102, and the output unit 103 correspond to the first input circuit 201, the first arithmetic circuit 202, and the output circuit 204, the acquisition unit 101, the identification unit 102, and the output unit 103 may also be an example of a processing circuit included in the image identification device 10. The first memory 203 includes a computer program for the first input circuit 201, the first arithmetic circuit 202, and the output circuit 204 to execute processing, a second calculated captured image acquired by the acquisition unit 101, and an identification unit 102. The discriminator to be used is stored. The first memory 203 may be composed of a single memory, or may be composed of a plurality of memories of the same type or different types. The first input circuit 201 and the output circuit 204 may include a communication circuit.

第二入力回路２２１、第三入力回路２２２及び第二演算回路２２３は、学習装置１２が備える処理回路の一例であり、第二メモリ２２４は、学習装置１２が備えるメモリの一例である。図９及び図１０を参照すると、第二入力回路２２１は、第一画像取得部１２１に対応する。第二入力回路２２１は、通信回路を含んでもよい。第三入力回路２２２は、第二画像取得部１２２に対応する。第三入力回路２２２は、通信回路を含んでもよい。第二演算回路２２３は、識別正解取得部１２３及び学習部１２４に対応する。第二演算回路２２３は、通信回路を含んでもよい。このように、第一画像取得部１２１、第二画像取得部１２２、識別正解取得部１２３及び学習部１２４は、第二入力回路２２１、第三入力回路２２２及び第二演算回路２２３に対応していることから、第一画像取得部１２１、第二画像取得部１２２、識別正解取得部１２３及び学習部１２４についても、学習装置１２が備える処理回路の一例といえる。第二メモリ２２４は、第二入力回路２２１、第三入力回路２２２及び第二演算回路２２３が処理を実行するためのコンピュータプログラム、第一画像取得部１２１が取得する第１の計算撮像画像、第二画像取得部１２２が取得する撮像画像、識別正解取得部１２３が取得する識別正解、学習部１２４が生成した識別器等を記憶する。第二メモリ２２４は、１つのメモリで構成されてもよく、同じ種類又は異なる種類の複数のメモリで構成されてもよい。 The second input circuit 221, the third input circuit 222, and the second arithmetic circuit 223 are examples of processing circuits included in the learning device 12, and the second memory 224 is an example of memory included in the learning device 12. With reference to FIGS. 9 and 10, the second input circuit 221 corresponds to the first image acquisition unit 121. The second input circuit 221 may include a communication circuit. The third input circuit 222 corresponds to the second image acquisition unit 122. The third input circuit 222 may include a communication circuit. The second arithmetic circuit 223 corresponds to the identification correct answer acquisition unit 123 and the learning unit 124. The second arithmetic circuit 223 may include a communication circuit. Thus, the first image acquisition unit 121, the second image acquisition unit 122, the identification correct answer acquisition unit 123, and the learning unit 124 correspond to the second input circuit 221, the third input circuit 222, and the second arithmetic circuit 223. Therefore, the first image acquisition unit 121, the second image acquisition unit 122, the identification correct answer acquisition unit 123, and the learning unit 124 are also examples of processing circuits included in the learning device 12. The second memory 224 includes a computer program for executing processing by the second input circuit 221, the third input circuit 222, and the second arithmetic circuit 223, the first calculated captured image acquired by the first image acquisition unit 121, the first The captured image acquired by the two image acquisition unit 122, the identification correct answer acquired by the identification correct acquisition unit 123, the classifier generated by the learning unit 124, and the like are stored. The second memory 224 may be composed of a single memory, or may be composed of a plurality of memories of the same type or different types.

第一入力回路２０１、第一演算回路２０２、出力回路２０４、第二入力回路２２１、第三入力回路２２２及び第二演算回路２２３は、ＣＰＵ又はＤＳＰ等のプロセッサを含む処理回路で構成され得る。第一メモリ２０３及び第二メモリ２２４は、例えば、ＲＯＭ、ＲＡＭ、フラッシュメモリなどの半導体メモリ、ハードディスクドライブ、又は、ＳＳＤ（Solid State Drive）等の記憶装置によって実現される。第一メモリ２０３及び第二メモリ２２４は、１つのメモリにまとめられてもよい。プロセッサは、メモリに展開されたコンピュータプログラムに記述された命令群を実行する。これにより、プロセッサは種々の機能を実現することができる。 The first input circuit 201, the first arithmetic circuit 202, the output circuit 204, the second input circuit 221, the third input circuit 222, and the second arithmetic circuit 223 may be configured by a processing circuit including a processor such as a CPU or a DSP. The first memory 203 and the second memory 224 are realized by a storage device such as a semiconductor memory such as a ROM, a RAM, and a flash memory, a hard disk drive, or an SSD (Solid State Drive), for example. The first memory 203 and the second memory 224 may be combined into one memory. The processor executes a group of instructions described in the computer program expanded in the memory. Thereby, the processor can realize various functions.

学習装置１２の第一画像取得部１２１及び第二画像取得部１２２は、機械学習のための第１の計算撮像画像及び撮像画像を取得する。第一画像取得部１２１のハードウェアの例は計算撮像画像を撮像するためのカメラであり、具体的にはマルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラ等である。つまり、第一画像取得部１２１は、例えば、第二入力回路２２１と計算撮像画像を撮像するためのカメラとによって実現される。第二画像取得部１２２のハードウェアの例は撮像画像を撮像するためのカメラであり、具体的にはデジタルカメラ等である。つまり、第二画像取得部１２２は、例えば、第三入力回路２２２と撮像画像を撮像するためのカメラとによって実現される。 The first image acquisition unit 121 and the second image acquisition unit 122 of the learning device 12 acquire a first calculated captured image and a captured image for machine learning. An example of hardware of the first image acquisition unit 121 is a camera for capturing a calculated captured image, and specifically, a multi-pinhole camera, a coded aperture camera, a light field camera, a lensless camera, or the like. That is, the first image acquisition unit 121 is realized by, for example, the second input circuit 221 and a camera for capturing a calculated captured image. An example of hardware of the second image acquisition unit 122 is a camera for capturing a captured image, specifically, a digital camera or the like. That is, the second image acquisition unit 122 is realized by, for example, the third input circuit 222 and a camera for capturing a captured image.

例えば、計算撮像画像を撮像するためのカメラによって撮像された第１の計算撮像画像は第二メモリ２２４に記憶され、第二入力回路２２１が第二メモリ２２４から第１の計算撮像画像を取得することで、第一画像取得部１２１は、第１の計算撮像画像を取得する。なお、第一画像取得部１２１は、ハードウェアとして計算撮像画像を撮像するためのカメラを含んでいなくてもよい。この場合、第一画像取得部１２１（第二入力回路２２１）は、撮像部１１から第１の計算撮像画像を取得してもよく（具体的には、撮像部１１によって撮像された第１の計算撮像画像は第二メモリ２２４に記憶され、第二メモリ２２４から第１の計算撮像画像を取得してもよく）、識別システム１Ａの外部から有線通信又は無線通信を介して、第１の計算撮像画像を取得してもよい。適用される有線通信及び無線通信は、上記で例示したもののいずれかであってもよい。 For example, a first calculated captured image captured by a camera for capturing a calculated captured image is stored in the second memory 224, and the second input circuit 221 acquires the first calculated captured image from the second memory 224. Thus, the first image acquisition unit 121 acquires the first calculated captured image. Note that the first image acquisition unit 121 may not include a camera for capturing a calculated captured image as hardware. In this case, the first image acquisition unit 121 (second input circuit 221) may acquire the first calculated captured image from the imaging unit 11 (specifically, the first image captured by the imaging unit 11). The calculated captured image is stored in the second memory 224, and the first calculated captured image may be acquired from the second memory 224), and the first calculation is performed from outside the identification system 1A via wired communication or wireless communication. A captured image may be acquired. The applied wired communication and wireless communication may be any of those exemplified above.

また、例えば、撮像画像を撮像するためのカメラによって撮像された撮像画像は第二メモリ２２４に記憶され、第三入力回路２２２が第二メモリ２２４から撮像画像を取得することで、第二画像取得部１２２は、撮像画像を取得する。なお、第二画像取得部１２２は、ハードウェアとして撮像画像を撮像するためのカメラを含んでいなくてもよい。この場合、第二画像取得部１２２（第三入力回路２２２）は、識別システム１Ａの外部から有線通信又は無線通信を介して、撮像画像を取得してもよい。適用される有線通信及び無線通信は、上記で例示したもののいずれかであってもよい。 Further, for example, a captured image captured by a camera for capturing a captured image is stored in the second memory 224, and the third input circuit 222 acquires the captured image from the second memory 224, thereby acquiring the second image. The unit 122 acquires a captured image. Note that the second image acquisition unit 122 may not include a camera for capturing a captured image as hardware. In this case, the second image acquisition unit 122 (third input circuit 222) may acquire a captured image from outside the identification system 1A via wired communication or wireless communication. The applied wired communication and wireless communication may be any of those exemplified above.

識別正解取得部１２３は、第一画像取得部１２１が取得した第１の計算撮像画像を用いた機械学習のために、識別正解を取得する。識別正解は、第１の計算撮像画像と共に、識別システム１Ａの外部から与えられてもよく、ユーザが識別正解を手動等により入力することによって与えられてもよい。識別正解は、第１の計算撮像画像に含まれる被写体が属するカテゴリ情報と、被写体の位置情報とを含む。被写体のカテゴリの例は、人物、自動車、自転車又は信号等である。位置情報は、画像上の位置（具体的には、平面における位置又は奥行方向における位置）を含む。識別正解取得部１２３は、取得した識別正解を、第１の計算撮像画像と対応付けて、第二メモリ２２４に格納する。 The identification correct answer acquisition unit 123 acquires an identification correct answer for machine learning using the first calculated captured image acquired by the first image acquisition unit 121. The identification correct answer may be given together with the first calculated captured image from the outside of the identification system 1A, or may be given by the user manually inputting the identification correct answer. The identification correct answer includes category information to which the subject included in the first calculated captured image belongs, and position information of the subject. Examples of subject categories are people, cars, bicycles or signals. The position information includes a position on the image (specifically, a position on a plane or a position in the depth direction). The identification correct answer acquisition unit 123 stores the acquired identification correct answer in the second memory 224 in association with the first calculated captured image.

ただし、前述のように、取得部１０１及び第一画像取得部１２１が取得する計算撮像画像は、人によって実空間の状態と同様に視覚的に認識できない画像である。そのため、第一画像取得部１２１が取得した第１の計算撮像画像に識別正解を入力することは困難である。そこで、本実施の形態の識別システム１Ａは、第二画像取得部１２２を有し、第一画像取得部１２１が取得した第１の計算撮像画像ではなく、第二画像取得部１２２が取得した、人によって実空間の状態と同様に視覚的に認識できる撮像画像に対して識別正解を入力する。詳細は、後述する。 However, as described above, the calculated captured image acquired by the acquisition unit 101 and the first image acquisition unit 121 is an image that cannot be visually recognized by a person in the same manner as the state of the real space. Therefore, it is difficult to input an identification correct answer to the first calculated captured image acquired by the first image acquisition unit 121. Therefore, the identification system 1A of the present embodiment includes the second image acquisition unit 122, which is acquired by the second image acquisition unit 122 instead of the first calculated captured image acquired by the first image acquisition unit 121. An identification correct answer is input to a captured image that can be visually recognized by a person in the same manner as in the real space. Details will be described later.

学習部１２４は、第一画像取得部１２１が取得した第１の計算撮像画像と、識別正解取得部１２３が取得した、第二画像取得部１２２が取得した撮像画像に対する識別正解とを用いて、識別部１０２の識別器の学習を行う。なお、本実施の形態では、上述したように、識別器はプーリング層を有しないネットワークで構成されている。学習部１２４は、第二メモリ２２４に格納された識別器に機械学習をさせ、学習後の最新の識別器を第二メモリ２２４に格納する。識別部１０２は、第二メモリ２２４に格納された最新の識別器を取得し、第一メモリ２０３に格納しつつ、識別処理に使用する。上記機械学習は、例えば、ディープラーニングなどにおける誤差逆伝播法（ＢＰ：BackPropagation）などによって実現される。具体的には、学習部１２４は、識別器に第１の計算撮像画像を入力し、識別器が出力する識別結果を取得する。そして、学習部１２４は、識別結果が識別正解となるように識別器を調整する。学習部１２４は、このような調整をそれぞれ異なる複数の（例えば数千組の）第１の計算撮像画像及びこれに対応する識別正解について繰り返すことによって、識別器の識別精度を向上させる。 The learning unit 124 uses the first calculated captured image acquired by the first image acquisition unit 121 and the identification correct answer for the captured image acquired by the second image acquisition unit 122 acquired by the identification correct answer acquisition unit 123. Learning of the classifier of the classifier 102 is performed. In the present embodiment, as described above, the discriminator is configured by a network that does not have a pooling layer. The learning unit 124 causes the discriminator stored in the second memory 224 to perform machine learning, and stores the latest discriminator after learning in the second memory 224. The identification unit 102 acquires the latest classifier stored in the second memory 224 and stores it in the first memory 203 while using it for the identification process. The machine learning is realized by, for example, an error back propagation (BP) in deep learning or the like. Specifically, the learning unit 124 inputs the first calculated captured image to the classifier, and acquires the identification result output by the classifier. Then, the learning unit 124 adjusts the discriminator so that the discrimination result becomes the discrimination correct answer. The learning unit 124 improves the discrimination accuracy of the discriminator by repeating such adjustment for a plurality of (for example, several thousand) first calculated captured images and discrimination correct answers corresponding thereto.

次に、図９〜図１１を参照しつつ、学習装置１２について説明する。 Next, the learning device 12 will be described with reference to FIGS.

図１１は、学習装置１２の主要な処理の流れの一例を示すフローチャートである。 FIG. 11 is a flowchart illustrating an example of the main processing flow of the learning device 12.

まず、ステップＳ１において、学習部１２４は、第一画像取得部１２１が取得する第１の計算撮像画像と、第二画像取得部１２２が取得する撮像画像の画像上での位置（画素）の対応関係を取得する。具体的には、学習部１２４は、第１の計算撮像画像が有する複数の第１の画素及び撮像画像が有する複数の第２の画素の対応関係を取得する。これは、第１の計算撮像画像及び撮像画像に対して幾何学的キャリブレーションが行なわれることで実現される。幾何学的キャリブレーションは、３次元位置が既知の点が第１の計算撮像画像及び撮像画像のどこに撮像されるかを事前に取得し、その情報を元に被写体の３次元位置と第１の計算撮像画像及び撮像画像との関係を求めるものである。これは、例えばＴｓａｉのキャリブレーションとして知られている手法を利用することで実現できる。通常、撮像画像からは被写体の３次元位置を求めることができないが、前述のように、計算撮像画像であるライトフィールド画像では１枚の画像から３次元位置を求めることができる。また、第一画像取得部１２１が取得する第１の計算撮像画像と、第二画像取得部１２２が取得する撮像画像の画像上での対応点（画素）を取得することで、キャリブレーションを実現することができる。例えば、第１の計算撮像画像と撮像画像との対応関係が取得されることで、第１の計算撮像画像と撮像画像との原点合わせをすることができる。なお、第１の計算撮像画像を撮像するカメラと、撮像画像を撮像するカメラとの位置関係が変わらなければ、このようなキャリブレーションは一度行うだけでよい。なお、以下の説明では、計算撮像画像がライトフィールド画像であるとして説明する。 First, in step S 1, the learning unit 124 associates the first calculated captured image acquired by the first image acquisition unit 121 with the position (pixel) on the image of the captured image acquired by the second image acquisition unit 122. Get relationship. Specifically, the learning unit 124 acquires a correspondence relationship between the plurality of first pixels included in the first calculated captured image and the plurality of second pixels included in the captured image. This is realized by performing geometric calibration on the first calculated captured image and the captured image. Geometric calibration acquires in advance in the first calculated captured image and the captured image where a point with a known three-dimensional position is captured, and based on that information, the three-dimensional position of the subject and the first The calculated captured image and the relationship with the captured image are obtained. This can be realized, for example, by using a technique known as Tsai calibration. Normally, the three-dimensional position of the subject cannot be obtained from the captured image, but as described above, the three-dimensional position can be obtained from one image in the light field image that is the calculated captured image. Further, calibration is realized by acquiring corresponding points (pixels) on the image of the first calculated captured image acquired by the first image acquisition unit 121 and the captured image acquired by the second image acquisition unit 122. can do. For example, by acquiring the correspondence between the first calculated captured image and the captured image, the origin of the first calculated captured image and the captured image can be adjusted. If the positional relationship between the camera that captures the first calculated captured image and the camera that captures the captured image does not change, such calibration only needs to be performed once. In the following description, it is assumed that the calculated captured image is a light field image.

ライトフィールド画像は、画素値と奥行情報との両方の情報を有する。ライトフィールド画像は、ライトフィールドカメラによって取得される。ライトフィールドカメラの具体例は、マルチピンホール又はマイクロレンズを使用したカメラである。撮像部１１がライトフィールドカメラであり、第一画像取得部１２１は、撮像部１１が撮像したライトフィールド画像を取得してもよい。又は、第一画像取得部１２１は、識別システム１Ａの外部から有線通信又は無線通信を介してライトフィールド画像を取得してもよい。 The light field image has both information of pixel values and depth information. The light field image is acquired by a light field camera. A specific example of the light field camera is a camera using a multi-pinhole or a microlens. The imaging unit 11 may be a light field camera, and the first image acquisition unit 121 may acquire a light field image captured by the imaging unit 11. Alternatively, the first image acquisition unit 121 may acquire a light field image from outside the identification system 1A via wired communication or wireless communication.

図１２は、マルチピンホールを使用したライトフィールドカメラの例を示す図である。 FIG. 12 is a diagram illustrating an example of a light field camera using a multi-pinhole.

図１２に示すライトフィールドカメラ２１１は、マルチピンホールマスク２１１ａと、イメージセンサ２１１ｂとを有する。マルチピンホールマスク２１１ａは、イメージセンサ２１１ｂから一定距離離れて配置されている。マルチピンホールマスク２１１ａは、ランダム又は等間隔に配置された複数のピンホール２１１ａａを有している。複数のピンホール２１１ａａのことを、マルチピンホールとも呼ぶ。イメージセンサ２１１ｂは、各ピンホール２１１ａａを通じて被写体の画像を取得する。ピンホールを通じて取得される画像を、ピンホール画像と呼ぶ。各ピンホール２１１ａａの位置及び大きさによって、被写体のピンホール画像は異なるため、イメージセンサ２１１ｂは、複数のピンホール画像の重畳画像を取得する。ピンホール２１１ａａの位置は、イメージセンサ２１１ｂ上に投影される被写体の位置に影響を与え、ピンホール２１１ａａの大きさは、ピンホール画像のボケに影響を与える。マルチピンホールマスク２１１ａを用いることによって、位置及びボケの程度が異なる複数のピンホール画像を重畳して取得することが可能である。被写体がピンホール２１１ａａから離れている場合、複数のピンホール画像はほぼ同じ位置に投影される。一方、被写体がピンホール２１１ａａに近い場合、複数のピンホール画像は離れた位置に投影される。このように、重畳された複数のピンホール画像のずれ量と、被写体とマルチピンホールマスク２１１ａ間の距離とは対応しているため、重畳画像には当該ずれ量に応じた被写体の奥行情報が含まれている。 A light field camera 211 shown in FIG. 12 includes a multi-pinhole mask 211a and an image sensor 211b. The multi-pinhole mask 211a is arranged at a certain distance from the image sensor 211b. The multi-pinhole mask 211a has a plurality of pinholes 211aa arranged at random or at equal intervals. The plurality of pinholes 211aa is also referred to as a multipinhole. The image sensor 211b acquires an image of a subject through each pinhole 211aa. An image acquired through a pinhole is called a pinhole image. Since the pinhole image of the subject differs depending on the position and size of each pinhole 211aa, the image sensor 211b acquires a superimposed image of a plurality of pinhole images. The position of the pinhole 211aa affects the position of the subject projected on the image sensor 211b, and the size of the pinhole 211aa affects the blur of the pinhole image. By using the multi-pinhole mask 211a, it is possible to superimpose and acquire a plurality of pinhole images having different positions and degrees of blur. When the subject is away from the pinhole 211aa, the plurality of pinhole images are projected at substantially the same position. On the other hand, when the subject is close to the pinhole 211aa, a plurality of pinhole images are projected at distant positions. As described above, since the shift amount of the plurality of superimposed pinhole images corresponds to the distance between the subject and the multi-pinhole mask 211a, the superimposed image includes the depth information of the subject corresponding to the shift amount. include.

例えば、図１３及び図１４にはそれぞれ、通常撮像画像の例と、マルチピンホールを使用したライトフィールドカメラによるライトフィールド画像（計算撮像画像）の例とが、示されている。 For example, FIG. 13 and FIG. 14 show an example of a normal captured image and an example of a light field image (calculated captured image) by a light field camera using a multi-pinhole, respectively.

図１３は、通常撮像された被写体の画像（撮像画像）の例を示す模式図であり、図１４は、マルチピンホールマスクを含むライトフィールドカメラを使用して撮像された被写体の画像（計算撮像画像）の例を示す模式図である。 FIG. 13 is a schematic diagram illustrating an example of a subject image (captured image) that is normally captured, and FIG. 14 is an image of the subject (computed imaging) captured using a light field camera including a multi-pinhole mask. It is a schematic diagram which shows the example of an image.

図１３に示すように、通常撮像画像において、被写体として、道路上の人物Ａと自動車Ｂ及びＣとが写し出される。これらの被写体を、例えば４つのピンホールを有するライトフィールドカメラで撮像した場合、図１４に示すように、人物Ａ、自動車Ｂ及びＣそれぞれの画像は、複数の重畳された画像として取得される。具体的には、人物Ａの画像は、人物Ａ１、Ａ２及びＡ３として取得され、自動車Ｂの画像は、自動車Ｂ１、Ｂ２、Ｂ３及びＢ４として取得され、自動車Ｃの画像は、自動車Ｃ１、Ｃ２、Ｃ３及びＣ４として取得される。また、図１３及び図１４において符号を付していないが、図１３において自動車Ｂ及びＣが走行する道路の画像についても、図１４に示すように、複数の重畳された画像として取得される。このように、計算撮像画像は、撮像対象物（例えば人物Ａ、自動車Ｂ及びＣ等）及び撮像対象物の周辺環境（例えば道路等）がそれぞれ複数重畳された視差情報を含んだ画像となる。 As shown in FIG. 13, in the normal captured image, a person A on the road and cars B and C are projected as subjects. When these subjects are imaged by, for example, a light field camera having four pinholes, the images of the person A, the cars B, and C are acquired as a plurality of superimposed images as shown in FIG. Specifically, the image of the person A is acquired as the persons A1, A2, and A3, the image of the car B is acquired as the cars B1, B2, B3, and B4, and the image of the car C is acquired by the cars C1, C2, Obtained as C3 and C4. 13 and 14, the road images on which the cars B and C travel are also acquired as a plurality of superimposed images as shown in FIG. 14. As described above, the calculated captured image is an image including parallax information in which a plurality of imaging objects (for example, person A, automobiles B, and C) and surrounding environments (for example, roads) of the imaging object are respectively superimposed.

図１１に示すように、ステップＳ２において、第一画像取得部１２１は第二メモリ２２４から撮像対象物及び撮像対象物の周辺環境を含む第１の計算撮像画像を取得し、ステップＳ３において、第二画像取得部１２２は第二メモリ２２４から当該撮像対象物及び当該周辺環境を含む撮像画像を取得する。ここで、第一画像取得部１２１は実空間の状態と同様に視覚的に認識できない画像である計算撮像画像を取得するが、第二画像取得部１２２は、実空間の状態と同様に視覚的に認識できる画像である通常の撮像画像を取得する。 As shown in FIG. 11, in step S2, the first image acquisition unit 121 acquires a first calculated captured image including the imaging target and the surrounding environment of the imaging target from the second memory 224. In step S3, The two-image acquisition unit 122 acquires a captured image including the imaging object and the surrounding environment from the second memory 224. Here, the first image acquisition unit 121 acquires a calculated captured image that is an image that cannot be visually recognized in the same manner as the state of the real space, but the second image acquisition unit 122 is visually in the same way as the state of the real space. A normal captured image that is a recognizable image is acquired.

図１１に示すように、ステップＳ４において、識別正解取得部１２３は、第二画像取得部１２２が取得した撮像画像に含まれる撮像対象物及び撮像対象物の周辺環境の識別結果（識別正解）を取得する。識別正解は、例えば、撮像対象物及び撮像対象物の周辺環境（人物、自動車、自転車又は信号等の被写体）が属するカテゴリ情報と、画像上での撮像対象物及び撮像対象物の周辺環境の平面における位置及び領域とを含む。なお、識別正解は、画像上での撮像最小物及び撮像対象物の周辺環境の奥行方向における位置を含んでいてもかまわない。識別正解は、第１の計算撮像画像と共に識別システム１Ａの外部から与えられたもの、又は、第二画像取得部１２２による撮像画像に対してユーザによって与えられたものである。識別正解取得部１２３は、撮像画像において、被写体の位置に基づき、被写体を特定し、特定した被写体とカテゴリとを対応付ける。この結果、識別正解取得部１２３は、被写体の領域と、被写体のカテゴリと、第二画像取得部１２２が取得した撮像画像に対する被写体の位置情報とを対応付けて取得し、これらの情報を識別正解とする。 As illustrated in FIG. 11, in step S 4, the identification correct answer acquisition unit 123 obtains the identification result (identification correct answer) of the imaging target and the surrounding environment of the imaging target included in the captured image acquired by the second image acquisition unit 122. get. The identification correct answer includes, for example, category information to which the imaging object and the surrounding environment of the imaging object (a person such as a person, a car, a bicycle, or a signal) belong, and the plane of the imaging object and the surrounding environment of the imaging object on the image. Position and region. The identification correct answer may include the position in the depth direction of the surrounding environment of the minimum imaging object and the imaging object on the image. The identification correct answer is given from the outside of the identification system 1 A together with the first calculated captured image, or is given by the user to the captured image by the second image acquisition unit 122. The identification correct answer acquisition unit 123 identifies the subject based on the position of the subject in the captured image, and associates the identified subject with the category. As a result, the identification correct answer acquiring unit 123 acquires the subject area, the subject category, and the position information of the subject with respect to the captured image acquired by the second image acquiring unit 122 in association with each other. And

識別正解取得部１２３は、被写体の撮像画像上での平面位置及び領域を決定する際、指標を用いる。例えば、識別正解取得部１２３は、当該指標として、被写体を囲む枠を用いる。以下、被写体を囲む枠を識別領域枠とも呼ぶ。識別領域枠は、被写体の位置及び領域を示すことができる。識別領域枠の一例が、図１５Ａ及び図１５Ｂに示されている。 The identification correct answer acquisition unit 123 uses an index when determining the plane position and area on the captured image of the subject. For example, the identification correct answer acquiring unit 123 uses a frame surrounding the subject as the index. Hereinafter, the frame surrounding the subject is also referred to as an identification area frame. The identification area frame can indicate the position and area of the subject. An example of the identification area frame is shown in FIGS. 15A and 15B.

図１５Ａは、識別領域枠が重畳表示された撮像画像を示す模式的な図である。図１５Ｂは、識別領域枠のみを示す模式的な図である。 FIG. 15A is a schematic diagram illustrating a captured image in which an identification area frame is superimposed and displayed. FIG. 15B is a schematic diagram showing only the identification area frame.

図１５Ａ及び図１５Ｂに示す例では、識別正解取得部１２３は、各被写体を外から囲み且つ各被写体に外接する矩形の識別領域枠を設定する。なお、識別領域枠の形状は、図１５Ａ及び図１５Ｂの例に限定されない。 In the example illustrated in FIGS. 15A and 15B, the identification correct acquisition unit 123 sets a rectangular identification area frame that surrounds each subject from the outside and circumscribes each subject. Note that the shape of the identification area frame is not limited to the example of FIGS. 15A and 15B.

図１５Ａ及び図１５Ｂにおいて、識別正解取得部１２３は、例えば、人物Ａに識別領域枠ＦＡを設定し、自動車Ｂに識別領域枠ＦＢを設定し、自動車Ｃに識別領域枠ＦＣを設定する。この際、識別正解取得部１２３は、識別領域枠の形状及びその位置を示す情報として、識別領域枠全体の線形及び座標を算出してもよく、識別領域枠の各頂点の座標を算出してもよく、識別領域枠の左上等の１つの頂点の座標及び各辺の長さを算出してもよい。座標は、例えば上述したように、第１の計算撮像画像と撮像画像とで原点合わせをしたときの当該原点に対する座標である。上述のようにすることで、識別正解取得部１２３は、識別正解として、識別領域枠の領域の平面位置（座標）及び形状等を含む情報を出力する。なお、識別正解として、識別領域枠の領域の平面位置及び形状等の他に撮像画像が含まれていてもよい。また、ここでは、識別正解として、道路には識別領域枠が設定されていないが、道路等の周辺環境に対しても識別領域枠が設定されてもよい。 15A and 15B, the identification correct answer acquiring unit 123 sets, for example, an identification area frame FA for the person A, an identification area frame FB for the automobile B, and an identification area frame FC for the automobile C. At this time, the identification correct answer acquiring unit 123 may calculate the alignment and coordinates of the entire identification area frame as information indicating the shape and position of the identification area frame, and calculate the coordinates of each vertex of the identification area frame. Alternatively, the coordinates of one vertex such as the upper left of the identification area frame and the length of each side may be calculated. For example, as described above, the coordinates are coordinates with respect to the origin when the origin is matched between the first calculated captured image and the captured image. As described above, the identification correct answer acquiring unit 123 outputs information including the plane position (coordinates) and the shape of the area of the identification area frame as the identification correct answer. Note that, as an identification correct answer, a captured image may be included in addition to the planar position and shape of the area of the identification area frame. Further, here, as an identification correct answer, an identification area frame is not set on the road, but an identification area frame may also be set for a surrounding environment such as a road.

また、識別正解取得部１２３は、識別正解として、識別領域枠の情報を取得するのではなく、画素毎に識別正解を取得してもよい。画素毎の識別正解は、図１６においてドットハッチングで示すように例えば画像上にマスクとして与えられてもよい。 Further, the identification correct answer acquisition unit 123 may acquire the identification correct answer for each pixel instead of acquiring the identification area frame information as the identification correct answer. The identification correct answer for each pixel may be given as a mask on the image, for example, as shown by dot hatching in FIG.

図１６は、画像上でマスクとして与えられた識別正解の例を示す模式図である。 FIG. 16 is a schematic diagram illustrating an example of an identification correct answer given as a mask on an image.

図１６の例では、識別正解として、人物ＡにはマスクＡａが与えられ、自動車Ｂ及びＣにはそれぞれマスクＢａ及びＣａが与えられている。このようにすることで、識別正解取得部１２３は、画素毎に識別正解を出力する。なお、ここでは、識別正解として、道路にはマスクが与えられていないが、道路等の周辺環境に対してもマスクが与えられてもよい。 In the example of FIG. 16, the mask Aa is given to the person A and the masks Ba and Ca are given to the automobiles B and C, respectively, as correct identification answers. By doing in this way, the identification correct answer acquisition part 123 outputs an identification correct answer for every pixel. Here, as an identification correct answer, no mask is given to the road, but a mask may be given to the surrounding environment such as the road.

図１１に示すように、ステップＳ５において、学習部１２４は、ステップＳ１で取得された複数の第１の画素及び複数の第２の画素の対応関係を参照して、撮像画像の識別結果に基づいて、第１の計算撮像画像を識別するための識別モデル（識別器）を生成する。例えば、図１３に示す撮像画像が有する複数の第２の画素と図１４に示す第１の計算撮像画像が有する複数の第１の画素との対応関係を参照することで、撮像画像における各位置（各画素）が第１の計算撮像画像においてどの位置（画素）に対応しているかを認識できる。そして、例えば、図１４に示す第１の計算撮像画像に含まれる人物Ａ１、Ａ２及びＡ３についての識別正解が、図１３に示す撮像画像の識別結果である図１５Ｂに示すような識別領域枠ＦＡの位置又は図１６に示すようなマスクＡａの位置となり、かつ、カテゴリが人となるように機械学習が行われて識別器が生成される。同じように、自動車Ｂ１、Ｂ２、Ｂ３及びＢ４についての識別正解が、識別領域枠ＦＢの位置又はマスクＢａの位置となり、かつ、カテゴリが自動車となるように機械学習が行われ、自動車Ｃ１、Ｃ２、Ｃ３及びＣ４についての識別正解が、識別領域枠ＦＣの位置又はマスクＣａの位置となり、かつ、カテゴリが自動車となるように機械学習が行われて識別器が生成される。なお、このとき、撮像対象物及び撮像対象物の周辺環境の奥行方向における位置についても機械学習が行われてもよい。詳細は後述するが、通常の撮像画像を撮像するカメラとしてマルチビューステレオカメラ等を用いることで、容易に当該奥行方向における位置を取得でき、取得した奥行方向における位置に基づいて機械学習を行うことができる。 As illustrated in FIG. 11, in step S5, the learning unit 124 refers to the correspondence relationship between the plurality of first pixels and the plurality of second pixels acquired in step S1, and based on the identification result of the captured image. Thus, an identification model (identifier) for identifying the first calculated captured image is generated. For example, by referring to the correspondence relationship between the plurality of second pixels included in the captured image illustrated in FIG. 13 and the plurality of first pixels included in the first calculated captured image illustrated in FIG. It can be recognized which position (pixel) corresponds to each pixel in the first calculated captured image. Then, for example, the identification correct answer for the persons A1, A2, and A3 included in the first calculated captured image shown in FIG. 14 is the identification area frame FA as shown in FIG. 15B, which is the identification result of the captured image shown in FIG. Or the position of the mask Aa as shown in FIG. 16, and machine learning is performed so that the category is human, and a discriminator is generated. In the same manner, machine learning is performed such that the identification correct answer for the automobiles B1, B2, B3, and B4 is the position of the identification area frame FB or the position of the mask Ba, and the category is the automobile, and the automobiles C1, C2 , C3 and C4 are machine-learned so that the correct answer for identification is the position of the identification area frame FC or the position of the mask Ca, and the category is a car, thereby generating a classifier. At this time, machine learning may also be performed on the imaging object and the position of the surrounding environment of the imaging object in the depth direction. Although details will be described later, a position in the depth direction can be easily acquired by using a multi-view stereo camera or the like as a camera for capturing a normal captured image, and machine learning is performed based on the acquired position in the depth direction. Can do.

図１３に示すような撮像画像及び図１４に示すような第１の計算撮像画像の組を数多く（例えば数千組）準備する。学習部１２４は、第二メモリ２２４に格納された識別器を取得し、識別器にこれらの第１の計算撮像画像を入力し出力結果を取得し、出力結果が第１の計算撮像画像のそれぞれに対応する撮像画像を用いて入力された識別正解となるように、識別器を調整する。そして、学習部１２４は、調整後の識別器を第二メモリ２２４に格納することで第二メモリ２２４内の識別器を更新する。 Many sets (for example, several thousand sets) of captured images as shown in FIG. 13 and first calculated captured images as shown in FIG. 14 are prepared. The learning unit 124 acquires the discriminator stored in the second memory 224, inputs the first calculated captured image to the discriminator, acquires the output result, and the output result is the first calculated captured image. The discriminator is adjusted so that the correct discrimination is input using the captured image corresponding to. Then, the learning unit 124 updates the discriminator in the second memory 224 by storing the adjusted discriminator in the second memory 224.

ステップＳ６において、学習部１２４は、第２の計算撮像画像を識別する画像識別装置１０に、識別モデル（識別器）を出力する。これにより、画像識別装置１０は、学習装置１２によって生成された識別器を用いて、人によって実空間の状態と同様に視覚的に認識できない第２の計算撮像画像に含まれる撮像対象物及び撮像対象物の周辺環境を識別できるようになる。これについて図１及び図１７を参照して説明する。 In step S6, the learning unit 124 outputs an identification model (identifier) to the image identification device 10 that identifies the second calculated captured image. Thereby, the image identification device 10 uses the classifier generated by the learning device 12 and the imaging object and the imaging object included in the second calculated captured image that cannot be visually recognized by the person in the same manner as the state of the real space. The surrounding environment of the object can be identified. This will be described with reference to FIGS.

図１７は、実施の形態に係る画像識別装置１０の動作の流れの一例を示すフローチャートである。なお、以下の説明において、撮像部１１がライトフィールドカメラであるとして説明する。 FIG. 17 is a flowchart illustrating an example of an operation flow of the image identification device 10 according to the embodiment. In the following description, it is assumed that the imaging unit 11 is a light field camera.

ステップＳ１０１において、取得部１０１は、第一メモリ２０３（図１０参照）から、撮像部１１によって撮像された撮像対象物及び撮像対象物の周辺環境を含む第２の計算撮像画像を取得する。具体的には、第一入力回路２０１が第一メモリ２０３から第２の計算撮像画像を取得することで、取得部１０１は、第２の計算撮像画像を取得する。例えば、撮像部１１は、所定の周期である第１の周期毎に、第２の計算撮像画像として、ライトフィールド画像を撮像（取得）し、当該画像が第一メモリ２０３に記憶される。取得部１０１は、撮像部１１が撮像したライトフィールド画像を取得し、識別部１０２に出力する。なお、取得部１０１は、識別システム１の外部からライトフィールド画像を取得してもよい（具体的には、外部からのライトフィールド画像は第一メモリ２０３に記憶され、取得部１０１は、第一メモリ２０３からライトフィールド画像を取得してもよい）。 In step S101, the acquisition unit 101 acquires, from the first memory 203 (see FIG. 10), a second calculated captured image that includes the imaging object captured by the imaging unit 11 and the surrounding environment of the imaging object. Specifically, when the first input circuit 201 acquires the second calculated captured image from the first memory 203, the acquisition unit 101 acquires the second calculated captured image. For example, the imaging unit 11 captures (acquires) a light field image as the second calculated captured image for each first period that is a predetermined period, and the image is stored in the first memory 203. The acquisition unit 101 acquires the light field image captured by the imaging unit 11 and outputs it to the identification unit 102. The acquisition unit 101 may acquire a light field image from the outside of the identification system 1 (specifically, the light field image from the outside is stored in the first memory 203, and the acquisition unit 101 A light field image may be acquired from the memory 203).

次いで、ステップＳ１０２において、識別部１０２は、第一メモリ２０３に記憶された、プーリング層を有しないネットワークで構成されている識別モデル（識別器）を用いて、第２の計算撮像画像中の撮像対象物を識別する。つまり、識別部１０２は、ライトフィールド画像において識別対象とされる物体を検出する。識別対象の物体は、予め、識別器に設定されてよい。例えば、識別システム１が自動車に搭載される場合、識別対象の物体の例は、人物、自動車、自転車及び信号等である。識別部１０２は、識別器にライトフィールド画像に入力することによって、識別器から、出力結果として、識別対象の物体の識別結果を取得する。識別部１０２による識別処理の詳細については後述する。なお、識別部１０２は、識別処理済みのライトフィールド画像を、第一メモリ２０３に格納してもよい。 Next, in step S102, the identification unit 102 uses the identification model (identifier) configured in a network that does not have a pooling layer, stored in the first memory 203, to capture an image in the second calculated captured image. Identify the object. That is, the identification unit 102 detects an object to be identified in the light field image. The object to be identified may be set in the classifier in advance. For example, when the identification system 1 is mounted on a car, examples of objects to be identified are a person, a car, a bicycle, a signal, and the like. The identification unit 102 inputs the light field image to the classifier, and acquires the identification result of the object to be identified as an output result from the classifier. Details of the identification processing by the identification unit 102 will be described later. Note that the identification unit 102 may store the light field image subjected to the identification process in the first memory 203.

次いで、ステップＳ１０３において、出力部１０３は、識別部１０２によって識別処理された結果（識別結果）を出力する。例えば、出力部１０３は、ライトフィールド画像を含む画像情報を出力してもよいし、ライトフィールド画像を含まない画像情報を出力してもよい。少なくともこの画像情報は、識別部１０２が検出した物体の情報を含んでもよい。物体の情報は、物体の位置（平面における位置又は奥行方向における位置）、領域等を含む。出力部１０３は、識別システム１が備えるディスプレイ及び外部機器の少なくとも一方に、画像情報を出力してもよい。 Next, in step S 103, the output unit 103 outputs a result (identification result) subjected to identification processing by the identification unit 102. For example, the output unit 103 may output image information including a light field image, or may output image information not including a light field image. At least the image information may include information on the object detected by the identification unit 102. The object information includes the position of the object (position on the plane or position in the depth direction), region, and the like. The output unit 103 may output image information to at least one of a display and an external device provided in the identification system 1.

さらに、図１７におけるステップＳ１０２の識別処理を説明する。ライトフィールドカメラである撮像部１１が撮像したライトフィールド画像から、画像情報と奥行情報とを同時に取得することが可能である。識別部１０２は、ライトフィールド画像に対して、学習装置１２で学習した識別器を使用して識別処理を行う。この学習は、上述したように、ディープラーニングなどのニューラルネットワークを用いた機械学習によって実現する。 Further, the identification process in step S102 in FIG. 17 will be described. Image information and depth information can be simultaneously acquired from a light field image captured by the imaging unit 11 that is a light field camera. The identification unit 102 performs identification processing on the light field image using the classifier learned by the learning device 12. As described above, this learning is realized by machine learning using a neural network such as deep learning.

識別部１０２は、テクスチャ情報の識別と奥行情報の識別とを行い、識別されたテクスチャ情報及び奥行情報を用いて、画像に含まれる物体を統合的に識別する構成であってもよい。このような構成を図１８に示す。 The identification unit 102 may be configured to identify texture information and depth information, and collectively identify objects included in the image using the identified texture information and depth information. Such a configuration is shown in FIG.

図１８は、識別部１０２の機能的な構成の一例を示す模式図である。 FIG. 18 is a schematic diagram illustrating an example of a functional configuration of the identification unit 102.

このような識別部１０２は、図１８に示すように、テクスチャ情報識別部１０２１と、奥行情報識別部１０２２と、統合識別部１０２３とを含む。テクスチャ情報識別部１０２１及び奥行情報識別部１０２２は、例えば、統合識別部１０２３に対して、並列に接続されている。 As shown in FIG. 18, such an identification unit 102 includes a texture information identification unit 1021, a depth information identification unit 1022, and an integrated identification unit 1023. The texture information identification unit 1021 and the depth information identification unit 1022 are connected in parallel to the integrated identification unit 1023, for example.

テクスチャ情報識別部１０２１は、ライトフィールド画像においてテクスチャ情報を使用して被写体を検出する。具体的には、テクスチャ情報識別部１０２１は、例えば、非特許文献１に記載されるようなニューラルネットワークを識別器として使用することによって、ライトフィールド画像において被写体の領域（平面における位置）と被写体のカテゴリとを識別する。テクスチャ情報識別部１０２１への入力情報は、ライトフィールド画像であり、テクスチャ情報識別部１０２１の識別結果は、学習装置１２の場合と同様に、ライトフィールド画像上での被写体の領域及び被写体のカテゴリである。通常撮像画像の場合、入射する光線の方向の値、つまり奥行値が積分されて画素値に含まれるため、奥行情報が削除されている。このような通常撮像画像と比較すると、ライトフィールド画像は、画像自体に被写体に関する多くの情報を含む。このため、マルチピンホール等が用いられるライトフィールド画像が、識別器の入力情報として用いられることによって、通常撮像画像を入力情報とする場合以上の高精度な識別が可能である。 The texture information identification unit 1021 detects the subject using the texture information in the light field image. Specifically, the texture information identifying unit 1021 uses, for example, a neural network as described in Non-Patent Document 1 as a classifier, so that the subject area (position on the plane) and the subject in the light field image are detected. Identify the category. The input information to the texture information identification unit 1021 is a light field image, and the identification result of the texture information identification unit 1021 is the subject area and subject category on the light field image, as in the learning device 12. is there. In the case of a normal captured image, since the value of the direction of the incident light beam, that is, the depth value is integrated and included in the pixel value, the depth information is deleted. Compared to such a normal captured image, the light field image includes a lot of information about the subject in the image itself. For this reason, a light field image using a multi-pinhole or the like is used as input information of the discriminator, so that it is possible to perform discrimination with higher accuracy than when a normal captured image is used as input information.

奥行情報識別部１０２２は、ライトフィールド画像から被写体の奥行情報を検出する。具体的には、奥行情報識別部１０２２は、学習装置１２において、ライトフィールド画像と対応する被写体の奥行情報を事前に学習する。被写体の奥行情報は、後述するように、第二画像取得部１２２からマルチビューステレオ画像を取得することで計算してもかまわないし、識別正解取得部１２３から取得してもかまわない。 The depth information identification unit 1022 detects the depth information of the subject from the light field image. Specifically, the depth information identification unit 1022 learns in advance the depth information of the subject corresponding to the light field image in the learning device 12. As will be described later, the depth information of the subject may be calculated by acquiring a multi-view stereo image from the second image acquisition unit 122, or may be acquired from the identification correct acquisition unit 123.

統合識別部１０２３は、テクスチャ情報識別部１０２１の識別結果と、奥行情報識別部１０２２の識別結果とを統合し、最終的な識別結果を出力する。統合識別部１０２３が用いる識別器は、テクスチャ情報識別部１０２１のテクスチャ情報又はその識別結果と、奥行情報識別部１０２２の識別結果である奥行情報とを入力とし、最終的な識別結果を出力するものである。最終的な識別結果は、ライトフィールド画像に含まれる物体の領域、当該領域の画像上での平面位置、及び当該領域の奥行位置等を含む。 The integrated identification unit 1023 integrates the identification result of the texture information identification unit 1021 and the identification result of the depth information identification unit 1022 and outputs a final identification result. The discriminator used by the integrated discriminating unit 1023 receives the texture information of the texture information discriminating unit 1021 or its identification result and the depth information which is the discriminating result of the depth information discriminating unit 1022 and outputs the final discrimination result. It is. The final identification result includes the area of the object included in the light field image, the planar position on the image of the area, the depth position of the area, and the like.

なお、テクスチャ情報識別部１０２１用のニューラルネットワークと、奥行情報識別部１０２２用のニューラルネットワークとがそれぞれ生成されてもよい。つまり、平面における位置及びカテゴリについては、平面における位置及びカテゴリを識別するためのニューラルネットワークが用いられ、奥行方向における位置については平面における位置及びカテゴリを識別するためのニューラルネットワークとは別途生成された、奥行方向における位置を識別するためのニューラルネットワークが用いられてもよい。また、テクスチャ情報識別部１０２１用のニューラルネットワークと、奥行情報識別部１０２２用のニューラルネットワークとがまとめて生成されてもよい。つまり、平面における位置、奥行方向における位置及びカテゴリについて、平面における位置、奥行方向における位置及びカテゴリをまとめて識別するための１つのニューラルネットワークが用いられてもよい。 Note that a neural network for the texture information identification unit 1021 and a neural network for the depth information identification unit 1022 may be generated. That is, for the position and category in the plane, a neural network for identifying the position and category in the plane is used, and for the position in the depth direction, it is generated separately from the neural network for identifying the position and category in the plane. A neural network for identifying the position in the depth direction may be used. Further, the neural network for the texture information identification unit 1021 and the neural network for the depth information identification unit 1022 may be generated together. That is, for the position in the plane, the position in the depth direction, and the category, one neural network for collectively identifying the position in the plane, the position in the depth direction, and the category may be used.

また、上記説明では、撮像部１１は、マルチピンホール又はマイクロレンズを用いるライトフィールドカメラであったが、これに限らない。例えば、撮像部１１は、符号化開口画像を撮像する構成であってもよい。これは、一種のマルチピンホールカメラでもある。 In the above description, the imaging unit 11 is a light field camera using a multi-pinhole or a microlens, but is not limited thereto. For example, the imaging unit 11 may be configured to capture a coded aperture image. This is also a kind of multi-pinhole camera.

図１９は、ランダムマスクを符号化絞りとして使用する符号化開口マスクの例の模式図である。 FIG. 19 is a schematic diagram of an example of a coded aperture mask that uses a random mask as a coded stop.

図１９に示すように、符号化開口マスク３１１は、色無し領域で示される光の透過領域と、黒塗り領域で示される光の遮光領域とを有し、光の透過領域及び遮光領域はランダムに配置されていることがわかる。このような符号化開口マスク３１１は、ガラスにクロムを蒸着することで作製される。このような符号化開口マスク３１１が、主レンズとイメージセンサとの間の光路上に配置されると、光線の一部が遮断される。これにより、符号化開口画像を撮像するカメラの実現が可能である。 As shown in FIG. 19, the coded aperture mask 311 has a light transmission region indicated by a non-colored region and a light shielding region indicated by a blackened region, and the light transmission region and the light shielding region are random. It can be seen that the Such a coded aperture mask 311 is produced by evaporating chromium on glass. When such a coded aperture mask 311 is placed on the optical path between the main lens and the image sensor, part of the light beam is blocked. As a result, it is possible to realize a camera that captures an encoded aperture image.

また、第二画像取得部１２２は通常画像ではなく、画像情報に加え、奥行情報も取得できる画像を取得するようにしてもかまわない。例えば、第二画像取得部１２２はマルチビューステレオカメラで構成されてもよい。第二画像取得部１２２は、マルチビューステレオ画像を取得することにより、被写体の３次元情報も取得することができる。そのため、第一画像取得部１２１と第二画像取得部１２２の取得する画像を事前にキャリブレーションすることで、第一画像取得部１２１が取得した画像と第二画像取得部１２２が取得した画像の対応関係を取得することができる。このキャリブレーションでは、第二画像取得部１２２で取得する３次元座標と第一画像取得部１２１が取得する画像座標との対応が求められる。これにより、第二画像取得部１２２が取得した撮像画像に対する識別正解を、第一画像取得部１２１が取得した第１の計算撮像画像に対する識別正解に変換させることができる。このように、撮像画像は、マルチビューステレオカメラによる撮像対象物及び撮像対象物の周辺環境の撮像により得られる画像であってもよい。 Further, the second image acquisition unit 122 may acquire an image that can acquire depth information in addition to image information instead of a normal image. For example, the second image acquisition unit 122 may be configured with a multi-view stereo camera. The second image acquisition unit 122 can also acquire the three-dimensional information of the subject by acquiring the multi-view stereo image. Therefore, by calibrating the images acquired by the first image acquisition unit 121 and the second image acquisition unit 122 in advance, the images acquired by the first image acquisition unit 121 and the images acquired by the second image acquisition unit 122 Correspondence can be acquired. In this calibration, correspondence between the three-dimensional coordinates acquired by the second image acquisition unit 122 and the image coordinates acquired by the first image acquisition unit 121 is required. Thereby, the identification correct answer with respect to the captured image acquired by the second image acquisition unit 122 can be converted into the identification correct answer with respect to the first calculated captured image acquired by the first image acquisition unit 121. As described above, the captured image may be an image obtained by imaging the imaging object and the surrounding environment of the imaging object by the multi-view stereo camera.

以上の説明では、識別正解として、例えば、人物、自動車、自転車又は信号等の被写体が属するカテゴリ情報と、画像上での被写体の平面的な位置及び領域と、画像上での被写体の奥行方向における位置を与えていた。例えば、識別システム１が、識別正解として奥行方向における位置（奥行情報）を識別することは、第二画像取得部１２２が取得したマルチビューステレオから求めた奥行方向における位置（奥行情報）を識別正解として与えるようにすることで実現できる。 In the above description, as the identification correct answer, for example, category information to which a subject such as a person, a car, a bicycle, or a signal belongs, the planar position and area of the subject on the image, and the depth direction of the subject on the image Gave position. For example, the identification system 1 identifying the position (depth information) in the depth direction as the identification correct answer identifies the position (depth information) in the depth direction obtained from the multi-view stereo acquired by the second image acquisition unit 122. It can be realized by giving as.

また、識別部１０２は、テクスチャ情報識別部１０２１と奥行情報識別部１０２２とが並列関係である構成を有するのではなく、奥行情報識別部１０２２による奥行情報の抽出後に、テクスチャ情報識別部１０２１による識別を行うように構成されてもよい。 In addition, the identification unit 102 does not have a configuration in which the texture information identification unit 1021 and the depth information identification unit 1022 are in a parallel relationship, but is identified by the texture information identification unit 1021 after the depth information is extracted by the depth information identification unit 1022. May be configured.

図２０は、識別部１０２の機能的な構成の別の一例を示す模式図である。 FIG. 20 is a schematic diagram illustrating another example of the functional configuration of the identification unit 102.

図２０に示すように、識別部１０２では、奥行情報識別部１０２２、テクスチャ情報識別部１０２１及び統合識別部１０２３が直列関係にあってもよい。奥行情報識別部１０２２は、ライトフィールド画像に対して奥行画像を生成する。テクスチャ情報識別部１０２１は、奥行情報識別部１０２２が生成した奥行画像を入力情報として、例えば、非特許文献１に記載されるようなニューラルネットワークを用いることによって、被写体の位置、領域及び被写体のカテゴリを識別する。統合識別部１０２３は、テクスチャ情報識別部１０２１の識別結果を出力する。最終的な識別結果は、テクスチャ情報識別部１０２１及び統合識別部１０２３が並列関係にある場合と同様に、ライトフィールド画像に含まれる物体の領域、当該領域の画像上での平面位置、及び当該領域の奥行位置等を含む。 As illustrated in FIG. 20, in the identification unit 102, the depth information identification unit 1022, the texture information identification unit 1021, and the integrated identification unit 1023 may be in a serial relationship. The depth information identification unit 1022 generates a depth image for the light field image. The texture information identification unit 1021 uses the depth image generated by the depth information identification unit 1022 as input information, for example, by using a neural network as described in Non-Patent Document 1, and thereby the subject position, region, and subject category. Identify The integrated identification unit 1023 outputs the identification result of the texture information identification unit 1021. As in the case where the texture information identifying unit 1021 and the integrated identifying unit 1023 are in a parallel relationship, the final identification result is the region of the object included in the light field image, the planar position on the image of the region, and the region. Including the depth position.

また、識別部１０２は、撮像部１１に応じて、そのニューラルネットワークの構成を変えるようにしてもよい。撮像部１１がライトフィールドカメラである場合、奥行画像は、撮像部１１のマルチピンホールの位置及び大きさ等を用いて生成される。例えば、撮像部１１の種類又は製造ばらつき等によって、マルチピンホールの位置及び大きさが撮像部１１毎に異なる場合、撮像部１１毎にニューラルネットワークを構成することにより（言い換えると撮像部１１毎に個別に機械学習がなされることにより）、識別部１０２の識別精度を向上させることができる。なお、マルチピンホールの位置及び大きさの情報は、事前にカメラキャリブレーションを実施することで取得可能である。 In addition, the identification unit 102 may change the configuration of the neural network according to the imaging unit 11. When the imaging unit 11 is a light field camera, the depth image is generated using the position and size of the multi-pinhole of the imaging unit 11. For example, when the position and size of the multi-pinholes differ for each imaging unit 11 due to the type or manufacturing variation of the imaging unit 11, a neural network is configured for each imaging unit 11 (in other words, for each imaging unit 11. By performing machine learning individually, the identification accuracy of the identification unit 102 can be improved. Information on the position and size of the multi-pinhole can be acquired by performing camera calibration in advance.

以上のように、識別部１０２は、ライトフィールド画像を入力情報とし、当該ライトフィールド画像のテクスチャ情報及び奥行情報から識別処理を行う。それにより、識別部１０２は、従来の通常撮像画像を使用したテクスチャ画像のみに基づく識別処理と比べ、例えばどれだけ離れた位置にあるのかも識別できるため、より高精度の識別処理を可能にする。 As described above, the identification unit 102 uses the light field image as input information, and performs identification processing from the texture information and depth information of the light field image. As a result, the identification unit 102 can identify how far away it is from the identification process based only on the texture image using the conventional normal captured image, for example, thus enabling a highly accurate identification process. .

上述したように、識別部１０２を含む画像識別装置１０を備える実施の形態に係る識別システム１と、当該画像識別装置１０と学習装置１２とを備える実施の形態の変形例に係る識別システム１Ａとを例示した。しかしながら、例えば、識別部１０２は、学習装置１２を包含してもよく、この場合、識別システム１が学習装置１２を備えることになる。つまり、この場合、識別システム１は、識別システム１Ａと同等の機能を有する。 As described above, the identification system 1 according to the embodiment including the image identification device 10 including the identification unit 102, and the identification system 1A according to a modification of the embodiment including the image identification device 10 and the learning device 12 Was illustrated. However, for example, the identification unit 102 may include the learning device 12, and in this case, the identification system 1 includes the learning device 12. That is, in this case, the identification system 1 has a function equivalent to that of the identification system 1A.

以上のように、実施の形態及び変形例に係る識別システム１及び１Ａにおいて、画像識別装置１０は、ライトフィールド画像等の第２の計算撮像画像を用いて、当該画像内の被写体の識別を行う。さらに、画像識別装置１０は、一連の識別処理の過程において、第２の計算撮像画像を通常撮像画像に画像復元せず、第２の計算撮像画像に含まれるテクスチャ情報と、計算撮像画像に含まれる奥行情報とに基づき、第２の計算撮像画像内の被写体の識別を行う。よって、画像識別装置１０は、被写体の識別処理量を低減することができる。特に、識別処理の際に第２の計算撮像画像から通常撮像画像への画像復元を伴う手法と比較して、画像識別装置１０は、識別処理の大幅な高速化を可能にする。また、３次元レンジファインダ等を用いなくても奥行情報を取得できるため、低コスト化が可能となる。 As described above, in the identification systems 1 and 1A according to the embodiment and the modification, the image identification device 10 identifies the subject in the image using the second calculated captured image such as the light field image. . Further, the image identification device 10 does not restore the second calculated captured image to the normal captured image in the course of a series of identification processes, and is included in the texture information included in the second calculated captured image and the calculated captured image. The subject in the second calculated captured image is identified based on the depth information. Therefore, the image identification device 10 can reduce the amount of subject identification processing. In particular, the image identification device 10 can significantly speed up the identification process as compared with a technique involving image restoration from the second calculated captured image to the normal captured image during the identification process. Further, since the depth information can be acquired without using a three-dimensional range finder or the like, the cost can be reduced.

また、第１の計算撮像画像の撮像に用いられるカメラ（例えば第一画像取得部１２１）の光軸と、撮像画像の撮像に用いられるカメラ（例えば第二画像取得部１２２）の光軸とは、略一致するようにしてもかまわない。図２１Ａはこれを説明するための模式図である。 Also, the optical axis of the camera (for example, the first image acquisition unit 121) used for capturing the first calculated captured image and the optical axis of the camera (for example, the second image acquisition unit 122) used for capturing the captured image. It does not matter if they are approximately the same. FIG. 21A is a schematic diagram for explaining this.

図２１Ａは、第二画像取得部１２２の光軸と第一画像取得部１２１の光軸とがおおよそ一致することを示す模式図である。 FIG. 21A is a schematic diagram showing that the optical axis of the second image acquisition unit 122 and the optical axis of the first image acquisition unit 121 are approximately the same.

この図において、第一画像取得部１２１及び第二画像取得部１２２として、それぞれ、そのハードウェアの例であるカメラを模式的に示している。また、光軸２３１は第一画像取得部１２１の光軸を示し、光軸２３２は第二画像取得部１２２の光軸を示している。各光軸をおおよそ一致させるためには、第一画像取得部１２１と第二画像取得部１２２とを接近させ、かつ、各光軸がほぼ平行になるように配置すればよい。 In this figure, as the first image acquisition unit 121 and the second image acquisition unit 122, cameras that are examples of the hardware are schematically shown. An optical axis 231 indicates the optical axis of the first image acquisition unit 121, and an optical axis 232 indicates the optical axis of the second image acquisition unit 122. In order to make the optical axes approximately coincide with each other, the first image acquisition unit 121 and the second image acquisition unit 122 may be brought close to each other and arranged so that the optical axes are substantially parallel to each other.

また、第二画像取得部１２２をステレオカメラとして構成する場合、第二画像取得部１２２を構成する２つのカメラのそれぞれの光軸と第一画像取得部１２１の光軸とがおおよそ一致するようにすればよい。図２１Ｂはこれを説明するための模式図である。 Further, when the second image acquisition unit 122 is configured as a stereo camera, the optical axes of the two cameras configuring the second image acquisition unit 122 and the optical axes of the first image acquisition unit 121 are approximately the same. do it. FIG. 21B is a schematic diagram for explaining this.

図２１Ｂは、第二画像取得部１２２を構成するステレオカメラの各光軸と第一画像取得部１２１の光軸とがおおよそ一致することを示す模式図である。 FIG. 21B is a schematic diagram showing that the optical axes of the stereo camera constituting the second image acquisition unit 122 and the optical axes of the first image acquisition unit 121 are approximately the same.

この図において、図２１Ａと同じ構成要素には同じ符号を付与し説明を省略する。この図において、光軸２３２ａ及び２３２ｂは第二画像取得部１２２を構成するステレオカメラの各光軸を示している。前述のように、本実施の形態の識別システム１又は１Ａは、第二画像取得部１２２が取得した撮像画像に対する識別正解を、第一画像取得部１２１が取得した第１の計算撮像画像に対する識別正解に変換させるが、各光軸をおおよそ一致させることで、変換に伴う誤差を小さくすることができ、より高精度の識別が実現できる。 In this figure, the same components as those in FIG. In this figure, optical axes 232 a and 232 b indicate the respective optical axes of the stereo camera constituting the second image acquisition unit 122. As described above, the identification system 1 or 1A according to the present embodiment identifies the correct identification for the captured image acquired by the second image acquisition unit 122 and the identification for the first calculated captured image acquired by the first image acquisition unit 121. Although it is converted to a correct answer, by making the optical axes approximately coincident, errors due to the conversion can be reduced, and more accurate identification can be realized.

また、第一画像取得部１２１と第二画像取得部１２２の光軸を一致させるために、ビームスプリッタ、プリズム又はハーフミラーなどを利用してもかまわない。 Further, in order to make the optical axes of the first image acquisition unit 121 and the second image acquisition unit 122 coincide with each other, a beam splitter, a prism, a half mirror, or the like may be used.

図２２は、第一画像取得部１２１の光軸と第二画像取得部１２２の光軸とを一致させるために、ビームスプリッタが利用されることを示す模式図である。 FIG. 22 is a schematic diagram showing that a beam splitter is used to match the optical axis of the first image acquisition unit 121 and the optical axis of the second image acquisition unit 122.

この図において、図２１Ａと同じ構成要素には同じ番号を付与し説明を省略する。ビームスプリッタ２４０により、被写体からの光線を二つに分離することができるため、分離した光線の一方を第一画像取得部１２１の光軸２３１と一致させ、もう一方を第二画像取得部１２２の光軸２３２と一致させることで、第一画像取得部１２１の光軸と第二画像取得部１２２の光軸とを一致させることが可能である。このように、第１の計算撮像画像の撮像に用いられるカメラ（例えば第一画像取得部１２１）の光軸と、撮像画像の撮像に用いられるカメラ（例えば第二画像取得部１２２）の光軸とは、ビームスプリッタ、プリズム又はハーフミラーを介することで一致する。前述のように、本実施の形態の識別システム１又は１Ａは、第二画像取得部１２２が取得した撮像画像に対する識別正解を、第一画像取得部１２１が取得した第１の計算撮像画像に対する識別正解に変換させるが、各光軸を一致させることで、変換に伴う誤差を小さくすることができ、より高精度の識別が実現できる。 In this figure, the same components as those in FIG. Since the light beam from the subject can be separated into two by the beam splitter 240, one of the separated light beams is made to coincide with the optical axis 231 of the first image acquisition unit 121, and the other one of the second image acquisition unit 122. By matching with the optical axis 232, the optical axis of the first image acquisition unit 121 and the optical axis of the second image acquisition unit 122 can be matched. Thus, the optical axis of the camera (for example, the first image acquisition unit 121) used for capturing the first calculated captured image and the optical axis of the camera (for example, the second image acquisition unit 122) used for capturing the captured image. Is matched with a beam splitter, a prism, or a half mirror. As described above, the identification system 1 or 1A according to the present embodiment identifies the correct identification for the captured image acquired by the second image acquisition unit 122 and the identification for the first calculated captured image acquired by the first image acquisition unit 121. Although it is converted to a correct answer, by making each optical axis coincide, an error accompanying the conversion can be reduced, and more accurate identification can be realized.

以上、本開示の識別システム１及び画像識別装置１０について、実施の形態に基づいて説明したが、本開示は、上記実施の形態に限定されるものではない。本開示の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したもの、及び、異なる実施の形態における構成要素を組み合わせて構築される形態も、本開示の範囲内に含まれる。 As described above, the identification system 1 and the image identification device 10 of the present disclosure have been described based on the embodiments. However, the present disclosure is not limited to the above embodiments. Unless it deviates from the gist of the present disclosure, various modifications conceived by those skilled in the art have been made in the present embodiment, and forms constructed by combining components in different embodiments are also included in the scope of the present disclosure. It is.

例えば、上記実施の形態では、第２の計算撮像画像に含まれる撮像対象物及び撮像対象物の周辺環境の平面における位置、奥行方向における位置及びカテゴリ情報が識別されたが、これに限らない。例えば、撮像対象物及び撮像対象物の周辺環境の平面における位置、奥行方向における位置及びカテゴリ情報のいずれか１つ又は２つのみが識別されてもよい。つまり、撮像対象物及び撮像対象物の周辺環境の平面における位置、奥行方向における位置及びカテゴリ情報のいずれか１つ又は２つのみが機械学習されて、識別モデルが生成されてもよい。 For example, in the above embodiment, the imaging object and the position in the plane of the surrounding environment of the imaging object, the position in the depth direction, and the category information included in the second calculated captured image are identified, but the present invention is not limited to this. For example, only one or two of the imaging object and the position in the plane of the surrounding environment of the imaging object, the position in the depth direction, and the category information may be identified. That is, only one or two of the imaging object and the position in the plane of the surrounding environment of the imaging object, the position in the depth direction, and the category information may be machine-learned to generate the identification model.

また、例えば、上記実施の形態では、奥行方向における位置についても機械学習されたが、されなくてもよい。例えば、取得部１０１が第２の計算撮像画像を取得した段階において、第２の計算撮像画像に含まれる撮像対象物及び撮像対象物の周辺環境がそれぞれ複数重畳された画像を用いて、被写体の奥行方向における位置が計算されてもよい。つまり、識別モデルを用いずに、第２の計算撮像画像自体から直接奥行方向における位置が計算されてもよい。 Further, for example, in the above-described embodiment, the machine learning is performed on the position in the depth direction, but it may not be performed. For example, when the acquisition unit 101 acquires the second calculated captured image, an image of the subject is used by using an image in which a plurality of imaging objects and surrounding environments of the imaging target included in the second calculated captured image are respectively superimposed. The position in the depth direction may be calculated. That is, the position in the depth direction may be calculated directly from the second calculated captured image itself without using the identification model.

また、例えば、第二画像取得部１２２が取得した撮像画像に対する識別正解は、例えば人によって手動で与えられたが、これに限らない。例えば、第二画像取得部１２２が取得した撮像画像に対する識別正解を与えるための学習モデルを予め準備しておいて、当該学習モデルを用いて識別正解が与えられてもよい。 Moreover, for example, the correct identification for the captured image acquired by the second image acquisition unit 122 is manually given by a person, for example, but is not limited thereto. For example, a learning model for giving an identification correct answer to the captured image acquired by the second image acquisition unit 122 may be prepared in advance, and the identification correct answer may be given using the learning model.

また、例えば、本開示は、画像識別装置１０として実現できるだけでなく、画像識別装置１０を構成する各構成要素が行うステップ（処理）を含む識別方法として実現できる。 Further, for example, the present disclosure can be realized not only as the image identification device 10 but also as an identification method including steps (processes) performed by each component constituting the image identification device 10.

具体的には、当該識別方法は、図１７に示すように、撮像対象物及び撮像対象物の周辺環境を含む計算撮像画像を取得し（ステップＳ１０１）、プーリング層を有しないネットワークで構成されている識別モデルを用いて、計算撮像画像中の撮像対象物を識別する（ステップＳ１０２）。 Specifically, as shown in FIG. 17, the identification method acquires a calculated captured image including an imaging target and the surrounding environment of the imaging target (step S 101), and is configured by a network that does not have a pooling layer. The object to be imaged in the calculated captured image is identified using the identified identification model (step S102).

また、例えば、それらのステップは、コンピュータ（コンピュータシステム）によって実行されてもよい。そして、本開示は、それらの方法に含まれるステップを、コンピュータに実行させるためのプログラムとして実現できる。さらに、本開示は、そのプログラムを記録したＣＤ−ＲＯＭ等である非一時的なコンピュータ読み取り可能な記録媒体として実現できる。 Further, for example, these steps may be executed by a computer (computer system). The present disclosure can be realized as a program for causing a computer to execute the steps included in these methods. Furthermore, the present disclosure can be realized as a non-transitory computer-readable recording medium such as a CD-ROM or the like on which the program is recorded.

また、本開示において、システム、装置、部材又は部の全部又は一部、又は各図に示されるブロック図の機能ブロックの全部又は一部は、半導体装置、半導体集積回路（ＩＣ）、又はＬＳＩ（large scale integration）を含む一つ又は複数の電子回路によって実行されてもよい。 Further, in this disclosure, all or part of the system, device, member, or part, or all or part of the functional blocks in the block diagrams shown in the drawings may be a semiconductor device, a semiconductor integrated circuit (IC), or an LSI ( It may be performed by one or more electronic circuits including large scale integration).

ＬＳＩ又はＩＣは、一つのチップに集積されてもよいし、複数のチップを組み合わせて構成されてもよい。例えば、記憶素子以外の機能ブロックは、一つのチップに集積されてもよい。ここでは、ＬＳＩやＩＣと呼んでいるが、集積の度合いによって呼び方が変わり、システムＬＳＩ、ＶＬＳＩ（very large scale integration）、若しくはＵＬＳＩ（ultra large scale integration）と呼ばれるものであってもよい。ＬＳＩの製造後にプログラムされる、ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ（FPGA）、又はＬＳＩ内部の接合関係の再構成又はＬＳＩ内部の回路区画のセットアップができるｒｅｃｏｎｆｉｇｕｒａｂｌｅｌｏｇｉｃｄｅｖｉｃｅも同じ目的で使うことができる。 The LSI or IC may be integrated on a single chip, or may be configured by combining a plurality of chips. For example, the functional blocks other than the memory element may be integrated on one chip. Here, the term “LSI” or “IC” is used, but the term changes depending on the degree of integration, and it may be called system LSI, VLSI (very large scale integration), or ULSI (ultra large scale integration). A field programmable gate array (FPGA), which is programmed after the manufacture of the LSI, or a reconfigurable logic device capable of reconfiguring the junction relationship inside the LSI or setting up a circuit partition inside the LSI can be used for the same purpose.

また、識別部１０２は、撮像部１１が計算撮像画像（第２の計算撮像画像）を撮像する際に利用した計算撮像パラメータを計算撮像画像とともに取得部１０１から取得し、識別器と計算撮像画像に加え計算撮像パラメータを利用して、計算撮像画像中の物体の情報を取得するようにしてもかまわない。これについて、図２３から図３０を用いて説明する。 Further, the identification unit 102 acquires the calculated imaging parameter used when the imaging unit 11 images the calculated captured image (second calculated captured image) from the acquisition unit 101 together with the calculated captured image, and the classifier and the calculated captured image. In addition, the information of the object in the calculated captured image may be acquired using the calculated imaging parameter. This will be described with reference to FIGS.

図２３は、３つのピンホール２１１ａａ１、２１１ａａ２及び２１１ａａ３を有するマルチピンホールマスク２１１ａの一例を示す模式図である。 FIG. 23 is a schematic diagram showing an example of a multi-pinhole mask 211a having three pinholes 211aa1, 211aa2, and 211aa3.

マルチピンホールカメラで撮像された画像は、複数の画像が重畳されたものとなる。このとき、重畳された画像の位置関係は、マルチピンホールマスク２１１ａにおけるピンホールの位置関係に依存する。ここで、３つのピンホールを有するマルチピンホールマスクとして、図２３のようなマルチピンホールマスク２１１ａを利用するとする。このマルチピンホールマスク２１１ａは、例えば、３個のピンホール２１１ａａ１、２１１ａａ２及び２１１ａａ３を有している。この図において、ｄ１２はピンホール２１１ａａ１とピンホール２１１ａａ２の距離、ｄ１３はピンホール２１１ａａ１とピンホール２１１ａａ３の距離、ｄ２３はピンホール２１１ａａ２とピンホール２１１ａａ３の距離を示している。 An image picked up by the multi-pinhole camera is obtained by superimposing a plurality of images. At this time, the positional relationship between the superimposed images depends on the positional relationship between the pinholes in the multi-pinhole mask 211a. Here, it is assumed that a multi-pinhole mask 211a as shown in FIG. 23 is used as a multi-pinhole mask having three pinholes. The multi-pinhole mask 211a has, for example, three pinholes 211aa1, 211aa2, and 211aa3. In this figure, d12 indicates the distance between the pinhole 211aa1 and the pinhole 211aa2, d13 indicates the distance between the pinhole 211aa1 and the pinhole 211aa3, and d23 indicates the distance between the pinhole 211aa2 and the pinhole 211aa3.

図２４は、図２３のマルチピンホールマスク２１１ａを有する撮像部１１を利用して撮像された計算撮像画像５００の一例を示す模式図である。以下では、図１３に示す人物Ａ及び自動車Ｂに着目して説明する。この図において、人物Ａは、Ａ１、Ａ２及びＡ３として撮像され、自動車Ｂは、Ｂ１、Ｂ２及びＢ３として撮像されている。 FIG. 24 is a schematic diagram illustrating an example of a calculated captured image 500 captured using the imaging unit 11 having the multi-pinhole mask 211a of FIG. Below, it demonstrates paying attention to the person A and the motor vehicle B which are shown in FIG. In this figure, the person A is imaged as A1, A2, and A3, and the automobile B is imaged as B1, B2, and B3.

図２５は、図２４の計算撮像画像５００に対する識別部１０２の識別結果を示した模式図である。この図には、人物Ａに対応するＡ１、Ａ２及びＡ３の識別領域枠ＴＡ１、ＴＡ２及びＴＡ３と、自動車Ｂに対応するＢ１、Ｂ２及びＢ３の識別領域枠ＴＢ１、ＴＢ２及びＴＢ３を示している。 FIG. 25 is a schematic diagram showing the identification result of the identification unit 102 for the calculated captured image 500 of FIG. This figure shows the identification area frames TA1, TA2 and TA3 of A1, A2 and A3 corresponding to the person A, and the identification area frames TB1, TB2 and TB3 of B1, B2 and B3 corresponding to the automobile B.

図２６は、識別領域枠の重心位置の距離と、撮像部１１から被写体までの距離の関係を説明するための模式図である。この図において、ＤＡ１２は識別領域枠ＴＡ１と識別領域枠ＴＡ２の重心間の距離、ＤＡ１３は識別領域枠ＴＡ１と識別領域枠ＴＡ３のそれぞれの重心間の距離、ＤＡ２３は識別領域枠ＴＡ２と識別領域枠ＴＡ３のそれぞれの重心間の距離、ＤＢ１２は識別領域枠ＴＢ１と識別領域枠ＴＢ２のそれぞれの重心間の距離、ＤＢ１３は識別領域枠ＴＢ１と識別領域枠ＴＢ３のそれぞれの重心間の距離、ＤＢ２３は識別領域枠ＴＢ２と識別領域枠ＴＢ３のそれぞれの重心間の距離を示している。 FIG. 26 is a schematic diagram for explaining the relationship between the distance of the center of gravity position of the identification area frame and the distance from the imaging unit 11 to the subject. In this figure, DA12 is the distance between the centroids of the identification area frame TA1 and the identification area frame TA2, DA13 is the distance between the centroids of the identification area frame TA1 and the identification area frame TA3, and DA23 is the identification area frame TA2 and the identification area frame. The distance between the centroids of TA3, DB12 is the distance between the centroids of identification area frame TB1 and identification area frame TB2, DB13 is the distance between the centroids of identification area frame TB1 and identification area frame TB3, and DB23 is the identification The distance between the center of gravity of each of the area frame TB2 and the identification area frame TB3 is shown.

このとき、識別領域枠ＴＡ１、ＴＡ２及びＴＡ３の画像上での位置関係、並びに識別領域枠ＴＢ１、ＴＢ２及びＴＢ３の画像上での位置関係は、マルチピンホールマスク２１１ａ上のピンホール２１１ａａ１、２１１ａａ２及び２１１ａａ３の位置関係と相似の関係になる。つまり、識別検出枠ＴＡ１、ＴＡ２及びＴＡ３の画像上での位置関係、並びに識別検出枠ＴＢ１、ＴＢ２及びＴＢ３の画像上での位置関係は、マルチピンホールマスク２１１ａ上のピンホール２１１ａａ１、２１１ａａ２及び２１１ａａ３の位置関係と相似の関係になるため、以下の関係が成り立つ。 At this time, the positional relationship on the image of the identification area frames TA1, TA2 and TA3, and the positional relationship on the image of the identification area frames TB1, TB2 and TB3 are the pinholes 211aa1, 211aa2 on the multi-pinhole mask 211a and Similar to the positional relationship of 211aa3. That is, the positional relationship of the identification detection frames TA1, TA2, and TA3 on the image and the positional relationship of the identification detection frames TB1, TB2, and TB3 on the image are the pinholes 211aa1, 211aa2, and 211aa3 on the multi-pinhole mask 211a. Therefore, the following relationship holds.

ｄ１２：ｄ１３：ｄ２３＝ＤＡ１２：ＤＡ１３：ＤＡ２３＝ＤＢ１２：ＤＢ１３：ＤＢ２３ d12: d13: d23 = DA12: DA13: DA23 = DB12: DB13: DB23

このことから、計算撮像パラメータとして、マルチピンホールマスク２１１ａ上のピンホール２１１ａａ１、２１１ａａ２及び２１１ａａ３の位置情報を利用することで、計算撮像画像５００上の被写体の位置関係を推定できることがわかる。 From this, it is understood that the positional relationship of the subject on the calculated captured image 500 can be estimated by using the positional information of the pinholes 211aa1, 211aa2, and 211aa3 on the multi-pinhole mask 211a as the calculated imaging parameter.

同様に、Ａ１、Ａ２、Ａ３の識別領域枠ＴＡ１、ＴＡ２及びＴＡ３それぞれの重心間の距離は、その被写体である人物Ａの撮像部１１からの距離に依存する。ここで、被写体である人物Ａは自動車Ｂより撮像部１１に近い位置に存在するため、以下の関係が成り立つ。 Similarly, the distance between the centroids of the identification area frames TA1, TA2, and TA3 of A1, A2, and A3 depends on the distance from the imaging unit 11 of the person A that is the subject. Here, since the person A who is the subject exists at a position closer to the imaging unit 11 than the car B, the following relationship is established.

ＤＡ１２＞ＤＢ１２
ＤＡ１３＞ＤＢ１３
ＤＡ２３＞ＤＢ２３ DA12> DB12
DA13> DB13
DA23> DB23

ただし、これらの関係は、撮像部１１が利用するレンズなどの光学系が理想的な場合にのみ成り立つ。通常のレンズは、収差などの歪みを有するため、事前にカメラキャリブレーションを実施し、歪みの影響を取り除く必要がある。このようなカメラキャリブレーションは、例えば、Ｔｓａｉのキャリブレーションとして知られている手法を利用することで実現できる。 However, these relationships hold only when an optical system such as a lens used by the imaging unit 11 is ideal. Since normal lenses have distortions such as aberrations, it is necessary to perform camera calibration in advance to remove the influence of distortions. Such camera calibration can be realized, for example, by using a method known as Tsai calibration.

また、学習部１２４は、第一画像取得部１２１が取得した計算撮像画像（第１の計算撮像画像）と、第二画像取得部１２２が取得した撮像画像と、識別正解取得部１２３が取得した識別正解に加え、撮像部１１が計算撮像画像を撮像する際に利用した計算撮像パラメータとを用いて、識別部１０２における識別器の学習を行うようにしてもかまわない。前述のように、マルチピンホールカメラにおいて、マルチピンホールマスクのピンホール位置は、撮像される計算撮像画像に大きく影響する。そこで、この情報を利用した学習をすることで、識別精度を向上させることができる。この処理について詳述する。ここで、識別器として、ＳＳＤ（Single Shot MultiBox Detector）などのＤｅｅｐＬｅａｒｎｉｎｇを想定する。 In addition, the learning unit 124 acquires the calculated captured image (first calculated captured image) acquired by the first image acquisition unit 121, the captured image acquired by the second image acquisition unit 122, and the identification correct answer acquisition unit 123. In addition to the correct identification, the discriminator in the discriminating unit 102 may be learned using the calculated imaging parameters used when the imaging unit 11 captures the calculated captured image. As described above, in the multi-pinhole camera, the pinhole position of the multi-pinhole mask greatly affects the calculated captured image. Therefore, identification accuracy can be improved by learning using this information. This process will be described in detail. Here, Deep Learning such as SSD (Single Shot MultiBox Detector) is assumed as the discriminator.

図２７Ａは、識別部１０２の識別器として利用したＤｅｅｐＬｅａｒｎｉｎｇの一例を示す模式図である。このような識別器は、画像上の特徴量を抽出する複数の特徴抽出層と、認識結果を出力する認識層が接続されたネットワークである。また、図２７Ｂは、識別部の識別器として利用したＤｅｅｐＬｅａｒｎｉｎｇの他の一例を示す模式図である。具体的には、図２７Ｂは、接続する複数の特徴抽出層を直列に接続するのではなく、前段の特徴抽出層の出力を、バイパスのように直接、認識層に接続するネットワークを示している。このような構成を実現することで、被写体の大きさにロバストな識別や、安定した学習を実現することができる。このような識別器では、識別検出枠を出力するために、識別検出枠候補を設定し、その枠内の情報が識別対象と一致するかどうかを確認することで、識別を実現する。 FIG. 27A is a schematic diagram illustrating an example of Deep Learning used as a discriminator of the discriminating unit 102. Such a discriminator is a network in which a plurality of feature extraction layers that extract feature amounts on an image and a recognition layer that outputs a recognition result are connected. FIG. 27B is a schematic diagram illustrating another example of deep learning used as a discriminator of the discriminating unit. Specifically, FIG. 27B shows a network in which a plurality of connected feature extraction layers are connected in series, but the output of the preceding feature extraction layer is directly connected to the recognition layer like a bypass. . By realizing such a configuration, it is possible to realize robust identification and stable learning with respect to the size of the subject. In such a discriminator, in order to output an identification detection frame, identification detection frame candidates are set, and identification is realized by checking whether information in the frame matches an identification target.

そこで、本実施の形態の識別器は、撮像部１１が計算撮像画像を撮像する際に利用した計算撮像パラメータを利用することで、この識別検出枠候補を設定する。 Therefore, the discriminator according to the present embodiment sets the discrimination detection frame candidate by using the calculated imaging parameter used when the imaging unit 11 captures the calculated captured image.

図２８は、図２４の計算撮像画像５００に対応するシーンにおける識別検出枠候補を設定する処理を説明するための模式図である。ピンホール２１１ａａ１に対応する様々なアスペクト比を有する被写体の識別検出枠がＴＡ１に存在する場合、ピンホール２１１ａａ２及び２１１ａａ３に対応する識別検出枠ＴＡ２及びＴＡ３は図２８のように存在する。そこで、学習時の損失関数Ｌｏｓｓ（ｖ，ｃ，ｐ，ｇ）は、各ピンホールにおける信頼性損失関数と位置に関する損失関数の総和として、以下のようなものを利用する。 FIG. 28 is a schematic diagram for explaining processing for setting identification detection frame candidates in a scene corresponding to the calculated captured image 500 of FIG. When identification detection frames of subjects having various aspect ratios corresponding to the pinhole 211aa1 exist in TA1, the identification detection frames TA2 and TA3 corresponding to the pinholes 211aa2 and 211aa3 exist as shown in FIG. Accordingly, the loss function Loss (v, c, p, g) at the time of learning uses the following as the sum of the reliability loss function and the position related loss function in each pinhole.

ただし、
である。 However,
It is.

ここで、ｎはピンホールの数、ｖは識別カテゴリ、ｃは識別カテゴリｖに対する信頼度である。ｇは正解識別検出枠であり、ｐは上記計算撮像パラメータから計算される予測識別検出枠である。また、ｋは各ピンホールを示している。このような損失関数を利用して識別器を学習することで、高精度の識別を実現することができる。 Here, n is the number of pinholes, v is the identification category, and c is the reliability for the identification category v. g is a correct identification detection frame, and p is a predicted identification detection frame calculated from the calculated imaging parameters. K represents each pinhole. By learning the discriminator using such a loss function, high-accuracy discrimination can be realized.

また、本実施の形態の識別システムは、奥行推定装置１３をさらに有し、計算撮像パラメータを用いて画像識別装置１０が検出した被写体と撮像部１１の距離を出力するようにしてもかまわない。図２９は、本実施の形態の変形例である奥行推定装置１３を有する識別システム１Ｂの機能的な構成の一例を示す模式図である。図２９について、図９と同じ構成要素に関して同一の符号を付しており、説明を省略する。前述のように、同一被写体に対する複数のピンホールに対応する識別検出枠それぞれの重心間の距離Ｄは、その被写体の撮像部１１からの距離Ｌに依存する（図３０参照）。 The identification system according to the present embodiment may further include a depth estimation device 13 and output the distance between the subject detected by the image identification device 10 and the imaging unit 11 using the calculated imaging parameter. FIG. 29 is a schematic diagram illustrating an example of a functional configuration of an identification system 1B having a depth estimation device 13 that is a modification of the present embodiment. 29, the same components as those in FIG. 9 are denoted by the same reference numerals, and description thereof is omitted. As described above, the distance D between the centers of gravity of the identification detection frames corresponding to a plurality of pinholes for the same subject depends on the distance L from the imaging unit 11 of the subject (see FIG. 30).

図３０は、同一被写体に対する複数のピンホールに対応する識別検出枠それぞれの重心間の距離Ｄと、その被写体の撮像部１１からの距離Ｌを説明するための模式図である。説明を簡略化するため、ピンホール数が２個の場合について説明する。この図において、Ｌは被写体と撮像部１１のイメージセンサ２１１ｂの距離、ｆはマルチピンホールマスク２１１ａとイメージセンサ２１１ｂの距離、Ｄは識別検出枠それぞれの重心間の距離、ｄはピンホール間の距離を示している。計算撮像パラメータは、マルチピンホールマスク２１１ａとイメージセンサ２１１ｂの距離ｆとピンホール間の距離ｄである。このとき以下の関係が成り立つ。 FIG. 30 is a schematic diagram for explaining the distance D between the centroids of each of the identification detection frames corresponding to a plurality of pinholes for the same subject, and the distance L from the imaging unit 11 of the subject. In order to simplify the description, a case where the number of pinholes is two will be described. In this figure, L is the distance between the subject and the image sensor 211b of the imaging unit 11, f is the distance between the multi-pinhole mask 211a and the image sensor 211b, D is the distance between the centers of gravity of the identification detection frames, and d is between the pinholes. Shows the distance. The calculated imaging parameters are the distance f between the multi-pinhole mask 211a and the image sensor 211b and the distance d between the pinholes. At this time, the following relationship holds.

（Ｌ−ｆ）：ｄ＝L：Ｄ (L−f): d = L: D

よって、L=fD/(D-d)となる。 Therefore, L = fD / (D−d).

つまり、画像識別装置１０が検出した識別検出枠間の距離Ｄと、計算撮像パラメータである距離ｆ及びｄから、被写体と撮像部１１の距離Ｌを計算することができる。 That is, the distance L between the subject and the imaging unit 11 can be calculated from the distance D between the identification detection frames detected by the image identification device 10 and the distances f and d that are calculated imaging parameters.

さらに、システム、装置、部材又は部の全部又は一部の機能又は操作は、上述したように、ソフトウェア処理によって実行することが可能である。この場合、ソフトウェアは少なくとも１つのＲＯＭ、光学ディスク、又はハードディスクドライブなどの非一時的記録媒体に記録され、ソフトウェアが処理装置（processor）によって実行されたときに、そのソフトウェアで特定された機能が処理装置（processor）及び周辺装置によって実行される。 Furthermore, the functions or operations of all or part of the system, apparatus, member, or unit can be executed by software processing as described above. In this case, the software is recorded on a non-transitory recording medium such as at least one ROM, optical disk, or hard disk drive, and when the software is executed by a processor, the function specified by the software is processed. It is executed by a processor and peripheral devices.

システム又は装置は、ソフトウェアが記録されている一つ又は複数の非一時的記録媒体、処理装置（processor）、及びハードウェアデバイスを備えていてもよい。 The system or apparatus may include one or more non-transitory recording media in which software is recorded, a processor, and a hardware device.

また、上記で用いた序数、数量等の数字は、全て本開示の技術を具体的に説明するために例示するものであり、本開示は例示された数字に制限されない。また、構成要素間の接続関係は、本開示の技術を具体的に説明するために例示するものであり、本開示の機能を実現する接続関係はこれに限定されない。 Further, the numbers such as the ordinal numbers and the quantities used in the above are examples for specifically explaining the technology of the present disclosure, and the present disclosure is not limited to the illustrated numbers. In addition, the connection relationship between the constituent elements is exemplified for specifically explaining the technology of the present disclosure, and the connection relationship for realizing the functions of the present disclosure is not limited thereto.

また、ブロック図における機能ブロックの分割は一例であり、複数の機能ブロックを１つの機能ブロックとして実現したり、１つの機能ブロックを複数に分割したり、一部の機能を他の機能ブロックに移してもよい。また、単一のハードウェア又はソフトウェアが、類似する機能を有する複数の機能ブロックの機能を並列又は時分割に処理してもよい。 In addition, division of functional blocks in the block diagram is an example, and a plurality of functional blocks are realized as one functional block, one functional block is divided into a plurality of parts, or some functions are transferred to other functional blocks. May be. A single piece of hardware or software may process the functions of a plurality of functional blocks having similar functions in parallel or in time division.

本開示の一態様に係る識別システム１Ａは、撮像対象の周辺環境の情報を含む第２の計算撮像画像を撮像する撮像部１１と、撮像部１１が撮像した第２の計算撮像画像から、プーリング層を有しないネットワークで構成されている識別器を利用して当該画像に含まれる被写体を検出し、検出結果を出力する画像識別装置１０と、当該識別器を生成する学習装置１２からなる識別システムである。当該識別器は当該画像上の位置情報に敏感なネットワークを構成する。 An identification system 1A according to an aspect of the present disclosure includes an imaging unit 11 that captures a second calculated captured image including information on a surrounding environment of an imaging target, and a second calculated captured image captured by the imaging unit 11 from a pooling A discriminating system comprising an image discriminating device 10 that detects a subject included in the image using a discriminator configured by a network having no layer and outputs a detection result, and a learning device 12 that generates the discriminator. It is. The discriminator forms a network sensitive to position information on the image.

撮像部１１及び第一画像取得部は、マルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラから構成される。 The imaging unit 11 and the first image acquisition unit are configured by a multi-pinhole camera, a coded aperture camera, a light field camera, or a lensless camera.

撮像部１１及び第一画像取得部１２１は、計算撮像画像として、人が見ても視覚的に認識できない画像を取得する。 The imaging unit 11 and the first image acquisition unit 121 acquire an image that cannot be visually recognized even when viewed by a person as a calculated captured image.

プーリング層を有しないネットワークで構成されている識別器のネットワークにおいて、畳み込み層の入力が重ならない。 In a network of discriminators configured by a network having no pooling layer, convolution layer inputs do not overlap.

当該識別器のネットワークにおいて、畳み込み層の入力ストライドが２以上である。 In the classifier network, the convolutional layer has an input stride of 2 or more.

識別部１０２は、画像情報に加え、奥行情報も取得できる画像を取得する。 The identification unit 102 acquires an image that can acquire depth information in addition to the image information.

本開示の一態様に係る学習装置１２は、第１の計算撮像画像を取得する第一画像取得部１２１と、撮像画像を取得する第二画像取得部１２２と、第二画像取得部１２２が取得した撮像画像に関する識別正解を取得する識別正解取得部１２３と、撮像画像に関する識別正解を利用して、第一画像取得部１２１が取得した第１の計算撮像画像に対する機械学習を行なうことで、プーリング層を有しない識別器を取得する学習部１２４とを備え、当該識別器は、プーリング層を有しないネットワークであって、当該画像上の位置情報に敏感なネットワークを利用する識別器を有する。 The learning device 12 according to an aspect of the present disclosure is acquired by the first image acquisition unit 121 that acquires the first calculated captured image, the second image acquisition unit 122 that acquires the captured image, and the second image acquisition unit 122. Pooling by performing machine learning on the first calculated captured image acquired by the first image acquiring unit 121 using the identified correct answer acquiring unit 123 that acquires the identified correct answer related to the captured image and the identified correct answer related to the captured image. And a learning unit 124 that acquires a classifier that does not have a layer, and the classifier has a classifier that uses a network that does not have a pooling layer and is sensitive to position information on the image.

本開示の一態様に係る識別方法では、第２の計算撮像画像から、プーリング層を有しないネットワークで構成されている識別器を利用して当該画像に含まれる被写体を検出し、検出結果を出力し、第１の計算撮像画像と撮像画像を取得し、撮像画像に関する識別正解を取得し、撮像画像に関する識別正解を利用して、第１の計算撮像画像に対する機械学習を行なうことで、識別器を生成し、当該識別器は当該画像上の位置情報に敏感なネットワークを利用する。 In the identification method according to one aspect of the present disclosure, a subject included in the image is detected from the second calculated captured image using a classifier configured by a network that does not have a pooling layer, and the detection result is output. The first calculated captured image and the captured image are acquired, the identification correct answer regarding the captured image is acquired, and the machine learning is performed on the first calculated captured image using the identification correct answer regarding the captured image, whereby the classifier The discriminator uses a network sensitive to position information on the image.

本開示の技術は、計算撮像画像中の物体を画像認識する技術に広く適用可能である。本開示の技術は、計算撮像画像を撮像する撮像装置が、高い識別処理速度が要求される移動体に搭載される場合にも、広く適用可能であり、例えば、自動車の自動運転技術、ロボット及び周辺監視カメラシステム等に適用可能である。 The technique of the present disclosure can be widely applied to a technique for recognizing an object in a calculated captured image. The technology of the present disclosure can be widely applied even when an imaging device that captures a calculated captured image is mounted on a moving body that requires a high identification processing speed. For example, an automatic driving technology for an automobile, a robot, It can be applied to a peripheral monitoring camera system or the like.

１，１Ａ，１Ｂ識別システム
１０画像識別装置（識別装置）
１１撮像部
１２学習装置
１３奥行推定装置
１０１取得部
１０２識別部
１０３出力部
１２１第一画像取得部
１２２第二画像取得部
１２３識別正解取得部
１２４学習部
２０１第一入力回路
２０２第一演算回路
２０３第一メモリ
２０４出力回路
２１１ライトフィールドカメラ
２１１ａマルチピンホールマスク
２１１ａａ，２１１ａａ１，２１１ａａ２，２１１ａａ３ピンホール
２１１ｂイメージセンサ
２２１第二入力回路
２２２第三入力回路
２２３第二演算回路
２２４第二メモリ
２３１，２３２，２３２ａ，２３２ｂ光軸
２４０ビームスプリッタ
３１１符号化開口マスク
４００入力層
４２０，４３０，４４０中間出力層
４０１，４０２，４０３，４０４，４０５，４０６，４０７，４０８，４０９，４１０，４１１，４２１，４２２，４２３，４２４，４２５，４３１，４４１，４４２，４４３，４４４，４５１，４５２，４５３，４５４データ（メモリ）
５００計算撮像画像
１０２１テクスチャ情報識別部
１０２２奥行情報識別部
１０２３統合識別部 1,1A, 1B Identification system 10 Image identification device (identification device)
DESCRIPTION OF SYMBOLS 11 Imaging part 12 Learning apparatus 13 Depth estimation apparatus 101 Acquisition part 102 Identification part 103 Output part 121 First image acquisition part 122 Second image acquisition part 123 Identification correct answer acquisition part 124 Learning part 201 First input circuit 202 First arithmetic circuit 203 First memory 204 Output circuit 211 Light field camera 211a Multi-pinhole mask 211aa, 211aa1, 211aa2, 211aa3 Pinhole 211b Image sensor 221 Second input circuit 222 Third input circuit 223 Second arithmetic circuit 224 Second memory 231, 232, 232a, 232b Optical axis 240 Beam splitter 311 Coding aperture mask 400 Input layer 420, 430, 440 Intermediate output layer 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 11,421,422,423,424,425,431,441,442,443,444,451,452,453,454 data (memory)
500 Computed Image 1021 Texture Information Identification Unit 1022 Depth Information Identification Unit 1023 Integrated Identification Unit

Claims

撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を撮像するカメラと、
識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別する処理回路とを備え、
前記識別モデルは、プーリング層を有しないネットワークで構成されている、
識別システム。 A camera that captures a captured image including an imaging object and a surrounding environment of the imaging object;
A processing circuit that identifies the imaging object in the calculated captured image using an identification model;
The identification model is composed of a network that does not have a pooling layer.
Identification system.

前記識別モデルのネットワークにおいて、畳み込み層の入力は重ならない、
請求項１に記載の識別システム。 In the identification model network, the convolutional layer inputs do not overlap.
The identification system according to claim 1.

前記識別モデルのネットワークにおいて、畳み込み層の入力のストライドが２以上である、
請求項１に記載の識別システム。 In the network of the identification model, the input stride of the convolution layer is 2 or more.
The identification system according to claim 1.

前記識別モデルのネットワークにおいて、畳み込み層の入力のストライドが当該カーネルサイズの半分以上である、
請求項３に記載の識別システム。 In the network of the identification model, the stride of the input of the convolution layer is more than half of the kernel size.
The identification system according to claim 3.

前記カメラは、前記計算撮像画像として、前記撮像対象物及び前記撮像対象物の周辺環境がそれぞれ複数重畳された視差情報を含む画像を撮像する、
請求項１〜４のいずれか１項に記載の識別システム。 The camera captures an image including parallax information in which a plurality of the imaging object and the surrounding environment of the imaging object are superimposed as the calculated captured image,
The identification system according to any one of claims 1 to 4.

前記カメラは、マルチピンホールカメラ、ＣｏｄｅｄＡｐｅｒｔｕｒｅカメラ、ライトフィールドカメラ、又は、レンズレスカメラである、
請求項５に記載の識別システム。 The camera is a multi-pinhole camera, a coded aperture camera, a light field camera, or a lensless camera.
The identification system according to claim 5.

メモリ及び処理回路を備えた識別装置であって、
前記処理回路は、
前記メモリから撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を取得し、
前記メモリに記憶された識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別し、
前記識別モデルは、プーリング層を有しないネットワークで構成されている、
識別装置。 An identification device comprising a memory and a processing circuit,
The processing circuit is
Obtaining a calculated captured image including the imaging object and the surrounding environment of the imaging object from the memory;
Using the identification model stored in the memory, the imaging object in the calculated captured image is identified,
The identification model is composed of a network that does not have a pooling layer.
Identification device.

撮像対象物及び前記撮像対象物の周辺環境を含む計算撮像画像を取得し、
プーリング層を有しないネットワークで構成されている識別モデルを用いて、前記計算撮像画像中の前記撮像対象物を識別する、
識別方法。 Obtain a captured image including the imaging object and the surrounding environment of the imaging object,
Identifying the imaging object in the computed captured image using an identification model comprised of a network without a pooling layer;
Identification method.

請求項８に記載の識別方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the identification method according to claim 8.