JP2021056785A

JP2021056785A - Image recognition system, imaging apparatus, recognition device and image recognition method

Info

Publication number: JP2021056785A
Application number: JP2019179524A
Authority: JP
Inventors: 龍佑野坂; Ryusuke Nosaka; 黒川　高晴; Takaharu Kurokawa; 高晴黒川; 秀紀氏家; Hidenori Ujiie
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-04-08
Anticipated expiration: 2039-09-30
Also published as: JP7368995B2

Abstract

To provide an image recognition system that stably reduces data capacity of an image to be recognized while maintaining accuracy of image recognition.SOLUTION: An image recognition method comprises steps of: imaging a space in which a predetermined object can be imaged; generating a first image consisting of a pixel having a tonal value within a first tonal range; converting, when the first image is input, the first image into a second image by a converter that outputs a second image consisting of a pixel having a first tonal value within a second tonal range smaller than the first tonal range; and generating, when the second image is input, a recognition result for the second image by a recognizer that outputs a recognition result for a predetermined object for the second image.SELECTED DRAWING: Figure 1

Description

本発明は、画像認識システム、撮像装置、認識装置及び画像認識方法に関する。 The present invention relates to an image recognition system, an imaging device, a recognition device, and an image recognition method.

近年、防犯意識の高まりから監視カメラの設置数が増加している。これに伴い、監視カメラ等の撮像装置によって撮像された画像を監視者が視認して不審者や不審物等の対象物を認識することが難しくなっている。そこで、このような画像に対して画像認識処理を実行し、対象物を自動的に認識する要求が高まっている。 In recent years, the number of surveillance cameras installed has been increasing due to heightened awareness of crime prevention. Along with this, it has become difficult for the observer to visually recognize the image captured by the imaging device such as a surveillance camera to recognize an object such as a suspicious person or a suspicious object. Therefore, there is an increasing demand for automatically recognizing an object by executing an image recognition process on such an image.

画像認識処理を実行するためには、撮像された画像を一時記憶又は／及び伝送する必要があり、多数の撮像装置によって撮像された画像を記憶又は／及び伝送するためには多くの記憶容量又は／及び伝送容量が要求される。したがって、画像認識の精度を保ちながら、画像認識処理を実行する対象の画像のデータ容量が抑えられることが好ましい。特許文献１には、撮影画像を区分した複数のブロックに含まれるエッジの強度に基づいて各ブロックのエッジレベルを推定し、推定されたエッジレベルに基づいて各ブロックを低解像度の画像に置換する監視カメラが開示されている。特許文献２には、太さが基準以下である微細エッジ及びその近傍の微細構造領域を検出し、検出された微細構造領域の外側を低解像度の画像に置換する監視カメラが開示されている。 In order to execute the image recognition process, it is necessary to temporarily store and / and transmit the captured image, and in order to store and / and transmit the image captured by a large number of imaging devices, a large storage capacity or / And transmission capacity is required. Therefore, it is preferable that the data capacity of the image to be executed for the image recognition process is suppressed while maintaining the accuracy of the image recognition. Patent Document 1 estimates the edge level of each block based on the strength of edges contained in a plurality of blocks that divide a captured image, and replaces each block with a low-resolution image based on the estimated edge level. Surveillance cameras are disclosed. Patent Document 2 discloses a surveillance camera that detects a fine edge having a thickness equal to or less than a reference and a fine structure region in the vicinity thereof, and replaces the outside of the detected fine structure region with a low-resolution image.

特開２０１５−０８８８１７号公報Japanese Unexamined Patent Publication No. 2015-088817 特開２０１５−０８８８１８号公報Japanese Unexamined Patent Publication No. 2015-088818

しかしながら、特許文献１及び２の手法では、画像のエッジの強度等に応じて置換後の画像のデータ容量が変動するため、画像認識処理のために必要となる記憶容量又は／及び伝送容量の予測が困難であるという問題があった。そこで、画像認識の精度を保ちつつ、画像認識の対象である画像のデータ容量を安定して低減させることが望まれている。 However, in the methods of Patent Documents 1 and 2, since the data capacity of the image after replacement varies depending on the edge strength of the image and the like, the storage capacity and / and the transmission capacity required for the image recognition process are predicted. There was a problem that it was difficult. Therefore, it is desired to stably reduce the data capacity of the image to be image-recognized while maintaining the accuracy of image recognition.

本発明は、上述の課題を解決するためになされたものであり、画像認識の精度を保ちつつ、画像認識の対象である画像のデータ容量を安定して低減させることを可能とする画像認識システム、撮像装置、認識装置及び画像認識方法を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and is an image recognition system capable of stably reducing the data capacity of an image to be image recognition while maintaining the accuracy of image recognition. , An image pickup device, a recognition device, and an image recognition method.

本発明に係る画像認識システムは、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換する変換手段と、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する認識手段と、を備えたことを特徴とする。 The image recognition system according to the present invention includes an imaging means that images a space in which a predetermined object can be imaged and generates a first image composed of pixels having a gradation value within the first gradation range, and a first image recognition means. When an image is input, a converter that outputs a second image consisting of pixels having a gradation value within a second gradation range smaller than the first gradation range converts the first image into a second image. Recognition that generates a recognition result for the second image by a conversion means for conversion and a recognizer that performs a process for recognizing an object when the second image is input on the second image and outputs a recognition result. It is characterized by having means and.

また、本発明に係る画像認識システムにおいて、変換器及び認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the image recognition system according to the present invention, the converter and the recognizer have gradation values within the first gradation range in the neural network connected so that the output of the converter becomes the input of the recognizer. It was learned so that the recognition result output when the first learning image composed of pixels is input is close to the learning recognition result set in advance as the recognition result to be output for the first learning image. It is preferably a trained neural network.

また、本発明に係る画像認識システムにおいて、変換器及び認識器は、結合されたニューラルネットワークに学習用第１画像を入力した場合に前記変換器によって出力される第２の階調範囲を有する画像を学習用第１画像から生成されるエッジ画像に近づけ、且つ、結合されたニューラルネットワークによって出力される認識結果を学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the image recognition system according to the present invention, the converter and the recognizer are images having a second gradation range output by the converter when the first image for learning is input to the combined neural network. Is preferably a trained neural network trained so as to bring the recognition result output by the combined neural network closer to the edge image generated from the first learning image and closer to the learning recognition result. ..

また、本発明に係る画像認識システムにおいて、変換された第２画像を所定の伝送網に出力する出力手段と、第２画像を伝送網から取得する取得手段と、をさらに備え、認識手段は、取得された第２画像に対する認識結果を生成する、ことが好ましい。 Further, the image recognition system according to the present invention further includes an output means for outputting the converted second image to a predetermined transmission network and an acquisition means for acquiring the second image from the transmission network. It is preferable to generate a recognition result for the acquired second image.

本発明に係る撮像装置は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換する変換手段と、第２画像を出力する出力手段と、を備えたことを特徴とする。 The imaging apparatus according to the present invention includes an imaging means that images a space in which a predetermined object can be imaged and generates a first image composed of pixels having a gradation value within the first gradation range, and a first image. Is input, the first image is converted to the second image by a converter that outputs a second image consisting of pixels having a gradation value within the second gradation range smaller than the first gradation range. It is characterized in that it is provided with a conversion means for outputting a second image and an output means for outputting a second image.

また、本発明に係る撮像装置において、変換器は、変換器の出力が、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the imaging apparatus according to the present invention, the converter performs a process for recognizing an object when the output of the converter inputs a second image on the second image, and outputs a recognition result. The recognition result output when the first image for learning consisting of pixels having gradation values within the first gradation range is input to the neural network coupled so as to be the input of the recognizer is used for learning. It is preferable that the learning neural network is trained so as to approach a preset learning recognition result as a recognition result to be output for one image.

本発明に係る認識装置は、第１の階調範囲内の階調値を有する画素からなる第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、撮像により生成された第１画像を変換した第２画像を取得する取得手段と、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する認識手段と、を備えたことを特徴とする。 The recognition device according to the present invention is in a second gradation range smaller than the first gradation range when a first image composed of pixels having a gradation value within the first gradation range is input. An acquisition means for acquiring a second image obtained by converting a first image generated by imaging by a converter that outputs a second image composed of pixels having a gradation value, and a target when the second image is input. It is characterized in that it is provided with a recognition means for generating a recognition result for the second image by a recognizer that performs a process for recognition on the second image and outputs a recognition result.

また、本発明に係る認識装置において、認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像を入力した場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the recognition device according to the present invention, the recognizer is a learning composed of pixels having a gradation value within the first gradation range in a neural network coupled so that the output of the converter becomes the input of the recognition device. With a trained neural network trained so that the recognition result output when the first image for learning is input is close to the recognition result for learning preset as the recognition result to be output for the first image for learning. It is preferable that there is.

本発明に係る画像認識方法は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成し、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換し、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する、ことを含むことを特徴とする。 In the image recognition method according to the present invention, a space in which a predetermined object can be imaged is imaged, a first image composed of pixels having a gradation value within the first gradation range is generated, and the first image is input. When this is done, the first image is converted into the second image by a converter that outputs a second image composed of pixels having a gradation value within the second gradation range smaller than the first gradation range. It is characterized by including that a recognition result for the second image is generated by a recognizer that performs a process for recognizing an object on the second image when the second image is input and outputs a recognition result. And.

本発明に係る画像認識システム、撮像装置、認識装置、画像認識方法は、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。 The image recognition system, the image pickup device, the recognition device, and the image recognition method according to the present invention make it possible to stably reduce the data capacity of an image to be image recognition while maintaining the accuracy of image recognition.

本発明の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline of this invention. 画像認識システム１の概略構成の一例を示す図である。It is a figure which shows an example of the schematic structure of the image recognition system 1. 学習装置２の概略構成の一例を示す図である。It is a figure which shows an example of the schematic structure of the learning apparatus 2. 撮像装置３の概略構成の一例を示す図である。It is a figure which shows an example of the schematic structure of the image pickup apparatus 3. 認識装置４の概略構成の一例を示す図である。It is a figure which shows an example of the schematic structure of the recognition device 4. 変換器の概要について説明するための模式図である。It is a schematic diagram for demonstrating the outline of a converter. 識別器の概要について説明するための模式図である。It is a schematic diagram for demonstrating the outline of a classifier. 学習用データ２１１のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the learning data 211. 学習処理の流れの一例を示すフロー図である。It is a flow diagram which shows an example of the flow of a learning process. 画像認識処理の流れの一例を示すシーケンス図である。It is a sequence diagram which shows an example of the flow of image recognition processing. 認識結果画面７００の一例を示す図である。It is a figure which shows an example of the recognition result screen 700.

以下、図面を参照しつつ、本発明の様々な実施形態について説明する。ただし、本発明の技術的範囲はそれらの実施形態に限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。 Hereinafter, various embodiments of the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to those embodiments, but extends to the inventions described in the claims and their equivalents.

（本発明の概要）
図１は、本発明の概要について説明するための模式図である。本発明に係る画像認識システムは、撮像手段と、変換手段と、認識手段とを有する。 (Outline of the present invention)
FIG. 1 is a schematic diagram for explaining an outline of the present invention. The image recognition system according to the present invention includes an imaging means, a conversion means, and a recognition means.

撮像手段は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する。所定の対象は例えば人であり、第１画像は、例えば、ＲＧＢの３チャネルのそれぞれについて０〜２５５の階調値を有する画素からなる画像である。変換手段は、変換器により、撮像手段によって生成された第１画像を第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像に変換する。第２画像は、例えば、０又は１の階調値を有する画素からなる画像である。認識手段は、認識器により、変換手段によって変換された第２画像に対する所定の対象の認識結果を生成する。認識結果は、例えば、第２画像に写っている人の像に外接する矩形領域を示す情報である。 The imaging means images a space in which a predetermined object can be imaged, and generates a first image composed of pixels having gradation values within the first gradation range. The predetermined object is, for example, a person, and the first image is, for example, an image composed of pixels having a gradation value of 0 to 255 for each of the three channels of RGB. The conversion means converts the first image generated by the imaging means into a second image composed of pixels having a gradation value within a second gradation range smaller than the first gradation range by the converter. The second image is, for example, an image composed of pixels having a gradation value of 0 or 1. The recognition means generates a recognition result of a predetermined object for the second image converted by the conversion means by the recognizer. The recognition result is, for example, information indicating a rectangular region circumscribing the image of a person in the second image.

変換器及び認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークを学習させることにより生成された学習済みニューラルネットワークである。学習は、第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合にニューラルネットワークから出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように行われる。 The transducer and the recognizer are trained neural networks generated by training a neural network coupled so that the output of the converter becomes the input of the recognizer. In learning, the recognition result output from the neural network when the first learning image composed of pixels having gradation values within the first gradation range is input is output to the first learning image. It is performed so as to approach a preset learning recognition result as a power recognition result.

このように、画像認識システムにおいて、変換手段は、第１画像が入力された場合に第２画像を出力する変換器により、第１画像を第２画像に変換する。このようにすることで、画像認識システムは、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。すなわち、第１画像が変換された第２画像のデータ容量は、第２の階調範囲及び画素数によって定まり、第１画像の内容に依存しない。したがって、画像認識システムは、第１画像を第２画像に変換することにより、第１画像の内容にかかわらず画像のデータ容量を安定して削減することを可能とする。 As described above, in the image recognition system, the conversion means converts the first image into the second image by the converter that outputs the second image when the first image is input. By doing so, the image recognition system can stably reduce the data capacity of the image to be image recognition while maintaining the accuracy of image recognition. That is, the data capacity of the second image to which the first image is converted is determined by the second gradation range and the number of pixels, and does not depend on the content of the first image. Therefore, the image recognition system makes it possible to stably reduce the data capacity of the image regardless of the content of the first image by converting the first image into the second image.

また、第１画像を入力された場合に第２画像を出力する変換器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークを、学習用第１画像及び学習用認識結果を用いて学習させることにより生成される。このようにすることで、画像認識システムは、認識器において高い精度での画像認識が可能となるような第２画像を変換器に出力させることが可能となる。 Further, the converter that outputs the second image when the first image is input uses a neural network that is connected so that the output of the converter becomes the input of the recognizer, and recognizes the first image for learning and the recognition for learning. It is generated by training using the results. By doing so, the image recognition system can output the second image to the converter so that the image recognition can be performed with high accuracy in the recognizer.

なお、上述した図１の説明は、本発明の内容への理解を深めるための説明にすぎない。本発明は、具体的には、次に説明する各実施形態において実施され、且つ、本発明の原則を実質的に超えずに、さまざまな変形例によって実施されてもよい。このような変形例はすべて、本発明および本明細書の開示範囲に含まれる。 The above description of FIG. 1 is merely an explanation for deepening the understanding of the contents of the present invention. Specifically, the present invention may be carried out in each of the embodiments described below, and may be carried out by various modifications without substantially exceeding the principles of the present invention. All such variations are included within the scope of the present invention and the present specification.

（システムの概略構成）
図２は、画像認識システム１の概略構成の一例を示す図である。画像認識システム１は、学習装置２と、撮像装置３と、認識装置４と、表示装置５とを有する。学習装置２、撮像装置３、認識装置４及び表示装置５は、インターネット又はイントラネット等の伝送網６を介して相互に接続される。 (Outline configuration of system)
FIG. 2 is a diagram showing an example of a schematic configuration of the image recognition system 1. The image recognition system 1 includes a learning device 2, an imaging device 3, a recognition device 4, and a display device 5. The learning device 2, the imaging device 3, the recognition device 4, and the display device 5 are connected to each other via a transmission network 6 such as the Internet or an intranet.

学習装置２は、サーバ又はＰＣ（Personal Computer）等の情報処理装置である。学習装置２は、学習済みニューラルネットワークである変換器及び認識器を同時学習により生成する。変換器は、多値画像が入力された場合に二値画像を出力するニューラルネットワークである。認識器は、二値画像が入力された場合に二値画像内における人の領域を出力するニューラルネットワークである。なお、変換器及び認識器の同時学習とは、変換器及び認識器を結合したニューラルネットワークを学習させることをいう。また、多値画像は第１画像の一例であり、二値画像は第２画像の一例である。 The learning device 2 is an information processing device such as a server or a PC (Personal Computer). The learning device 2 generates a converter and a recognizer, which are trained neural networks, by simultaneous learning. The converter is a neural network that outputs a binary image when a multi-valued image is input. The recognizer is a neural network that outputs the area of a person in the binary image when the binary image is input. Simultaneous learning of the converter and the recognizer means learning a neural network in which the converter and the recognizer are connected. The multi-valued image is an example of the first image, and the binary image is an example of the second image.

撮像装置３は、例えば、監視カメラである。撮像装置３は、例えば、建物内の一室を撮像することにより当該部屋の多値画像を生成する。撮像装置３は、学習装置２により生成された変換器により多値画像を二値画像に変換する。撮像装置３は、変換された二値画像を伝送網６に出力する。なお、上記部屋は対象が撮像され得る空間の一例である。 The image pickup device 3 is, for example, a surveillance camera. The image pickup device 3 generates a multi-valued image of a room in the building, for example, by taking an image of the room. The image pickup device 3 converts the multi-valued image into a binary image by the converter generated by the learning device 2. The image pickup apparatus 3 outputs the converted binary image to the transmission network 6. The room is an example of a space in which an object can be imaged.

認識装置４は、サーバ又はＰＣ等の情報処理装置である。認識装置４は、撮像装置３によって出力された二値画像を伝送網６から取得する。認識装置４は、学習装置２により生成された認識器により、二値画像に対する認識結果を生成する。認識装置４は、生成された認識結果を伝送網６に出力する。取得される二値画像のデータ容量は元の多値画像に比べて少なく、且つ固定値である。そのため、認識装置４や伝送網６が要する伝送容量を削減することができ、認識装置４が一時記憶したり保管するために要する記憶容量も削減することができる。また、二値画像は認識器と同時学習された変換器により生成されるため変換による認識精度を高く維持したまま伝送容量や記憶容量を削減できる。 The recognition device 4 is an information processing device such as a server or a PC. The recognition device 4 acquires the binary image output by the image pickup device 3 from the transmission network 6. The recognition device 4 generates a recognition result for the binary image by the recognizer generated by the learning device 2. The recognition device 4 outputs the generated recognition result to the transmission network 6. The data capacity of the acquired binary image is smaller than that of the original multi-valued image, and is a fixed value. Therefore, the transmission capacity required by the recognition device 4 and the transmission network 6 can be reduced, and the storage capacity required for the recognition device 4 to temporarily store or store the data can also be reduced. Further, since the binary image is generated by the converter simultaneously learned with the recognizer, the transmission capacity and the storage capacity can be reduced while maintaining high recognition accuracy by the conversion.

表示装置５は、サーバ又はＰＣ等の情報処理装置である。表示装置５は、認識装置４によって出力された認識結果を伝送網６から取得する。表示装置５は、取得された認識結果を表示装置５が備える液晶ディスプレイ等の表示部に表示する。 The display device 5 is an information processing device such as a server or a PC. The display device 5 acquires the recognition result output by the recognition device 4 from the transmission network 6. The display device 5 displays the acquired recognition result on a display unit such as a liquid crystal display included in the display device 5.

図３は、学習装置２の概略構成の一例を示す図である。学習装置２は、第１記憶部２１と、第１通信部２２と、第１処理部２３とを備える。 FIG. 3 is a diagram showing an example of a schematic configuration of the learning device 2. The learning device 2 includes a first storage unit 21, a first communication unit 22, and a first processing unit 23.

第１記憶部２１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第１記憶部２１は、第１処理部２３による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ（Compact Disc）−ＲＯＭ（Read Only Memory）、ＤＶＤ（Digital Versatile Disc）−ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第１記憶部２１にインストールされる。 The first storage unit 21 is a device for storing a program or data, and includes, for example, a semiconductor memory device. The first storage unit 21 stores an operating system program, a driver program, an application program, data, and the like used for processing by the first processing unit 23. The program may be a computer-readable and non-temporary portable storage medium such as a CD (Compact Disc) -ROM (Read Only Memory) or a DVD (Digital Versatile Disc) -ROM, using a known setup program or the like. It is installed in the first storage unit 21.

また、第１記憶部２１は、学習用データ２１１及び学習用モデル２１２を記憶する。 In addition, the first storage unit 21 stores the learning data 211 and the learning model 212.

第１通信部２２は、学習装置２を他の装置と通信可能にする通信インタフェース回路を備える。第１通信部２２が備える通信インタフェース回路は、有線ＬＡＮ（Local Area Network）又は無線ＬＡＮ等の通信インタフェース回路である。第１通信部２２は、他の装置から送信されたデータを受信し、第１処理部２３に供給するとともに、第１処理部２３から供給されたデータを他の装置に送信する。 The first communication unit 22 includes a communication interface circuit that enables the learning device 2 to communicate with another device. The communication interface circuit included in the first communication unit 22 is a communication interface circuit such as a wired LAN (Local Area Network) or a wireless LAN. The first communication unit 22 receives the data transmitted from the other device, supplies the data to the first processing unit 23, and transmits the data supplied from the first processing unit 23 to the other device.

第１処理部２３は、一又は複数個のプロセッサ及びその周辺回路を備える。第１処理部２３は、例えばＣＰＵ（Central Processing Unit）であり、学習装置２の動作を統括的に制御する。第１処理部２３は、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＬＳＩ（Large-Scaled IC）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等でもよい。第１処理部２３は、第１記憶部２１に記憶されているプログラムに基づいて学習装置２の各種処理が適切な手順で実行されるように、第１通信部２２の動作を制御するとともに、各種の処理を実行する。また、第１処理部２３は、複数のプログラムを並列に実行することができる。 The first processing unit 23 includes one or more processors and peripheral circuits thereof. The first processing unit 23 is, for example, a CPU (Central Processing Unit), and controls the operation of the learning device 2 in an integrated manner. The first processing unit 23 may be a GPU (Graphics Processing Unit), a DSP (Digital Signal Processor), an LSI (Large-Scaled IC), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or the like. The first processing unit 23 controls the operation of the first communication unit 22 and controls the operation of the first communication unit 22 so that various processes of the learning device 2 are executed in an appropriate procedure based on the program stored in the first storage unit 21. Execute various processes. In addition, the first processing unit 23 can execute a plurality of programs in parallel.

第１処理部２３は、学習用モデル取得手段２３１、学習用データ取得手段２３２、エッジ画像生成手段２３３、学習手段２３４及び出力手段２３５を備える。これらの各手段は、第１処理部２３によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして学習装置２に実装されてもよい。 The first processing unit 23 includes a learning model acquisition means 231, a learning data acquisition means 232, an edge image generation means 233, a learning means 234, and an output means 235. Each of these means is a functional module realized by a program executed by the first processing unit 23. Each of these means may be implemented in the learning device 2 as firmware.

図４は、撮像装置３の概略構成の一例を示す図である。撮像装置３は、第２記憶部３１と、第２通信部３２と、撮像部３３と、第２処理部３４とを備える。 FIG. 4 is a diagram showing an example of a schematic configuration of the image pickup apparatus 3. The image pickup device 3 includes a second storage unit 31, a second communication unit 32, an image pickup unit 33, and a second processing unit 34.

第２記憶部３１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第２記憶部３１は、第２処理部３４による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第２記憶部３１にインストールされる。 The second storage unit 31 is a device for storing a program or data, and includes, for example, a semiconductor memory device. The second storage unit 31 stores an operating system program, a driver program, an application program, data, and the like used for processing by the second processing unit 34. The program is installed in the second storage unit 31 from a computer-readable and non-temporary portable storage medium such as a CD-ROM or a DVD-ROM using a known setup program or the like.

第２通信部３２は、撮像装置３を他の装置と通信可能にする通信インタフェース回路を備える。第２通信部３２が備える通信インタフェース回路は、有線ＬＡＮ又は無線ＬＡＮ等の通信インタフェース回路である。第２通信部３２は、他の装置から送信されたデータを受信し、第２処理部３４に供給するとともに、第２処理部３４から供給されたデータを他の装置に送信する。 The second communication unit 32 includes a communication interface circuit that enables the image pickup device 3 to communicate with another device. The communication interface circuit included in the second communication unit 32 is a communication interface circuit such as a wired LAN or a wireless LAN. The second communication unit 32 receives the data transmitted from the other device, supplies the data to the second processing unit 34, and transmits the data supplied from the second processing unit 34 to the other device.

撮像部３３は、結像光学系、撮像素子及び画像処理部等を備える。結像光学系は、例えば光学レンズであり、被写体からの光束を撮像素子の撮像面上に結像させる。撮像素子は、例えば、ＣＣＤ（Charge Coupled Device）又はＣＭＯＳ（Complementary Metal Oxide Semiconductor）等であり、撮像面上に結像した被写体像の画像信号を出力する。画像処理部は、撮像素子によって生成された画像信号から所定の形式の画像データを生成して第２処理部３４に供給する。 The image pickup unit 33 includes an imaging optical system, an image pickup device, an image processing unit, and the like. The imaging optical system is, for example, an optical lens, and a light flux from a subject is imaged on an imaging surface of an image pickup device. The image pickup device is, for example, a CCD (Charge Coupled Device), a CMOS (Complementary Metal Oxide Semiconductor), or the like, and outputs an image signal of a subject image formed on the image pickup surface. The image processing unit generates image data in a predetermined format from the image signal generated by the image sensor and supplies it to the second processing unit 34.

第２処理部３４は、一又は複数個のプロセッサ及びその周辺回路を備える。第２処理部３４は、例えばＣＰＵであり、撮像装置３の動作を統括的に制御する。第２処理部３４は、ＧＰＵ、ＤＳＰ、ＬＳＩ、ＡＳＩＣ、ＦＰＧＡ等でもよい。第２処理部３４は、第２記憶部３１に記憶されているプログラムに基づいて撮像装置３の各種処理が適切な手順で実行されるように、第２通信部３２及び撮像部３３の動作を制御するとともに、各種の処理を実行する。また、第２処理部３４は、複数のプログラムを並列に実行することができる。 The second processing unit 34 includes one or more processors and peripheral circuits thereof. The second processing unit 34 is, for example, a CPU, and controls the operation of the image pickup apparatus 3 in an integrated manner. The second processing unit 34 may be a GPU, DSP, LSI, ASIC, FPGA or the like. The second processing unit 34 operates the second communication unit 32 and the imaging unit 33 so that various processes of the imaging device 3 are executed in an appropriate procedure based on the program stored in the second storage unit 31. It controls and executes various processes. In addition, the second processing unit 34 can execute a plurality of programs in parallel.

第２処理部３４は、撮像手段３４１、変換手段３４２及び二値画像出力手段３４３を備える。これらの各手段は、第２処理部３４によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして撮像装置３に実装されてもよい。 The second processing unit 34 includes an imaging unit 341, a conversion unit 342, and a binary image output unit 343. Each of these means is a functional module realized by a program executed by the second processing unit 34. Each of these means may be implemented in the image pickup apparatus 3 as firmware.

図５は、認識装置４の概略構成の一例を示す図である。認識装置４は、第３記憶部４１と、第３通信部４２と、第３処理部４３とを備える。 FIG. 5 is a diagram showing an example of a schematic configuration of the recognition device 4. The recognition device 4 includes a third storage unit 41, a third communication unit 42, and a third processing unit 43.

第３記憶部４１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第３記憶部４１は、第３処理部４３による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第３記憶部４１にインストールされる。 The third storage unit 41 is a device for storing a program or data, and includes, for example, a semiconductor memory device. The third storage unit 41 stores an operating system program, a driver program, an application program, data, and the like used for processing by the third processing unit 43. The program is installed in the third storage unit 41 from a computer-readable and non-temporary portable storage medium such as a CD-ROM or a DVD-ROM using a known setup program or the like.

第３通信部４２は、認識装置４を他の装置と通信可能にする通信インタフェース回路を備える。第３通信部４２が備える通信インタフェース回路は、有線ＬＡＮ又は無線ＬＡＮ等の通信インタフェース回路である。第３通信部４２は、他の装置から送信されたデータを受信し、第３処理部４３に供給するとともに、第３処理部４３から供給されたデータを他の装置に送信する。 The third communication unit 42 includes a communication interface circuit that enables the recognition device 4 to communicate with another device. The communication interface circuit included in the third communication unit 42 is a communication interface circuit such as a wired LAN or a wireless LAN. The third communication unit 42 receives the data transmitted from the other device, supplies the data to the third processing unit 43, and transmits the data supplied from the third processing unit 43 to the other device.

第３処理部４３は、一又は複数個のプロセッサ及びその周辺回路を備える。第３処理部４３は、例えばＣＰＵであり、認識装置４の動作を統括的に制御する。第３処理部４３は、ＧＰＵ、ＤＳＰ、ＬＳＩ、ＡＳＩＣ、ＦＰＧＡ等でもよい。第３処理部４３は、第３記憶部４１に記憶されているプログラムに基づいて認識装置４の各種処理が適切な手順で実行されるように、第３通信部４２の動作を制御するとともに、各種の処理を実行する。また、第３処理部４３は、複数のプログラムを並列に実行することができる。 The third processing unit 43 includes one or more processors and peripheral circuits thereof. The third processing unit 43 is, for example, a CPU, and controls the operation of the recognition device 4 in an integrated manner. The third processing unit 43 may be a GPU, DSP, LSI, ASIC, FPGA or the like. The third processing unit 43 controls the operation of the third communication unit 42 and controls the operation of the third communication unit 42 so that various processes of the recognition device 4 are executed in an appropriate procedure based on the program stored in the third storage unit 41. Execute various processes. In addition, the third processing unit 43 can execute a plurality of programs in parallel.

第３処理部４３は、二値画像取得手段４３１、認識手段４３２及び認識結果出力手段４３３を備える。これらの各手段は、第３処理部４３によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして認識装置４に実装されてもよい。 The third processing unit 43 includes a binary image acquisition unit 431, a recognition unit 432, and a recognition result output unit 433. Each of these means is a functional module realized by a program executed by the third processing unit 43. Each of these means may be implemented in the recognition device 4 as firmware.

（変換器及び識別器の概要）
図６は、変換器の概要について説明するための模式図である。変換器は、多値画像が入力された場合に二値画像を出力する畳み込みニューラルネットワーク（Convolutional Neural Network；ＣＮＮ）であり、入力層、隠れ層及び出力層を有する。隠れ層は、畳み込み層、プーリング層及びアンプーリング層等である。 (Overview of converter and classifier)
FIG. 6 is a schematic diagram for explaining the outline of the converter. The converter is a convolutional neural network (CNN) that outputs a binary image when a multi-valued image is input, and has an input layer, a hidden layer, and an output layer. The hidden layer is a convolutional layer, a pooling layer, an amplifiering layer, or the like.

変換器の入力層は、複数の多値画像Ｄ１を入力として受け付ける。多値画像Ｄ１は、例えば、ＲＧＢの３チャネルのそれぞれについて０〜２５５の階調範囲内の階調値を有する画素からなる画像である。 The input layer of the converter accepts a plurality of multi-valued images D1 as inputs. The multi-valued image D1 is, for example, an image composed of pixels having gradation values within the gradation range of 0 to 255 for each of the three RGB channels.

変換器の畳み込み層Ｐ１０１は、入力層に入力された複数の多値画像Ｄ１に対して、所定のサイズ及び係数を有する複数のフィルタによる畳み込み処理を実行し、特徴マップを生成する。生成される特徴マップは、多値画像Ｄ１と同一のサイズ及びフィルタの数と同数のチャネル数を有する（フィルタ数が２５６個なら２５６チャネル）。畳み込み層Ｐ１０１は、生成された特徴マップに対してバッチ正規化（Batch Normalization）処理を実行し、生成された特徴マップの特徴量がチャネルごとに所定の平均値及び分散値を有するように、各特徴量を補正する。畳み込み層Ｐ１０１は、バッチ正規化処理により補正された各特徴量に対して活性化関数（Activation Function）を適用する活性化処理を実行する。活性化関数は、例えば、ＲｅＬＵ（Rectified Linear Unit）関数である。活性化関数は、双曲線正接（Hyperbolic Tangent）関数でもよく、シグモイド（Sigmoid）関数でもよい。畳み込み層Ｐ１０１は、活性化関数を適用する前に、各特徴量に対して所定のバイアス値を加えてもよい。 The convolution layer P101 of the converter executes a convolution process by a plurality of filters having a predetermined size and coefficient on the plurality of multi-valued images D1 input to the input layer, and generates a feature map. The generated feature map has the same size as the multi-valued image D1 and the same number of channels as the number of filters (256 channels if the number of filters is 256). Each convolutional layer P101 executes a batch normalization process on the generated feature map so that the feature amount of the generated feature map has a predetermined average value and variance value for each channel. Correct the feature amount. The convolutional layer P101 executes an activation process of applying an activation function to each feature amount corrected by the batch normalization process. The activation function is, for example, a ReLU (Rectified Linear Unit) function. The activation function may be a hyperbolic tangent function or a sigmoid function. The convolutional layer P101 may apply a predetermined bias value to each feature before applying the activation function.

プーリング層Ｐ１０２は、畳み込み層Ｐ１０１の出力データである特徴マップに対してプーリング（Pooling）処理を実行する。プーリング処理は、特徴マップのサイズを減少させる処理であり、例えば、特徴マップ内の所定のサイズ（例えば、２×２）の領域に含まれる特徴量のうち最大の特徴量を抽出する最大値プーリング（Max Pooling）処理である。プーリング処理は、平均値プーリング（Average Pooling）処理でもよい。プーリング層Ｐ１０２は、プーリング処理により生成された特徴マップを出力する。プーリング層Ｐ１０２の出力データである特徴マップのサイズは、プーリング層Ｐ１０２の入力データである特徴マップのサイズより小さく、例えば、縦方向、横方向のそれぞれについて入力データのサイズの２分の１である。 The pooling layer P102 executes a pooling process on the feature map which is the output data of the convolution layer P101. The pooling process is a process for reducing the size of the feature map. For example, the maximum value pooling for extracting the largest feature amount among the feature amounts included in a region of a predetermined size (for example, 2 × 2) in the feature map. (Max Pooling) Processing. The pooling process may be an average pooling process. The pooling layer P102 outputs a feature map generated by the pooling process. The size of the feature map, which is the output data of the pooling layer P102, is smaller than the size of the feature map, which is the input data of the pooling layer P102, and is, for example, half the size of the input data in each of the vertical and horizontal directions. ..

畳み込み層Ｐ１０３は、プーリング層Ｐ１０２の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。プーリング層Ｐ１０４は、畳み込み層Ｐ１０３の出力データに対してプーリング処理を実行する。プーリング層Ｐ１０４の出力データのサイズは、例えば、縦方向、横方向のそれぞれについてプーリング層Ｐ１０４の入力データのサイズの２分の１である。 The convolutional layer P103 executes a convolutional processing, a batch normalization processing, and an activation processing on the output data of the pooling layer P102. The pooling layer P104 executes a pooling process on the output data of the convolution layer P103. The size of the output data of the pooling layer P104 is, for example, half the size of the input data of the pooling layer P104 in each of the vertical direction and the horizontal direction.

畳み込み層Ｐ１０５は、プーリング層Ｐ１０４の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。アンプーリング層Ｐ１０６は、畳み込み層Ｐ１０５の出力データに対してアンプーリング（Unpooling）処理を実行する。アンプーリング処理は、特徴マップのサイズを増大させるアップサンプリング処理である。アンプーリング層Ｐ１０６の出力データのサイズは、アンプーリング層Ｐ１０６の入力データのサイズより大きく、例えば、縦方向、横方向のそれぞれについて入力データのサイズの２倍である。 The convolutional layer P105 executes a convolutional processing, a batch normalization processing, and an activation processing on the output data of the pooling layer P104. The amplifiering layer P106 executes an unpooling process on the output data of the convolutional layer P105. The amplifiering process is an upsampling process that increases the size of the feature map. The size of the output data of the amplifiering layer P106 is larger than the size of the input data of the amplifiering layer P106, for example, twice the size of the input data in each of the vertical direction and the horizontal direction.

畳み込み層Ｐ１０７は、アンプーリング層Ｐ１０６の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。加算層Ｐ１０８は、畳み込み層Ｐ１０７の出力データと畳み込み層Ｐ１０３の出力データとを加算する。加算層Ｐ１０８を設けることにより、後述する誤差逆伝播法の適用時において算出される勾配の絶対値が大きくなり、学習速度が向上される。アンプーリング層Ｐ１０９は、加算層Ｐ１０８の出力データに対してアンプーリング処理を実行する。アンプーリング層Ｐ１０９の出力データのサイズは、例えば、縦方向、横方向のそれぞれについてアンプーリング層Ｐ１０９の入力データのサイズの２倍である。 The convolutional layer P107 executes a convolutional processing, a batch normalization processing, and an activation processing on the output data of the amplifiering layer P106. The addition layer P108 adds the output data of the convolutional layer P107 and the output data of the convolutional layer P103. By providing the addition layer P108, the absolute value of the gradient calculated when the error back propagation method described later is applied becomes large, and the learning speed is improved. The amplifiering layer P109 executes an amplifiering process on the output data of the addition layer P108. The size of the output data of the amplifiering layer P109 is, for example, twice the size of the input data of the amplifiering layer P109 in each of the vertical direction and the horizontal direction.

畳み込み層Ｐ１１０は、アンプーリング層Ｐ１０９の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。加算層Ｐ１１１は、畳み込み層Ｐ１１０の出力データと畳み込み層Ｐ１０１の出力データとを加算する。 The convolutional layer P110 executes a convolutional processing, a batch normalization processing, and an activation processing on the output data of the amplifiering layer P109. The addition layer P111 adds the output data of the convolutional layer P110 and the output data of the convolutional layer P101.

変換層Ｐ１１２は、加算層Ｐ１１１の出力データに対してチャネル変換処理を実行する。変換層Ｐ１１２は、各画素についての複数チャネルの特徴量に基づいて、１チャネルの特徴マップを生成して出力する。例えば、加算層Ｐ１１１の出力データがＮチャネルの特徴マップであるとすると、変換層Ｐ１１２は、加算層Ｐ１１１の出力データをＮチャネルのフィルタ１個だけで畳み込んで１チャネルの特徴マップを生成する。これにより、変換層Ｐ１１２は、特徴マップのデータ容量を削減する。 The conversion layer P112 executes a channel conversion process on the output data of the addition layer P111. The conversion layer P112 generates and outputs a one-channel feature map based on the feature quantities of the plurality of channels for each pixel. For example, assuming that the output data of the addition layer P111 is an N-channel feature map, the conversion layer P112 convolves the output data of the addition layer P111 with only one N-channel filter to generate a one-channel feature map. .. As a result, the conversion layer P112 reduces the data capacity of the feature map.

活性層Ｐ１１３は、変換層Ｐ１１２の出力データに対して活性化関数を適用する活性化処理を実行する。活性化関数は、例えば、シグモイド関数である。活性化層Ｐ１１３は、活性化関数を適用する前に、各特徴量に対して所定のバイアス値を加えてもよい。 The active layer P113 executes an activation process that applies an activation function to the output data of the conversion layer P112. The activation function is, for example, a sigmoid function. The activation layer P113 may apply a predetermined bias value to each feature amount before applying the activation function.

閾値処理層Ｐ１１４は、活性層Ｐ１１３の出力データに対して所定の閾値を有する階段関数を適用する閾値処理を実行する。階段関数は、活性層Ｐ１１３の出力である特徴マップに含まれる特徴量が閾値以上であればその特徴量を１に変換し、閾値未満であればその特徴量を０に変換する関数である。これにより、閾値処理層Ｐ１１４は、各画素に対応する特徴量が０又は１である特徴マップを出力する。 The threshold processing layer P114 executes a threshold processing that applies a step function having a predetermined threshold to the output data of the active layer P113. The step function is a function that converts the feature amount included in the feature map output of the active layer P113 to 1 if it is equal to or more than the threshold value, and converts the feature amount to 0 if it is less than the threshold value. As a result, the threshold processing layer P114 outputs a feature map in which the feature amount corresponding to each pixel is 0 or 1.

変換器の出力層は、閾値処理層Ｐ１１４の出力である特徴マップの特徴量を各画素の階調値とする二値画像Ｄ２を出力する。二値画像Ｄ２は、多値画像Ｄ１と同一のサイズを有し、各画素の階調値が０又は１である画像である。このようにして、変換器は、多値画像Ｄ１が入力された場合に二値画像Ｄ２を出力する。 The output layer of the converter outputs a binary image D2 in which the feature amount of the feature map, which is the output of the threshold processing layer P114, is the gradation value of each pixel. The binary image D2 is an image having the same size as the multivalued image D1 and having a gradation value of 0 or 1 for each pixel. In this way, the converter outputs the binary image D2 when the multi-valued image D1 is input.

なお、閾値処理層Ｐ１１４は、学習時、階段関数を適用する前に、活性層Ｐ１１３の出力である特徴マップにノイズを重畳してもよい（認識時は重畳しない）。例えば、閾値処理層Ｐ１１４は、特徴マップの各特徴量に、所定の分散値を有する、正規分布等の分布に基づいて生成された乱数を加算する。これにより、変換器は、活性層Ｐ１１３の出力の全ての特徴量が閾値未満、又は全ての特徴量が閾値以上である場合でも、二値画像Ｄ２の全ての画素の階調値が０又は１の何れかのみとなる確率を低減させる。二値画像Ｄ２の全ての画素の階調値が０又は１となってしまった場合、後述する認識器にその二値画像Ｄ２が入力されたとしても学習が行えなくなるため、学習速度が低下する。変換器は、そのような二値画像Ｄ２を出力する可能性を低減させることにより、学習速度を向上させることができる。 Note that the threshold processing layer P114 may superimpose noise on the feature map which is the output of the active layer P113 (not superimpose at the time of recognition) before applying the step function at the time of learning. For example, the threshold processing layer P114 adds a random number generated based on a distribution such as a normal distribution having a predetermined variance value to each feature amount of the feature map. As a result, the converter has 0 or 1 gradation values of all pixels of the binary image D2 even when all the feature amounts of the output of the active layer P113 are less than the threshold value or all the feature amounts are equal to or more than the threshold value. Reduce the probability of becoming only one of. When the gradation values of all the pixels of the binary image D2 become 0 or 1, learning cannot be performed even if the binary image D2 is input to the recognizer described later, so that the learning speed decreases. .. The transducer can improve the learning speed by reducing the possibility of outputting such a binary image D2.

また、この場合において、閾値処理層Ｐ１１４は、特徴マップの特徴量に応じた大きさのノイズを重畳してもよい。例えば、閾値処理層Ｐ１１４は、各特徴量について、各特徴量に乱数を加算した場合に閾値との関係が変化する確率が所定確率（例えば、１０００分の１）となる乱数の分布を決定する。閾値との関係が変化するとは、閾値未満である特徴量に乱数を加算した場合に閾値以上となること、又は、閾値以上である特徴量に乱数を加算した場合に閾値未満となることである。閾値処理層Ｐ１１４は、各特徴量について決定された分布に基づいて乱数をそれぞれ生成し、生成された乱数を各特徴量に加算する。これにより、変換器は、二値画像Ｄ２の全ての画素の階調値が０又は１となる確率を低下させつつ、ノイズによって多値画像Ｄ１との相関がない二値画像Ｄ２が出力される確率を低減させることができる。 Further, in this case, the threshold processing layer P114 may superimpose noise having a magnitude corresponding to the feature amount of the feature map. For example, the threshold processing layer P114 determines the distribution of random numbers for each feature amount, for which the probability that the relationship with the threshold value changes when a random number is added to each feature amount has a predetermined probability (for example, 1/1000). .. The change in the relationship with the threshold value means that when a random number is added to a feature amount that is less than the threshold value, the value becomes greater than or equal to the threshold value, or when a random number is added to a feature amount that is greater than or equal to the threshold value, the value becomes less than the threshold value. .. The threshold processing layer P114 generates random numbers based on the distribution determined for each feature amount, and adds the generated random numbers to each feature amount. As a result, the converter outputs a binary image D2 that does not correlate with the multivalued image D1 due to noise while reducing the probability that the gradation values of all the pixels of the binary image D2 become 0 or 1. The probability can be reduced.

また、閾値処理層Ｐ１１４は、各特徴量の平均値、中央値等の統計値に基づいて一つの分布を決定し、決定された一つの分布に基づいて生成された乱数を各特徴量に加算してもよい。これにより、変換器は、少ない計算負荷でノイズを重畳することができる。 Further, the threshold processing layer P114 determines one distribution based on statistical values such as the average value and the median value of each feature amount, and adds a random number generated based on the determined one distribution to each feature amount. You may. As a result, the converter can superimpose noise with a small calculation load.

なお、変換器において、加算層Ｐ１０８及びＰ１１１は設けられなくてもよい。 In the converter, the addition layers P108 and P111 may not be provided.

図７は、認識器の概要について説明するための模式図である。認識器は、二値画像が入力された場合に対象の領域及び対象の種別を出力するＣＮＮであり、例えば、ＳＳＤ（Single Shot Multibox Detector）である。対象の領域は、入力された二値画像において対象の像に外接する矩形領域を示す情報である。対象の種別は、矩形領域に含まれる対象が、あらかじめ設定された複数の対象の種別の何れに該当するかを示す情報である。対象の種別は、例えば、「人」、「車両」又は「椅子」等である。対象の種別は、「人の上半身」等でもよい。なお、認識すべき対象の種別が一種類（例えば、「人」のみ）である場合、認識器は、対象の種別を出力しなくてもよい。 FIG. 7 is a schematic diagram for explaining the outline of the recognizer. The recognizer is a CNN that outputs a target area and a target type when a binary image is input, and is, for example, an SSD (Single Shot Multibox Detector). The target area is information indicating a rectangular area circumscribing the target image in the input binary image. The target type is information indicating which of the plurality of preset target types the target included in the rectangular area corresponds to. The type of object is, for example, "person", "vehicle", "chair", or the like. The type of target may be "human upper body" or the like. When there is only one type of target to be recognized (for example, only "person"), the recognizer does not have to output the type of target.

認識器の入力層は、二値画像Ｄ３を入力として受け付ける。二値画像Ｄ３は、変換器から出力された二値画像Ｄ２である。 The input layer of the recognizer accepts the binary image D3 as an input. The binary image D3 is a binary image D2 output from the converter.

ベースネットワーク（Base Network）Ｐ２０１は、複数の畳み込み層及び全結合層を有するＣＮＮである。ベースネットワークＰ２０１は、画像分類のために用いられる任意のＣＮＮであってよく、例えば、ＶＧＧ−１６等である。ベースネットワークＰ２０１は、二値画像Ｄ３を入力された場合に、特徴マップを出力する。 Base Network P201 is a CNN having a plurality of convolutional layers and fully connected layers. The base network P201 may be any CNN used for image classification, such as VGG-16. The base network P201 outputs a feature map when the binary image D3 is input.

特徴層Ｐ２０２は、ベースネットワークＰ２０１の出力データを入力として受け付ける。特徴層Ｐ２０２は、入力された特徴マップに畳み込み処理を実行し、入力データよりも小さいサイズの特徴マップを出力する。また、特徴層Ｐ２０２は、出力される特徴マップの各画素の特徴量から推定される矩形領域を示す領域情報を出力するとともに、複数の対象の種別のそれぞれについて、その矩形領域に各種別の対象が含まれる可能性を示す信頼度情報を出力する。領域情報は、例えば、矩形領域の中心座標並びに矩形領域の幅及び高さの情報である。信頼度情報は、例えば、対象の各種別に対応する、０以上１以下の値で示される複数の変数からなるベクトルであり、各変数は、その値が１に近いほど対応する種別の対象が含まれる可能性が高いことを示す。 The feature layer P202 receives the output data of the base network P201 as an input. The feature layer P202 executes a convolution process on the input feature map, and outputs a feature map having a size smaller than the input data. Further, the feature layer P202 outputs area information indicating a rectangular area estimated from the feature amount of each pixel of the output feature map, and for each of the plurality of target types, various different targets are set in the rectangular area. Outputs reliability information indicating the possibility that is included. The area information is, for example, information on the center coordinates of the rectangular area and the width and height of the rectangular area. The reliability information is, for example, a vector consisting of a plurality of variables represented by a value of 0 or more and 1 or less corresponding to each type of object, and each variable includes an object of the corresponding type as the value approaches 1. Indicates that there is a high possibility of being affected.

特徴層Ｐ２０３は、特徴層Ｐ２０２の出力データである特徴マップを入力として受け付ける。特徴層Ｐ２０３は、特徴層Ｐ２０２と同様に、畳み込み処理を実行し、入力データよりも小さいサイズの特徴マップ、並びに、その特徴マップについての領域情報及び信頼度情報を出力する。 The feature layer P203 accepts a feature map, which is output data of the feature layer P202, as an input. Similar to the feature layer P202, the feature layer P203 executes a convolution process and outputs a feature map having a size smaller than the input data, and area information and reliability information about the feature map.

特徴層Ｐ２０３の次に、さらに任意の数の特徴層が設けられてもよい。 An arbitrary number of feature layers may be further provided after the feature layer P203.

後処理部Ｐ２０４は、各特徴層から出力された領域情報と信頼度情報とを入力として受け付ける。後処理部Ｐ２０４は、入力された信頼度情報に基づいて、各領域情報に示される矩形領域に何れかの種別の対象が含まれるか否か、及び、含まれる場合には何れの種別の対象が含まれるかを判定する。判定は、例えば、信頼度情報に含まれる各変数の値が所定値以上であるか否か、及び、所定値以上である変数が複数である場合には、何れの変数の値が最も大きいかに基づいて行われる。後処理部Ｐ２０４は、同一の種別の対象が含まれると判定され、且つ、領域が所定比率以上重複している複数の矩形領域を統合する。矩形領域の統合には、例えば、Non-Maximum Suppression等の方法が用いられる。これにより、一の対象に対して一の矩形領域が生成される。後処理部Ｐ２０４は、出力層を介して、生成された矩形領域の領域情報を対象の領域Ｄ４として出力するとともに、その矩形領域に対応する信頼度情報を対象の種別Ｄ５として出力する。 The post-processing unit P204 receives the area information and the reliability information output from each feature layer as inputs. Based on the input reliability information, the post-processing unit P204 determines whether or not the rectangular area shown in each area information includes an object of any type, and if so, an object of any type. Is included. The judgment is, for example, whether or not the value of each variable included in the reliability information is equal to or more than a predetermined value, and if there are a plurality of variables having a predetermined value or more, which variable has the largest value. It is done based on. The post-processing unit P204 integrates a plurality of rectangular regions that are determined to include objects of the same type and whose regions overlap by a predetermined ratio or more. For example, a method such as Non-Maximum Suppression is used to integrate the rectangular areas. As a result, one rectangular area is generated for one object. The post-processing unit P204 outputs the area information of the generated rectangular area as the target area D4 via the output layer, and outputs the reliability information corresponding to the rectangular area as the target type D5.

（各種データのデータ構造）
図８は、学習装置２の第１記憶部２１に記憶される学習用データ２１１のデータ構造の一例を示す図である。学習用データ２１１は、データＩＤと、学習用多値画像と、学習用認識結果とが関連付けられたデータである。なお、学習用多値画像は、学習用第１画像の一例である。 (Data structure of various data)
FIG. 8 is a diagram showing an example of the data structure of the learning data 211 stored in the first storage unit 21 of the learning device 2. The learning data 211 is data in which the data ID, the learning multi-valued image, and the learning recognition result are associated with each other. The learning multi-valued image is an example of the learning first image.

データＩＤは、学習用多値画像と学習用認識結果との組み合わせを識別するための識別情報である。学習用多値画像には、画像を構成する各画素の階調値の情報が含まれる。図８に示す例では、各画素について、ＲＧＢの３チャネルのそれぞれについて０〜２５５の階調値が記憶されている。学習用認識結果は、学習用多値画像に対して出力されるべきものとして予め設定された認識結果であり、対象の領域と対象の種別とを含む。対象の領域は、学習用多値画像において対象の像に外接する矩形領域を示す情報であり、例えば、矩形領域の中心座標並びに矩形領域の幅及び高さの情報である。対象の種別の情報は、対象の領域によって示される矩形領域に含まれる対象が、あらかじめ設定された複数の対象の種別の何れに該当するかを示す情報である。対象の種別は、例えば、該当する種別に対応する変数の値が１で、他の種別に対応する変数の値が０である、所謂one-hotベクトルである。なお、認識すべき対象の種別が一種類である場合、学習用認識結果は、対象の種別を含まなくてもよい。また、学習用多値画像に複数の対象が含まれる場合、各対象に対応する複数の対象の領域及び対象の種別の情報が含まれてもよい。 The data ID is identification information for identifying a combination of the learning multi-valued image and the learning recognition result. The learning multi-valued image includes information on the gradation value of each pixel constituting the image. In the example shown in FIG. 8, gradation values of 0 to 255 are stored for each of the three RGB channels for each pixel. The learning recognition result is a recognition result preset as what should be output for the learning multi-valued image, and includes a target area and a target type. The target area is information indicating a rectangular area circumscribing the target image in the learning multi-valued image, and is, for example, information on the center coordinates of the rectangular area and the width and height of the rectangular area. The target type information is information indicating which of the plurality of preset target types the target included in the rectangular area indicated by the target area corresponds to. The target type is, for example, a so-called one-hot vector in which the value of the variable corresponding to the corresponding type is 1 and the value of the variable corresponding to the other type is 0. When there is only one type of target to be recognized, the learning recognition result does not have to include the type of target. Further, when a plurality of targets are included in the learning multi-valued image, information on a plurality of target areas and target types corresponding to each target may be included.

学習用データ２１１は、あらかじめ学習装置２の管理者によって設定され、第１記憶部２１に記憶される。 The learning data 211 is set in advance by the administrator of the learning device 2 and stored in the first storage unit 21.

（処理の流れ）
図９は、学習装置２によって実行される学習処理の流れの一例を示すフロー図である。学習処理は、第１記憶部２１に記憶されたプログラムに従って、第１処理部２３が学習装置２の各構成要素と協働することにより実現される。 (Processing flow)
FIG. 9 is a flow chart showing an example of the flow of the learning process executed by the learning device 2. The learning process is realized by the first processing unit 23 collaborating with each component of the learning device 2 according to the program stored in the first storage unit 21.

まず、学習用モデル取得手段２３１は、第１記憶部２１から学習用モデルを取得する（Ｓ１０１）。学習用モデルは、変換器の出力が認識器の入力となるように結合されたＣＮＮである。学習用モデル取得手段２３１は、取得された学習用モデルに含まれるフィルタの係数等のパラメータを、乱数等により初期化してもよい。 First, the learning model acquisition means 231 acquires the learning model from the first storage unit 21 (S101). The training model is a CNN coupled so that the output of the transducer is the input of the recognizer. The learning model acquisition means 231 may initialize parameters such as the coefficients of the filter included in the acquired learning model with random numbers or the like.

続いて、学習用データ取得手段２３２は、第１記憶部２１から学習用データ２１１を取得する（Ｓ１０２）。 Subsequently, the learning data acquisition means 232 acquires the learning data 211 from the first storage unit 21 (S102).

続いて、エッジ画像生成手段２３３は、学習用データ２１１に含まれる学習用多値画像からエッジ画像を生成する（Ｓ１０３）。エッジ画像は、エッジ画素の階調値と他の画素の階調値とが互いに異なる二値画像である。エッジ画像生成手段２３３は、学習用多値画像に対してＣａｎｎｙのエッジ検出方法を適用し、学習用多値画像からエッジ画素を検出する。エッジ画像生成手段２３３は、学習用多値画像において、検出されたエッジ画素の階調値を１に、他の画素の階調値を０に設定した画像をエッジ画像として生成する。 Subsequently, the edge image generation means 233 generates an edge image from the learning multi-valued image included in the learning data 211 (S103). An edge image is a binary image in which the gradation value of an edge pixel and the gradation value of another pixel are different from each other. The edge image generation means 233 applies Canny's edge detection method to the learning multi-valued image, and detects edge pixels from the learning multi-valued image. The edge image generation means 233 generates an image in which the gradation value of the detected edge pixel is set to 1 and the gradation value of the other pixel is set to 0 in the learning multi-valued image as the edge image.

なお、エッジ画像生成手段２３３は、ソーベルフィルタ等の公知のエッジ検出フィルタを用いてエッジ画像を生成してもよい。 The edge image generation means 233 may generate an edge image by using a known edge detection filter such as a Sobel filter.

続いて、学習手段２３４は、学習用モデルに学習用多値画像を入力することにより、認識結果を生成する（Ｓ１０４）。認識結果は、学習用モデルから出力された対象物の領域及び対象物の種別である。認識結果は、学習用モデルのうちの変換器から出力された二値画像を含んでもよい。 Subsequently, the learning means 234 generates a recognition result by inputting a learning multi-valued image into the learning model (S104). The recognition result is the area of the object and the type of the object output from the learning model. The recognition result may include a binary image output from the converter of the training model.

なお、学習手段２３４は、学習用モデルに、学習用多値画像にノイズを付加した画像を入力してもよい。これにより、学習装置２は、入力される多値画像にノイズが含まれていても適切に認識結果が出力されるように学習用モデルを学習させることができる。ただしこの場合、エッジ画像生成手段２３３は、ノイズを付加する前の学習用多値画像からエッジ画像を生成するのが良い。 The learning means 234 may input an image in which noise is added to the learning multi-valued image into the learning model. As a result, the learning device 2 can train the learning model so that the recognition result is appropriately output even if the input multi-valued image contains noise. However, in this case, the edge image generation means 233 preferably generates an edge image from the learning multi-valued image before adding noise.

続いて、学習手段２３４は、生成された認識結果と学習用認識結果とに基づいて、誤差を算出する（Ｓ１０５）。誤差は、変換器の学習に用いられる、生成された認識結果と学習用認識結果との間の差の程度を示す指標であり、対象の領域に関する誤差と、対象の種別に関する誤差との重み付け和である誤差関数により算出される。対象の領域に関する誤差は、例えば、生成された認識結果の矩形領域と、学習用認識結果の矩形領域との間の中心座標、幅及び高さの二乗誤差又は対数二乗誤差等である。対象の種別に関する誤差は、例えば、生成された認識結果の対象の種別と、学習用認識結果の対象の種別との間の交差エントロピー誤差である。 Subsequently, the learning means 234 calculates an error based on the generated recognition result and the learning recognition result (S105). The error is an index showing the degree of difference between the generated recognition result and the learning recognition result used for learning of the converter, and is the weighted sum of the error related to the target area and the error related to the target type. It is calculated by the error function. The error regarding the target area is, for example, a square area error of the center coordinates, width and height between the rectangular area of the generated recognition result and the rectangular area of the recognition result for learning, a logarithmic square error, and the like. The error regarding the target type is, for example, an cross entropy error between the target type of the generated recognition result and the target type of the learning recognition result.

誤差関数には、さらに二値画像に関する誤差が含まれてもよい。二値画像に関する誤差は、認識結果に含まれる、学習用モデルの変換器から出力された二値画像と、学習用多値画像から生成されたエッジ画像との二乗誤差である。これにより、学習装置２は、変換器によって出力される二値画像をエッジ画像に近づけ、且つ、学習用モデルによって出力される認識結果を学習用認識結果に近づけるように学習用モデルを学習させる。学習装置２は、変換器によって出力される二値画像をエッジ画像に近づけることにより、画像認識システム１のユーザが二値画像における対象の像を視認しやすくする。 The error function may further include an error with respect to the binary image. The error related to the binary image is the square error between the binary image output from the converter of the training model and the edge image generated from the multi-value image for training included in the recognition result. As a result, the learning device 2 trains the learning model so that the binary image output by the converter is brought closer to the edge image and the recognition result output by the learning model is closer to the learning recognition result. The learning device 2 brings the binary image output by the converter closer to the edge image, thereby making it easier for the user of the image recognition system 1 to visually recognize the target image in the binary image.

二値画像に関する誤差は、二値画像と、エッジ画像をぼかした画像との二乗誤差でもよい。エッジ画像をぼかした画像は、エッジ画像に所定のフィルタ（例えば、ガウシアンフィルタ）を適用した画像である。また、二値画像に関する誤差は、二値画像のヒストグラムと、エッジ画像のヒストグラムとの二乗誤差でもよい。ヒストグラムは、例えば、各画像を所定のサイズの領域に区分した場合に、各領域に含まれる階調値が０である画素（又は、１である画素）の数を階級とし、各階級に対応する領域の数を度数とする度数分布である。ヒストグラムは、各画像における階調値の勾配の頻度を示すＨＯＧ（Histogram of Oriented Gradients）でもよい。 The error regarding the binary image may be the square error between the binary image and the image in which the edge image is blurred. The blurred edge image is an image obtained by applying a predetermined filter (for example, a Gaussian filter) to the edge image. Further, the error regarding the binary image may be the square error between the histogram of the binary image and the histogram of the edge image. The histogram corresponds to each class, for example, when each image is divided into areas of a predetermined size, the number of pixels (or pixels) having a gradation value of 0 included in each area is set as a class. It is a frequency distribution in which the number of regions to be used is the frequency. The histogram may be a HOG (Histogram of Oriented Gradients) showing the frequency of gradients of gradation values in each image.

二値画像とエッジ画像との間にエッジの位置や形状の微差があったとしても、そのような微差はユーザが二値画像における対象の像を視認する際には問題となりにくい。学習装置２は、エッジ画像をぼかした画像を用いることで、このようなエッジの位置や形状の微差を誤差関数に反映されにくくし、変換器の学習を容易にする。 Even if there is a slight difference in the position or shape of the edge between the binary image and the edge image, such a slight difference is unlikely to be a problem when the user visually recognizes the target image in the binary image. By using an image in which the edge image is blurred, the learning device 2 makes it difficult for such a fine difference in the position and shape of the edge to be reflected in the error function, and facilitates learning of the converter.

また、畳み込み層と、畳み込み層の出力に基づく入力に対して活性化関数を適用する活性化層とが含まれる変換器の学習に用いられる誤差関数には、畳み込み層において適用されるフィルタの係数のノルムが含まれてもよい。フィルタの係数のノルムは、例えば、係数の二乗和（Ｌ２ノルム）又はフィルタのスペクトルノルムである。 Also, the error function used to train the converter, which includes the convolutional layer and the activation layer that applies the activation function to the input based on the output of the convolutional layer, is the coefficient of the filter applied in the convolutional layer. Norm may be included. The norm of the coefficients of the filter is, for example, the sum of squares of the coefficients (L2 norm) or the spectral norm of the filter.

フィルタの係数のＬ２ノルムが大きい場合、変換器の畳み込み層において適用されるフィルタの係数の絶対値が大きいため、変換器の活性化層Ｐ１１３に入力される特徴マップの特徴量の絶対値も大きくなりやすい。この場合、活性化層Ｐ１１３により適用される活性化関数がシグモイド関数であれば、活性化層Ｐ１１３の出力の特徴量の多くは０に近い値又は１に近い値を有し、中間である０．５に近い値を有しない。このような特徴量を有する特徴マップが閾値処理層Ｐ１１４に入力された場合、閾値処理層Ｐ１１４から出力される画像の全ての画素の階調値が０又は１となる可能性が高くなり、認識器の学習が行われず、学習速度が低下する。 When the L2 norm of the coefficient of the filter is large, the absolute value of the coefficient of the filter applied in the convolution layer of the converter is large, so that the absolute value of the feature amount of the feature map input to the activation layer P113 of the converter is also large. Prone. In this case, if the activation function applied by the activation layer P113 is a sigmoid function, most of the output features of the activation layer P113 have a value close to 0 or a value close to 1, and are intermediate 0. It does not have a value close to .5. When a feature map having such a feature amount is input to the threshold processing layer P114, there is a high possibility that the gradation values of all the pixels of the image output from the threshold processing layer P114 will be 0 or 1, and recognition will occur. The vessel is not learned and the learning speed is reduced.

また、スペクトルノルムは、畳み込み層に対する入力である複数の特徴マップのＬ２ノルムに対する、各入力に対応する出力である特徴マップのＬ２ノルムの比のうち、最大のものである。スペクトルノルムが大きい場合、畳み込み層の出力データの特徴量の絶対値が大きいため、同様に、閾値処理層Ｐ１１４の出力が全ての画素の階調値が０又は１である画像となる可能性が高くなり、学習速度が低下する。 Further, the spectrum norm is the largest ratio of the L2 norms of the feature map, which is the output corresponding to each input, to the L2 norms of the plurality of feature maps, which are the inputs to the convolution layer. When the spectrum norm is large, the absolute value of the feature amount of the output data of the convolutional layer is large, so that the output of the threshold processing layer P114 may be an image in which the gradation values of all the pixels are 0 or 1. It becomes higher and the learning speed decreases.

学習装置２は、誤差関数にＬ２ノルム又はスペクトルノルムを加えることにより、Ｌ２ノルム又はスペクトルノルムの値を小さくするようにＣＮＮを学習させる。これにより、学習装置２は、変換器から出力される二値画像の全ての画素の階調値が０又は１となる可能性を低減させ、学習速度を向上させることができる。なお、畳み込み層のフィルタのスペクトルノルムを誤差関数に加えるかわりに、スペクトルノルムが１となるように正規化したフィルタの係数を畳み込みで用いるようにしてもよい。 The learning device 2 trains the CNN so as to reduce the value of the L2 norm or the spectral norm by adding the L2 norm or the spectral norm to the error function. As a result, the learning device 2 can reduce the possibility that the gradation values of all the pixels of the binary image output from the converter become 0 or 1, and can improve the learning speed. Instead of adding the spectral norm of the filter of the convolution layer to the error function, the coefficient of the filter normalized so that the spectral norm becomes 1 may be used for convolution.

また、誤差関数には、変換器から出力される二値画像を構成する画素のうち、階調値が１である画素の割合（又は、階調値が０である画素の割合）が含まれてもよい。また、誤差関数には、変換器から出力される二値画像を構成する各画素と、各画素に隣接する画素との間の階調値の二乗誤差が含まれてもよい。このようにすることで、変換器から出力される二値画像を圧縮する場合に、その圧縮効率を向上させることができる。 Further, the error function includes the ratio of pixels having a gradation value of 1 (or the ratio of pixels having a gradation value of 0) among the pixels constituting the binary image output from the converter. You may. Further, the error function may include a square error of the gradation value between each pixel constituting the binary image output from the converter and the pixel adjacent to each pixel. By doing so, when the binary image output from the converter is compressed, the compression efficiency can be improved.

続いて、学習手段２３４は、ＣＮＮのパラメータを更新する（Ｓ１０６）。学習手段２３４は、誤差逆伝播法を用いてＣＮＮの各層の勾配を算出し、算出された勾配に基づく確率的勾配法により、誤差が小さくなるようにパラメータを更新する。更新されるパラメータは、畳み込み層において適用されるフィルタの係数並びに畳み込み層におけるバッチ正規化処理により補正された各特徴量の平均値及び分散値である。更新されるパラメータには、畳み込み層及び活性化層において活性化関数が適用される前に各特徴量に加えられるバイアス値が含まれてもよい。更新されるパラメータには、閾値処理層において適用される階段関数の閾値等が含まれてもよい。 Subsequently, the learning means 234 updates the parameters of the CNN (S106). The learning means 234 calculates the gradient of each layer of the CNN by using the error backpropagation method, and updates the parameters so that the error becomes small by the stochastic gradient descent method based on the calculated gradient. The parameters to be updated are the coefficient of the filter applied in the convolutional layer and the mean value and the variance value of each feature amount corrected by the batch normalization process in the convolutional layer. The parameters to be updated may include a bias value applied to each feature before the activation function is applied in the convolutional layer and the activation layer. The parameters to be updated may include the threshold value of the step function applied in the threshold processing layer and the like.

学習手段２３４は、変換器のパラメータを更新するための誤差逆伝播法を適用する際に、階段関数とは異なる他の関数の勾配を、変換器に含まれる、入力に対して階段関数を適用する閾値処理層Ｐ１１４の勾配として用いてもよい。他の関数は、勾配が０となる区間が階段関数よりも小さい関数であり、例えば、恒等関数又はシグモイド関数等である。このようにすることで、学習装置２は、誤差をより小さくするようにパラメータを更新し、学習速度を向上させることができる。すなわち、誤差逆伝播法においては、各層の勾配に基づいてその前の層の勾配を算出し、誤差の大きな要因となるパラメータを特定することによりパラメータを更新する。したがって、階段関数のように勾配が０である区間が支配的である関数を適用する層が存在する場合、その層より前の層において誤差の要因となるパラメータを特定することが難しくなる。学習装置２は、閾値処理層Ｐ１１４の勾配として、階段関数とは異なる、勾配が０となる区間が階段関数よりも小さい他の関数の勾配を用いることにより、誤差の要因となるパラメータの特定を容易にする。 When applying the backpropagation method for updating the parameters of the converter, the learning means 234 applies the step function to the input contained in the converter by applying the gradient of another function different from the step function. It may be used as the gradient of the threshold processing layer P114. The other function is a function in which the interval where the gradient becomes 0 is smaller than the step function, and is, for example, an identity function or a sigmoid function. By doing so, the learning device 2 can update the parameters so as to make the error smaller, and improve the learning speed. That is, in the error backpropagation method, the gradient of the previous layer is calculated based on the gradient of each layer, and the parameter is updated by specifying the parameter that causes a large error. Therefore, when there is a layer to which a function in which a section having a gradient of 0 is dominant, such as a step function, is applied, it becomes difficult to identify a parameter that causes an error in a layer before that layer. The learning device 2 identifies the parameter that causes the error by using the gradient of the threshold processing layer P114 as the gradient of another function that is different from the step function and whose interval where the gradient becomes 0 is smaller than the step function. make it easier.

続いて、学習手段２３４は、学習の終了条件が満たされたか否かを判定する（Ｓ１０７）。学習の終了条件は、例えば、所定回数以上パラメータが更新されたこと、又は、更新後のパラメータの更新前のパラメータに対する変化量が所定値以下であること等である。 Subsequently, the learning means 234 determines whether or not the learning end condition is satisfied (S107). The learning end condition is, for example, that the parameter has been updated a predetermined number of times or more, or that the amount of change of the updated parameter with respect to the parameter before the update is equal to or less than the predetermined value.

終了条件が満たされていないと判定された場合（Ｓ１０７−Ｎｏ）、学習手段２３４は、Ｓ１０２に処理を進める。終了条件が満たされていると判定された場合（Ｓ１０７−Ｙｅｓ）、学習手段２３４は、ＣＮＮを学習済みモデルとして第１記憶部２１に記憶し（Ｓ１０８）、一連の処理を終了する。 When it is determined that the end condition is not satisfied (S107-No), the learning means 234 proceeds to S102. When it is determined that the end condition is satisfied (S107-Yes), the learning means 234 stores the CNN as a learned model in the first storage unit 21 (S108), and ends a series of processes.

このように、学習装置２は、変換器及び認識器を同時学習により生成する。これにより、学習装置２は、変換器を、認識器による対象物の認識精度が高い二値画像を出力するように学習させることを可能とする。 In this way, the learning device 2 generates the converter and the recognizer by simultaneous learning. As a result, the learning device 2 makes it possible to train the converter so as to output a binary image having high recognition accuracy of the object by the recognizer.

図１０は、画像認識システム１によって実行される画像認識処理の流れの一例を示すシーケンス図である。画像認識処理は、第１記憶部２１、第２記憶部３１及び第３記憶部４１に記憶されたプログラムに基づいて、第１処理部２３、第２処理部３４及び第３処理部４３が各装置の構成要素と協働することにより実現される。 FIG. 10 is a sequence diagram showing an example of the flow of the image recognition process executed by the image recognition system 1. The image recognition process is performed by the first processing unit 23, the second processing unit 34, and the third processing unit 43, respectively, based on the programs stored in the first storage unit 21, the second storage unit 31, and the third storage unit 41. It is realized by cooperating with the components of the device.

まず、学習装置２の出力手段２３５は、第１通信部２２を介して、変換器及び識別器を撮像装置３及び認識装置４に対して出力する（Ｓ２０１）。出力手段２３５は、第１記憶部２１に記憶された学習済みモデルであるＣＮＮを分離することにより変換器及び認識器を生成する。出力手段２３５は、第１通信部２２を介して、変換器を撮像装置３に、認識器を認識装置４にそれぞれ送信する。撮像装置３は、変換器を受信して第２記憶部３１に記憶する。認識装置４は、認識器を受信して第３記憶部４１に記憶する。 First, the output means 235 of the learning device 2 outputs the converter and the classifier to the image pickup device 3 and the recognition device 4 via the first communication unit 22 (S201). The output means 235 generates a converter and a recognizer by separating the learned model CNN stored in the first storage unit 21. The output means 235 transmits the converter to the image pickup device 3 and the recognizer to the recognition device 4 via the first communication unit 22. The image pickup apparatus 3 receives the converter and stores it in the second storage unit 31. The recognition device 4 receives the recognizer and stores it in the third storage unit 41.

続いて、撮像装置３の撮像手段３４１は、撮像部３３を制御して、建物内の一室を撮像して多値画像を生成する（Ｓ２０２）。 Subsequently, the imaging means 341 of the imaging device 3 controls the imaging unit 33 to image a room in the building to generate a multi-valued image (S202).

続いて、変換手段３４２は、生成された多値画像を二値画像に変換する（Ｓ２０３）。変換手段３４２は、第２記憶部３１に記憶された変換器に多値画像を入力し、二値画像を出力させることにより多値画像を二値画像に変換する。 Subsequently, the conversion means 342 converts the generated multi-valued image into a binary image (S203). The conversion means 342 converts the multi-valued image into a binary image by inputting the multi-valued image into the converter stored in the second storage unit 31 and outputting the binary image.

続いて、二値画像出力手段３４３は、第２通信部３２を介して、二値画像を伝送網６に対して出力する（Ｓ２０４）。二値画像出力手段３４３は、二値画像に所定の可逆圧縮技術を適用して出力してもよい。これにより、撮像装置３は、二値画像の伝送容量を抑えることができる。 Subsequently, the binary image output means 343 outputs the binary image to the transmission network 6 via the second communication unit 32 (S204). The binary image output means 343 may output the binary image by applying a predetermined lossless compression technique. As a result, the image pickup apparatus 3 can suppress the transmission capacity of the binary image.

続いて、認識装置４の二値画像取得手段４３１は、第３通信部４２を介して、二値画像を伝送網６から取得する（Ｓ２０５）。 Subsequently, the binary image acquisition means 431 of the recognition device 4 acquires the binary image from the transmission network 6 via the third communication unit 42 (S205).

続いて、認識手段４３２は、二値画像に対する認識結果を生成する（Ｓ２０６）。認識手段４３２は、第３記憶部４１に記憶された認識器に二値画像を入力し、認識結果を出力させることにより認識結果を生成する。 Subsequently, the recognition means 432 generates a recognition result for the binary image (S206). The recognition means 432 generates a recognition result by inputting a binary image into the recognizer stored in the third storage unit 41 and outputting the recognition result.

続いて、認識結果出力手段４３３は、第３通信部４２を介して、生成された認識結果を表示装置５に対して出力し（Ｓ２０７）、一連の処理を終了する。例えば、認識結果出力手段４３３は、表示装置５が認識結果に基づく認識結果画面７００を表示するための表示データを表示装置５に送信する。 Subsequently, the recognition result output means 433 outputs the generated recognition result to the display device 5 (S207) via the third communication unit 42, and ends a series of processes. For example, the recognition result output means 433 transmits display data for the display device 5 to display the recognition result screen 700 based on the recognition result to the display device 5.

図１１は、表示装置５に表示される認識結果画面７００の一例を示す図である。認識結果画面７００は、二値画像７１０と、対象の像７１１と、外接矩形７２０と、種別表示オブジェクト７２１とを含む。 FIG. 11 is a diagram showing an example of the recognition result screen 700 displayed on the display device 5. The recognition result screen 700 includes a binary image 710, a target image 711, an circumscribed rectangle 720, and a type display object 721.

二値画像７１０は、撮像装置３によって生成された二値画像である。図１１に示す例では、階調値が１及び０である画素がそれぞれ黒及び白で示されている。対象の像７１１は、図１１に示す例では、人の全身画像である。外接矩形７２０は、認識装置４によって生成された対象の領域に基づいて表示される、対象の像７１１に外接する矩形のオブジェクトである。種別表示オブジェクト７２１は、認識装置４によって生成された対象の種別に基づいて表示される、対象の種別を文字等により示すオブジェクトである。種別表示オブジェクト７２１は、例えば、認識装置４によって生成された対象の種別に含まれる各変数のうち、最も値が大きい変数に対応する対象の種別を示す。 The binary image 710 is a binary image generated by the image pickup apparatus 3. In the example shown in FIG. 11, the pixels having gradation values of 1 and 0 are shown in black and white, respectively. The image 711 of the subject is a full-body image of a person in the example shown in FIG. The circumscribed rectangle 720 is a rectangular object circumscribing the target image 711, which is displayed based on the target area generated by the recognition device 4. The type display object 721 is an object that indicates the type of the target by characters or the like, which is displayed based on the type of the target generated by the recognition device 4. The type display object 721 indicates, for example, the type of the target corresponding to the variable having the largest value among the variables included in the type of the target generated by the recognition device 4.

以上説明したように、画像認識システム１において、学習装置２は、変換器の出力が認識器の入力となるように結合されたＣＮＮを学習させる。そして、撮像装置３は、学習済みモデルである変換器により多値画像を二値画像に変換し、認識装置４は、学習済みモデルである認識器により二値画像に対する認識結果を生成する。このようにすることで、画像認識システム１は、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。 As described above, in the image recognition system 1, the learning device 2 trains the combined CNN so that the output of the converter becomes the input of the recognizer. Then, the imaging device 3 converts the multi-valued image into a binary image by the converter which is the trained model, and the recognition device 4 generates the recognition result for the binary image by the recognizer which is the trained model. By doing so, the image recognition system 1 makes it possible to stably reduce the data capacity of the image to be image recognition while maintaining the accuracy of image recognition.

なお、上述した説明では、変換器の入力は複数のチャネルを有する画像であるものとしたが、変換器の入力は、１チャネルの画像（例えば、グレースケール画像）でもよい。 In the above description, the input of the converter is an image having a plurality of channels, but the input of the converter may be an image of one channel (for example, a grayscale image).

また、上述した説明では、変換器の出力は１チャネルの二値画像であるものとしたが、これに限られない。変換器は、入力である多値画像の各チャネルに対応する複数の二値画像を出力してもよい。例えば、入力である多値画像がＲＧＢの３チャネルを有する場合、変換器は、Ｒチャネルに対応する二値画像、Ｇチャネルに対応する二値画像、及び、Ｂチャネルに対応する二値画像をそれぞれ生成する。 Further, in the above description, the output of the converter is assumed to be a binary image of one channel, but the present invention is not limited to this. The converter may output a plurality of binary images corresponding to each channel of the input multi-valued image. For example, when the input multi-valued image has three RGB channels, the converter converts the binary image corresponding to the R channel, the binary image corresponding to the G channel, and the binary image corresponding to the B channel. Generate each.

この場合、認識器は、複数の二値画像を入力として受け付ける。また、エッジ画像生成手段２３３は、学習用多値画像の各チャネルの階調値に基づいて、各チャネルに対応するエッジ画像をそれぞれ生成する。このようにすることで、認識器の認識精度が向上する。 In this case, the recognizer accepts a plurality of binary images as inputs. Further, the edge image generation means 233 generates an edge image corresponding to each channel based on the gradation value of each channel of the learning multi-valued image. By doing so, the recognition accuracy of the recognizer is improved.

また、変換器の出力は、変換器に入力される多値画像よりも小さい階調範囲の多値画像でもよい。これにより、認識器における認識精度が向上する。 Further, the output of the converter may be a multi-valued image having a gradation range smaller than that of the multi-valued image input to the converter. This improves the recognition accuracy in the recognizer.

また、上述した説明では、画像認識システム１は、それぞれ１つの撮像装置３及び認識装置４を有するものとしたが、これに限られない。画像認識システム１は、複数の撮像装置３又は認識装置４を有してもよい。この場合、学習装置２は、複数の撮像装置３のそれぞれに変換器を出力し、又は、複数の認識装置４のそれぞれに認識器を出力する。 Further, in the above description, the image recognition system 1 has one imaging device 3 and one recognition device 4, respectively, but the present invention is not limited to this. The image recognition system 1 may have a plurality of image pickup devices 3 or recognition devices 4. In this case, the learning device 2 outputs a converter to each of the plurality of image pickup devices 3, or outputs a recognizer to each of the plurality of recognition devices 4.

また、撮像装置３又は認識装置４により学習装置２又は表示装置５の機能が実現されてもよい。 Further, the function of the learning device 2 or the display device 5 may be realized by the image pickup device 3 or the recognition device 4.

また、上述した説明では、物体の領域または物体の領域と種別を認識する認識器とそれに対応した変換器を例示したが、人の年齢や性別等の属性を認識する認識器とそれに対応した変換器であってもよいし、人又は車両の混雑度合い又は姿勢等の状態を認識する認識器とそれに対応した変換器であってもよく、種々の対象の画像認識に適用できる。なお、それらの場合、対象に応じた学習用認識結果を設定して学習を行うことになる。 Further, in the above description, the recognizer that recognizes the area of the object or the area and type of the object and the corresponding converter are illustrated, but the recognizer that recognizes the attributes such as the age and gender of a person and the corresponding conversion are illustrated. It may be a device, a recognizer that recognizes a state such as the degree of congestion or posture of a person or a vehicle, and a converter corresponding thereto, and can be applied to image recognition of various objects. In those cases, the learning recognition result is set according to the target and learning is performed.

当業者は、本発明の精神および範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。例えば、上述した各部の処理は、本発明の範囲において、適宜に異なる順序で実行されてもよい。また、上述した実施形態及び変形例は、本発明の範囲において、適宜に組み合わせて実施されてもよい。 It will be appreciated by those skilled in the art that various changes, substitutions and modifications can be made to this without departing from the spirit and scope of the invention. For example, the above-mentioned processing of each part may be executed in an appropriately different order within the scope of the present invention. Further, the above-described embodiments and modifications may be carried out in appropriate combinations within the scope of the present invention.

１画像認識システム
２学習装置
２３１学習用モデル取得手段
２３２学習用データ取得手段
２３３エッジ画像生成手段
２３４学習手段
２３５出力手段
３撮像装置
３４１撮像手段
３４２変換手段
３４３二値画像出力手段
４認識装置
４３１二値画像取得手段
４３２認識手段
４３３認識結果出力手段 1 Image recognition system 2 Learning device 231 Learning model acquisition means 232 Learning data acquisition means 233 Edge image generation means 234 Learning means 235 Output means 3 Imaging device 341 Imaging means 342 Conversion means 343 Binary image output means 4 Recognition device 431 2 Value image acquisition means 432 Recognition means 433 Recognition result output means

Claims

所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、
前記第１画像が入力された場合に前記第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、前記第１画像を前記第２画像に変換する変換手段と、
前記第２画像が入力された場合に前記対象を認識するための処理を前記第２画像に対して行って認識結果を出力する認識器により、前記第２画像に対する認識結果を生成する認識手段と、
を備えたことを特徴とする画像認識システム。 An imaging means that images a space in which a predetermined object can be imaged and generates a first image composed of pixels having a gradation value within the first gradation range.
When the first image is input, the first image is output by a converter that outputs a second image composed of pixels having a gradation value within a second gradation range smaller than the first gradation range. To the second image, and
A recognition means for generating a recognition result for the second image by a recognizer that performs a process for recognizing the target when the second image is input for the second image and outputs a recognition result. ,
An image recognition system characterized by being equipped with.

前記変換器及び前記認識器は、前記変換器の出力が前記認識器の入力となるように結合されたニューラルネットワークに前記第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を前記学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、
請求項１に記載の画像認識システム。 The converter and the recognizer are learning firsts composed of pixels having gradation values within the first gradation range in a neural network coupled so that the output of the converter becomes the input of the recognizer. A trained neural network trained so that the recognition result output when one image is input is close to the learning recognition result preset as the recognition result to be output for the first learning image. is there,
The image recognition system according to claim 1.

前記変換器及び前記認識器は、前記結合されたニューラルネットワークに前記学習用第１画像を入力した場合に前記変換器によって出力される前記第２の階調範囲内の階調値を有する画素からなる画像を前記学習用第１画像から生成されるエッジ画像に近づけ、且つ、前記結合されたニューラルネットワークによって出力される前記認識結果を前記学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、
請求項２に記載の画像認識システム。 The converter and the recognizer are derived from pixels having a gradation value within the second gradation range output by the converter when the first image for learning is input to the combined neural network. A trained neural network trained so as to bring the image closer to the edge image generated from the first learning image and to bring the recognition result output by the combined neural network closer to the learning recognition result. Is,
The image recognition system according to claim 2.

前記変換された第２画像を所定の伝送網に出力する出力手段と、
前記第２画像を前記伝送網から取得する取得手段と、
をさらに備え、
前記認識手段は、前記取得された第２画像に対する認識結果を生成する、
請求項１から３のいずれか一項に記載の画像認識システム。 An output means for outputting the converted second image to a predetermined transmission network, and
An acquisition means for acquiring the second image from the transmission network, and
With more
The recognition means generates a recognition result for the acquired second image.
The image recognition system according to any one of claims 1 to 3.

所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、
前記第１画像が入力された場合に前記第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、前記第１画像を前記第２画像に変換する変換手段と、
前記第２画像を出力する出力手段と、
を備えたことを特徴とする撮像装置。 An imaging means that images a space in which a predetermined object can be imaged and generates a first image composed of pixels having a gradation value within the first gradation range.
When the first image is input, the first image is output by a converter that outputs a second image composed of pixels having a gradation value within a second gradation range smaller than the first gradation range. To the second image, and
An output means for outputting the second image and
An imaging device characterized by being equipped with.

前記変換器は、前記変換器の出力が、前記第２画像が入力された場合に前記対象を認識するための処理を前記第２画像に対して行って認識結果を出力する認識器の入力となるように結合されたニューラルネットワークに前記第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を前記学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、
請求項５に記載の撮像装置。 In the converter, the output of the converter is the input of the recognizer that performs a process for recognizing the target when the second image is input on the second image and outputs a recognition result. The recognition result output when a learning first image composed of pixels having gradation values within the first gradation range is input to the neural network coupled so as to become the learning first image. On the other hand, it is a trained neural network trained so as to approach a preset recognition result for learning as a recognition result to be output.
The imaging device according to claim 5.

第１の階調範囲内の階調値を有する画素からなる第１画像が入力された場合に前記第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、撮像により生成された第１画像を変換した第２画像を取得する取得手段と、
前記第２画像が入力された場合に前記対象を認識するための処理を前記第２画像に対して行って認識結果を出力する認識器により、前記第２画像に対する認識結果を生成する認識手段と、
を備えたことを特徴とする認識装置。 When a first image consisting of pixels having gradation values within the first gradation range is input, from pixels having gradation values within the second gradation range smaller than the first gradation range. An acquisition means for acquiring a second image obtained by converting the first image generated by imaging by a converter that outputs the second image.
A recognition means for generating a recognition result for the second image by a recognizer that performs a process for recognizing the target when the second image is input for the second image and outputs a recognition result. ,
A recognition device characterized by being equipped with.

前記認識器は、前記変換器の出力が前記認識器の入力となるように結合されたニューラルネットワークに前記第１の階調範囲内の階調値を有する画素からなる学習用第１画像を入力した場合に出力される認識結果を前記学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、
請求項７に記載の認識装置。 The recognizer inputs a first image for learning composed of pixels having a gradation value within the first gradation range into a neural network coupled so that the output of the converter becomes an input of the recognizer. This is a trained neural network trained so that the recognition result output in the case of the above is close to the learning recognition result preset as the recognition result to be output for the first learning image.
The recognition device according to claim 7.

所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成し、
前記第１画像が入力された場合に前記第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、前記第１画像を前記第２画像に変換し、
前記第２画像が入力された場合に前記対象を認識するための処理を前記第２画像に対して行って認識結果を出力する認識器により、前記第２画像に対する認識結果を生成する、
ことを含むことを特徴とする画像認識方法。 By imaging a space in which a predetermined object can be imaged, a first image composed of pixels having a gradation value within the first gradation range is generated.
When the first image is input, the first image is output by a converter that outputs a second image composed of pixels having a gradation value within a second gradation range smaller than the first gradation range. Is converted into the second image,
When the second image is input, the recognition result for the second image is generated by the recognizer that performs the process for recognizing the target on the second image and outputs the recognition result.
An image recognition method characterized by including.