JP2021174289A

JP2021174289A - Learning device, method, and program

Info

Publication number: JP2021174289A
Application number: JP2020078227A
Authority: JP
Inventors: 賢史小森田; Masashi Komorida; 和之田坂; Kazuyuki Tasaka
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-01
Anticipated expiration: 2040-04-27
Also published as: JP7290602B2

Abstract

To provide a learning device for learning a model that automatically estimates landmarks from an image without trouble of manual annotation or the like.SOLUTION: The learning device includes: a point group construction unit 12 that constructs a point group of a field by detecting a feature point from each of a plurality of images in which the field is photographed and by obtaining a correspondence of the feature points between the images; a determination unit 13 that sets a region determined to have a wide range of image shooting directions of an image corresponding to points determined to have a high point density in the point group, as a landmark candidate region;, and a learning unit 14 that assigns an identifier to each landmark candidate region, performs training with learning data in which annotation is given to the landmark region to which the identifier is allocated in the image as an object region designated by the identifier, and obtains an identification model with respect to the landmark candidate region determined to have high identification performance of the object region in a learning result.SELECTED DRAWING: Figure 2

Description

本発明は、画像よりランドマークを自動推定するモデルを学習する学習装置、方法及びプログラムに関する。 The present invention relates to a learning device, a method and a program for learning a model for automatically estimating landmarks from images.

測位技術等の関連技術として、ランドマークに関する技術がある。 As related technology such as positioning technology, there is a technology related to landmarks.

特許文献１では、ランドマークの名称などのメタ情報に基づいて画像の分類および検索を行なうことができ、ユーザが撮影した画像の整理および検索の利便性を高める。特許文献２では、ランドマークを検索する際に、撮影条件が類似しているものから検索し処理負荷を軽減する。特許文献３では、位置分散及び輝度分散を考慮して頑健なランドマークを選定できるランドマーク選定装置が提供される。 In Patent Document 1, images can be classified and searched based on meta information such as landmark names, which enhances the convenience of organizing and searching images taken by the user. In Patent Document 2, when searching for landmarks, those with similar shooting conditions are searched to reduce the processing load. Patent Document 3 provides a landmark selection device capable of selecting a robust landmark in consideration of position dispersion and brightness dispersion.

特開2011-010171号広報Japanese Patent Application Laid-Open No. 2011-010171 Public Relations 特開2019-052904号広報Japanese Patent Application Laid-Open No. 2019-052904 特開2019-053462号広報Japanese Patent Application Laid-Open No. 2019-053462 Public Relations

しかしながら、以上のような従来技術においてはランドマークを利用するために、事前にメタ情報等を手動で登録しておく手間が発生するという課題があった。 However, in the above-mentioned conventional technology, in order to use the landmark, there is a problem that it takes time and effort to manually register the meta information and the like in advance.

ランドマークを選定する際に、画像としてランドマークの優劣を判定するものはあるが、それが類似物が多かったり、ほとんど見つからないものとなる可能性があった。ランドマーク学習は、手動でアノテーションした一般的なものしかなかった。（手動アノテーションとして例えば、オックスフォードやパリなどのオープンデータがある。）例えば屋内はその施設それぞれのランドマークがあるため、このような一般的なデータでは対処できず、それぞれ手動でアノテーションして作成する手間が発生してしまう課題があった。 When selecting a landmark, there are some images that judge the superiority or inferiority of the landmark, but there is a possibility that there are many analogs or almost no landmarks can be found. Landmark learning was only common, manually annotated. (For example, there are open data such as Oxford and Paris as manual annotation.) For example, since there are landmarks of each facility indoors, such general data cannot be dealt with, and each is manually annotated and created. There was a problem that it took time and effort.

上記従来技術の課題に鑑み、本発明は、測位や地図作成において一般的に利用されている点群を利用して、手動によるアノテーション等の手間を発生させることなく、画像よりランドマークを自動推定するモデルを学習する学習装置、方法及びプログラムを提供することを目的とする。 In view of the above problems of the prior art, the present invention uses a point cloud generally used in positioning and map creation to automatically estimate a landmark from an image without incurring the trouble of manual annotation or the like. It is an object of the present invention to provide a learning device, a method and a program for learning an annotation.

上記目的を達成するため、本発明は学習装置であって、フィールドを撮影した複数の画像を取得する画像取得部と、前記画像の各々より特徴点を検出し、画像間での特徴点の対応関係を求めることにより、前記フィールドの点群を構築する点群構築部と、前記点群において、点の密度が高いと判定され、且つ、点に対応する画像の撮影方向の範囲が広いと判定される領域を、ランドマーク候補領域として設定する決定部と、各ランドマーク候補領域に識別子を付与して、前記画像において識別子が付与されたランドマーク候補領域を当該識別子で指定される物体領域であるものとしてアノテーション付与した学習データによる学習を行い、前記学習の結果において、物体領域の識別性能が高いと判定されるランドマーク候補領域についての識別モデルを得る学習部と、を備えることを特徴とする。また、前記学習装置に対応する方法及びプログラムであることを特徴とする。 In order to achieve the above object, the present invention is a learning device, which is an image acquisition unit that acquires a plurality of images of a field, detects feature points from each of the images, and corresponds the feature points between the images. By obtaining the relationship, it is determined that the point group construction unit that constructs the point group in the field and the point group have a high point density and a wide range of the shooting direction of the image corresponding to the point. A determination unit that sets the area to be designated as a landmark candidate area, and an object area that assigns an identifier to each landmark candidate area and assigns an identifier to the landmark candidate area in the image. It is characterized by including a learning unit that performs training using training data annotated as a certain object and obtains an identification model for a landmark candidate region that is determined to have high identification performance of an object region in the training result. do. Further, it is characterized in that it is a method and a program corresponding to the learning device.

本発明によれば、点群を構築してその密度が高く且つ対応する撮影方向の範囲が広いと自動判定される領域をランドマーク候補領域として設定し、画像においてランドマーク候補領域がなす物体領域の区別が付与されたものをアノテーション付与された学習データとして識別モデルの学習を行い、この結果のうち物体領域の識別性能が高いと判定されるランドマーク候補領域についての識別モデルを得ることにより、手動によるアノテーション等の手間を発生させることなく、画像よりランドマークを自動推定するモデルを学習することが可能となる。 According to the present invention, a region in which a point group is constructed and is automatically determined to have a high density and a wide range of corresponding shooting directions is set as a landmark candidate region, and an object region formed by the landmark candidate region in an image is set. The identification model is trained using the data to which the distinction is given as the annotated training data, and the identification model for the landmark candidate region judged to have high identification performance of the object region is obtained from the results. It is possible to learn a model that automatically estimates landmarks from images without incurring the trouble of manual annotation.

一実施形態に係る点群処理システムの構成図である。It is a block diagram of the point cloud processing system which concerns on one Embodiment. 一実施形態に係る点群処理システムの機能ブロック図である。It is a functional block diagram of the point cloud processing system which concerns on one Embodiment. 一実施形態に係る点群処理システムの動作のフローチャートである。It is a flowchart of the operation of the point cloud processing system which concerns on one Embodiment. 学習装置において手動の手間なくランドマーク検出モデルを生成することができる根拠となる、点群とランドマークとの関係の例を示す図である。It is a figure which shows the example of the relationship between a point group and a landmark, which is the basis which can generate a landmark detection model in a learning apparatus without manual effort. 算出する視野範囲r(i,j,k)の模式的な例を２次元断面として示す図である。It is a figure which shows a typical example of the visual field range r (i, j, k) to be calculated as a two-dimensional cross section. 手順２におけるマッピングを模式的に示す図である。It is a figure which shows typically the mapping in step 2. 一般的なコンピュータにおけるハードウェア構成を示す図である。It is a figure which shows the hardware configuration in a general computer.

図１は、一実施形態に係る点群処理システムの構成図であり、点群処理システム100はインターネット等のネットワークNWを介して相互に通信可能とされる少なくとも１つの端末1及び少なくとも１つのサーバ2を備える。端末1は、スマートフォン等のモバイルデバイス又は車載装置等として構成することができる移動体であり、点群処理システム100において処理する対象となる点群を、映像として撮影する等によって端末1の周辺環境から取得するものである。一実施形態では点群処理システム100には端末1のみが含まれサーバ2が含まれない構成により、端末1が単独で点群に関する処理を行うようにしてもよい。一実施形態では点群処理システム100は端末1及びサーバ2を含み、端末1及びサーバ2で処理を分担しながら点群に関する処理を担うようにしてもよい。 FIG. 1 is a configuration diagram of a point cloud processing system according to an embodiment, in which the point cloud processing system 100 has at least one terminal 1 and at least one server capable of communicating with each other via a network NW such as the Internet. It has 2. The terminal 1 is a mobile body that can be configured as a mobile device such as a smartphone or an in-vehicle device, and the surrounding environment of the terminal 1 is taken by shooting a point cloud to be processed by the point cloud processing system 100 as an image. It is obtained from. In one embodiment, the point cloud processing system 100 may include only the terminal 1 and not the server 2, so that the terminal 1 may independently perform the processing related to the point cloud. In one embodiment, the point cloud processing system 100 includes the terminal 1 and the server 2, and the terminal 1 and the server 2 may share the processing and take charge of the processing related to the point cloud.

図２は、一実施形態に係る点群処理システム100の機能ブロック図である。点群処理システム100は、学習装置10及び推定装置20を備える。学習装置10は画像取得部11、点群構築部12、決定部13、学習部14及びモデルDB（データベース）15を備え、決定部13はさらにボクセル分割部131、密度評価部132、視野評価部133及び候補決定部134を備える。推定装置20は画像撮影部21及び推定部22を備える。 FIG. 2 is a functional block diagram of the point cloud processing system 100 according to the embodiment. The point cloud processing system 100 includes a learning device 10 and an estimation device 20. The learning device 10 includes an image acquisition unit 11, a point cloud construction unit 12, a determination unit 13, a learning unit 14, and a model DB (database) 15, and the determination unit 13 further includes a voxel division unit 131, a density evaluation unit 132, and a visual field evaluation unit. It includes 133 and a candidate determination unit 134. The estimation device 20 includes an image capturing unit 21 and an estimation unit 22.

なお、学習装置10及び推定装置20の各々に関して、図１に示される端末1（及びサーバ2）として構成することができる。学習装置10を移動体として構成する端末1と、推定装置20を移動体として構成する端末1とは別の端末であってもよい。 Each of the learning device 10 and the estimation device 20 can be configured as a terminal 1 (and a server 2) shown in FIG. The terminal 1 in which the learning device 10 is configured as a mobile body and the terminal 1 in which the estimation device 20 is configured as a mobile body may be different terminals.

学習装置10を移動体として構成する端末1には少なくとも画像取得部11が備わり、端末1自身の周辺環境の撮影を行うことにより映像を取得するが、画像取得部11以外の学習装置10の構成はサーバ2に備わることで、処理をサーバ2に委ねるようにしてもよい。推定装置20を移動体として構成する端末1には少なくとも画像撮影部21が備わり、端末1自身の周辺環境の撮影を行うことにより画像を取得するが、画像撮影部21以外の推定装置20の構成はサーバ2に備わることで、処理をサーバ2に委ねるようにしてもよい。 The terminal 1 in which the learning device 10 is configured as a moving body is provided with at least an image acquisition unit 11, and acquires an image by photographing the surrounding environment of the terminal 1 itself. However, the configuration of the learning device 10 other than the image acquisition unit 11 May be provided on the server 2 so that the processing can be entrusted to the server 2. The terminal 1 in which the estimation device 20 is configured as a mobile body is provided with at least an image capturing unit 21, and acquires an image by photographing the surrounding environment of the terminal 1 itself. However, the configuration of the estimation device 20 other than the image capturing unit 21 May be provided on the server 2 so that the processing can be entrusted to the server 2.

図３は、一実施形態に係る点群処理システム100の動作のフローチャートである。点群処理システム100の全体的な動作として、図３のステップS1〜S6において学習装置10が学習によりモデル構築を行い構築されたモデル（画像からランドマークを検出するモデル）をモデルDB15に保存した後に、モデルDB15に保存されたモデルを推定装置20が参照して利用することにより、図３のステップS7において位置推定を行うことができる。 FIG. 3 is a flowchart of the operation of the point cloud processing system 100 according to the embodiment. As the overall operation of the point cloud processing system 100, the model (the model that detects landmarks from the image) constructed by the learning device 10 constructing the model by learning in steps S1 to S6 of FIG. 3 is saved in the model DB15. Later, the estimation device 20 can refer to and use the model stored in the model DB 15, so that the position can be estimated in step S7 of FIG.

図４は、学習装置10において手動の手間なくランドマーク検出モデルを生成することができる根拠となる、点群とランドマークとの関係の例を示す図である。図４の点群PGは、移動軌跡Cに沿って屋内空間を移動するカメラにより撮影された映像より、当該屋内空間の環境を表現するものとして構築されたものをフィールド内の一部分に関して示すものである。図４では例EX1にこの点群PGを、フィールドの上面視（上空側から地面側を見る視点）において示しており、例EX2においてこの点群PGをフィールドの側面視（地面に垂直に立って水平方向に見る視点）において示している。この点群PGのうち、ランドマークに好適な箇所となっている例が２つの箇所P1及びP2である。これら２つの箇所P1及びP2はいずれも、点群の点が密集しており密度が高く、且つ、点群の点が見える方向（すなわち、点群を構成する点が抽出された映像内のフレーム画像におけるカメラ方向）が広範囲に渡る、という特徴を有するものである。なお図４中にはP1,P2以外にも同様のランドマーク候補を含むが、説明のための例示として箇所P1,P2を示した。 FIG. 4 is a diagram showing an example of the relationship between the point cloud and the landmark, which is the basis for generating the landmark detection model in the learning device 10 without manual labor. The point cloud PG in FIG. 4 shows a part of the field constructed as an expression of the environment of the indoor space from the image taken by the camera moving in the indoor space along the movement locus C. be. In FIG. 4, this point cloud PG is shown in Example EX1 from the top view of the field (viewpoint from the sky side to the ground side), and in Example EX2, this point cloud PG is shown from the side view of the field (standing perpendicular to the ground). It is shown in the horizontal view). Among the point cloud PGs, two points P1 and P2 are examples of suitable points for landmarks. In both of these two points P1 and P2, the points of the point cloud are dense and dense, and the direction in which the points of the point cloud can be seen (that is, the frame in the image in which the points constituting the point cloud are extracted). It has a feature that the camera direction in the image) covers a wide range. Although similar landmark candidates are included in FIG. 4 in addition to P1 and P2, locations P1 and P2 are shown as examples for explanation.

すなわち、このような箇所P1やP2に存在する対象物は、（固定的に配置された物体等によって遮られることなく）広範囲から可視の状態にあり、且つ、点群の点の数も多い（すなわち、画像より特徴点として検出される点の数も多い）という定量的な特徴のみにより、自動で優れたランドマークであると推定することが可能となる。本実施形態の学習装置10では点群におけるこのような定量的な特徴を利用して、手動アノテーション等の手間を発生させることなく、ランドマーク検出モデルを自動で学習することが可能となる。 That is, the objects existing in such places P1 and P2 are visible from a wide range (without being obstructed by a fixedly arranged object or the like), and the number of points in the point cloud is large (). That is, it is possible to automatically presume that it is an excellent landmark only by the quantitative feature (the number of points detected as feature points is larger than that of the image). In the learning device 10 of the present embodiment, it is possible to automatically learn the landmark detection model without incurring the trouble of manual annotation or the like by utilizing such a quantitative feature in the point cloud.

以下、図３の各ステップの詳細を説明しながら、学習装置10及び推定装置20の各機能部の処理の詳細を説明する。 Hereinafter, the details of the processing of each functional unit of the learning device 10 and the estimation device 20 will be described while explaining the details of each step of FIG.

図３のフローが開始されると、ステップS1では、画像取得部11が、この学習装置10（の点群構築部12）においてフィールド内の点群を構築するための映像を取得してから、ステップS2へと進む。画像取得部11はハードウェアとしてはカメラで構成され、移動体である端末1に備わることでこの移動体と共にフィールド内を移動しながら撮影を行うことにより、ステップS1での映像が取得される。画像取得部11で取得した映像（各時刻t=1,2,3…でのフレーム画像F(t)）は点群構築部12及び学習部14へと出力される。 When the flow of FIG. 3 is started, in step S1, the image acquisition unit 11 acquires an image for constructing the point cloud in the field in the learning device 10 (point cloud construction unit 12), and then, after that, Proceed to step S2. The image acquisition unit 11 is composed of a camera as hardware, and is provided in the terminal 1 which is a moving body, so that the image in step S1 is acquired by taking a picture while moving in the field together with the moving body. The video acquired by the image acquisition unit 11 (frame image F (t) at each time t = 1, 2, 3 ...) Is output to the point cloud construction unit 12 and the learning unit 14.

ステップS2では、ステップS1で得られた各フレーム画像F(t)(t=1,2,3,…)を利用して点群構築部12が点群PGを構築し、且つ、点群PGに属する各点（３次元世界座標における点）に対して、対応するフレーム画像F(t)の特徴点を紐付けた結果を点群PGの情報として決定部13へと出力してから、ステップS3へと進む。 In step S2, the point cloud construction unit 12 constructs the point cloud PG using each frame image F (t) (t = 1,2,3, ...) obtained in step S1, and the point cloud PG is constructed. The result of associating the feature points of the corresponding frame image F (t) with each point belonging to (points in three-dimensional world coordinates) is output to the determination unit 13 as information of the point cloud PG, and then the step Proceed to S3.

ここで、フレーム画像F(t) (t=1,2,3,…)から点群PGを構築する手法としては、SfM(Structure from Motion)等の任意の既存手法を利用してよい。SfMにおいては、例えばSIFT特徴等により各画像から特徴点及び局所特徴量を求め、画像間での特徴点の対応（局所特徴量が一致すると判定される特徴点同士の対応）を利用して三角測量等の原理により、各特徴点の３次元世界座標の情報を得ることで、点群PGを構築することができる。フレーム画像F(t)の特徴点と点群PGとの紐付けに関しては、点群PGを構成する各点（３次元世界座標の点）がいずれのフレーム画像F(t)のいずれの特徴点（２次元画像座標の点）及び局所特徴量に対応しているかの情報（点群PGを構築する際にこの情報は得られている）を与えることで、紐付けを行うことができる。 Here, as a method for constructing the point cloud PG from the frame image F (t) (t = 1,2,3, ...), any existing method such as SfM (Structure from Motion) may be used. In SfM, for example, feature points and local feature quantities are obtained from each image using SIFT features, etc., and the correspondence between feature points between images (correspondence between feature points judged to match local feature quantities) is used to create a triangle. A point cloud PG can be constructed by obtaining information on the three-dimensional world coordinates of each feature point by a principle such as surveying. Regarding the association between the feature points of the frame image F (t) and the point group PG, each point (point of the three-dimensional world coordinates) constituting the point group PG is which feature point of the frame image F (t). By giving (points of two-dimensional image coordinates) and information on whether or not they correspond to local feature quantities (this information is obtained when constructing the point group PG), the association can be performed.

ステップS3では、ステップS2で得た点群PGをボクセル分割部131がボクセルに分割してから、ステップS4へと進む。ボクセル分割部131では、点群PGが定義される３次元世界座標を予め所定のボクセル（直方体又は立方体の格子）に区切っておくことで、点群PGをボクセル分割することができる。説明のため、ボクセル分割部131により点群PGがボクセル分割された結果のうち、３次元世界座標のXYZ座標において、X,Y,Z方向にi,j,k番目（それぞれ整数）のボクセルV(i,j,k)に属する点群を点群PG(i,j,k)と表記する。 In step S3, the voxel dividing unit 131 divides the point cloud PG obtained in step S2 into voxels, and then proceeds to step S4. In the voxel dividing unit 131, the point cloud PG can be divided into voxels by dividing the three-dimensional world coordinates in which the point cloud PG is defined into predetermined boxels (rectangular parallelepiped or cubic lattice) in advance. For the sake of explanation, among the results of voxel division of the point cloud PG by the voxel division unit 131, the i, j, kth (each integer) voxel V in the X, Y, Z directions in the XYZ coordinates of the three-dimensional world coordinates. The point cloud belonging to (i, j, k) is referred to as the point cloud PG (i, j, k).

ステップS4では、各ボクセルV(i,j,k)につき、密度評価部132がその点群PG(i,j,k)の密度d(i,j,k)を評価し、且つ、視野評価部133がその点群P(i,j,k)の視野範囲r(i,j,k)を評価してから、ステップS5へと進む。 In step S4, for each voxel V (i, j, k), the density evaluation unit 132 evaluates the density d (i, j, k) of the point cloud PG (i, j, k) and evaluates the visual field. After the unit 133 evaluates the visual field range r (i, j, k) of the point cloud P (i, j, k), the process proceeds to step S5.

密度評価部132では、ボクセルV(i,j,k)の体積をvol(i,j,k)とし、ボクセルV(i,j,k)内に属する点群PG(i,j,k)における点の個数をnum(i,j,k)とすると、d(i,j,k)=num(i,j,k)/vol(i,j,k)として密度d(i,j,k)を求めることができる。各ボクセルV(i,j,k)のサイズが共通であり、体積vol(i,j,k)も一定値である場合は、d(i,j,k)=num(i,j,k)として、点の個数num(i,j,k)を密度d(i,j,k)の値としてそのまま用いるようにしてもよい。 In the density evaluation unit 132, the volume of voxel V (i, j, k) is vol (i, j, k), and the point cloud PG (i, j, k) belonging to voxel V (i, j, k) If the number of points in is num (i, j, k), then d (i, j, k) = num (i, j, k) / vol (i, j, k) and the density d (i, j, k) k) can be obtained. If the size of each voxel V (i, j, k) is common and the volume vol (i, j, k) is also a constant value, d (i, j, k) = num (i, j, k) ), The number of points num (i, j, k) may be used as it is as the value of the density d (i, j, k).

視野評価部133では例えば、ボクセルの点群PG(i,j,k)に属する各点p（p∈PG(i,j,k)）に対応する特徴点が取得された2つ以上(N個とする。N=N(p)である（個数Nは点pごとに一般に異なる）が、以下では単にNと書くこととする。)のフレーム画像のカメラ位置をCp1,Cp2,…,CpNとすると、点pを頂点として、対応するN個のカメラ位置Cp1,Cp2,…,CpNによって囲まれる底面B(p)を有する錐体Cone(p)を考え、ボクセルV(i,j,k)の中心位置を中心とする単位球の表面（面積1の全天周）からこの錐体Cone(p)が切り取る領域をs(p)とすると、ボクセル点群PG(i,j,k)に属する全ての点pによる領域s(p)の和集合∪s(p)の面積として、視野範囲r(i,j,k)を算出してよい。 In the visual field evaluation unit 133, for example, two or more (N) feature points corresponding to each point p (p ∈ PG (i, j, k)) belonging to the voxel point cloud PG (i, j, k) are acquired. The camera position of the frame image of N = N (p) (the number N is generally different for each point p), but will be simply written as N below) is Cp1, Cp2, ..., CpN. Then, consider a voxel V (i, j, k) with a point p as the apex and a cone Cone (p) having a bottom surface B (p) surrounded by the corresponding N camera positions Cp1, Cp2, ..., CpN. Let s (p) be the region cut out by this cone Cone (p) from the surface of the unit sphere centered on the center position of) (the entire sky circumference of area 1), and the voxel point cloud PG (i, j, k) The visual field range r (i, j, k) may be calculated as the area of the sum set ∪s (p) of the regions s (p) by all the points p belonging to.

図５は上記のようにして算出する視野範囲r(i,j,k)の模式的な例を２次元断面として示す図であり、視野範囲を評価する対象となるボクセルVが３つの点p,q,rを有する場合に、点pが対応する少なくとも２つのカメラ位置Cp1,Cp2による錐体Cone(p)によって単位球Eから領域s(p)が切り取られ、点qが対応する少なくとも２つのカメラ位置Cq1,Cq2による錐体Cone(q)によって単位球Eから領域s(q)が切り取られ、点rが対応する少なくとも２つのカメラ位置Cr1,Cr2による錐体Cone(r)によって単位球Eから領域s(r)が切り取られることで、図５の例では相互に重複しない領域s(p),s(r),s(q)の面積の総和として視野範囲r(i,j,k)が定まる。 FIG. 5 is a diagram showing a schematic example of the visual field range r (i, j, k) calculated as described above as a two-dimensional cross section, and the boxel V to be evaluated for the visual field range has three points p. When having, q, r, the region s (p) is cut from the unit sphere E by the cone Cone (p) with at least two camera positions Cp1 and Cp2 corresponding to the point p, and the point q corresponds to at least 2 The region s (q) is cut from the unit sphere E by the cone Cone (q) with one camera position Cq1 and Cq2, and the unit sphere is cut out from the unit sphere E by the cone Cone (r) with at least two camera positions Cr1 and Cr2 corresponding to the point r. By cutting the region s (r) from E, in the example of FIG. 5, the visual field range r (i, j, k) is determined.

視野評価部133では上記の手法に限らず、ボクセルの点群PG(i,j,k)に属する各点p（p∈PG(i,j,k)）が対応するカメラ位置Cp1,Cp2,…,CpNの方向（点p又はボクセルV(i,j,k)の中心から見た際の方向）のばらつきが大きいほど点pによる視野範囲が広いものとして、点pの視野範囲を評価して、ボクセル点群PG(i,j,k)の全ての点pによるこのような視野範囲の全体が広いほど値が大きくなるように、視野範囲r(i,j,k)を算出することができる。 In the field of view evaluation unit 133, not limited to the above method, each point p (p ∈ PG (i, j, k)) belonging to the voxel point cloud PG (i, j, k) corresponds to the camera position Cp1, Cp2, …, The visual field range of the point p is evaluated assuming that the larger the variation in the direction of CpN (the direction when viewed from the center of the point p or voxel V (i, j, k)), the wider the visual field range due to the point p. Therefore, the field of view r (i, j, k) is calculated so that the wider the entire field of view of all the points p of the voxel point cloud PG (i, j, k), the larger the value. Can be done.

例えば、単位球などは設定せずに、単純に直方体又は立方体として構成されるボクセルV(i,j,k)の各面（６つの面）について、ボクセル点群VG(i,j,k)に属する少なくとも１つの点pとカメラ位置Cp1,Cp2,…,CpNとを結ぶ直線のうち少なくとも1本が通過するか否かを判定し、６つの面のそれぞれについて1本でもこのような直線が通過すれば評価値を１とし、このような直線が通過しない場合には評価値を0として、６つの面での評価値の総和（0以上6以下）として簡易に視野範囲r(i,j,k)を算出するようにしてもよい。 For example, for each face (6 faces) of a voxel V (i, j, k) that is simply configured as a rectangular parallelepiped or a cube without setting a unit sphere, the voxel point group VG (i, j, k) It is determined whether or not at least one of the straight lines connecting the at least one point p belonging to the camera position Cp1, Cp2, ..., CpN passes through, and even one such straight line is formed for each of the six surfaces. If it passes, the evaluation value is set to 1, and if such a straight line does not pass, the evaluation value is set to 0. , K) may be calculated.

なお、点pに対応するN個のカメラ位置Cp1,Cp2,…,CpNの情報は、点群構築部12において点群PGを構築した際に既知となっている。すなわち、点群PGに属する各点pに対応するフレーム画像の特徴点が紐づいていることから、各点pに対応するN個のフレーム画像の情報が紐づいており、さらに、このN個のフレーム画像について撮影した際のカメラ位置の情報（点群PGを算出する際にこのカメラ位置の情報も算出されている）として、カメラ位置Cp1,Cp2,…,CpNの情報も、点群PGに紐づけて保持しておくことが可能である。 The information on the N camera positions Cp1, Cp2, ..., CpN corresponding to the point p is known when the point cloud construction unit 12 constructs the point cloud PG. That is, since the feature points of the frame image corresponding to each point p belonging to the point cloud PG are linked, the information of N frame images corresponding to each point p is linked, and further, these N pieces. As the information of the camera position when the frame image of is taken (the information of this camera position is also calculated when calculating the point cloud PG), the information of the camera positions Cp1, Cp2, ..., CpN is also the point cloud PG. It is possible to keep it tied to.

ステップS5では、候補決定部134が、各ボクセルV(i,j,k)につき、ステップS4で密度評価部132が算出した密度d(i,j,k)及び視野評価部133が算出した視野範囲r(i,j,k)に基づいて、当該ボクセルV(i,j,k)が候補ボクセルに該当するか否かを決定し、決定結果を学習部14へと出力してから、ステップS6へと進む。 In step S5, the candidate determination unit 134 determines the density d (i, j, k) calculated by the density evaluation unit 132 in step S4 and the visual field calculated by the visual field evaluation unit 133 for each voxel V (i, j, k). Based on the range r (i, j, k), it is determined whether or not the voxel V (i, j, k) corresponds to the candidate voxel, and the determination result is output to the learning unit 14, and then the step is taken. Proceed to S6.

候補決定部134では具体的に以下のように、密度d(i,j,k)が所定の密度閾値THdよりも大きく、且つ、視野範囲r(i,j,k)が所定の視野範囲閾値THrよりも大きいと判定されるようなボクセルV(i,j,k)を、候補ボクセルに該当するものとして決定することができる。
d(i,j,k)>THd 且つ r(i,j,k)>THr Specifically, in the candidate determination unit 134, the density d (i, j, k) is larger than the predetermined density threshold value THd, and the visual field range r (i, j, k) is the predetermined visual field range threshold value, as shown below. A voxel V (i, j, k) that is determined to be larger than THr can be determined as a candidate voxel.
d (i, j, k)> THd and r (i, j, k)> THr

このように決定される候補ボクセルとは、対応するボクセルV(i,j,k)の点群PG(i,j,k)が、図４で箇所P1やP2として模式的に示したような特徴を有するランドマークに由来するものである候補であることを意味するものである。すなわち、当該ボクセルV(i,j,k)内にランドマークが存在する可能性があることを、候補ボクセルは意味している。 The candidate voxels determined in this way are as the point cloud PG (i, j, k) of the corresponding voxels V (i, j, k) is schematically shown as locations P1 and P2 in FIG. It means that it is a candidate derived from a landmark having a characteristic. That is, the candidate voxel means that a landmark may exist in the voxel V (i, j, k).

ステップS6では学習部14が、画像取得部11で取得した映像（各フレーム画像F(t)）及び候補決定部134で決定した候補ボクセルの情報を利用して学習を行うことにより、画像からランドマーク領域を自動検出する深層学習モデルを得て、この深層学習モデルをモデルDB15に保存してから、ステップS7へと進む。 In step S6, the learning unit 14 learns from the image by using the video (each frame image F (t)) acquired by the image acquisition unit 11 and the information of the candidate voxels determined by the candidate determination unit 134. After obtaining a deep learning model that automatically detects the mark area and saving this deep learning model in the model DB15, the process proceeds to step S7.

ステップS6において学習部14は具体的に以下の手順１〜手順３により、モデルを学習することができる。 In step S6, the learning unit 14 can specifically learn the model by following steps 1 to 3.

（手順１）候補決定部134で決定した候補ボクセルに対して、連結領域ラベリング等を適用することにより、隣接（連結）しているボクセル（連結成分を構成しているボクセル）を１つの塊にまとめ、共通のIDを付与する。説明のため、このようにして得られた連結候補ボクセル群が合計でK個あるものとし、これらをVc1,Vc2,…,VcK(={Vck|k=1,2,…,K})と表記する。 (Procedure 1) By applying connected region labeling or the like to the candidate voxels determined by the candidate determination unit 134, adjacent (connected) voxels (voxels constituting the connected components) are combined into one mass. In summary, give a common ID. For the sake of explanation, it is assumed that there are a total of K concatenation candidate voxels obtained in this way, and these are referred to as Vc1, Vc2,…, VcK (= {Vck | k = 1,2,…, K}). write.

（手順２）各連結候補ボクセル群Vck(k=1,2,…,K)を、フレーム画像F(t)(t=1,2,3,…)へと透視投影によりマッピングすることで、各フレーム画像F(t)内において各連結候補ボクセル群Vckの占める領域R(t,k)の情報を取得する。（なお、t,kの組み合わせによっては、そもそも投影されない場合（R(t,k)が空集合となる場合）もありうる。） (Procedure 2) By mapping each connection candidate voxel group Vck (k = 1,2, ..., K) to the frame image F (t) (t = 1,2,3, ...) by perspective projection. Information on the region R (t, k) occupied by each connection candidate voxel group Vck in each frame image F (t) is acquired. (In addition, depending on the combination of t and k, it may not be projected in the first place (R (t, k) may be an empty set).)

図６は、手順２におけるマッピングを模式的に示す図であり、３次元世界座標系内に存在する、ある連結候補ボクセル群Vckが、あるフレーム画像F(t)の２次元画像座標に透視投影された領域として領域R(t,k)が定まることが模式的に示されている。すでに説明したように、フレーム画像F(t)は点群PGを求めた際に、そのカメラの外部パラメータに相当するカメラ位置（及び向き）が３次元世界座標系におけるものとして定まっているので、この情報（及び既知のカメラの内部パラメータの情報）を用いて、図６に示すような透視投影を行うことが可能となる。 FIG. 6 is a diagram schematically showing the mapping in step 2, in which a certain connection candidate voxel group Vck existing in the three-dimensional world coordinate system is perspectively projected onto the two-dimensional image coordinates of a certain frame image F (t). It is schematically shown that the region R (t, k) is determined as the region. As described above, when the point cloud PG is obtained for the frame image F (t), the camera position (and orientation) corresponding to the external parameters of the camera is determined as that in the three-dimensional world coordinate system. Using this information (and information on known internal parameters of the camera), it is possible to perform fluoroscopic projection as shown in FIG.

（手順３）少なくとも１つの領域R(t,k)がマッピングして投影されているフレーム画像F(t)を自動アノテーションが付与された学習データとして利用して、深層学習による物体検出やインスタンスセグメンテーションなどの学習を実施する。ここで、フレーム画像F(t)内の領域R(t,k)は、K個存在する連結候補ボクセル群Vck(k=1,2,…,K)、すなわち、K個のランドマーク候補のうちk番目のランドマークの対象物の領域に該当するものとして、自動アノテーションが付与された学習データとして利用することができる。すなわち、当該自動アノテーション付与された学習データにおいては、ランドマークの対象物が具体的に何であるか（人間により何の物体として認識されうるか）の情報は明示的には与えられないが、K個のランドマーク候補のうちいずれのk番目のランドマークであるかという、ランドマークの識別子が、アノテーションとして付与されていることとなる。 (Procedure 3) Object detection and instance segmentation by deep learning using the frame image F (t) to which at least one region R (t, k) is mapped and projected as learning data with automatic annotation. Conduct learning such as. Here, the region R (t, k) in the frame image F (t) is the concatenated candidate voxel group Vck (k = 1,2, ..., K) in which K exists, that is, K landmark candidates. It can be used as learning data with automatic annotation as it corresponds to the area of the object of the kth landmark. That is, in the automatically annotated learning data, information on what the landmark object is specifically (what object can be recognized by humans) is not explicitly given, but K pieces. The landmark identifier, which is the kth landmark of the landmark candidates, is added as an annotation.

この自動アノテーション付与された学習データを用いた学習は以下の手順３−１〜手順３−３のようにすればよい。 Learning using the learning data to which the automatic annotation is added may be performed as in steps 3-1 to 3-3 below.

（手順３−１）学習用データ（上記のように自動アノテーション付与されたもの）から、識別子k=1,2,…,Kで区別される各ランドマーク候補（説明のためランドマーク候補L(k)とする）についての正誤及び確度の評価結果を得る。（すなわち、学習用データの一部を訓練用、残りの一部を検証用とし、訓練用データで訓練しながら定期的に検証用データで学習状況を検証して評価結果を得るようにする。）なお、確度は一般的な指標（recall, precision, F値など）で評価すればよく、指標として、IoU(Intersection over Union)により領域の重なり具合の推定の正確さを評価してもよいし、複数の物体検知をまとめて評価する指標としてmAP(mean Average Precision)を用いてもよい。 (Procedure 3-1) From the training data (automatically annotated as described above), each landmark candidate distinguished by identifiers k = 1,2, ..., K (landmark candidate L (for explanation) Obtain the evaluation results of correctness and accuracy of (k) and). (That is, a part of the learning data is used for training and the remaining part is used for verification, and while training with the training data, the learning situation is periodically verified with the verification data and the evaluation result is obtained. ) The accuracy may be evaluated by a general index (recall, precision, F value, etc.), and as an index, the accuracy of estimation of the degree of overlap of regions may be evaluated by IoU (Intersection over Union). , MAP (mean Average Precision) may be used as an index for collectively evaluating a plurality of object detections.

（手順３−２）上記の手順３−１の評価結果において、確度が低い、及び間違った推定をした、されたランドマーク候補L(k)は、ユニーク性が低いとして、すなわち、ランドマークとして設定して用いるには不適切であるものとして、候補から除外する。 (Procedure 3-2) In the evaluation result of the above procedure 3-1, the landmark candidate L (k) with low accuracy and incorrect estimation is regarded as having low uniqueness, that is, as a landmark. Exclude from candidates as it is inappropriate to set and use.

（手順３−３）上記の手順３−２で候補から除外されなかったランドマーク候補L(k)を、候補ではなくランドマークL(k)であるものとして確定させ、手順３−１において既に学習済みのモデルと、ランドマークの位置（元の候補ボクセル群Vckの位置として与えられる３次元世界座標における位置）とを紐づけてモデルDB15に保存する。（なお、手順３−２で候補から除外されたランドマーク候補に関しては、学習データにおいては自動アノテーション付与がなされなかったものとして学習データを更新し、再度、手順３−１と同様の学習を行って得られる学習モデル及びランドマーク位置を、モデルDB15に保存するようにしてもよい。） (Procedure 3-3) The landmark candidate L (k) that was not excluded from the candidates in the above procedure 3-2 is determined to be a landmark L (k) instead of a candidate, and has already been determined in step 3-1. The trained model and the landmark position (the position in the three-dimensional world coordinates given as the position of the original candidate voxel group Vck) are linked and saved in the model DB15. (For landmark candidates excluded from the candidates in step 3-2, the learning data is updated assuming that automatic annotation has not been added to the learning data, and the same learning as in step 3-1 is performed again. The learning model and landmark positions obtained may be saved in the model DB15.)

以上の通り、図３のステップS6を終えるとステップS7へと進む。ステップS7では推定装置10において新たな画像の撮影を行い、ステップS6で得られてモデルDB15に保存されているモデルを利用することにより、この画像を撮影したカメラ（画像撮影部21を構成するカメラ）の位置を推定して、図３のフローは終了する。 As described above, when step S6 in FIG. 3 is completed, the process proceeds to step S7. In step S7, the estimation device 10 takes a new image, and by using the model obtained in step S6 and stored in the model DB 15, the camera that took this image (the camera that constitutes the image capturing unit 21). ) Is estimated, and the flow of FIG. 3 ends.

具体的にステップS7ではまず、画像撮影部21が画像を撮影して、この画像を推定部22へと出力する。次いで、推定部22は、この画像に対してモデルDB15に保存されている学習済みモデルによる物体認識を適用し、いずれのランドマークL(k)（候補ではない）が画像内のいずれの領域に撮影されているかの認識結果を得て、当該認識されたランドマークの位置及び大きさより、画像の位置推定結果を得る。ここで、ランドマークの位置及び大きさは、画像におけるもの（2次元画像座標での位置及び大きさ）として推定したうえで、３次元世界座標におけるものとして画像の位置を推定すればよい。具体的には例えば以下のように（１）、（２）の場合分けに従って推定すればよい。 Specifically, in step S7, the image capturing unit 21 first captures an image and outputs this image to the estimation unit 22. Next, the estimation unit 22 applies object recognition by the trained model stored in the model DB 15 to this image, and any landmark L (k) (not a candidate) is applied to any region in the image. The recognition result of whether or not the image is taken is obtained, and the position estimation result of the image is obtained from the position and size of the recognized landmark. Here, the position and size of the landmark may be estimated as those in the image (position and size in the two-dimensional image coordinates), and then the position of the image may be estimated as those in the three-dimensional world coordinates. Specifically, for example, it may be estimated according to the cases (1) and (2) as follows.

（１）まず、撮影した画像にランドマークが3つ以上ある場合には、その位置関係からカメラ画像の３次元位置を計算することができる。また、その際に、検知した２次元画像座標での大きさも考慮することで、３次元位置の精度を高めるようにしてよい。（１ａ）またその３次元位置を推定した後、より精度を高めるために、推定したカメラ画像位置の近くの画像を抽出し、点群マッチングして位置推定するという手法を用いてもよい。（２）一方、撮影した画像に２つ以下のランドマークがある場合、そのランドマークが写っている画像を通常のVPSの点群のマッチング候補となる画像として利用し、通常の点群マッチングで３次元位置推定を行うようにすればよい。 (1) First, when there are three or more landmarks in the captured image, the three-dimensional position of the camera image can be calculated from the positional relationship. At that time, the accuracy of the three-dimensional position may be improved by considering the size of the detected two-dimensional image coordinates. (1a) Further, after estimating the three-dimensional position, in order to further improve the accuracy, a method of extracting an image near the estimated camera image position and matching the point cloud to estimate the position may be used. (2) On the other hand, if there are two or less landmarks in the captured image, the image showing the landmarks is used as an image that is a matching candidate for the normal VPS point cloud, and the normal point cloud matching is performed. The three-dimensional position estimation may be performed.

以上、本実施形態の学習装置によれば、測位や地図作成において一般的に利用されている点群を利用して、手動によるアノテーション等の手間を発生させることなく、画像よりランドマークを自動推定するモデルを学習することが可能となる。以下、種々の補足説明を行う。 As described above, according to the learning device of the present embodiment, landmarks are automatically estimated from images by using point clouds generally used in positioning and map creation without incurring the trouble of manual annotation and the like. It is possible to learn the model to be used. Hereinafter, various supplementary explanations will be given.

（Ａ）図３のステップS1にて画像取得部11は、単一の移動体としての端末1においてフィールド内を移動しながら撮影した映像の各フレーム画像を取得するものとして説明したが、本実施形態を適用可能な画像群はこのように映像として取得されたものに限定されない。ステップS1にて画像取得部11は、点群構築の対象となるフィールドを、複数の移動体における各カメラによって様々な位置姿勢において撮影された画像群を取得するようにしてもよい。また、移動体のカメラに限らず、様々な位置の固定カメラから撮影された画像群を取得するようにしてもよい。このように取得する場合、画像取得部11はハードウェアとしてカメラとして構成されることに代えて、または加えて、様々なカメラで撮影された画像をネットワーク上から取得するための通信インタフェースとして構成されるものであってよい。 (A) In step S1 of FIG. 3, the image acquisition unit 11 has been described as acquiring each frame image of the image captured while moving in the field on the terminal 1 as a single moving body. The image group to which the form can be applied is not limited to those acquired as an image in this way. In step S1, the image acquisition unit 11 may acquire an image group captured in various positions and postures by each camera in a plurality of moving bodies in the field to be constructed as a point cloud. Further, the image group may be acquired not only from the moving camera but also from the fixed cameras at various positions. When acquiring in this way, the image acquisition unit 11 is configured as a communication interface for acquiring images taken by various cameras from the network instead of or in addition to being configured as a camera as hardware. It may be something.

（Ｂ）図７は、一般的なコンピュータ装置70におけるハードウェア構成の例を示す図である。点群処理システム100における学習装置10及び推定装置20を構成する端末1及びサーバ2はそれぞれ、このような構成を有する１台以上のコンピュータ装置70として実現可能である。なお、２台以上のコンピュータ装置70で学習装置10又は推定装置20実現する場合、ネットワーク経由で処理に必要な情報の送受を行うようにしてよい。コンピュータ装置70は、所定命令を実行するCPU（中央演算装置）71、CPU71の実行命令の一部又は全部をCPU71に代わって又はCPU71と連携して実行する専用プロセッサとしてのGPU（グラフィックス演算装置）72、CPU71（及びGPU72）にワークエリアを提供する主記憶装置としてのRAM73、補助記憶装置としてのROM74、通信インタフェース75、ディスプレイ76、マウス、キーボード、タッチパネル等によりユーザ入力を受け付ける入力インタフェース77、画像取得部11及び画像撮影部21をハードウェアとして構成するカメラ78と、これらの間でデータを授受するためのバスBSと、を備える。前述の通り、画像取得部11を構成するハードウェアは、通信インタフェース75であってもよい。 (B) FIG. 7 is a diagram showing an example of a hardware configuration in a general computer device 70. The terminal 1 and the server 2 constituting the learning device 10 and the estimation device 20 in the point cloud processing system 100 can be realized as one or more computer devices 70 having such a configuration, respectively. When the learning device 10 or the estimation device 20 is realized by two or more computer devices 70, information necessary for processing may be transmitted and received via a network. The computer device 70 is a CPU (central processing unit) 71 that executes a predetermined instruction, and a GPU (graphics calculation device) as a dedicated processor that executes a part or all of the execution instructions of the CPU 71 on behalf of the CPU 71 or in cooperation with the CPU 71. ) 72, RAM73 as the main memory that provides the work area to the CPU71 (and GPU72), ROM74 as the auxiliary storage, communication interface 75, display 76, input interface 77 that accepts user input by mouse, keyboard, touch panel, etc. It includes a camera 78 that configures the image acquisition unit 11 and the image capturing unit 21 as hardware, and a bus BS for exchanging data between them. As described above, the hardware constituting the image acquisition unit 11 may be the communication interface 75.

学習装置10及び推定装置20の各機能部は、各部の機能に対応する所定のプログラムをROM74から読み込んで実行するCPU71及び／又はGPU72によって実現することができる。なお、CPU71及びGPU72は共に、演算装置（プロセッサ）の一種である。ここで、表示関連の処理が行われる場合にはさらに、ディスプレイ76が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース75が連動して動作する。点群処理システム100による処理結果等はディスプレイ76で表示して出力してよい。 Each functional unit of the learning device 10 and the estimation device 20 can be realized by a CPU 71 and / or a GPU 72 that reads and executes a predetermined program corresponding to the function of each unit from the ROM 74. Both CPU71 and GPU72 are a type of arithmetic unit (processor). Here, when the display-related processing is performed, the display 76 further operates in conjunction with the display 76, and when the communication-related processing related to data transmission / reception is performed, the communication interface 75 further operates in conjunction with the display. The processing result of the point cloud processing system 100 may be displayed and output on the display 76.

100…点群処理システム、1…端末、2…サーバ
10…学習装置、11…画像取得部、12…点群構築部、13…決定部、14…学習部、15…モデルDB、131…ボクセル分割部、132…密度評価部、133…視野評価部、134…候補決定部
20…推定装置、21…画像撮影部、22…推定部 100 ... Point cloud processing system, 1 ... Terminal, 2 ... Server
10 ... Learning device, 11 ... Image acquisition unit, 12 ... Point cloud construction unit, 13 ... Decision unit, 14 ... Learning unit, 15 ... Model DB, 131 ... Voxel division unit, 132 ... Density evaluation unit, 133 ... Field of view evaluation unit , 134 ... Candidate determination department
20 ... Estimator, 21 ... Imaging unit, 22 ... Estimator

Claims

フィールドを撮影した複数の画像を取得する画像取得部と、
前記画像の各々より特徴点を検出し、画像間での特徴点の対応関係を求めることにより、前記フィールドの点群を構築する点群構築部と、
前記点群において、点の密度が高いと判定され、且つ、点に対応する画像の撮影方向の範囲が広いと判定される領域を、ランドマーク候補領域として設定する決定部と、
各ランドマーク候補領域に識別子を付与して、前記画像において識別子が付与されたランドマーク候補領域を当該識別子で指定される物体領域であるものとしてアノテーション付与した学習データによる学習を行い、
前記学習の結果において、物体領域の識別性能が高いと判定されるランドマーク候補領域についての識別モデルを得る学習部と、を備えることを特徴とする学習装置。 An image acquisition unit that acquires multiple images of the field,
A point cloud construction unit that constructs a point cloud in the field by detecting a feature point from each of the images and obtaining a correspondence relationship between the feature points between the images.
In the point cloud, a determination unit that sets a region that is determined to have a high point density and a wide range of image shooting directions corresponding to the points as a landmark candidate region.
An identifier is assigned to each landmark candidate area, and learning is performed using the learning data in which the landmark candidate area to which the identifier is assigned in the image is annotated as an object area specified by the identifier.
A learning device including a learning unit that obtains an identification model for a landmark candidate region determined to have high identification performance of an object region in the learning result.

前記決定部は、前記点群を所定のボクセルに分割し、ボクセル毎に前記ランドマーク候補領域を設定することを特徴とする請求項１に記載の学習装置。 The learning device according to claim 1, wherein the determination unit divides the point cloud into predetermined voxels and sets the landmark candidate area for each voxel.

前記学習部は、前記ランドマーク候補領域として設定されたボクセルの連結成分の各々に、ランドマーク候補領域の識別子を付与することを特徴とする請求項２に記載の学習装置。 The learning device according to claim 2, wherein the learning unit assigns an identifier of the landmark candidate region to each of the connected components of the voxels set as the landmark candidate region.

前記学習部は、前記画像において識別子が付与されたランドマーク候補領域を３次元世界座標から当該画像の座標系に透視投影することにより、当該識別子で指定される物体領域を定めることを特徴とする請求項１ないし３のいずれかに記載の学習装置。 The learning unit is characterized in that a landmark candidate region to which an identifier is assigned in the image is perspectively projected from three-dimensional world coordinates onto the coordinate system of the image to determine an object region designated by the identifier. The learning device according to any one of claims 1 to 3.

フィールドを撮影した複数の画像を取得する画像取得段階と、
前記画像の各々より特徴点を検出し、画像間での特徴点の対応関係を求めることにより、前記フィールドの点群を構築する点群構築段階と、
前記点群において、点の密度が高いと判定され、且つ、点に対応する画像の撮影方向の範囲が広いと判定される領域を、ランドマーク候補領域として設定する決定段階と、
各ランドマーク候補領域に識別子を付与して、前記画像において識別子が付与されたランドマーク候補領域を当該識別子で指定される物体領域であるものとしてアノテーション付与した学習データによる学習を行い、
前記学習の結果において、物体領域の識別性能が高いと判定されるランドマーク候補領域についての識別モデルを得る学習段階と、を備えることを特徴とする学習方法。 The image acquisition stage to acquire multiple images of the field, and
A point cloud construction stage in which a point cloud in the field is constructed by detecting a feature point from each of the images and obtaining a correspondence relationship between the feature points between the images.
In the point cloud, a determination step of setting a region where it is determined that the density of points is high and the range of the image corresponding to the points in the shooting direction is wide is set as a landmark candidate region.
An identifier is assigned to each landmark candidate area, and learning is performed using the learning data in which the landmark candidate area to which the identifier is assigned in the image is annotated as an object area specified by the identifier.
A learning method comprising a learning step of obtaining an identification model for a landmark candidate region determined to have high identification performance of an object region in the learning result.

コンピュータを請求項１ないし４のいずれかに記載の学習装置として機能させることを特徴とするプログラム。 A program characterized in that the computer functions as the learning device according to any one of claims 1 to 4.