JP7248345B2

JP7248345B2 - Image processing device, image processing method and program

Info

Publication number: JP7248345B2
Application number: JP2021505596A
Authority: JP
Inventors: 永記石寺
Original assignee: NEC Solution Innovators Ltd
Current assignee: NEC Solution Innovators Ltd
Priority date: 2019-03-11
Filing date: 2020-02-06
Publication date: 2023-03-29
Anticipated expiration: 2040-02-06
Also published as: JPWO2020184006A1; WO2020184006A1

Description

本開示は、画像処理装置、画像処理方法及びプログラムに関する。 The present disclosure relates to an image processing device, an image processing method, and a program.

所定の領域を監視する監視装置の一つとして、カメラで撮像した画像（動画像を含む）から人物等の移動物体を検出して追跡する画像処理を行う装置が知られている（例えば、特許文献１）。 As one of monitoring devices for monitoring a predetermined area, there is known a device that performs image processing to detect and track a moving object such as a person from an image (including a moving image) captured by a camera (for example, patent Reference 1).

特許文献１には、過去の画像から検出された人物の位置に基づいて、現在の処理対象の画像において複数の人物の重なり領域を検出し、最前の人物を判断することが開示されている。 Japanese Patent Laid-Open No. 2006-100000 discloses detecting an overlapping area of a plurality of persons in a current image to be processed based on the positions of persons detected from past images, and determining the foremost person.

特開２０１７－０２７１９７号公報JP 2017-027197 A

画像において、複数の人物の重なり領域が存在する場合、重なっている人物の人物領域を正確に特定する必要がある。特許文献１では、人物の位置を推定する処理を行うことが開示されているが、重なっている人物の人物領域を特定することが開示されていない。 In an image, when there are overlapping areas of a plurality of persons, it is necessary to accurately identify the person areas of the overlapping persons. Patent Literature 1 discloses performing a process of estimating the position of a person, but does not disclose identifying a person area of an overlapping person.

本開示の目的は、このような課題を解決するためになされたものであり、画像に含まれる人物の人物領域を精度良く特定することが可能な画像処理装置、画像処理方法及びプログラムを提供することである。 An object of the present disclosure is to solve such problems, and to provide an image processing device, an image processing method, and a program capable of accurately identifying a human area of a person included in an image. That is.

本開示にかかる画像処理装置は、
撮像装置により撮像された第１画像を入力する入力部と、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成する生成部と、を備える画像処理装置である。The image processing device according to the present disclosure is
an input unit for inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. and a generation unit that generates a second image including:

本開示にかかる画像処理方法は、
撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成することと、を含む画像処理方法である。The image processing method according to the present disclosure is
inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. and generating a second image comprising:

本開示にかかるプログラムは、
撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成することと、をコンピュータに実行させるプログラムである。The program according to the present disclosure is
inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. and generating a second image containing

本開示によれば、画像に含まれる人物の人物領域を精度良く特定することが可能な画像処理装置、画像処理方法及びプログラムを提供することができる。 Advantageous Effects of Invention According to the present disclosure, it is possible to provide an image processing device, an image processing method, and a program capable of accurately identifying a human region of a person included in an image.

実施の形態１にかかる画像処理装置の構成例を示す図である。1 is a diagram illustrating a configuration example of an image processing apparatus according to a first embodiment; FIG. 実施の形態２にかかる画像処理装置の構成例を示す図である。FIG. 10 is a diagram illustrating a configuration example of an image processing apparatus according to a second embodiment; FIG. 入力画像の一例を示す図である。It is a figure which shows an example of an input image. 等距離領域を含む画像の一例を示す図である。It is a figure which shows an example of the image containing an equidistant area. 合成画像の一例を示す図である。It is a figure which shows an example of a synthetic image. 領域パターンを説明するための図である。FIG. 4 is a diagram for explaining a region pattern; FIG. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 生成処理を説明するための図である。It is a figure for demonstrating a production|generation process. 決定処理を説明するための図である。It is a figure for demonstrating a determination process. 決定処理を説明するための図である。It is a figure for demonstrating a determination process. 実施の形態２にかかる学習装置の構成例を示す図である。FIG. 10 is a diagram illustrating a configuration example of a learning device according to a second embodiment; FIG. 実施の形態２にかかる画像処理装置の動作例を説明する図である。FIG. 10 is a diagram for explaining an operation example of the image processing apparatus according to the second embodiment; FIG. 実施の形態２にかかる学習装置の動作例を示す図である。FIG. 10 is a diagram showing an operation example of the learning device according to the second embodiment; 本開示の各実施の形態にかかる画像処理装置等を実現可能な、コンピュータ（情報処理装置）のハードウェア構成を例示するブロック図である。1 is a block diagram illustrating a hardware configuration of a computer (information processing device) capable of realizing an image processing device or the like according to each embodiment of the present disclosure; FIG.

（実施の形態１）
以下、図面を参照して本発明の実施の形態について説明する。図１は、実施の形態１にかかる画像処理装置の構成例を示す図である。画像処理装置１は、例えば、サーバ装置、パーソナルコンピュータ装置等であってもよい。(Embodiment 1)
BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram illustrating a configuration example of an image processing apparatus according to a first embodiment; The image processing device 1 may be, for example, a server device, a personal computer device, or the like.

画像処理装置１は、入力部２と、生成部３とを備える。
入力部２は、撮像装置により撮像された第１画像を入力する。撮像装置は、例えば、監視カメラ、定点カメラ、デジタルカメラ等であってもよい。The image processing device 1 includes an input section 2 and a generation section 3 .
The input unit 2 inputs the first image captured by the imaging device. The imaging device may be, for example, a surveillance camera, a fixed point camera, a digital camera, or the like.

生成部３は、学習済みの学習モデルに基づいて、第１画像のうち、人物が含まれると推定される推定領域から、撮像装置からの距離が等しいと推定される第１領域を抽出し、第１領域を含む第２画像を生成する。 The generation unit 3 extracts a first region estimated to be at the same distance from the imaging device from the estimated regions estimated to include a person in the first image based on the learned learning model, A second image is generated that includes the first region.

第１画像において、複数の人物が重なる領域が含まれている場合、重なっている人物の人物領域は、撮像装置からの距離が異なると推定される。生成部３は、人物が含まれると推定される推定領域から、撮像装置からの距離が等しい第１領域を抽出することにより、抽出された第１領域を第１画像に含まれる各人物の人物領域と特定することが可能となる。したがって、画像処理装置１によれば、画像に含まれる人物の人物領域を精度良く特定することが可能となる。 When the first image includes an area where a plurality of persons overlap, it is estimated that the person areas of the overlapping persons are at different distances from the imaging device. The generation unit 3 extracts the first regions having the same distance from the imaging device from the estimated regions that are estimated to include a person, and converts the extracted first regions to the figures of each person included in the first image. It becomes possible to specify the region. Therefore, according to the image processing apparatus 1, it is possible to accurately identify the person area of the person included in the image.

（実施の形態２）
続いて、実施の形態２について説明する。実施の形態２は、実施の形態１を詳細にした実施の形態である。(Embodiment 2)
Next, Embodiment 2 will be described. The second embodiment is a detailed version of the first embodiment.

＜画像処理装置の構成例＞
図２を用いて、実施の形態２にかかる画像処理装置１０について説明する。図２は、実施の形態２にかかる画像処理装置の構成例を示す図である。画像処理装置１０は、入力部１１と、データ記憶部１２と、生成部１３と、モデル記憶部１４と、決定部１５とを備える。<Configuration example of image processing apparatus>
An image processing apparatus 10 according to the second embodiment will be described with reference to FIG. FIG. 2 is a diagram illustrating a configuration example of an image processing apparatus according to a second embodiment; The image processing apparatus 10 includes an input unit 11 , a data storage unit 12 , a generation unit 13 , a model storage unit 14 and a determination unit 15 .

入力部１１は、撮像装置により撮像された画像を入力し、入力された画像をデータ記憶部１２に記憶する。入力部１１は、記録媒体に記憶された画像を入力してもよい。もしくは、入力部１１は、画像処理装置１０に接続された外部のパーソナルコンピュータ装置又はサーバ装置等から、撮像装置により撮像された画像を入力してもよい。 The input unit 11 inputs an image captured by the imaging device and stores the input image in the data storage unit 12 . The input unit 11 may input an image stored in a recording medium. Alternatively, the input unit 11 may input an image captured by an imaging device from an external personal computer device, server device, or the like connected to the image processing device 10 .

入力部１１に入力される画像は、例えば、監視カメラ、定点カメラ、デジタルカメラ等の撮像装置により撮像された画像であってもよい。なお、以降の説明では、入力部１１は、監視カメラにより撮像された画像を、監視カメラに接続されたサーバ装置から入力することとして説明する。また、以降の説明では、入力部１１に入力される画像を入力画像と称して記載する。 The image input to the input unit 11 may be, for example, an image captured by an imaging device such as a surveillance camera, a fixed point camera, or a digital camera. In the following description, it is assumed that the input unit 11 inputs an image captured by a surveillance camera from a server device connected to the surveillance camera. Also, in the following description, an image input to the input unit 11 is referred to as an input image.

ここで、図３を用いて、入力画像の一例について説明する。図３は、入力画像の一例を示す図である。入力部１１に入力される画像（入力画像）は、監視カメラにより撮像された画像である。図３に示すように、入力画像には複数の人物が含まれており、人物が重なって撮像された領域が含まれる。 An example of an input image will now be described with reference to FIG. FIG. 3 is a diagram showing an example of an input image. An image (input image) input to the input unit 11 is an image captured by a surveillance camera. As shown in FIG. 3, the input image includes a plurality of persons, and includes an area in which persons overlap and are imaged.

図２に戻り、データ記憶部１２について説明する。
データ記憶部１２は、入力画像を記憶する。また、データ記憶部１２は、入力部１１に入力された画像の背景画像を記憶する。なお、背景画像も、入力部１１に入力され、入力部１１が背景画像をデータ記憶部１２に記憶するようにしてもよい。データ記憶部１２は、生成部１３が生成する画像も記憶する。Returning to FIG. 2, the data storage unit 12 will be described.
The data storage unit 12 stores input images. The data storage unit 12 also stores the background image of the image input to the input unit 11 . The background image may also be input to the input section 11 and the input section 11 may store the background image in the data storage section 12 . The data storage unit 12 also stores images generated by the generation unit 13 .

生成部１３は、後述するモデル記憶部１４に記憶された学習済みの学習モデルに基づいて、入力画像のうち、人物が含まれると推定される推定領域から、監視カメラからの距離が等しいと推定される領域を示す等距離領域を抽出する。生成部１３は、等距離領域を含む画像を生成し、生成した画像をデータ記憶部１２に記憶する。 Based on a learned model stored in a model storage unit 14 (to be described later), the generating unit 13 estimates that the distances from the monitoring camera are equal to each other, based on the estimated area in the input image that is estimated to include a person. Extract the equidistant region that indicates the region where the The generation unit 13 generates an image including equidistant areas, and stores the generated image in the data storage unit 12 .

生成部１３は、背景画像と、入力画像とをデータ記憶部１２から取得する。生成部１３は、背景画像と入力画像とを用いて、例えば、背景差分法（背景差分処理）により、入力画像のうち、人物が含まれると推定される領域を示す推定領域を推定する。 The generation unit 13 acquires the background image and the input image from the data storage unit 12 . The generation unit 13 uses the background image and the input image to estimate an estimated region indicating a region in which a person is estimated to be included in the input image by, for example, a background subtraction method (background subtraction processing).

生成部１３は、モデル記憶部１４に記憶された学習済みの学習モデルを取得する。生成部１３は、取得した学習モデルに基づいて、推定領域から、監視カメラからの距離が等しいと推定される領域を示す等距離領域を抽出する。生成部１３は、入力画像を学習モデルに入力し、等距離領域を抽出する。生成部１３は、抽出された等距離領域を含む画像を生成して、データ記憶部１２に記憶する。 The generation unit 13 acquires a learned learning model stored in the model storage unit 14 . Based on the acquired learning model, the generation unit 13 extracts an equidistant area indicating an area estimated to be at the same distance from the surveillance camera from the estimated area. The generation unit 13 inputs the input image to the learning model and extracts equidistant regions. The generation unit 13 generates an image including the extracted equidistant regions, and stores the image in the data storage unit 12 .

ここで、図４を用いて、生成部１３が生成する等距離領域を含む画像の一例について説明する。図４は、等距離領域を含む画像の一例を示す図である。図４において、白色の領域は、生成部１３が抽出した等距離領域である。領域Ｕ１は、等距離領域を含む画像の一部の領域であり、領域Ｕ１のうち右半分の領域は、入力画像において人物が重なっている領域である。 Here, an example of an image including equidistant areas generated by the generation unit 13 will be described with reference to FIG. 4 . FIG. 4 is a diagram showing an example of an image including equidistant areas. In FIG. 4 , white areas are equidistant areas extracted by the generator 13 . The area U1 is a partial area of the image including the equidistant area, and the right half area of the area U1 is an area where the person overlaps in the input image.

領域Ｕ１には、１１人の人物が含まれている。領域Ｕ１において、隣り合う等距離領域の間には黒線（黒の領域）が含まれており、当該黒線（黒の領域）により、等距離領域同士が区切られている。そのため、画像処理装置１０は、白色の等距離領域から、予め定められた所定の閾値以上の面積を持つ連結領域（等距離領域）のみを取り出すことで、領域Ｕ１には、１１人の人物が含まれていることを特定することができる。なお、領域Ｕ１以外の他の領域についても、同様であるため、画像処理装置１０は、画像に含まれる人物の人物領域を精度良く特定することができる。 Area U1 includes 11 persons. In the region U1, a black line (black region) is included between adjacent equidistant regions, and the equidistant regions are separated by the black line (black region). Therefore, the image processing apparatus 10 extracts only connected regions (equidistant regions) having an area equal to or larger than a predetermined threshold value from the white equidistant regions, so that 11 persons are included in the region U1. can be identified as being included. Since the same applies to areas other than the area U1, the image processing apparatus 10 can accurately identify the human area of the person included in the image.

図２に戻り、生成部１３の説明を続ける。
生成部１３は、取得した学習モデルに基づいて、人物が含まれると推定される推定領域から、監視カメラとの距離が異なる境界線よりも撮像装置からの距離が短い領域を示す手前領域、及び境界線よりも撮像装置からの距離が長い領域を示す奥領域を抽出する。生成部１３は、入力画像を学習モデルに入力し、推定領域から、手前領域及び奥領域を抽出する。Returning to FIG. 2, the description of the generator 13 is continued.
Based on the acquired learning model, the generating unit 13 selects an estimated area that is estimated to include a person, and generates a front area indicating an area whose distance from the imaging device is shorter than a boundary line with a different distance from the surveillance camera, and A depth region indicating a region having a longer distance from the imaging device than the boundary line is extracted. The generation unit 13 inputs the input image to the learning model and extracts the front region and the back region from the estimated region.

また、生成部１３は、人物が含まれると推定される推定領域から、監視カメラとの距離が異なる境界線も抽出する。なお、生成部１３は、監視カメラとの距離が異なる境界線を抽出しなくてもよい。 The generation unit 13 also extracts boundary lines at different distances from the monitoring camera from the estimated area that is estimated to include a person. Note that the generation unit 13 does not have to extract boundary lines at different distances from the surveillance camera.

具体的には、生成部１３は、学習モデルに基づいて、入力画像のうち、等距離領域の周辺領域から、境界線と、手前領域と、奥領域とを抽出する。等距離領域の周辺領域は、図４の領域Ｕ１のうち、等距離領域同士を区切っている黒線（黒の領域）である。周辺領域は、図４の領域Ｕ１のうち、隣り合う等距離領域との間に含まれる黒線（黒の領域）であるとも言える。 Specifically, based on the learning model, the generation unit 13 extracts the boundary line, the front area, and the back area from the peripheral area of the equidistant area in the input image. The peripheral area of the equidistant area is the black line (black area) separating the equidistant areas in the area U1 in FIG. The peripheral area can also be said to be a black line (black area) included between adjacent equidistant areas in the area U1 in FIG.

生成部１３は、境界線、手前領域及び奥領域を抽出すると、データ記憶部１２に記憶されている、等距離領域を含む画像と、境界線、手前領域及び奥領域とを合成して、合成された画像を生成する。生成部１３は、生成した画像をデータ記憶部１２に記憶する。なお、学習モデル、及び生成部１３が各領域を抽出し、抽出された各領域を含む画像を生成する生成処理について後述する。また、以降の説明では、等距離領域を含む画像と、等距離領域、境界線、手前領域及び奥領域が合成された画像とを区別するために、等距離領域を含む画像を等距離画像とし、合成された画像を合成画像と称して記載する。 After extracting the boundary line, the front region, and the back region, the generation unit 13 synthesizes the image including the equidistant region stored in the data storage unit 12 with the boundary line, the front region, and the back region. to generate a rendered image. The generation unit 13 stores the generated image in the data storage unit 12 . A learning model and generation processing in which the generation unit 13 extracts each region and generates an image including each extracted region will be described later. Further, in the following description, an image including an equidistant area is assumed to be an equidistant image in order to distinguish between an image including an equidistant area and an image in which an equidistant area, a boundary line, a front area, and a back area are synthesized. , the synthesized image will be referred to as a synthesized image.

ここで、図５を用いて、生成部１３が生成する合成画像の一例について説明する。図５は、合成画像の一例を示す図である。図５について、領域Ｕ２を用いて説明する。領域Ｕ２は、図４の領域Ｕ１の一部の領域であり、２人の人物が重なる領域である。 Here, an example of the composite image generated by the generation unit 13 will be described with reference to FIG. 5 . FIG. 5 is a diagram showing an example of a synthesized image. FIG. 5 will be described using region U2. A region U2 is a partial region of the region U1 in FIG. 4, and is a region where two persons overlap.

領域Ｕ２には、一点鎖線Ｌ１、点線Ｌ２及び実線Ｌ３が記載されている。一点鎖線Ｌ１は、生成部１３が生成した等距離領域の境界を表す線である。一点鎖線Ｌ１の内側の領域（点線Ｌ２の方向と逆側の領域）は、等距離領域である。点線Ｌ２は、生成部１３が抽出した境界線を表す線である。一点鎖線Ｌ１と点線Ｌ２との間の領域は、生成部１３が抽出した手前領域である。実線Ｌ３は、奥領域の境界を表す線であり、点線Ｌ２と実線Ｌ３との間の領域は、奥領域である。 A dashed-dotted line L1, a dotted line L2, and a solid line L3 are drawn in the region U2. A dashed-dotted line L1 is a line representing the boundary of the equidistant regions generated by the generation unit 13 . The area inside the dashed-dotted line L1 (the area opposite to the direction of the dotted line L2) is an equidistant area. A dotted line L2 is a line representing a boundary line extracted by the generation unit 13 . The region between the dashed-dotted line L1 and the dotted line L2 is the front region extracted by the generator 13 . A solid line L3 is a line representing the boundary of the depth region, and the region between the dotted line L2 and the solid line L3 is the depth region.

別の観点で説明をすると、領域Ｕ２は、色の濃淡により、各領域が分かるように示されており、白い領域（点線Ｌ２と実線Ｌ３との間の領域）は奥領域であり、黒い領域（一点鎖線Ｌ１と点線Ｌ２との間の領域）は手前領域である。また、白と黒の間のグレーの領域（一点鎖線Ｌ１から点線Ｌ２の方向と逆側の領域）は等距離領域である。このように、生成部１３は、等距離領域の周辺領域から境界線、手前領域及び奥領域を抽出して、抽出した境界線、手前領域及び奥領域と、等距離画像とを合成して合成画像を生成する。 To explain from another point of view, the area U2 is indicated by the color shading so that each area can be identified, the white area (the area between the dotted line L2 and the solid line L3) is the back area, and the black area (A region between the dashed line L1 and the dotted line L2) is the front region. Also, the gray area between white and black (the area on the opposite side of the dashed line L1 to the dotted line L2) is an equidistant area. In this way, the generation unit 13 extracts the boundary line, the front region, and the back region from the peripheral region of the equidistant region, and synthesizes the extracted boundary line, the front region, and the back region with the equidistant image. Generate an image.

図２に戻り、モデル記憶部１４について説明する。
モデル記憶部１４は、生成部１３が用いる学習済みの学習モデルを記憶する。モデル記憶部１４に記憶される学習モデルは、後述する学習装置２０により学習された学習モデルである。学習モデルは、推定領域に含まれる所定の画素ブロック毎に、複数の領域パターンのうち一致する領域パターンを出力する学習モデルである。所定の画素ブロックは、例えば、１５×１５のパッチ画像として切り出された画素ブロック（画素群）である。なお、上記の画素ブロックは一例であり、３×３～１５０×１５０の画素ブロックの中から任意に選択することができる。Returning to FIG. 2, the model storage unit 14 will be described.
The model storage unit 14 stores a learned learning model used by the generation unit 13 . The learning model stored in the model storage unit 14 is a learning model learned by a learning device 20, which will be described later. The learning model is a learning model that outputs a matching area pattern among a plurality of area patterns for each predetermined pixel block included in the estimation area. The predetermined pixel block is, for example, a pixel block (pixel group) cut out as a 15×15 patch image. Note that the pixel blocks described above are only examples, and can be arbitrarily selected from pixel blocks of 3×3 to 150×150.

学習モデルは、例えば、畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）であり、入力層と、多層の隠れ層と、出力層とを含む。入力層は、例えば、入力画像がＲＧＢ(Red Green Blue)画像である場合、Ｒ、Ｇ、Ｂのチャネル別にパッチ画像のサイズを乗じた値とすることができる。また、学習モデルは、３層の隠れ層で構成し、活性化関数としてＲｅＬＵ関数を用いて構成してもよい。学習モデルは、学習装置２０において深層学習（Deep learning）により学習され、各層に適用される重み及び閾値を含むパラメータが学習される。なお、学習モデルは、他のアルゴリズムを用いた学習モデルであってもよい。また、上記した入力層、隠れ層、活性化関数は一例であるので、異なるように構成された学習モデルであってもよい。 The learning model is, for example, a convolutional neural network (CNN), and includes an input layer, multiple hidden layers, and an output layer. For example, when the input image is an RGB (Red Green Blue) image, the input layer can be a value obtained by multiplying the size of the patch image for each of the R, G, and B channels. Also, the learning model may be composed of three hidden layers, and may be composed using a ReLU function as an activation function. The learning model is learned by deep learning in the learning device 20, and parameters including weights and thresholds applied to each layer are learned. Note that the learning model may be a learning model using another algorithm. Also, since the input layer, hidden layer, and activation function described above are examples, the learning model may be configured differently.

決定部１５は、合成画像における等距離領域、手前領域及び奥領域の少なくとも１つの領域に基づいて、入力画像に含まれる人物の前後関係を決定する。 The determining unit 15 determines the front-rear relationship of the person included in the input image based on at least one of the equidistant area, the near area, and the far area in the synthesized image.

決定部１５は、合成画像における等距離領域に基づいて、入力画像に含まれる人物の前後関係を決定してもよい。もしくは、決定部１５は、合成画像における等距離領域を、入力画像に含まれる人物の人物領域と特定し、特定された人物領域の各々について、手前領域及び奥領域のうちの少なくとも１つの領域に基づいて、隣接する人物領域との前後関係を決定する。そして、決定部１５は、決定された、各人物領域と、当該人物領域に隣接する人物領域との前後関係に基づいて、入力画像に含まれる人物の前後関係を決定してもよい。もしくは、決定部１５は、合成画像における等距離領域、手前領域及び奥領域のうちの２つ以上を組み合わせて入力領域に含まれる人物の前後関係を決定してもよい。 The determining unit 15 may determine the anteroposterior relationship of the person included in the input image based on the equidistant regions in the synthesized image. Alternatively, the determining unit 15 identifies equidistant regions in the synthesized image as human regions of persons included in the input image, and determines each of the identified human regions in at least one of the front region and the back region. Based on this, the anteroposterior relationship with the adjacent person area is determined. Then, the determining unit 15 may determine the anteroposterior relationship of the person included in the input image based on the determined anteroposterior relationship between each person area and the person area adjacent to the person area. Alternatively, the determining unit 15 may combine two or more of the equidistant area, the near area, and the far area in the composite image to determine the anteroposterior relationship of the person included in the input area.

本実施の形態では、決定部１５は、合成画像に含まれる等距離領域、手前領域及び奥領域を用いて入力画像に含まれる人物の前後関係を決定する。なお、決定部１５が行う人物の前後関係の決定処理については後述する。 In the present embodiment, the determination unit 15 determines the front-back relationship of the person included in the input image using the equidistant area, the front area, and the back area included in the synthesized image. The process of determining the context of a person performed by the determination unit 15 will be described later.

＜学習モデル＞
次に、モデル記憶部１４に記憶された学習モデルについて説明する。上述したように、学習モデルは、推定領域に含まれる所定の画素ブロック毎に、予め定義された複数の領域パターンのうち一致する領域パターンを出力する学習モデルである。<Learning model>
Next, learning models stored in the model storage unit 14 will be described. As described above, the learning model is a learning model that outputs a matching area pattern among a plurality of predefined area patterns for each predetermined pixel block included in the estimation area.

ここで、図６を用いて、複数の領域パターンについて説明する。図６は、領域パターンを説明するための図である。図６に記載されている数字は、領域パターンの番号を示しており、数字の下に記載された図は、各領域パターンの概念図を示している。 A plurality of area patterns will now be described with reference to FIG. FIG. 6 is a diagram for explaining the area pattern. The numbers shown in FIG. 6 indicate the numbers of the area patterns, and the figures below the numbers show the conceptual diagrams of the respective area patterns.

各領域パターンの概念図の中に含まれる実線は、撮像装置からの距離が異なる境界線を示している。Ｆ（斜線でハッチングされた領域）は手前領域を表しており、Ｂ（縦線でハッチングされた領域）は奥領域を表している。領域パターンの番号が１～８の領域パターンは、境界線と、手前領域と、奥領域との配置関係がそれぞれ異なる領域パターンであり、撮像装置からの距離（深度）の勾配を示す奥行勾配方向がそれぞれ異なる領域パターンである。換言すると、領域パターンの番号が１～８の領域パターンは、手前領域から奥領域に向けた勾配方向がそれぞれ異なる領域パターンである。また、領域パターンの番号が１～８の領域パターンは、手前領域と、奥領域と、境界線との配置パターンがそれぞれ異なる領域パターンであるとも言える。 The solid lines included in the conceptual diagrams of each area pattern indicate boundary lines at different distances from the imaging device. F (area hatched with oblique lines) represents the front area, and B (area hatched with vertical lines) represents the back area. Area patterns with area pattern numbers 1 to 8 are area patterns in which the arrangement relationship between the boundary line, the front area, and the back area is different, and the depth gradient direction indicates the gradient of the distance (depth) from the imaging device. are different region patterns. In other words, the area patterns with area pattern numbers 1 to 8 have different gradient directions from the front area to the back area. Also, it can be said that the area patterns with area pattern numbers 1 to 8 are area patterns in which the front area, the back area, and the boundary lines are arranged in different patterns.

図６に示すように、領域パターンの番号が１～８の領域パターンは、奥行勾配方向がそれぞれ異なる８方向に対応する。なお、領域パターンの番号が１～８は、奥行勾配方向がそれぞれ異なる８方向に対応する領域パターンであるが、奥行勾配方向が８方向の間の方向を含めた１６方向に対応する領域パターンを設けるようにしてもよい。 As shown in FIG. 6, area patterns with area pattern numbers 1 to 8 correspond to eight different depth gradient directions. Area pattern numbers 1 to 8 are area patterns corresponding to 8 directions with different depth gradient directions. You may make it provide.

領域パターンの番号が９の領域パターンは、撮像装置からの距離が等距離の領域を示す領域パターンであり、等距離領域を抽出する領域パターンである。 The area pattern numbered 9 is an area pattern indicating an area equidistant from the imaging device, and is an area pattern for extracting an equidistant area.

以上のように、複数の領域パターン（パターン１～９）を定義する。そして、学習モデルは、人物が存在すると推定される推定領域に含まれる所定の画素ブロック毎に、複数の領域パターンのうち一致する領域パターンを出力する。 As described above, a plurality of area patterns (patterns 1 to 9) are defined. Then, the learning model outputs a matching area pattern among the plurality of area patterns for each predetermined pixel block included in the estimated area in which a person is estimated to exist.

＜生成処理＞
次に、図４、図５及び図７～図１４を用いて、生成部１３が行う各領域の抽出及び画像の生成処理について説明する。図７～図１４は、生成処理を説明するための図である。<Generation processing>
Next, with reference to FIGS. 4, 5, and 7 to 14, extraction of each region and image generation processing performed by the generation unit 13 will be described. 7 to 14 are diagrams for explaining the generation process.

生成部１３は、モデル記憶部１４に記憶された学習モデルを用いて入力画像のうち、人物が存在すると推定される推定領域に含まれる所定の画素ブロック毎に、処理対象画素を変更させながら複数の領域パターンのうち一致する領域パターンを出力する。生成部１３は、推定領域に含まれる所定の画素ブロックに対して、出力された領域パターンを適用して等距離領域、又は境界線、手前領域及び奥領域を抽出する。 The generation unit 13 uses the learning model stored in the model storage unit 14 to generate a plurality of pixels while changing the pixels to be processed for each predetermined pixel block included in the estimation region in which a person is estimated to exist in the input image. output the matching area pattern among the area patterns of . The generating unit 13 applies the output area pattern to predetermined pixel blocks included in the estimated area to extract an equidistant area, a boundary line, a front area, and a back area.

生成部１３は、学習モデルから領域パターンの番号が９の領域パターンが出力された画素ブロックに対して当該領域パターンを適用して等距離領域を抽出し、図４に示す画像を生成する。 The generation unit 13 extracts equidistant regions by applying the region pattern to the pixel blocks for which the region pattern number 9 is output from the learning model, and generates the image shown in FIG.

生成部１３は、学習モデルから領域パターンの番号が１の領域パターンが出力された画素ブロックに当該領域パターンを適用して境界線と、手前領域と、奥領域とを抽出し、図７に示す画像を生成する。図７は、境界線が合成画像の左右方向であり、境界線に対して手前領域が合成画像の下側に存在し、境界線に対して奥領域が合成画像の上側に存在する領域である。つまり、生成部１３は、領域パターンの番号が１に一致する領域を抽出した画像を生成する。 The generating unit 13 applies the area pattern to the pixel block for which the area pattern number 1 is output from the learning model, and extracts the boundary line, the front area, and the back area, as shown in FIG. Generate an image. In FIG. 7, the boundary line is in the left-right direction of the composite image, the front region of the boundary line is below the composite image, and the back region of the boundary line is above the composite image. . That is, the generation unit 13 generates an image by extracting an area whose area pattern number matches 1. FIG.

生成部１３は、学習モデルから領域パターンの番号が２の領域パターンが出力された画素ブロックに当該領域パターンを適用して境界線と、手前領域と、奥領域とを抽出し、図８に示す画像を生成する。図８は、境界線が合成画像の左下から右上に向かう斜め方向であり、境界線に対して手前領域が合成画像の右下側に存在し、境界線に対して奥領域が合成画像の左上側に存在する領域である。つまり、生成部１３は、領域パターンの番号が２に一致する領域を抽出した画像を生成する。 The generation unit 13 applies the area pattern to the pixel block for which the area pattern number 2 is output from the learning model, and extracts the boundary line, the front area, and the back area, as shown in FIG. Generate an image. In FIG. 8, the boundary line is diagonally directed from the lower left to the upper right of the synthesized image, the area in front of the boundary exists on the lower right side of the synthesized image, and the area behind the boundary exists in the upper left of the synthesized image. This is the area that exists on the side. That is, the generating unit 13 generates an image by extracting an area whose area pattern number matches 2. FIG.

生成部１３は、同様に、学習モデルから領域パターンの番号が３～８の領域パターンが出力された画素ブロックに当該領域パターンを適用して境界線と、手前領域と、奥領域とを抽出し、図９～１４に示す画像を生成する。なお、図９～図１４は、それぞれ、領域パターンが３～８に対応する図である。 Similarly, the generation unit 13 applies the area patterns to pixel blocks for which area patterns with area pattern numbers 3 to 8 are output from the learning model, and extracts the boundary line, the front area, and the back area. , to produce the images shown in FIGS. 9 to 14 are diagrams corresponding to area patterns 3 to 8, respectively.

生成部１３は、各領域パターンから生成された画像を合成して、図５に示す合成画像を生成する。このように、生成部１３は、学習モデルにより出力された領域パターンの画素ブロックから等距離領域、境界線、手前領域及び奥領域を抽出して、抽出した各領域を合成して合成画像を生成する。 The generation unit 13 synthesizes the images generated from the respective area patterns to generate the composite image shown in FIG. In this way, the generation unit 13 extracts an equidistant region, a boundary line, a front region, and a back region from the pixel blocks of the region pattern output by the learning model, and combines the extracted regions to generate a composite image. do.

＜決定処理＞
次に、図１５及び図１６を用いて、決定部１５が行う人物の前後関係を決定する決定処理について説明する。図１５及び図１６は、決定処理を説明するための図である。<Decision process>
15 and 16, the determination process for determining the anteroposterior relationship of a person performed by the determination unit 15 will be described. 15 and 16 are diagrams for explaining the determination process.

まず、決定処理の概要について説明する。
決定部１５は、合成画像に含まれる等距離領域に基づき人物領域を特定する。決定部１５は、特定された人物領域の各々の下端線に基づいて入力画像に含まれる人物の前後関係を決定する。First, an overview of the determination process will be described.
The determining unit 15 identifies the person area based on the equidistant area included in the composite image. The determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the bottom line of each identified person region.

決定部１５は、特定された各人物領域について、隣接する人物領域との間に含まれる手前領域及び奥領域に基づいて、各人物領域と、隣接する人物領域との前後関係を決定する。決定部１５は、特定された各人物領域について、隣接する人物領域との間に含まれる手前領域及び奥領域の距離を用いて、手前領域が近い一方の人物領域の人物を他方の人物領域の人物よりも前に位置すると決定する。決定部１５は、奥領域が近い一方の人物領域の人物を他方の人物領域の人物よりも後ろに位置すると決定する。 The determination unit 15 determines the anteroposterior relationship between each specified person area and the adjacent person area based on the front area and the back area included between the specified person area and the adjacent person area. The determining unit 15 uses the distances of the front region and the back region included between the adjacent person regions for each of the identified human regions to determine the person in one of the human regions whose front region is closer to the person in the other human region. It is determined to be located in front of the person. The determining unit 15 determines that the person in one of the person areas whose back area is closer is positioned behind the person in the other person area.

決定部１５は、各人物領域の下端線を用いて決定された人物領域の前後関係の決定処理の結果と、各人物領域について、隣接する人物領域との前後関係の決定処理の結果とを用いて、入力画像に含まれる人物の前後関係を決定する。 The determination unit 15 uses the result of the determination processing of the context of the human region determined using the bottom line of each human region, and the result of the determination processing of the context of each human region with an adjacent human region. to determine the context of the person included in the input image.

次に、図１５を用いて、各人物領域の下端線及び各人物領域の画素数を用いた決定処理について説明する。図１５は、図５の合成画像を模式化した図であり、図４の領域Ｕ１に対応する領域を示している。実線で囲まれた領域は、等距離領域を示している。点線は、境界線を示しており、斜線でハッチングされた領域は、手前領域を示し、縦線でハッチングされた領域は、奥領域を示している。なお、図１５には、生成部１３が抽出した境界線、手前領域及び奥領域のうち、等距離領域同士が隣り合う部分のみの境界線、手前領域及び奥領域を示している。 Next, determination processing using the bottom line of each person area and the number of pixels in each person area will be described with reference to FIG. 15 . FIG. 15 is a schematic diagram of the synthesized image in FIG. 5, showing an area corresponding to the area U1 in FIG. Areas surrounded by solid lines indicate equidistant areas. A dotted line indicates a boundary line, an area hatched with oblique lines indicates a front area, and an area hatched with vertical lines indicates a back area. Note that FIG. 15 shows the boundary line, the front region, and the back region only for portions where equidistant regions are adjacent to each other among the boundary line, the front region, and the back region extracted by the generation unit 13 .

決定部１５は、合成画像に座標を設定する。決定部１５は、例えば、合成画像の左下の端点を原点座標に設定し、合成画像の右方向をＸ軸正方向とし、合成画像の上方向をＹ軸正方向として設定する。決定部１５は、等距離領域で囲まれた領域を人物領域として特定する。図１５に示すように、決定部１５は、人物領域Ｐ１～Ｐ１１と特定する。 The determination unit 15 sets coordinates in the synthesized image. For example, the determining unit 15 sets the lower left corner point of the composite image as the origin coordinates, sets the right direction of the composite image as the positive X-axis direction, and sets the upward direction of the composite image as the positive Y-axis direction. The determining unit 15 identifies an area surrounded by equidistant areas as a person area. As shown in FIG. 15, the determining unit 15 identifies person areas P1 to P11.

決定部１５は、人物領域Ｐ１～Ｐ１１のそれぞれに対して下端線を決定する。各人物領域の下端線のＹ座標が小さい場合、撮像装置に近い位置に存在する人物であると考えられる。そのため、決定部１５は、下端線のＹ座標が小さい方から順に、撮像装置から近い位置に存在する人物の人物領域として決定する。決定部１５は、決定した結果に基づいて、入力画像に含まれる人物の前後関係を決定する。 The determination unit 15 determines the bottom line for each of the person areas P1 to P11. If the Y coordinate of the bottom line of each person area is small, the person is considered to be present at a position close to the imaging device. Therefore, the determination unit 15 determines the human region of a person present in a position closer to the imaging device in order from the smaller Y coordinate of the bottom line. The determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the determined result.

例えば、人物領域Ｐ１の下端線のＹ座標がＹ１、人物領域Ｐ２の下端線のＹ座標がＹ２、人物領域Ｐ３の下端線のＹ座標がＹ３、人物領域Ｐ４の下端線のＹ座標がＹ４、人物領域Ｐ５の下端線のＹ座標がＹ５であるとする。また、Ｙ１＜Ｙ２＜Ｙ３＜Ｙ４＜Ｙ５であるとする。この場合、決定部１５は、人物領域Ｐ１、Ｐ２、Ｐ３、Ｐ４及びＰ５の順に撮像装置から近い位置に存在する人物の人物領域として決定する。決定部１５は、人物領域Ｐ６～Ｐ１１に対しても同様の決定処理を行う。 For example, the Y coordinate of the bottom line of the human region P1 is Y1, the Y coordinate of the bottom line of the human region P2 is Y2, the Y coordinate of the bottom line of the human region P3 is Y3, the Y coordinate of the bottom line of the human region P4 is Y4, Assume that the Y coordinate of the bottom line of the person area P5 is Y5. It is also assumed that Y1<Y2<Y3<Y4<Y5. In this case, the determining unit 15 determines the person areas P1, P2, P3, P4, and P5 in this order as the person areas of the person present at a position closer to the imaging device. The determination unit 15 performs similar determination processing on the person areas P6 to P11.

人物領域Ｐ６～Ｐ１１については、下端線が他の人物領域に隣接しており、他の人物領域と重なっている人物領域と判断することができる。そのため、人物領域Ｐ６～Ｐ１１については、正確な前後関係を決定することができない可能性があるため、決定部１５は、人物領域の下端線に基づいて、一時的に前後関係を決定する。そして、決定部１５は、人物領域の下端線が、他の人物領域と重なっている人物領域について、後述する決定処理の結果を適用して、入力画像に含まれる人物の前後関係を決定する。 As for the person areas P6 to P11, the bottom line is adjacent to another person area, and it can be determined that the person area overlaps with the other person area. For this reason, there is a possibility that the correct anteroposterior relationship cannot be determined for the person areas P6 to P11, so the determination unit 15 temporarily determines the anteroposterior relationship based on the bottom line of the person area. Then, the determination unit 15 applies the result of the determination processing described later to determine the anteroposterior relationship of the person included in the input image for the person area whose bottom line overlaps with another person area.

なお、決定部１５は、各人物領域に含まれる画素数に基づいて、撮像装置から近い位置に存在する人物の人物領域を決定し、決定した結果に基づいて、入力画像に含まれる人物の前後関係を決定してもよい。 Note that the determining unit 15 determines the human region of the person present in the position close to the imaging device based on the number of pixels included in each human region, and based on the determination result, determines the front and rear of the person included in the input image. Relationships may be determined.

各人物領域に含まれる画素数が多い場合、撮像装置に近い位置に存在する人物であると考えられる。そのため、決定部１５は、各人物領域に含まれる画素数を算出して、算出された画素数が多い順に、撮像装置から近い位置に存在する人物の人物領域として決定してもよい。 When the number of pixels included in each person area is large, it is considered that the person exists at a position close to the imaging device. Therefore, the determination unit 15 may calculate the number of pixels included in each person area, and determine the person area of the person located closest to the imaging device in descending order of the calculated number of pixels.

また、決定部１５は、各人物領域の下端線のＹ座標と、各人物領域に含まれる画素数とに対して重み付けを行ってもよい。そして、決定部１５は、重み付けされた、各人物領域の下端線のＹ座標及び各人物領域に含まれる画素数に基づいて、撮像装置から近い位置に存在する人物の人物領域として決定してもよい。 Further, the determination unit 15 may weight the Y coordinate of the bottom line of each person area and the number of pixels included in each person area. Then, the determination unit 15 determines the human region of the person located near the imaging device based on the weighted Y coordinate of the bottom line of each human region and the number of pixels included in each human region. good.

また、決定部１５は、人物領域の下端線が他の人物領域と隣接する人物領域に対して、人物領域の上端線に基づいて、撮像装置から近い位置に存在する人物の人物領域として決定してもよい。 Further, the determination unit 15 determines, based on the top line of the human region, the human region whose bottom line of the human region is adjacent to another human region as the human region of the person existing at a position close to the imaging device. may

各人物領域の下端線のＹ座標が大きい場合、撮像装置から遠い位置に存在する人物であると考えられる。そのため、決定部１５は、人物領域の下端線が他の人物領域と隣接する人物領域に対して、上端線のＹ座標が大きい方から順に、撮像装置から遠い位置に存在する人物の人物領域として決定してもよい。 If the Y coordinate of the bottom line of each person area is large, the person is considered to be located far from the imaging device. For this reason, the determination unit 15 selects human regions whose bottom line of the human region is adjacent to another human region, in descending order of the Y coordinate of the top line, as the human region of the person who is far from the imaging device. may decide.

次に、図１６について説明する。図１６は、図１５と同様の図であり、図５の合成画像を模式化した図であり、図４の領域Ｕ１に対応する領域を示している。実線で囲まれた領域は、等距離領域を示している。点線は、境界線を示しており、斜線でハッチングされた領域は、手前領域を示し、縦線でハッチングされた領域は、奥領域を示している。なお、図１５には、生成部１３が抽出した境界線、手前領域及び奥領域のうち、等距離領域同士が隣り合う部分のみの境界線、手前領域及び奥領域を示している。 Next, FIG. 16 will be described. FIG. 16 is similar to FIG. 15, and is a schematic diagram of the synthesized image of FIG. 5, showing an area corresponding to the area U1 of FIG. Areas surrounded by solid lines indicate equidistant areas. A dotted line indicates a boundary line, an area hatched with oblique lines indicates a front area, and an area hatched with vertical lines indicates a back area. Note that FIG. 15 shows the boundary line, the front region, and the back region only for portions where equidistant regions are adjacent to each other among the boundary line, the front region, and the back region extracted by the generation unit 13 .

決定部１５は、例えば、合成画像において、ある人物領域と隣接する人物領域とについて、所定の距離未満の人物領域を隣接する人物領域として決定する。決定部１５は、人物領域Ｐ３～Ｐ１１を、他の人物領域と隣接している人物領域として特定する。決定部１５は、他の人物領域と隣接している人物領域Ｐ３～Ｐ１１の各々について、手前領域及び奥領域に基づいて、隣接する人物領域との前後関係を決定する。決定部１５は、各人物領域についての隣接する人物領域との前後関係に基づいて、入力画像に含まれる人物の前後関係を決定する。 For example, the determination unit 15 determines a person area that is less than a predetermined distance from a person area adjacent to a certain person area in the composite image as an adjacent person area. The determination unit 15 identifies the person areas P3 to P11 as person areas adjacent to other person areas. For each of the person areas P3 to P11 adjacent to other person areas, the determination unit 15 determines the anteroposterior relationship with the adjacent person area based on the front area and the back area. The determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the anteroposterior relationship between each person area and an adjacent person area.

例えば、人物領域Ｐ３と、人物領域Ｐ５とは隣接する人物領域である。人物領域Ｐ３を基準として、人物領域Ｐ５との前後関係を決定する場合、人物領域Ｐ３と人物領域Ｐ５との間の領域Ｕ３に注目する。 For example, the person area P3 and the person area P5 are adjacent person areas. When determining the anteroposterior relationship with the person area P5 with the person area P3 as a reference, attention is paid to the area U3 between the person areas P3 and P5.

領域Ｕ３には、手前領域と奥領域とが含まれている。領域Ｕ３に含まれる手前領域は、人物領域Ｐ３と隣接している。つまり、人物領域Ｐ３は、人物領域Ｐ５よりも領域Ｕ３に含まれる手前領域に距離が近い。 The area U3 includes a front area and a back area. The front area included in the area U3 is adjacent to the person area P3. That is, the person area P3 is closer to the front area included in the area U3 than the person area P5.

一方、領域Ｕ３に含まれる奥領域は、人物領域Ｐ５と隣接している。つまり、人物領域Ｐ５は、人物領域Ｐ３よりも領域Ｕ３に含まれる奥領域に距離が近い。そのため、決定部１５は、手前領域と隣接し、距離が近い人物領域Ｐ３の人物を人物領域Ｐ５の人物よりも前に位置すると決定する。また、決定部１５は、奥領域と隣接し、距離が近い人物領域Ｐ５の人物を人物領域Ｐ３の人物よりも後ろに位置すると決定する。 On the other hand, the depth area included in the area U3 is adjacent to the person area P5. That is, the person area P5 is closer to the back area included in the area U3 than the person area P3. Therefore, the determining unit 15 determines that the person in the person area P3, which is adjacent to the near area and close to the person area, is located in front of the person in the person area P5. Further, the determination unit 15 determines that the person in the person area P5, which is adjacent to the back area and close to the person area, is located behind the person in the person area P3.

例えば、人物領域Ｐ３を基準として、人物領域Ｐ８との前後関係を決定する場合、人物領域Ｐ３と人物領域Ｐ８との間の手前領域は、人物領域Ｐ３と隣接し、距離が近い。人物領域Ｐ３と人物領域Ｐ８との間の奥領域は、人物領域Ｐ８と隣接し、距離が近い。そのため、決定部１５は、人物領域Ｐ３の人物を人物領域Ｐ８の人物よりも前に位置し、物領域Ｐ５の人物を人物領域Ｐ３の人物よりも後ろに位置すると決定する。その他の人物領域についても同様に、決定部１５は、各人物領域についての隣接する人物領域との前後関係を決定する。 For example, when determining the anteroposterior relationship with the person area P8 with the person area P3 as a reference, the front area between the person areas P3 and P8 is adjacent to the person area P3 and is close to the person area P3. The depth area between the person area P3 and the person area P8 is adjacent to the person area P8 and is close to the person area P8. Therefore, the determination unit 15 determines that the person in the person area P3 is positioned in front of the person in the person area P8, and the person in the object area P5 is positioned behind the person in the person area P3. Similarly for other person areas, the determining unit 15 determines the front-rear relationship between each person area and an adjacent person area.

決定部１５は、人物領域Ｐ１～Ｐ１１の下端線から決定した前後関係の決定処理の結果と、各人物領域についての隣接する人物領域との前後関係の決定処理の結果とを用いて、人物領域Ｐ１～Ｐ１１について前後関係を決定する。 The determination unit 15 uses the result of the determination process of the context determined from the bottom lines of the person areas P1 to P11 and the result of the determination process of the context of each person area with an adjacent person area to determine the person area. Determine the context for P1-P11.

決定部１５は、人物領域Ｐ１～Ｐ１１について前後関係を決定すると、合成画像の下側に存在する人物領域（撮像装置から近い人物の人物領域）から順に、撮像装置からの距離の順序を示すラベル番号を設定する。決定部１５は、設定したラベル番号に基づいて、入力画像に含まれる人物の前後関係を決定する。 After determining the anteroposterior relationship of the human regions P1 to P11, the determining unit 15 sequentially labels the human regions existing on the lower side of the composite image (the human region of the person closer to the imaging device) to indicate the order of the distance from the imaging device. Set number. The determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the set label number.

なお、決定部１５は、人物領域Ｐ１～Ｐ１１について、前後関係が正確に判定できない場合、判定出来なかった人物領域に対して前後関係が正しく判定出来なかったことを示すフラグを設定してもよい。もしくは、決定部１５は、人物領域Ｐ１～Ｐ１１について、前後関係が正確に判定できない場合、判定出来なかった人物領域に対して、同一のラベル番号を設定し、前後関係が正しく判定出来なかったことを示してもよい。 Note that, when the context cannot be accurately determined for the person areas P1 to P11, the determining unit 15 may set a flag indicating that the context could not be correctly determined for the person areas for which the determination was not possible. . Alternatively, when the context cannot be accurately determined for the person regions P1 to P11, the determination unit 15 sets the same label number to the person regions for which the determination was not possible, and the context cannot be determined correctly. may be indicated.

＜学習装置の構成例＞
次に、図１７を用いて、学習装置２０の構成例について説明する。図１７は、実施の形態２にかかる学習装置の構成例を示す図である。学習装置２０は、未学習の学習モデルを学習して、学習済みの学習モデルを生成する。学習装置２０は、入力部２１と、データ記憶部２２と、モデル記憶部２３と、学習部２４とを備える。<Configuration example of learning device>
Next, a configuration example of the learning device 20 will be described with reference to FIG. 17 . 17 is a diagram illustrating a configuration example of a learning device according to a second embodiment; FIG. The learning device 20 learns an unlearned learning model to generate a learned learning model. The learning device 20 includes an input unit 21 , a data storage unit 22 , a model storage unit 23 and a learning unit 24 .

入力部２１は、学習用画像及び教師データのペアを学習データとして入力する。入力部２１は、入力された画像をデータ記憶部２２に記憶する。 The input unit 21 inputs pairs of learning images and teacher data as learning data. The input unit 21 stores the input image in the data storage unit 22 .

データ記憶部２２は、入力部２１に入力された学習データを記憶する。
モデル記憶部２３は、未学習の学習モデル（学習中の学習モデルも含む）及び学習済みの学習モデルの少なくとも一方を記憶する。The data storage unit 22 stores learning data input to the input unit 21 .
The model storage unit 23 stores at least one of an unlearned learning model (including a learning model being learned) and a learned learning model.

学習部２４は、データ記憶部２２に入力された学習データを用いて、未学習の学習モデルを学習する。学習部２４は、深層学習により学習し、各層に適用される重み及び閾値を含むパラメータを学習し、更新する。学習部２４は、学習が終了すると、学習済みの学習モデルを生成してモデル記憶部２３に記憶する。なお、学習部２４は、未学習の学習モデルを学習済みの学習モデルにより更新してもよい。なお、学習部２４により生成された学習済みの学習モデルは、画像処理装置１０の管理者、運用者等により、画像処理装置１０のモデル記憶部１４に格納される。 The learning unit 24 uses the learning data input to the data storage unit 22 to learn an unlearned learning model. The learning unit 24 learns by deep learning, and learns and updates parameters including weights and thresholds applied to each layer. After completing the learning, the learning unit 24 generates a learned learning model and stores it in the model storage unit 23 . Note that the learning unit 24 may update an unlearned learning model with a learned learning model. The learned learning model generated by the learning unit 24 is stored in the model storage unit 14 of the image processing device 10 by the manager, operator, or the like of the image processing device 10 .

＜画像処理装置の動作例＞
次に、図１８を用いて、画像処理装置１０の動作例について説明する。図１８は、実施の形態２にかかる画像処理装置の動作例を説明する図である。<Example of operation of image processing apparatus>
Next, an operation example of the image processing apparatus 10 will be described with reference to FIG. FIG. 18 is a diagram for explaining an operation example of the image processing apparatus according to the second embodiment;

まず、入力部１１は、処理対象の画像（入力画像）を入力する（ステップＳ１）。入力部１１は、監視カメラにより撮像された画像を、監視カメラに接続されたサーバ装置から入力し、データ記憶部１２に記憶する。 First, the input unit 11 inputs an image to be processed (input image) (step S1). The input unit 11 receives an image captured by a surveillance camera from a server device connected to the surveillance camera, and stores the image in the data storage unit 12 .

生成部１３は、入力画像において、人物が含まれると推定される推定領域を推定する（ステップＳ２）。生成部１３は、データ記憶部１２から入力画像と、背景画像とを取得する。生成部１３は、背景画像と入力画像とを用いて、例えば、背景差分法（背景差分処理）により、入力画像のうち、人物が含まれると推定される領域を示す推定領域を推定する。 The generation unit 13 estimates an estimated region that is estimated to include a person in the input image (step S2). The generation unit 13 acquires the input image and the background image from the data storage unit 12 . The generation unit 13 uses the background image and the input image to estimate an estimated region indicating a region in which a person is estimated to be included in the input image by, for example, a background subtraction method (background subtraction processing).

生成部１３は、学習済みの学習モデルに基づいて、推定領域から等距離領域を抽出する（ステップＳ３）。生成部１３は、モデル記憶部１４に記憶された学習済みの学習モデルを取得する。生成部１３は、取得した学習モデルに基づいて、推定領域から、監視カメラからの距離が等しいと推定される領域を示す等距離領域を抽出する。生成部１３は、抽出された等距離画像を生成する。 The generator 13 extracts equidistant regions from the estimated region based on the learned model (step S3). The generation unit 13 acquires a learned learning model stored in the model storage unit 14 . Based on the acquired learning model, the generation unit 13 extracts an equidistant area indicating an area estimated to be at the same distance from the surveillance camera from the estimated area. The generator 13 generates the extracted equidistant image.

生成部１３は、学習済みの学習モデルに基づいて、推定領域から境界線、手前領域及び奥領域を抽出する（ステップＳ４）。生成部１３は、学習モデルに基づいて、入力画像のうち、等距離領域の周辺領域から、境界線と、手前領域と、奥領域とを抽出する。生成部１３は、境界線、手前領域及び奥領域を抽出すると、データ記憶部１２に記憶されている、等距離画像と、境界線、手前領域及び奥領域とを合成して、合成画像を生成する。生成部１３は、生成した合成画像をデータ記憶部１２に記憶する。 The generation unit 13 extracts a boundary line, a front area, and a back area from the estimated area based on the learned model (step S4). Based on the learning model, the generation unit 13 extracts the boundary line, the front area, and the back area from the peripheral area of the equidistant area in the input image. After extracting the boundary line, the front region, and the back region, the generation unit 13 combines the equidistant image stored in the data storage unit 12 with the boundary line, the front region, and the back region to generate a composite image. do. The generation unit 13 stores the generated synthetic image in the data storage unit 12 .

なお、ステップＳ３及びステップＳ４は、同時に実行されてもよい。モデル記憶部１４に記憶された学習モデルは、推定領域に含まれる所定の画素ブロック毎に、複数の領域パターンのうち一致するパターンを出力する学習モデルである。そのため、生成部１３は、学習モデルを用いることにより、等距離領域、境界線、手前領域及び奥領域を一度に抽出することができる。したがって、生成部１３は、等距離領域、境界線、手前領域及び奥領域を一度に抽出して、等距離画像を生成せずに、合成画像を生成するようにしてもよい。 In addition, step S3 and step S4 may be performed simultaneously. The learning model stored in the model storage unit 14 is a learning model that outputs a matching pattern among a plurality of region patterns for each predetermined pixel block included in the estimation region. Therefore, the generation unit 13 can extract the equidistant area, the boundary line, the front area, and the back area at once by using the learning model. Therefore, the generation unit 13 may extract the equidistant area, boundary line, front area and back area at once, and generate a composite image without generating an equidistant image.

決定部１５は、合成画像における等距離領域を人物領域と特定し（ステップＳ５）、各人物領域の下端線から各人物領域の前後関係を決定する（ステップＳ６）。決定部１５は、合成画像にＸＹ座標系を設定する。決定部１５は、各人物領域の下端線を算出し、下端線のＹ座標に基づいて、各人物領域の前後関係を決定する。 The determining unit 15 identifies equidistant areas in the composite image as person areas (step S5), and determines the front-back relationship of each person area from the bottom line of each person area (step S6). The determining unit 15 sets an XY coordinate system for the synthesized image. The determination unit 15 calculates the bottom line of each person area, and determines the front-back relationship of each person area based on the Y coordinate of the bottom line.

決定部１５は、各人物領域について、隣接する人物領域との前後関係を決定する（ステップＳ７）。決定部１５は、合成画像において、各人物領域に対して、隣接する人物領域を決定する。決定部１５は、各人物領域について、隣接する人物領域との間に含まれる手前領域及び奥領域に基づいて、隣接する人物領域との前後関係を決定する。決定部１５は、各人物領域について、比較対象の隣接する人物領域との間に含まれる手前領域と隣接する人物領域の人物が、他方の人物領域の人物よりも前に位置すると決定する。決定部１５は、各人物領域について、比較対象の隣接する人物領域との間に含まれる奥領域と隣接する人物領域の人物が、他方の人物領域の人物よりも後ろに位置すると決定する。 The determining unit 15 determines the anteroposterior relationship between each person area and an adjacent person area (step S7). The determining unit 15 determines human regions adjacent to each human region in the composite image. The determining unit 15 determines the anteroposterior relationship between each person area and the adjacent person area based on the front area and the back area included between the person area and the adjacent person area. The determining unit 15 determines that the person in the near side area included between the adjacent person areas to be compared and the person in the adjacent person area is positioned in front of the person in the other person area. The determining unit 15 determines that the person in the back area included between the adjacent person areas to be compared and the person in the adjacent person area is located behind the person in the other person area.

決定部１５は、入力画像に含まれる人物の前後関係を決定する（ステップＳ８）。決定部１５は、ステップＳ６及びＳ７において決定した結果に基づいて、各人物領域の前後関係を決定する。決定部１５は、各人物領域の前後関係を決定すると、合成画像の下側に存在する人物領域（撮像装置から近い人物の人物領域）から順に、撮像装置からの距離の順序を示すラベル番号を設定する。決定部１５は、設定したラベル番号に基づいて、入力画像に含まれる人物の前後関係を決定する。 The determination unit 15 determines the context of the person included in the input image (step S8). The determining unit 15 determines the anteroposterior relationship of each person area based on the results determined in steps S6 and S7. After determining the anteroposterior relationship of each person area, the determining unit 15 assigns label numbers indicating the order of distance from the image capturing apparatus in order from the person area existing on the lower side of the synthesized image (the person area of the person closer to the image capturing apparatus). set. The determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the set label number.

＜学習装置の動作例＞
次に、図１９を用いて、学習装置２０の動作例について説明する。図１９は、実施の形態２にかかる学習装置の動作例を示す図である。<Example of operation of learning device>
Next, an operation example of the learning device 20 will be described with reference to FIG. 19 . 19 is a diagram illustrating an operation example of the learning device according to the second embodiment; FIG.

入力部２１は、学習データを入力する（ステップＳ１１）。入力部２１は、学習用画像及び教師データのペアを学習データとして入力する。入力部２１は、入力された画像をデータ記憶部２２に記憶する。 The input unit 21 inputs learning data (step S11). The input unit 21 inputs pairs of learning images and teacher data as learning data. The input unit 21 stores the input image in the data storage unit 22 .

学習部２４は、学習済みの学習モデルを生成する（ステップＳ１２）。学習部２４は、データ記憶部２２に入力された学習データを用いて、未学習の学習モデルを学習する。学習部２４は、深層学習により学習し、各層に適用される重み及び閾値を含むパラメータを学習し、更新する。学習部２４は、学習が終了すると、学習済みの学習モデルを生成してモデル記憶部２３に記憶する。 The learning unit 24 generates a learned learning model (step S12). The learning unit 24 uses the learning data input to the data storage unit 22 to learn an unlearned learning model. The learning unit 24 learns by deep learning, and learns and updates parameters including weights and thresholds applied to each layer. After completing the learning, the learning unit 24 generates a learned learning model and stores it in the model storage unit 23 .

以上説明したように、生成部１３は、入力画像において、人物が含まれると推定される推定領域から、撮像装置からの距離が等しいと推定される等距離領域を抽出する。入力画像において、複数の人物が重なる領域が含まれている場合、重なっている人物の人物領域は、撮像装置からの距離が異なると推定される。そのため、生成部１３が等距離領域を抽出することにより、等距離領域を入力画像に含まれる各人物の人物領域と特定することが可能となる。したがって、画像処理装置１０によれば、画像に含まれる人物の人物領域を精度良く特定することが可能となる。 As described above, the generation unit 13 extracts equidistant areas estimated to be at equal distances from the imaging device from estimated areas estimated to include a person in the input image. When an input image includes an area where a plurality of persons overlap, it is estimated that the person areas of the overlapping persons are at different distances from the imaging device. Therefore, when the generation unit 13 extracts the equidistant areas, it is possible to specify the equidistant areas as the person areas of each person included in the input image. Therefore, according to the image processing apparatus 10, it is possible to accurately identify the person area of the person included in the image.

また、等距離領域を抽出することにより、入力画像に含まれる人物の人物領域を特定することができるので、画像処理装置１０を用いることにより、入力画像に含まれる人物の人数、人物が存在する位置等を特定することが可能となる。 In addition, by extracting equidistant regions, it is possible to identify the person regions of the persons included in the input image. It becomes possible to specify the position or the like.

生成部１３は、等距離領域に加えて、推定領域から手前領域及び奥領域を抽出する。決定部１５は、等距離領域に基づいて、合成画像に含まれる人物領域を特定する。決定部１５は、等距離領域、手前領域及び奥領域に基づいて、合成画像に含まれる各人物領域の前後関係を決定する。そして、決定部１５は、各人物領域の前後関係に基づいて、入力画像に含まれる人物の前後関係を決定する。したがって、画像処理装置１０によれば、入力画像に含まれる各人物の前後関係を特定することが可能となる。 The generation unit 13 extracts the front area and the back area from the estimated area in addition to the equidistant area. The determining unit 15 identifies the person area included in the composite image based on the equidistant area. The determining unit 15 determines the front-rear relationship of each person area included in the composite image based on the equidistant area, the front area, and the back area. Then, the determining unit 15 determines the anteroposterior relationship of the person included in the input image based on the anteroposterior relationship of each person region. Therefore, according to the image processing apparatus 10, it is possible to specify the anteroposterior relationship of each person included in the input image.

画像処理装置１０を用いることにより、入力画像に含まれる各人物の前後関係を特定することができるので、例えば、各時刻において撮像された画像を入力して、入力された画像に含まれる特定人物の位置を特定することができる。したがって、画像処理装置１０を用いることにより、例えば、特定人物の追跡を行うことが可能となる。 By using the image processing apparatus 10, it is possible to specify the context of each person included in the input image. can be located. Therefore, by using the image processing device 10, for example, it is possible to track a specific person.

（変形例）
実施の形態２では、生成部１３は、手前領域及び奥領域を抽出することで説明を行ったが、手前領域及び奥領域のうちのいずれか一方を抽出するようにしてもよい。この場合、決定部１５は、各人物領域について、隣接する人物領域との前後関係を決定する決定処理において、抽出された手前領域又は奥領域を用いる。(Modification)
In the second embodiment, the generation unit 13 extracts the front area and the back area, but it may extract either one of the front area and the back area. In this case, the determination unit 15 uses the extracted front region or back region in the determination process of determining the anteroposterior relationship between each person region and an adjacent person region.

生成部１３が抽出した領域が手前領域であれば、各人物領域と、比較対象の隣接する人物領域の間に含まれる手前領域と隣接する距離が近い人物領域の人物が、他方の人物領域の人物よりも前に位置すると決定する。 If the region extracted by the generating unit 13 is the front region, the person in the person region that is adjacent to the front region that is included between each person region and the adjacent person regions to be compared is the person in the other person region. It is determined to be located in front of the person.

生成部１３が抽出した領域が奥領域であれば、各人物領域と、比較対象の隣接する人物領域の間に含まれる奥領域と隣接する距離が近い人物領域の人物が、他方の人物領域の人物よりも後ろに位置すると決定する。このようにしても、実施の形態２と同様の効果を得ることが可能となる。 If the region extracted by the generation unit 13 is a back region, the person in the human region that is adjacent to the back region that is included between each human region and the adjacent human region that is the object of comparison is the person in the other human region. It is determined to be located behind the person. Even in this way, it is possible to obtain the same effect as in the second embodiment.

（他の実施の形態）
上述した実施の形態において説明した画像処理装置１、１０及び学習装置２０（以下、画像処理装置１等と称する）は、次のようなハードウェア構成を有していてもよい。図２０は、本開示の各実施の形態にかかる画像処理装置等を実現可能な、コンピュータ（情報処理装置）のハードウェア構成を例示するブロック図である。(Other embodiments)
The image processing apparatuses 1 and 10 and the learning apparatus 20 (hereinafter referred to as the image processing apparatus 1 and the like) described in the above embodiments may have the following hardware configuration. FIG. 20 is a block diagram illustrating a hardware configuration of a computer (information processing device) capable of realizing an image processing device and the like according to each embodiment of the present disclosure;

図２０を参照すると、画像処理装置１等は、プロセッサ１２０１及びメモリ１２０２を含む。プロセッサ１２０１は、メモリ１２０２からソフトウェア（コンピュータプログラム）を読み出して実行することで、上述の実施形態においてフローチャートを用いて説明された画像処理装置１等の処理を行う。プロセッサ１２０１は、例えば、マイクロプロセッサ、MPU（Micro Processing Unit）、又はCPU（Central Processing Unit）であってもよい。プロセッサ１２０１は、複数のプロセッサを含んでもよい。 Referring to FIG. 20 , the image processing apparatus 1 and the like include a processor 1201 and a memory 1202 . The processor 1201 reads and executes software (computer program) from the memory 1202 to perform the processing of the image processing apparatus 1 and the like described using the flowcharts in the above embodiments. The processor 1201 may be, for example, a microprocessor, an MPU (Micro Processing Unit), or a CPU (Central Processing Unit). Processor 1201 may include multiple processors.

メモリ１２０２は、揮発性メモリ及び不揮発性メモリの組み合わせによって構成される。メモリ１２０２は、プロセッサ１２０１から離れて配置されたストレージを含んでもよい。この場合、プロセッサ１２０１は、図示されていないI/Oインターフェースを介してメモリ１２０２にアクセスしてもよい。 Memory 1202 is comprised of a combination of volatile and non-volatile memory. Memory 1202 may include storage remotely located from processor 1201 . In this case, processor 1201 may access memory 1202 via an I/O interface (not shown).

図２０の例では、メモリ１２０２は、ソフトウェアモジュール群を格納するために使用される。プロセッサ１２０１は、これらのソフトウェアモジュール群をメモリ１２０２から読み出して実行することで、上述の実施形態において説明された画像処理装置１等の処理を行うことができる。 In the example of FIG. 20, memory 1202 is used to store software modules. The processor 1201 reads and executes these software modules from the memory 1202, thereby performing the processing of the image processing apparatus 1 and the like described in the above embodiments.

図２０を用いて説明したように、画像処理装置１等が有するプロセッサの各々は、図面を用いて説明されたアルゴリズムをコンピュータに行わせるための命令群を含む１または複数のプログラムを実行する。 As described with reference to FIG. 20, each of the processors included in the image processing apparatus 1 etc. executes one or more programs containing instruction groups for causing the computer to execute the algorithm described with reference to the drawings.

上述の例において、プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）を含む。さらに、非一時的なコンピュータ可読媒体の例は、ＣＤ－ＲＯＭ（Read Only Memory）、ＣＤ－Ｒ、ＣＤ－Ｒ／Ｗを含む。さらに、非一時的なコンピュータ可読媒体の例は、半導体メモリを含む。半導体メモリは、例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory）を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above examples, the programs can be stored and delivered to computers using various types of non-transitory computer readable media. Non-transitory computer-readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (eg, floppy disks, magnetic tapes, hard disk drives), magneto-optical recording media (eg, magneto-optical disks). Further examples of non-transitory computer readable media include CD-ROMs (Read Only Memory), CD-Rs, and CD-R/Ws. Further examples of non-transitory computer-readable media include semiconductor memory. The semiconductor memory includes, for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, and RAM (Random Access Memory). The program may also be delivered to the computer on various types of transitory computer readable medium. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. Transitory computer-readable media can deliver the program to the computer via wired channels, such as wires and optical fibers, or wireless channels.

以上、実施の形態を参照して本願発明を説明したが、本願発明は上記によって限定されるものではない。本願発明の構成や詳細には、発明のスコープ内で当業者が理解し得る様々な変更をすることができる。また、本開示は、それぞれの実施の形態を適宜組み合わせて実施されてもよい。 Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the invention. In addition, the present disclosure may be implemented by appropriately combining each embodiment.

また、上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
撮像装置により撮像された第１画像を入力する入力部と、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成する生成部と、を備える画像処理装置。
（付記２）
前記生成部は、前記学習モデルに基づいて、前記推定領域から、前記撮像装置との距離が異なる境界線よりも前記撮像装置からの距離が短い第２領域、及び前記境界線よりも前記撮像装置からの距離が長い第３領域のうちの少なくとも１つの領域を抽出し、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を含む前記第２画像を生成し、
前記第１領域と、前記第２領域及び前記第３領域の少なくとも１つの領域とに基づいて、前記第１画像に含まれる人物の前後関係を決定する決定部を備える、付記１に記載の画像処理装置。
（付記３）
前記決定部は、前記第１領域に基づき人物領域を特定し、前記特定された人物領域の各々の下端線、上端線、及び前記特定された人物領域の各々に含まれる画素数のうちの少なくとも１つに基づいて、前記第１画像に含まれる人物の前後関係を決定する、付記２に記載の画像処理装置。
（付記４）
前記決定部は、前記第１領域に基づき人物領域を特定し、前記特定された人物領域の各々について、前記第２領域及び前記第３領域のうちの少なくとも１つの領域に基づいて、隣接する人物領域との前後関係を決定し、各人物領域についての前記隣接する人物領域との前後関係に基づいて、前記第１画像に含まれる人物の前後関係を決定する、付記２又は３に記載の画像処理装置。
（付記５）
前記決定部は、前記特定された人物領域の各々と、前記隣接する人物領域との間に含まれる前記第２領域及び前記第３領域のうちの少なくとも１つの領域との距離に基づいて、前記特定された各人物領域の人物と、前記隣接する人物領域の人物との前後関係を決定する、付記４に記載の画像処理装置。
（付記６）
前記決定部は、前記生成部が前記第２領域を抽出する場合、前記特定された各人物領域及び前記隣接する人物領域のうち、前記特定された各人物領域と、前記隣接する人物領域との間に含まれる前記第２領域との距離が近い一方の人物領域の人物を他方の人物領域の人物よりも前に位置すると決定する、付記５に記載の画像処理装置。
（付記７）
前記決定部は、前記生成部が前記第３領域を抽出する場合、前記特定された各人物領域及び前記隣接する人物領域のうち、前記特定された各人物領域と、前記隣接する人物領域との間に含まれる前記第３領域との距離が近い一方の人物領域の人物を他方の人物領域の人物よりも後ろに位置すると決定する、付記５又は６に記載の画像処理装置。
（付記８）
前記学習モデルは、前記推定領域に含まれる所定の画素ブロック毎に、複数の領域パターンのうち一致する領域パターンを出力する学習モデルであり、
前記生成部は、前記出力された領域パターンに基づいて、前記推定領域から、前記第１領域と、前記第２領域及び前記第３領域のうちの少なくとも１つの領域と、を抽出する、付記２～７のいずれか１項に記載の画像処理装置。
（付記９）
前記複数の領域パターンは、前記第１領域を抽出するための第１パターンと、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を抽出するための複数の第２パターンとを含む、付記８に記載の画像処理装置。
（付記１０）
前記複数の第２パターンは、前記撮像装置との距離の勾配方向を示す奥行勾配方向がそれぞれ異なるパターンである、付記９に記載の画像処理装置。
（付記１１）
前記複数の第２パターンは、前記奥行勾配方向が８方向又は１６方向のそれぞれに対応するパターンである、付記１０に記載の画像処理装置。
（付記１２）
撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成することと、を含む画像処理方法。
（付記１３）
撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される第１領域を抽出し、前記第１領域を含む第２画像を生成することと、をコンピュータに実行させるプログラム。In addition, part or all of the above-described embodiments can be described as the following additional remarks, but are not limited to the following.
(Appendix 1)
an input unit for inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. and a generation unit that generates a second image including:
(Appendix 2)
Based on the learning model, the generating unit selects from the estimated area a second area having a shorter distance from the imaging device than a boundary line having a different distance from the imaging device, and a second area having a shorter distance from the imaging device than the boundary line extracting at least one of a third region having a large distance from the second image, and generating the second image including at least one of the second region and the third region;
The image according to appendix 1, further comprising a determination unit that determines the context of the person included in the first image based on the first area and at least one of the second area and the third area. processing equipment.
(Appendix 3)
The determination unit identifies a human region based on the first region, and selects at least a bottom line and a top line of each of the identified human regions and the number of pixels included in each of the identified human regions. 3. The image processing apparatus of claim 2, wherein the context of a person included in the first image is determined based on one.
(Appendix 4)
The determination unit specifies a person area based on the first area, and determines, for each of the specified person areas, an adjacent person area based on at least one area of the second area and the third area. 4. The image according to appendix 2 or 3, wherein the front-back relationship of each person region is determined, and the front-back relationship of the person included in the first image is determined based on the front-back relationship of each person region with the adjacent person region. processing equipment.
(Appendix 5)
The determining unit, based on a distance between each of the specified person areas and at least one of the second area and the third area included between the adjacent person areas, The image processing device according to appendix 4, wherein the anteroposterior relationship between the identified person in each person area and the person in the adjacent person area is determined.
(Appendix 6)
When the generating unit extracts the second area, the determination unit determines, from among the specified person areas and the adjacent person areas, The image processing device according to appendix 5, wherein the person in one person area that is closer to the second area included therebetween is positioned in front of the person in the other person area.
(Appendix 7)
When the generating unit extracts the third area, the determination unit determines, of the identified person areas and the adjacent person areas, the determination unit to determine whether the identified person areas and the adjacent person areas are 7. The image processing device according to appendix 5 or 6, wherein the person in one person area that is closer to the third area included therebetween is determined to be positioned behind the person in the other person area.
(Appendix 8)
The learning model is a learning model that outputs a matching area pattern among a plurality of area patterns for each predetermined pixel block included in the estimated area,
Supplementary note 2, wherein the generation unit extracts the first region and at least one of the second region and the third region from the estimated region based on the output region pattern. 8. The image processing device according to any one of 1 to 7.
(Appendix 9)
The plurality of area patterns include a first pattern for extracting the first area and a plurality of second patterns for extracting at least one of the second area and the third area. , Supplementary Note 8.
(Appendix 10)
The image processing apparatus according to appendix 9, wherein the plurality of second patterns are patterns with different depth gradient directions indicating the gradient direction of the distance from the imaging device.
(Appendix 11)
11. The image processing device according to appendix 10, wherein the plurality of second patterns are patterns corresponding to the depth gradient directions of 8 directions or 16 directions.
(Appendix 12)
inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. and generating a second image comprising:
(Appendix 13)
inputting a first image captured by an imaging device;
A first region estimated to be at the same distance from the imaging device is extracted from estimated regions estimated to include a person in the first image based on the learned model, and the first region is extracted. a program that causes a computer to generate a second image comprising:

この出願は、２０１９年３月１１日に出願された日本出願特願２０１９－０４４２７３を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2019-044273 filed on March 11, 2019, and the entire disclosure thereof is incorporated herein.

１、１０画像処理装置
２、１１、２１入力部
３、１３生成部
１２、２２データ記憶部
１４、２３モデル記憶部
１５決定部
２０学習装置
２４学習部Reference Signs List 1, 10 image processing device 2, 11, 21 input unit 3, 13 generation unit 12, 22 data storage unit 14, 23 model storage unit 15 determination unit 20 learning device 24 learning unit

Claims

撮像装置により撮像された第１画像を入力する入力手段と、
学習済みの学習モデルに基づいて、前記第１画像のうち、複数の人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される画素の集まりである第１領域を複数抽出し、前記抽出された複数の第１領域が互いに区別可能な態様で含まれる第２画像を生成する生成手段と、を備え、
前記生成手段は、前記学習モデルに基づいて、前記推定領域から、前記撮像装置との距離が異なる境界線よりも前記撮像装置からの距離が短い第２領域、及び前記境界線よりも前記撮像装置からの距離が長い第３領域のうちの少なくとも１つの領域を抽出し、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を含む前記第２画像を生成し、
前記第１領域と、前記第２領域及び前記第３領域の少なくとも１つの領域とに基づいて、前記第１画像に含まれる人物の前後関係を決定する決定手段を備える、画像処理装置。 input means for inputting a first image captured by the imaging device;
A first region that is a group of pixels that are estimated to be at equal distances from the imaging device, from an estimated region that is estimated to include a plurality of people, in the first image based on a learned learning model. and generating means for generating a second image containing the plurality of extracted first regions in a manner distinguishable from each other,
Based on the learning model, the generation means selects from the estimated area a second area having a shorter distance from the imaging device than a boundary line having a different distance from the imaging device, and a second area having a shorter distance from the imaging device than the boundary line extracting at least one of a third region having a large distance from the second image, and generating the second image including at least one of the second region and the third region;
An image processing apparatus comprising: determining means for determining a context of a person included in the first image based on the first area and at least one of the second area and the third area.

前記決定手段は、前記第１領域に基づき人物領域を特定し、前記特定された人物領域の各々の下端線、上端線、及び前記特定された人物領域の各々に含まれる画素数のうちの少なくとも１つに基づいて、前記第１画像に含まれる人物の前後関係を決定する、請求項１に記載の画像処理装置。 The determining means specifies a human region based on the first region, and selects at least the number of pixels included in each of the bottom line and top line of each of the specified human regions and the number of pixels included in each of the specified human regions. 2. The image processing apparatus of claim 1, wherein the context of a person included in said first image is determined based on one .

前記決定手段は、前記第１領域に基づき人物領域を特定し、前記特定された人物領域の各々について、前記第２領域及び前記第３領域のうちの少なくとも１つの領域に基づいて、隣接する人物領域との前後関係を決定し、各人物領域についての前記隣接する人物領域との前後関係に基づいて、前記第１画像に含まれる人物の前後関係を決定する、請求項１又は２に記載の画像処理装置。 The determining means specifies a person area based on the first area, and for each of the specified person areas, determines an adjacent person area based on at least one of the second area and the third area. 3. The method according to claim 1 or 2 , further comprising: determining the contextual relationship with an area, and determining the contextual relationship of the person included in the first image based on the contextual relationship of each person area with the adjacent person area. Image processing device.

前記決定手段は、前記特定された人物領域の各々と、前記隣接する人物領域との間に含まれる前記第２領域及び前記第３領域のうちの少なくとも１つの領域との距離に基づいて、前記特定された各人物領域の人物と、前記隣接する人物領域の人物との前後関係を決定する、請求項３に記載の画像処理装置。 The determining means, based on a distance between each of the specified person areas and at least one of the second area and the third area included between the adjacent person areas, 4. The image processing apparatus according to claim 3 , wherein the anteroposterior relationship between the person in each specified person area and the person in the adjacent person area is determined.

前記学習モデルは、前記推定領域に含まれる所定の画素ブロック毎に、複数の領域パターンのうち一致する領域パターンを出力する学習モデルであり、
前記生成手段は、前記出力された領域パターンに基づいて、前記推定領域から、前記第１領域と、前記第２領域及び前記第３領域のうちの少なくとも１つの領域と、を抽出する、請求項１～４のいずれか１項に記載の画像処理装置。 The learning model is a learning model that outputs a matching area pattern among a plurality of area patterns for each predetermined pixel block included in the estimated area,
3. The generating means extracts the first area and at least one of the second area and the third area from the estimated area based on the output area pattern. 5. The image processing device according to any one of 1 to 4 .

前記複数の領域パターンは、前記第１領域を抽出するための第１パターンと、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を抽出するための複数の第２パターンとを含む、請求項５に記載の画像処理装置。 The plurality of area patterns include a first pattern for extracting the first area and a plurality of second patterns for extracting at least one of the second area and the third area. 6. The image processing apparatus according to claim 5 .

前記複数の第２パターンは、前記撮像装置との距離の勾配方向を示す奥行勾配方向がそれぞれ異なるパターンである、請求項６に記載の画像処理装置。 7. The image processing apparatus according to claim 6 , wherein said plurality of second patterns are patterns with different depth gradient directions indicating gradient directions of distances from said imaging device.

撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち、複数の人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される画像の集まりである第１領域を抽出し、前記抽出された複数の第１領域が互いに区別可能な態様で含まれる第２画像を生成することと、を含み、
前記第２画像の生成において、前記学習モデルに基づいて、前記推定領域から、前記撮像装置との距離が異なる境界線よりも前記撮像装置からの距離が短い第２領域、及び前記境界線よりも前記撮像装置からの距離が長い第３領域のうちの少なくとも１つの領域を抽出し、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を含む前記第２画像を生成し、
前記第１領域と、前記第２領域及び前記第３領域の少なくとも１つの領域とに基づいて、前記第１画像に含まれる人物の前後関係を決定することを含む、画像処理方法。 inputting a first image captured by an imaging device;
A first region that is a group of images that are estimated to be at equal distances from the imaging device, from the estimated regions that are estimated to include a plurality of persons, in the first image, based on a learned learning model. and generating a second image that includes the extracted plurality of first regions in a manner that is distinguishable from each other ;
In generating the second image, based on the learning model, from the estimated area, a second area having a shorter distance from the imaging device than a boundary line having a different distance from the imaging device, and a second area having a shorter distance from the imaging device than the boundary line extracting at least one of a third region having a long distance from the imaging device, generating the second image including at least one of the second region and the third region;
An image processing method, comprising determining the context of a person included in the first image based on the first area and at least one of the second area and the third area.

撮像装置により撮像された第１画像を入力することと、
学習済みの学習モデルに基づいて、前記第１画像のうち、複数の人物が含まれると推定される推定領域から、前記撮像装置からの距離が等しいと推定される画像の集まりである第１領域を抽出し、前記抽出された複数の第１領域が互いに区別可能な態様で含まれる第２画像を生成することと、をコンピュータに実行させ、
前記第２画像の生成において、前記学習モデルに基づいて、前記推定領域から、前記撮像装置との距離が異なる境界線よりも前記撮像装置からの距離が短い第２領域、及び前記境界線よりも前記撮像装置からの距離が長い第３領域のうちの少なくとも１つの領域を抽出し、前記第２領域及び前記第３領域のうちの少なくとも１つの領域を含む前記第２画像を生成し、
前記第１領域と、前記第２領域及び前記第３領域の少なくとも１つの領域とに基づいて、前記第１画像に含まれる人物の前後関係を決定すること、を前記コンピュータに実行させる、プログラム。 inputting a first image captured by an imaging device;
A first region that is a group of images that are estimated to be at equal distances from the imaging device, from the estimated regions that are estimated to include a plurality of persons, in the first image, based on a learned learning model. and generating a second image in which the extracted plurality of first regions are included in a manner distinguishable from each other, causing a computer to execute
In generating the second image, based on the learning model, from the estimated area, a second area having a shorter distance from the imaging device than a boundary line having a different distance from the imaging device, and a second area having a shorter distance from the imaging device than the boundary line extracting at least one of a third region having a long distance from the imaging device, generating the second image including at least one of the second region and the third region;
A program that causes the computer to determine the context of a person included in the first image based on the first area and at least one of the second area and the third area.