JP2018124939A

JP2018124939A - Image synthesizer, image synthesizing method, and image synthesizing program

Info

Publication number: JP2018124939A
Application number: JP2017019074A
Authority: JP
Inventors: 広太竹内; Kota Takeuchi; 木全　英明; Hideaki Kimata; 英明木全; 和樹岡見; Kazuki Okami
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-02-03
Filing date: 2017-02-03
Publication date: 2018-08-09

Abstract

PROBLEM TO BE SOLVED: To provide an image synthesizer capable of precisely in a shadow area, robust against background variations under arbitrary background, and highly precisely in an outline section of an object synthesizing a depth image while significantly reducing teacher data generation cost, an image synthesizing method, and an image synthesizing program.SOLUTION: An image synthesizer includes: a teacher data generation unit for superposing a teacher multi-viewpoint background image with CG data containing an image representing an object based on a 3D shape model to generate multi-viewpoint teacher image data; a learning parameter calculation unit for performing machine learning using the teacher image data to calculate a learning parameter; and an image synthesizing unit for generating a depth image relative to the multi-viewpoint photographic image by the machine learning using the multi-viewpoint photographic image and the learning parameter.SELECTED DRAWING: Figure 1

Description

本発明は、所望の画像を画像合成処理によって生成する画像合成装置、画像合成方法、及び画像合成プログラムに関する。 The present invention relates to an image composition device, an image composition method, and an image composition program that generate a desired image by image composition processing.

映像から任意の被写体の３次元形状を推定するニーズが古くからある。３Ｄ映像の提示をはじめ、近年ではAugmented Reality（ＡＲ）やMixed Reality（ＭＲ）と呼ばれるＣＧ（Computer Graphics）空間と現実空間との融合など、様々な領域で３次元形状の推定が必要とされる。 There is an old need for estimating the three-dimensional shape of an arbitrary subject from an image. In addition to the presentation of 3D video, in recent years it is necessary to estimate 3D shapes in various areas, such as the fusion of CG (Computer Graphics) space called Augmented Reality (AR) or Mixed Reality (MR) and real space. .

ところで、被写体が人間である場合、映像から人の全身３Ｄ（three-dimensional；三次元の）モデルを推定することが究極的なゴールとなるが、単体のカメラによる映像からでは被写体の一部の面しか映されてはいない。そのため、映されていない面については、なんらかの他の情報を利用した推測が必要となる。 By the way, when the subject is a human being, the ultimate goal is to estimate a 3D model of the person from the video. Only the face is shown. Therefore, it is necessary to make an inference using some other information for the non-imaged surface.

一方で、撮影された画像の画素ごとにカメラから被写体までの距離に関する情報を保持するデプス画像を推定する研究が盛んにおこなわれている。デプス画像を推定することによって、撮影された被写体の一部の面だけではあるが、被写体の凹凸を把握することが可能となる。これにより、デプス画像は、例えば、視差のさほど大きくない３Ｄ映像の提示などに利用することができる。 On the other hand, research is being actively conducted to estimate a depth image that holds information about the distance from the camera to the subject for each pixel of the captured image. By estimating the depth image, it is possible to grasp the unevenness of the subject although it is only a part of the surface of the photographed subject. Accordingly, the depth image can be used for, for example, presentation of a 3D video that is not so large in parallax.

また、被写体を取り囲むようにして設置された多視点のカメラに対し、視点ごとにデプス画像を推定し、それらのデプス画像を合成することによって、被写体の全身３Ｄモデルを復元する試みもなされている。デプス画像を合成する手法の多くは、複数台のステレオカメラなどによるカメラ映像から推定する手法、または、デプスセンサなどの特殊なデバイスをカメラと併用して推定する手法の２つに大別することができる。 In addition, with respect to a multi-viewpoint camera installed so as to surround a subject, an attempt is made to restore a whole body 3D model of the subject by estimating a depth image for each viewpoint and synthesizing the depth images. . Many of the methods for synthesizing depth images are roughly divided into two methods: a method for estimating from a camera image by a plurality of stereo cameras or the like, or a method for estimating a special device such as a depth sensor in combination with a camera. it can.

ステレオカメラなどを用いる手法（例えば、非特許文献１参照）においては、カメラパラメータと呼ばれるカメラの位置および姿勢、カメラの焦点距離などを利用して、各カメラ間の幾何的な関係に基づく制約下で、画像間の画素値を比較することにより、デプス画像を推定しようとしている。一方、デプスセンサなどを用いる手法（例えば、非特許文献２参照）においては、被写体に対し非可視光である近赤外線などを高周波に照射し、その反射波を観測し、位相差信号からデプス画像を推定している。 In a method using a stereo camera or the like (for example, refer to Non-Patent Document 1), the position and orientation of the camera called camera parameters, the focal length of the camera, and the like are used, and the constraint is based on the geometric relationship between the cameras. Thus, a depth image is estimated by comparing pixel values between images. On the other hand, in a method using a depth sensor or the like (for example, see Non-Patent Document 2), a near-infrared ray that is invisible light is irradiated to a subject at a high frequency, the reflected wave is observed, and a depth image is obtained from a phase difference signal. Estimated.

Andreas Klaus, Mario Sormann, and Konrad Karner, ”Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure.” 18th International Conference on Pattern Recognition (ICPR’06)， Vol.3， IEEE, 2006.Andreas Klaus, Mario Sormann, and Konrad Karner, “Segment-Based Stereo Matching Using Belief Propagation and a Self-Adapting Dissimilarity Measure.” 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3, IEEE, 2006. Zhengyou Zhang, ”Microsoft Kinect Sensor and Its Effect２”, IEEE multimedia 19.2, pp.4-10, 2012.Zhengyou Zhang, “Microsoft Kinect Sensor and Its Effect 2”, IEEE multimedia 19.2, pp.4-10, 2012. Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun, ”Efficient Deep Learning for Stereo Matching”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun, “Efficient Deep Learning for Stereo Matching”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

しかしながら、非特許文献２のような、デプスセンサとカラー画像を用いてシーンの奥行を測定する手法では、実際の撮影会場などにおいてはデプスセンサの照射光と反射光とが干渉してデプス情報がうまく観測できないケースや、被写体とデプスセンサとの距離が遠すぎることによってデプスセンサの照射光が被写体まで届かずにデプス情報が計測できないケースがある。 However, in the method of measuring the depth of a scene using a depth sensor and a color image as in Non-Patent Document 2, the irradiation information and reflected light of the depth sensor interfere with each other and the depth information is observed well in an actual shooting venue. In some cases, the depth information cannot be measured because the irradiation light of the depth sensor does not reach the subject because the distance between the subject and the depth sensor is too long.

一方、非特許文献１のような、ステレオ画像からの画素値の比較に基づく手法では、画素値の比較を行っているため、例えば、影の領域など均一色で埋め尽くされている場合などには、精確な奥行きを推定することができない。また、オクルージョンと呼ばれる遮蔽が起きる状況においても、画素どうしの比較ができないため、正しいデプス情報を計測することが難しい。そこで、ディープニューラルネットワーク（ＤＮＮ；Deep Neural Network）を用いることによって、これらの課題を解決しようとする手法が提案されている（例えば、非特許文献３参照）。非特許文献３では、ＤＮＮと教師データを用いて、ステレオマッチングを行い、デプス画像を算出している。しかしながら、当該手法には、大量の教師データが必要であるという課題がある。さらに、当該手法には、デプス画像の輪郭部分が滲んでしまうという課題がある。 On the other hand, in the method based on comparison of pixel values from a stereo image as in Non-Patent Document 1, since pixel values are compared, for example, when a shadow area is filled with a uniform color, etc. Cannot estimate the exact depth. Further, even in a situation where occlusion occurs, it is difficult to measure correct depth information because pixels cannot be compared. In view of this, a technique for solving these problems by using a deep neural network (DNN) has been proposed (see, for example, Non-Patent Document 3). In Non-Patent Document 3, stereo matching is performed using DNN and teacher data, and a depth image is calculated. However, this method has a problem that a large amount of teacher data is required. Furthermore, this method has a problem that the contour portion of the depth image is blurred.

本発明は、このような事情に鑑みてなされたもので、教師データ生成のコストを大幅に削減しつつ、影領域でも精確に、かつ、任意背景下において背景変動に対しても頑健に、さらに、被写体の輪郭部分においても精度の高いデプス画像を合成可能な画像合成装置、画像合成方法、及び画像合成プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, while significantly reducing the cost of teacher data generation, accurately even in shadow areas, and robust against background fluctuations under an arbitrary background. An object of the present invention is to provide an image composition device, an image composition method, and an image composition program that can synthesize a depth image with high accuracy even in a contour portion of a subject.

本発明の一態様は、教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成部と、前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出部と、多視点撮影画像と、前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成部と、を備える画像合成装置である。 One aspect of the present invention is a teacher data generation unit that generates multi-view teacher image data by superimposing a teacher multi-view background image and CG data including an image showing a subject based on a 3D shape model, and the teacher image data Image learning that generates a depth image for the multi-viewpoint captured image by the machine learning using a learning parameter calculation unit that performs machine learning using a multi-viewpoint, a multi-viewpoint captured image, and the learning parameter An image synthesizing apparatus.

本発明の一態様は、上記の画像合成装置であって、前記機械学習は、ディープニューラルネットワークを用いた機械学習である。 One aspect of the present invention is the image synthesis device described above, wherein the machine learning is machine learning using a deep neural network.

本発明の一態様は、上記の画像合成装置であって、前記ＣＧデータは、被写体の姿勢、前記被写体の位置、及び照明の状態うち、少なくとも１つを指定する情報を含むデータである。 One aspect of the present invention is the image composition device described above, wherein the CG data is data including information specifying at least one of a posture of the subject, a position of the subject, and a lighting state.

本発明の一態様は、上記の画像合成装置であって、前記ディープニューラルネットワークは、前記多視点撮影画像に基づく投影画像群と、前記多視点撮影画像と同一の視点で撮影された多視点背景画像に基づく背景投影画像群と、をそれぞれ入力する入力層を有する。 One aspect of the present invention is the above-described image composition device, in which the deep neural network includes a projected image group based on the multi-view captured image, and a multi-view background captured from the same viewpoint as the multi-view captured image. An input layer for inputting a background projection image group based on the image is provided.

本発明の一態様は、コンピュータによる画像合成方法であって、教師データ生成部が、教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成ステップと、学習パラメータ算出部が、前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出ステップと、画像合成部が、多視点撮影画像と、前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成ステップと、を有する画像合成方法である。 One aspect of the present invention is an image synthesis method using a computer, in which a teacher data generation unit superimposes a teacher multi-view background image and CG data including an image showing a subject based on a 3D shape model, thereby supervising a multi-view teacher. A teacher data generation step that generates image data, a learning parameter calculation unit that performs machine learning using the teacher image data, calculates a learning parameter, and an image synthesis unit includes a multi-viewpoint captured image and And an image synthesis method for generating a depth image for the multi-viewpoint captured image by the machine learning using the learning parameter.

本発明の一態様は、コンピュータに、教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成ステップと、前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出ステップと、多視点撮影画像と前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成ステップと、を実行させるための画像合成プログラムである。 One aspect of the present invention is a teacher data generation step of generating multi-view teacher image data by superimposing a computer multi-view background image and CG data including an image showing a subject based on a 3D shape model on a computer; Machine learning is performed using teacher image data, and a learning parameter calculation step for calculating a learning parameter, and a depth image for the multi-view captured image is generated by the machine learning using the multi-view captured image and the learning parameter. An image composition program for executing the image composition step.

本発明によれば、ＣＧにより意図的に条件を変えてレンダリングした多視点画像と、事前に撮影した多視点背景画像とを重畳した画像を投影画像群に変換しＤＮＮの教師データとして生成することにより、教師データの作成コストを削減すると同時に、背景変動に頑健で、かつ、被写体の影部分と境界部分においても精確にデプス画像の合成を実現することができるという効果が得られる。 According to the present invention, an image obtained by superimposing a multi-viewpoint image that is rendered by CG with intentionally changed conditions and a multi-viewpoint background image captured in advance is converted into a projection image group and generated as DNN teacher data. As a result, it is possible to reduce the teacher data creation cost, and at the same time, it is robust against background fluctuations, and the depth image can be accurately synthesized at the shadow portion and the boundary portion of the subject.

本発明の実施形態における画像合成装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の動作の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of operation | movement of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の教師データ生成部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the teacher data generation part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の教師データ生成部による教師データの生成の概要を示す概略図である。It is the schematic which shows the outline | summary of the production | generation of the teacher data by the teacher data production | generation part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の第１画像変換部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the 1st image conversion part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の第１画像変換部による画像変換の概要を示す概略図である。It is the schematic which shows the outline | summary of the image conversion by the 1st image conversion part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の学習パラメータ算出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the learning parameter calculation part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の学習パラメータ算出部による学習パラメータの算出の概要を示す概略図である。It is the schematic which shows the outline | summary of calculation of the learning parameter by the learning parameter calculation part of the image synthesizing | combining apparatus in embodiment of this invention. 本発明の実施形態における画像合成装置の画像合成部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the image composition part of the image composition apparatus in the embodiment of the present invention.

以下、図面を参照しながら、本発明の実施形態における画像合成装置について説明する。
なお、以下の説明においては、１枚の画像（静止画像）に対する処理についての説明をするが、複数の連続する画像（フレーム）に対してそれぞれ同様の処理を繰り返すことによって映像（動画）に対する処理を行うこともできる。なお、映像に含まれる全てのフレームに対して処理を行うのではなく、映像の一部のフレームにのみ処理を行うような構成であってもよい。 Hereinafter, an image composition device according to an embodiment of the present invention will be described with reference to the drawings.
In the following description, processing for one image (still image) will be described, but processing for video (moving image) is performed by repeating similar processing for each of a plurality of consecutive images (frames). Can also be done. A configuration may be adopted in which processing is not performed on all frames included in the video, but processing is performed only on some frames of the video.

＜画像合成装置の機能構成＞
以下、画像合成装置１の機能構成について、図面を参照しながら説明する。
図１は、本実施形態における画像合成装置１の機能構成を示すブロック図である。図示するように、画像合成装置１は、教師データ生成部１１と、学習パラメータ算出部１２と、画像合成部１３と、第１画像変換部１４と、第２画像変換部１５と、第３画像変換部１６と、第４画像変換部１７と、を含んで構成される。画像合成装置１は、コンピュータ装置、例えば、パーソナルコンピュータ（ＰＣ）、又は汎用コンピュータなどによって構成される。 <Functional configuration of image synthesizer>
Hereinafter, the functional configuration of the image composition device 1 will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of an image composition device 1 in the present embodiment. As illustrated, the image composition device 1 includes a teacher data generation unit 11, a learning parameter calculation unit 12, an image composition unit 13, a first image conversion unit 14, a second image conversion unit 15, and a third image. A conversion unit 16 and a fourth image conversion unit 17 are included. The image composition device 1 is configured by a computer device such as a personal computer (PC) or a general-purpose computer.

本実施形態における画像合成装置１には、多視点カメラによって撮影された多視点からの画像（以下、「多視点画像」ともいう）を示す画像データが入力される。そして、画像合成装置１は、多視点の中から任意に選ばれた１視点を基準視点として、基準視点におけるデプス画像を合成して出力する。
なお、多視点画像は、各視点間で時間的に同期がとられているものとする。 Image data indicating images from multiple viewpoints (hereinafter also referred to as “multi-view images”) captured by a multi-view camera are input to the image composition device 1 in the present embodiment. Then, the image composition device 1 synthesizes and outputs a depth image at the reference viewpoint using one viewpoint arbitrarily selected from the multiple viewpoints as a reference viewpoint.
Note that multi-viewpoint images are temporally synchronized between the viewpoints.

教師データ生成部１１は、一般的なＣＧデータと、教師多視点背景画像群と、多視点カメラのカメラパラメータと、を入力値として、後述する学習パラメータ算出部１２において教師データとして用いられる教師デプス画像群の生成、及び、後述する学習パラメータ算出部１２において教師データとして用いられる教師投影画像群の元データとなる教師多視点画像群の生成を行う。 The teacher data generation unit 11 uses general CG data, a teacher multi-viewpoint background image group, and camera parameters of a multi-view camera as input values, and is used as teacher data in the learning parameter calculation unit 12 described later. Generation of an image group and generation of a teacher multi-viewpoint image group serving as original data of a teacher projection image group used as teacher data in a learning parameter calculation unit 12 described later are performed.

なお、本実施形態においては、人間を被写体とする場合を例として説明する。そのため、ＣＧデータは、人体の一般的な３Ｄ形状モデルを示す情報、及び照明情報であるものとして説明する。なお、人間ではなく、例えば動物が被写体である場合には、動物の３Ｄ形状モデルを用いればよい。
また、ここでいう３Ｄ形状モデルとは、モデル表面上の３次元頂点座標、メッシュ連結情報、テクスチャ情報、スキニングパラメータ、及びリグ情報が含まれる情報であるものとする。 In the present embodiment, a case where a human is the subject will be described as an example. For this reason, the CG data is described as information indicating a general 3D shape model of a human body and illumination information. In addition, when it is not a human but an animal is a subject, for example, a 3D shape model of the animal may be used.
The 3D shape model here is information including 3D vertex coordinates on the model surface, mesh connection information, texture information, skinning parameters, and rig information.

また、ここでいう教師多視点背景画像群とは、後述する画像合成部１３に入力される多視点撮影画像の視点と同一の視点で撮影された、１フレーム以上の多視点画像である。教師多視点背景画像群は、最終的に抽出したい被写体が映りこんでいない、撮影画像の背景のみが撮影された多視点画像のことであるものとする。 The teacher multi-view background image group here is a multi-view image of one frame or more shot from the same viewpoint as the viewpoint of the multi-view shot image input to the image composition unit 13 described later. The teacher multi-viewpoint background image group is a multi-viewpoint image in which only the background of the photographed image is captured, in which the subject to be finally extracted does not appear.

また、ここでいうデプス画像とは、一般的なデプス画像のデータ形式で生成された画像であり、背景画像と同一の解像度の１チャネルの画像であるものとする。デプス画像の各画素には、カメラの光学中心点から被写体までの奥行を示す値が格納されている。 The depth image referred to here is an image generated in a general depth image data format, and is a one-channel image having the same resolution as the background image. Each pixel of the depth image stores a value indicating the depth from the optical center point of the camera to the subject.

また、ここでいうカメラパラメータとは、一般的なカメラパラメータを示す情報であり、カメラの焦点距離などを含む内部パラメータの値、及び位置姿勢を表す外部パラメータの値を含むデータである。 The camera parameters here are information indicating general camera parameters, and are data including values of internal parameters including the focal length of the camera and values of external parameters indicating the position and orientation.

第１画像変換部１４、第２画像変換部１５、第３画像変換部１６、及び第４画像変換部１７は、多視点画像とカメラパラメータとを示す情報を、投影画像群を示す情報に変換する。 The first image conversion unit 14, the second image conversion unit 15, the third image conversion unit 16, and the fourth image conversion unit 17 convert information indicating a multi-viewpoint image and camera parameters into information indicating a projection image group. To do.

第１画像変換部１４は、教師データ生成部１１から入力された教師多視点画像群と、カメラパラメータと、を入力値として教師投影画像群を生成し、学習パラメータ算出部１２へ出力する。
第２画像変換部１５は、教師多視点背景画像群と、カメラパラメータと、を入力値として教師背景投影画像群を生成し、学習パラメータ算出部１２へ出力する。
第３画像変換部１６は、多視点撮影画像と、カメラパラメータと、を入力値として投影画像群を生成し、画像合成部１３へ出力する。
第４画像変換部１７は、多視点背景撮影画像と、カメラパラメータと、を入力値として投影画像群を生成し、画像合成部１３へ出力する。 The first image conversion unit 14 generates a teacher projection image group using the teacher multi-viewpoint image group input from the teacher data generation unit 11 and the camera parameters as input values, and outputs the teacher projection image group to the learning parameter calculation unit 12.
The second image conversion unit 15 generates a teacher background projection image group using the teacher multi-viewpoint background image group and the camera parameters as input values, and outputs the teacher background projection image group to the learning parameter calculation unit 12.
The third image conversion unit 16 generates a projection image group using the multi-viewpoint captured image and the camera parameters as input values, and outputs the projection image group to the image composition unit 13.
The fourth image conversion unit 17 generates a projection image group using the multi-viewpoint background captured image and the camera parameters as input values, and outputs the projection image group to the image composition unit 13.

学習パラメータ算出部１２は、ネットワーク構造情報と、教師データ生成部１１から出力された教師デプス画像群と、第１画像変換部１４から出力された教師投影画像群と、第２画像変換部１５から出力された教師背景投影画像群と、を入力値として学習パラメータの算出を行う（機械学習）。そして、学習パラメータ算出部１２は、算出した学習パラメータに基づく学習済みネットワーク情報を、後述する画像合成部１３へ出力する。 The learning parameter calculation unit 12 includes network structure information, a teacher depth image group output from the teacher data generation unit 11, a teacher projection image group output from the first image conversion unit 14, and the second image conversion unit 15. Learning parameters are calculated using the output teacher background projection image group as input values (machine learning). Then, the learning parameter calculation unit 12 outputs learned network information based on the calculated learning parameter to the image composition unit 13 described later.

学習パラメータ算出部１２は、ディープニューラルネットワーク（ＤＮＮ）を含んで構成されており、学習パラメータ算出部１２に入力されるネットワーク構造情報とは、ＤＮＮのネットワークの構造を具体的に表す構造情報である。ネットワーク構造情報には、例えば、畳込層の順序、総数、カーネルおよびチャネル数、プーリング層の有無、活性化関数の種類、ロス関数の定義などを示す情報が含まれる。 The learning parameter calculation unit 12 includes a deep neural network (DNN), and the network structure information input to the learning parameter calculation unit 12 is structural information that specifically represents the structure of the DNN network. . The network structure information includes, for example, information indicating the order of convolution layers, the total number, the number of kernels and channels, the presence or absence of a pooling layer, the type of activation function, the definition of a loss function, and the like.

画像合成部１３は、第３画像変換部１６から入力された投影画像群と、第４画像変換部１７から入力された背景投影画像群と、を入力値として、学習パラメータ算出部１２において算出された学習済みネットワーク情報を用いて、基準視点におけるデプス画像を合成し、出力する。すなわち、ＤＮＮは、多視点撮影画像に基づく投影画像群と、当該多視点撮影画像と同一の視点で撮影された多視点背景画像に基づく背景投影画像群と、をそれぞれ入力する入力層を有する。 The image composition unit 13 is calculated by the learning parameter calculation unit 12 using the projection image group input from the third image conversion unit 16 and the background projection image group input from the fourth image conversion unit 17 as input values. Using the learned network information, the depth image at the reference viewpoint is synthesized and output. That is, the DNN has an input layer for inputting a projection image group based on a multi-viewpoint captured image and a background projection image group based on a multi-viewpoint background image captured at the same viewpoint as the multi-viewpoint captured image.

＜画像合成装置の動作＞
以下、画像合成装置１の動作の概要について、図面を参照しながら説明する。
図２は、本発明の実施形態における画像合成装置１の動作の概要を示すフローチャートである。
本フローチャートは、教師データ生成部１１に、ＣＧデータと、教師多視点背景画像群を示す情報と、カメラパラメータの値を示す情報とが入力された際に開始する。 <Operation of Image Synthesizer>
Hereinafter, an outline of the operation of the image composition device 1 will be described with reference to the drawings.
FIG. 2 is a flowchart showing an outline of the operation of the image composition device 1 according to the embodiment of the present invention.
This flowchart starts when CG data, information indicating a teacher multi-viewpoint background image group, and information indicating a camera parameter value are input to the teacher data generation unit 11.

（ステップＳ００１）教師データ生成部１１は、ＣＧデータと、教師多視点背景画像群と、カメラパラメータとから、教師多視点画像群、及び教師デプス画像群を生成する。そして、教師データ生成部１１は、教師多視点画像群を第１画像変換部１４へ、及び教師デプス画像群を学習パラメータ算出部１２へ、それぞれ出力する。その後、ステップＳ００２へ進む。
なお、教師データの生成における教師データ生成部１１の動作の詳細については、後述する。 (Step S001) The teacher data generation unit 11 generates a teacher multi-viewpoint image group and a teacher depth image group from the CG data, the teacher multi-viewpoint background image group, and the camera parameters. Then, the teacher data generation unit 11 outputs the teacher multi-viewpoint image group to the first image conversion unit 14 and the teacher depth image group to the learning parameter calculation unit 12, respectively. Thereafter, the process proceeds to step S002.
Details of the operation of the teacher data generation unit 11 in generating teacher data will be described later.

（ステップＳ００２）第１画像変換部１４は、教師データ生成部１１から入力された教師多視点画像群を、カメラパラメータを用いて、教師投影画像群に変換する。そして、第１画像変換部１４は、教師投影画像群を、学習パラメータ算出部１２へ出力する。その後、ステップＳ００３へ進む。
なお、教師多視点画像群から教師投影画像群への変換における、第１画像変換部の動作の詳細については、後述する。 (Step S002) The first image conversion unit 14 converts the teacher multi-viewpoint image group input from the teacher data generation unit 11 into a teacher projection image group using camera parameters. Then, the first image conversion unit 14 outputs the teacher projection image group to the learning parameter calculation unit 12. Thereafter, the process proceeds to step S003.
The details of the operation of the first image conversion unit in the conversion from the teacher multi-viewpoint image group to the teacher projection image group will be described later.

（ステップＳ００３）第２画像変換部１５は、教師多視点背景画像群を、カメラパラメータを用いて、教師背景投影画像群に変換する。そして、第２画像変換部１５は、教師背景投影画像群を、学習パラメータ算出部１２へ出力する。その後、ステップＳ００４へ進む。 (Step S003) The second image conversion unit 15 converts the teacher multi-viewpoint background image group into a teacher background projection image group using camera parameters. Then, the second image conversion unit 15 outputs the teacher background projection image group to the learning parameter calculation unit 12. Thereafter, the process proceeds to step S004.

（ステップＳ００４）学習パラメータ算出部１２は、教師データ生成部１１から入力された教師デプス画像群と、第１画像変換部１４から入力された教師投影画像群と、第２画像変換部１５から入力された教師背景投影画像群と、ネットワーク構造情報とを用いて、ＤＮＮの学習を実行して、学習パラメータを算出する。そして、学習パラメータ算出部１２は、算出した学習パラメータに基づく学習済みネットワーク情報を、画像合成部１３へ出力する。その後、ステップＳ００５へ進む。 (Step S 004) The learning parameter calculation unit 12 is input from the teacher depth image group input from the teacher data generation unit 11, the teacher projection image group input from the first image conversion unit 14, and the second image conversion unit 15. Using the teacher background projection image group and the network structure information, DNN learning is executed to calculate learning parameters. Then, the learning parameter calculation unit 12 outputs learned network information based on the calculated learning parameter to the image synthesis unit 13. Thereafter, the process proceeds to step S005.

（ステップＳ００５）第３画像変換部１６は、多視点撮影画像を、カメラパラメータを用いて、投影画像群に変換する。そして、第３画像変換部１６は、投影画像群を、画像合成部１３へ出力する。その後、ステップＳ００６へ進む。 (Step S005) The third image conversion unit 16 converts the multi-viewpoint captured image into a projected image group using camera parameters. Then, the third image conversion unit 16 outputs the projection image group to the image composition unit 13. Thereafter, the process proceeds to step S006.

（ステップＳ００６）第４画像変換部１７は、多視点背景画像を、カメラパラメータを用いて、背景投影画像群に変換する。そして、第４画像変換部１７は、背景投影画像群を、画像合成部１３へ出力する。その後、ステップＳ００７へ進む。 (Step S006) The fourth image conversion unit 17 converts the multi-viewpoint background image into a background projection image group using the camera parameters. Then, the fourth image conversion unit 17 outputs the background projection image group to the image composition unit 13. Thereafter, the process proceeds to step S007.

（ステップＳ００７）画像合成部１３は、第３画像変換部１６から入力された投影画像群と、第４画像変換部１７から入力された背景投影画像群とから、デプス画像を合成し、出力する。
以上で、本フローチャートの処理が終了する。 (Step S007) The image composition unit 13 synthesizes and outputs a depth image from the projection image group input from the third image conversion unit 16 and the background projection image group input from the fourth image conversion unit 17. .
Above, the process of this flowchart is complete | finished.

＜教師データ生成部の動作＞
以下、教師データ生成部１１の動作について、図面を参照しながら説明する。
図３は、本発明の実施形態における画像合成装置１の教師データ生成部１１の動作を示すフローチャートである。
本フローチャートは、教師データ生成部１１に、ＣＧデータと、教師多視点背景画像群を示す情報と、カメラパラメータの値を示す情報とが入力された際に開始する。 <Operation of teacher data generation unit>
Hereinafter, the operation of the teacher data generation unit 11 will be described with reference to the drawings.
FIG. 3 is a flowchart showing the operation of the teacher data generation unit 11 of the image composition device 1 according to the embodiment of the present invention.
This flowchart starts when CG data, information indicating a teacher multi-viewpoint background image group, and information indicating a camera parameter value are input to the teacher data generation unit 11.

（ステップＳ１０１）教師データ生成部１１は、人物のＣＧデータ、背景多視点画像群、及びカメラパラメータを取得し、保持する。その後、ステップＳ１０２へ進む。 (Step S101) The teacher data generation unit 11 acquires and holds CG data of a person, a background multi-viewpoint image group, and camera parameters. Then, it progresses to step S102.

（ステップＳ１０２）教師データ生成部１１は、人物のＣＧデータに対し、人物の姿勢、人物の位置、及び照明条件の変更を行う。以下に、人物のＣＧデータに対する、人物の姿勢、人物の位置、及び照明条件の変更について、図４を参照して説明する。
図４は、本発明の実施形態における画像合成装置１の教師データ生成部１１による教師データの生成の概要を示す概略図である。 (Step S <b> 102) The teacher data generation unit 11 changes the posture of the person, the position of the person, and the illumination condition for the CG data of the person. Hereinafter, a change in the posture of the person, the position of the person, and the illumination condition with respect to the CG data of the person will be described with reference to FIG.
FIG. 4 is a schematic diagram illustrating an overview of generation of teacher data by the teacher data generation unit 11 of the image composition device 1 according to the embodiment of the present invention.

人物の姿勢は、ＣＧデータのリグ情報を用いて変更することができる。図４に図示するように、例えば、人物が棒立ちしている姿勢、人物が両手を広げている姿勢、及び人物が足を開いている姿勢、の３つの姿勢の中から、ランダムに１つを選択して、人物の姿勢の変更を行うことができる。
なお、これらの姿勢がランダムに変更されるようにしてもよいが、画像合成部１３に入力される投影画像群の被写体がとりうる姿勢を網羅して変更されることが望ましい。 The posture of the person can be changed using rig information of CG data. As shown in FIG. 4, for example, one of three postures is randomly selected from three postures: a posture in which a person is standing, a posture in which a person spreads his hands, and a posture in which a person opens his / her legs. A person's posture can be changed by selecting.
Although these postures may be changed randomly, it is desirable that the postures that can be taken by the subject of the projection image group input to the image composition unit 13 are comprehensively changed.

人物の位置は、人物の３Ｄ形状モデルが画像中にレンダリングされる範囲内であれば、どの位置に変更されてもよい。例えば、図４に図示するように、床面を仮定し、人物が床面に足をつけて立てる位置の範囲内で、かつ、レンダリングされる範囲内において、変更を行うことができる、なお、図示するように、床面上をグリッドに区切ることによって、各格子点の位置の中から、人物の位置をランダムに選択することができるようにしてもよい。 The position of the person may be changed to any position as long as the 3D shape model of the person is within a range to be rendered in the image. For example, as shown in FIG. 4, assuming a floor surface, a change can be made within a range where the person stands on the floor surface and within a rendered range. As shown in the figure, the position of the person may be randomly selected from the positions of the lattice points by dividing the floor surface into grids.

照明条件は、任意に変更することができる。具体的には、例えば、図４に図示するように、人物の３Ｄ形状モデルの位置が決定された後、決定された位置を中心とする半天球面上の格子点（図示せず）の中からランダムに選択することができるようにしてもよい。なお、照明条件は、画像合成部１３に入力される投影画像群における照明条件と類似する照明条件に変更することが望ましい。 The illumination conditions can be arbitrarily changed. Specifically, for example, as shown in FIG. 4, after the position of the 3D shape model of the person is determined, from among lattice points (not shown) on the hemispherical sphere centered on the determined position. You may enable it to select at random. Note that the illumination condition is desirably changed to an illumination condition similar to the illumination condition in the projection image group input to the image composition unit 13.

このように、本実施形態における画像合成装置１によれば、人物の姿勢、人物の位置、及び照明条件を変動させて、教師データとして利用することによって、ＤＮＮにおいて、数万枚という大量の教師データが必要であるという課題を解決しつつ、姿勢変動、位置変動、及び照明変動に対して、頑健にデプス画像の合成が可能となる。 As described above, according to the image synthesizing apparatus 1 according to the present embodiment, a large number of teachers such as tens of thousands of copies are obtained in the DNN by changing the posture of the person, the position of the person, and the illumination condition and using them as teacher data. While solving the problem of needing data, it is possible to synthesize depth images robustly against posture fluctuations, position fluctuations, and lighting fluctuations.

（ステップＳ１０３）教師データ生成部１１は、入力された教師多視点背景画像群の中から、視点ごとに背景画像を１枚選択し、背景画像に対する前景画像として、人物領域のレンダリングを実行する。また、教師データ生成部１１は、人物領域のデプス画像をレンダリングする。 (Step S103) The teacher data generation unit 11 selects one background image for each viewpoint from the input teacher multi-viewpoint background image group, and executes rendering of the person region as a foreground image with respect to the background image. The teacher data generation unit 11 renders a depth image of the person area.

図４に図示するように、具体的には、教師データ生成部１１は、ステップＳ１０２において指定された、人物の姿勢、人物の位置、及び照明条件に基づいて透過度を含むＲＧＢＡ（Red, Green, Blue, Alpha）の４チャネル画像として、カメラパラメータを用いて視点ごとにレンダリングし、多視点前景画像を生成する。 As illustrated in FIG. 4, specifically, the teacher data generation unit 11 includes RGBA (Red, Green, including transparency based on the posture of the person, the position of the person, and the illumination condition specified in step S102. , Blue, Alpha) are rendered for each viewpoint using camera parameters to generate a multi-view foreground image.

この際、人物の３Ｄ形状モデルがレンダリングされる領域の画素のアルファ値は０（不透明）、人物の影となる領域の画素のアルファ値は０より大きく１未満、及びそれ以外の領域の画素のアルファ値は１（透明）とする。なお、人物の影となる領域の画素のアルファ値は、例えば、照明の強度に応じて変更されるようにしてもよいし、または例えば、固定値（例えば、０．５）としてもよい。 At this time, the alpha value of the pixel in the region where the 3D shape model of the person is rendered is 0 (opaque), the alpha value of the pixel in the region that becomes the shadow of the person is greater than 0 and less than 1, and the pixels in the other regions The alpha value is 1 (transparent). It should be noted that the alpha value of the pixel in the region that becomes the shadow of the person may be changed according to, for example, the intensity of illumination, or may be a fixed value (for example, 0.5).

次に、教師データ生成部１１は、基準視点において、人物の３Ｄ形状モデルがレンダリングされた画素のみのデプス画像を、教師デプス画像としてレンダリングする。 Next, the teacher data generation unit 11 renders, as a teacher depth image, a depth image of only pixels in which the 3D shape model of the person is rendered at the reference viewpoint.

次に、教師データ生成部１１は、入力された教師多視点背景画像群の中から、各視点について１フレームの教師背景画像をランダムに選択する。そして、教師データ生成部１１は、選択したそれぞれの教師背景画像に対して、多視点前景画像のアルファ値を利用して上述した多視点前景画像を重畳させることによって、多視点教師画像を生成する。教師データ生成部１１は、得られた教師デプス画像、及び多視点教師画像を保存しておく。その後、ステップＳ１０４へ進む。 Next, the teacher data generation unit 11 randomly selects one frame of the teacher background image for each viewpoint from the input teacher multi-viewpoint background image group. Then, the teacher data generation unit 11 generates a multi-view teacher image by superimposing the above-described multi-view foreground image on each selected teacher background image using the alpha value of the multi-view foreground image. . The teacher data generation unit 11 stores the obtained teacher depth image and multi-viewpoint teacher image. Thereafter, the process proceeds to step S104.

このように、本実施形態における画像合成装置１によれば、教師データ生成部１１が、撮影画像と同一視点の背景画像に対して前景画像を重畳させることによって、背景変動に対する頑健性の向上、及び輪郭部分のデプス推定精度の向上が可能になる。 Thus, according to the image composition device 1 in the present embodiment, the teacher data generation unit 11 improves the robustness against the background fluctuation by superimposing the foreground image on the background image of the same viewpoint as the captured image. In addition, the depth estimation accuracy of the contour portion can be improved.

（ステップＳ１０４）教師データ生成部１１は、上述した人物の姿勢、人物の位置、及び照明条件の変更について、全ての変更パターンでのレンダリングが完了したか否かを判定する。なお、どのような変更パターンのレンダリングをするかについて、事前に設定がなされているものとする。
教師データ生成部１１が、全ての変更パターンでのレンダリングが完了したと判定された場合は、ステップＳ１０５へ進む。そうでない場合はステップＳ１０２に戻り、レンダリングが完了していない変更パターンに条件を変更して、レンダリングを実行する。 (Step S <b> 104) The teacher data generation unit 11 determines whether or not rendering has been completed for all the change patterns for the above-described changes in the posture of the person, the position of the person, and the illumination conditions. It is assumed that the change pattern to be rendered is set in advance.
If the teacher data generation unit 11 determines that rendering has been completed for all the change patterns, the process proceeds to step S105. If not, the process returns to step S102, the condition is changed to a change pattern for which rendering has not been completed, and rendering is executed.

（ステップＳ１０５）教師データ生成部１１は、レンダリングされたすべての教師多視点画像群、及び教師デプス画像群を、上述した条件ごとにフレームとしてまとめる。そして、教師データ生成部１１は、教師多視点画像群を第１画像変換部１４へ、及び教師デプス画像群を学習パラメータ算出部１２へ出力する。
以上で、本フローチャートの処理が終了する。 (Step S105) The teacher data generation unit 11 collects all rendered teacher multi-viewpoint image groups and teacher depth image groups as frames according to the above-described conditions. Then, the teacher data generation unit 11 outputs the teacher multi-viewpoint image group to the first image conversion unit 14 and the teacher depth image group to the learning parameter calculation unit 12.
Above, the process of this flowchart is complete | finished.

＜画像変換部の動作＞
以下、第１画像変換部１４の動作について、図面を参照しながら説明する。
図５は、本発明の実施形態における画像合成装置１の第１画像変換部１４の動作を示すフローチャートである。
本フローチャートは、第１画像変換部１４に、教師データ生成部１１から出力された教師多視点画像群と、カメラパラメータの値を示す情報とが入力された際に開始する。 <Operation of image converter>
Hereinafter, the operation of the first image conversion unit 14 will be described with reference to the drawings.
FIG. 5 is a flowchart showing the operation of the first image conversion unit 14 of the image composition device 1 according to the embodiment of the present invention.
This flowchart is started when a teacher multi-viewpoint image group output from the teacher data generation unit 11 and information indicating camera parameter values are input to the first image conversion unit 14.

（ステップＳ２０１）第１画像変換部１４は、教師データ生成部１１から出力された教師多視点画像群と、カメラパラメータと、を取得する。その後、ステップＳ２０２へ進む。 (Step S <b> 201) The first image conversion unit 14 acquires a teacher multi-viewpoint image group output from the teacher data generation unit 11 and camera parameters. Thereafter, the process proceeds to step S202.

（ステップＳ２０２）第１画像変換部１４は、基準とするカメラの画像平面と平行する、仮想的な面である奥行き平面の、最小の奥行値、及び最大の奥行値を設定する。以下に、図６を参照しながら具体的に説明する。 (Step S202) The first image conversion unit 14 sets a minimum depth value and a maximum depth value of a depth plane that is a virtual plane parallel to the image plane of the reference camera. This will be specifically described below with reference to FIG.

図６は、本発明の実施形態における画像合成装置１の第１画像変換部１４による画像変換の概要を示す概略図である。ここでは、図６に図示するように、入力される多視点画像が３視点（カメラ０、カメラ１、及びカメラ２からの視点）であり、仮定する奥行き平面が５層（ｄ＝０〜４）であり、基準カメラをカメラ１とした場合における、投影画像群について説明する。 FIG. 6 is a schematic diagram showing an outline of image conversion by the first image conversion unit 14 of the image composition device 1 according to the embodiment of the present invention. Here, as illustrated in FIG. 6, the input multi-viewpoint image has three viewpoints (viewpoints from the camera 0, the camera 1, and the camera 2), and the assumed depth plane has five layers (d = 0 to 4). The projection image group when the reference camera is the camera 1 will be described.

図６において、各奥行き平面は、カメラ１の画像平面と平行する仮想的な面であるものとする。奥行き平面の数は、シーンに存在する物体の奥行きに応じて任意に設定することができる。例えば、基準カメラ（カメラ１）と平行な巨大な壁が背景であって、厚みの薄い被写体が１つだけ存在するようなシーンにおいては、奥行き平面を２層に設定することができる。一方、奥行き方向に向かって、多数の被写体が連続的に分布するようなシーンにおいては，数百層の奥行き平面を設定することが好ましい場合もある。 In FIG. 6, each depth plane is assumed to be a virtual plane parallel to the image plane of the camera 1. The number of depth planes can be arbitrarily set according to the depth of an object present in the scene. For example, in a scene where a huge wall parallel to the reference camera (camera 1) is the background and there is only one thin object, the depth plane can be set to two layers. On the other hand, in a scene where a large number of subjects are continuously distributed in the depth direction, it may be preferable to set several hundred depth planes.

第１画像変換部１４は、最小、及び最大の奥行値を設定する。ここでいう最小の奥行値とは、図６に示すｄ＝０の奥行き平面からカメラ１の光学中心点までの距離である。また、ここでいう最大の奥行値とは、図６に示すｄ＝４の奥行き平面からカメラ１の光学中心点までの距離である。最小、及び最大の奥行値は、撮影される前景として被写体が映っている範囲に基づいて、手動で大まかに設定されてもよい。
その後、ステップＳ２０３へ進む。 The first image conversion unit 14 sets the minimum and maximum depth values. The minimum depth value here is a distance from the depth plane of d = 0 shown in FIG. 6 to the optical center point of the camera 1. Further, the maximum depth value here is a distance from the depth plane of d = 4 shown in FIG. 6 to the optical center point of the camera 1. The minimum and maximum depth values may be manually set roughly based on the range in which the subject is shown as the foreground to be shot.
Then, it progresses to step S203.

（ステップＳ２０３）第１画像変換部１４は、取得した教師多視点画像群から、１フレーム分の同一時刻の教師多視点画像を選択する。その後、ステップＳ２０４へ進む。 (Step S203) The first image conversion unit 14 selects a teacher multi-viewpoint image for one frame at the same time from the acquired teacher multi-viewpoint image group. Thereafter, the process proceeds to step S204.

（ステップＳ２０４）第１画像変換部１４は、ステップＳ２０３において選択した教師多視点画像に含まれる全ての画素を、全ての仮想面（奥行き平面）へと投影し、仮想面ごと、及びカメラごとに、画像として保存する。 (Step S204) The first image conversion unit 14 projects all the pixels included in the teacher multi-viewpoint image selected in Step S203 onto all virtual planes (depth planes), and for each virtual plane and each camera. Save as an image.

以下、図６を参照しながら、具体的に説明する。上述したように、図６は、入力される多視点画像が３視点（カメラ０、カメラ１、及びカメラ２からの視点）であり、仮定する奥行き平面が５層（ｄ＝０〜４）であり、基準カメラをカメラ１とした場合における、投影画像群について示したものである。ここで、ｄは、奥行き平面のインデックス番号であるため、図６は、ｄ＝３の場合における、奥行き平面の投影画像群を構築する手順について示したものである。 Hereinafter, a specific description will be given with reference to FIG. As described above, in FIG. 6, the input multi-viewpoint image has three viewpoints (viewpoints from the camera 0, the camera 1, and the camera 2), and the assumed depth plane has five layers (d = 0 to 4). The projection image group when the reference camera is the camera 1 is shown. Here, since d is the index number of the depth plane, FIG. 6 shows a procedure for constructing the depth plane projection image group in the case of d = 3.

ここで言う投影画像群とは、各ｄの値において、各カメラ（カメラ０、カメラ１、及びカメラ２）から投影される画像の集合である。投影画像群のそれぞれの投影画像は、ＲＧＢＡの４チャネルで構成されている。以下、Ｎ視点の入力でＤ層構成である投影画像群における、入力視点ｎ、及び奥行き平面ｄの投影画像を、Ｉ_ｄ，ｎと表す。 The projected image group mentioned here is a set of images projected from each camera (camera 0, camera 1, and camera 2) at each d value. Each projection image in the projection image group is composed of four channels of RGBA. Hereinafter, the projection image of the input viewpoint n and the depth plane d in the projection image group having the D layer configuration with the input of the N viewpoints is represented as I _{d, n} .

また、カメラ０、カメラ１、及びカメラ２の画像を、それぞれＳ_０、Ｓ_１、及びＳ_２とする。また、基準カメラであるカメラ１の画像Ｓ_１の中のひとつの画素の画像座標ベクトルをｐ_１とする。また、画像Ｓ_１のｐ_１の画素値を、Ｓ_１（ｐ_１）とする。 In addition, the images of the camera 0, the camera 1, and the camera 2 are S ₀ , S ₁ , and S ₂ , respectively. Further, the image coordinate vector of one pixel in the image S ₁ of the camera 1 is the reference camera to p _1. Further, the pixel value of p ₁ of the image S ₁ is S ₁ (p ₁ ).

基準カメラからの視点で見た場合、奥行き平面は常に画像平面と平行である。そのため、投影画像群の画像座標ベクトルｑは、常にｑ＝ｐ_１が成りたつ。そのため、投影画像群のＩ_ｄ，１（ｑ）＝Ｓ_１（ｐ_１）となる。
次に、カメラ０、及びカメラ２からの投影画像群の画素値Ｉ_３，０（ｑ）、及びＩ_３，２（ｑ）については、Ｉ_３，０（ｑ）＝Ｓ_０（ｐ_０）、及びＩ_３，２（ｑ）＝Ｓ_２（ｐ_２）とする。ここで、ｐ_０＝Ｐ_３，０（ｑ）、ｐ_２＝Ｐ_３，２（ｑ）とし、Ｐ_ｄ，ｃは、奥行きインデックスｄ、投影先のカメラ番号ｃを仮定したときの、投影マッピング関数である。 When viewed from the viewpoint from the reference camera, the depth plane is always parallel to the image plane. Therefore, the image coordinate vector q of the projection image group always satisfies q = p ₁ . Therefore, I _{d, 1} (q) = S ₁ (p ₁ ) of the projection image group.
Next, for pixel values I _3,0 (q) and I _3,2 (q) of the projection image group from the camera 0 and the camera 2, I _3,0 (q) = S ₀ (p ₀ ) , And I _3,2 (q) = S ₂ (p ₂ ). Here, p ₀ = P _3,0 (q), p ₂ = P _3,2 (q), and P _{d, c} is a projection mapping when a depth index d and a camera number c of a projection destination are assumed. It is a function.

Ｐ_ｄ，ｃは、それぞれのカメラパラメータ、奥行値Ｆ（ｄ）から一般的に計算することができる。Ｆは一般的な関数であれば、どのような関数を用いてもかまわないが、遠方ほど奥行き平面の間隔が疎になる方が、効率的に計算ができるため、例えば、Ｆ＝ａ／（ｄ＋ｂ）とすることができる。なお、ａ、及びｂは任意の定数であり、シーンの奥行きによって手動で設定されるものとする。 P _{d, c} can be generally calculated from each camera parameter and depth value F (d). Any function can be used as long as F is a general function. However, since the distance between the depth planes becomes sparser as the distance increases, calculation can be performed more efficiently. For example, F = a / ( d + b). Note that a and b are arbitrary constants, and are set manually according to the depth of the scene.

このように、全ての奥行き平面（ｄ＝０〜４）において、全てのカメラ（カメラ０、カメラ１、及びカメラ２）からの投影画像Ｉ_ｄ，ｃを算出するが、Ｉ_ｄ，ｃのうちの、Ｉ_ｄ，０、及びＩ_ｄ，２の一部の画素については、カメラからの画角外であるため、画素値が格納されない。この場合、これらの画素については、アルファ値を１とし、透過であるものとする。また、それ以外の画素が格納されている画素については、アルファ値を０とし、不透明であるものとする。 Thus, in all depth planes (d = 0 to 4), all cameras (camera 0, the camera 1, and the camera 2) projected image _{I d from,} but calculates the _{_c,} of _{I d, c} Since some pixels of I _{d, 0} and I _{d, 2} are outside the angle of view from the camera, pixel values are not stored. In this case, for these pixels, the alpha value is set to 1 and the pixel is transparent. Further, for pixels in which other pixels are stored, the alpha value is 0 and it is opaque.

このようにして、第１画像変換部１４は、Ｉ_ｄ，ｃを算出し、保持する。そして、第１画像変換部１４は、別のフレームに対しても、同様に投影画像群の算出と保持を行う。第１画像変換部１４は、全てのフレームに対して同様の処理が完了するまで、当該処理を繰り返すものとする。 In this way, the first image conversion unit 14 calculates and holds I _{d, c} . Then, the first image conversion unit 14 similarly calculates and holds a projection image group for another frame. The first image conversion unit 14 repeats the process until the same process is completed for all frames.

（ステップＳ２０５）第１画像変換部１４は、全てのフレームに対して処理が完了したか否かを判定する。全てのフレームに対して処理が完了した場合、本フローチャートの処理が終了する。そうでない場合、ステップＳ２０３へ戻り、処理が完了していないフレームに対しての処理を実行する。 (Step S205) The first image conversion unit 14 determines whether or not processing has been completed for all frames. When the processing is completed for all the frames, the processing of this flowchart ends. Otherwise, the process returns to step S203, and the process for the frame that has not been completed is executed.

このように、第１画像変換部１４は、教師多視点画像群を、教師投影画像群に変換する。なお、第２画像変換部による、教師多視点背景画像群から教師背景投影画像群への変換における処理についても、上記と同様な処理によって行われる。また、第３画像変換部による、デプス推定の対象である多視点撮影画像から投影画像群への変換における処理についても、上記と同様な処理によって行われる。また、第４画像変換部による、多視点背景画像から背景投影画像群への変換における処理についても、上記と同様な処理によって行われる。 As described above, the first image conversion unit 14 converts the teacher multi-viewpoint image group into the teacher projection image group. Note that the processing in the conversion from the teacher multi-viewpoint background image group to the teacher background projection image group by the second image conversion unit is performed by the same process as described above. Further, the process in the conversion from the multi-viewpoint captured image, which is the depth estimation target, to the projection image group by the third image conversion unit is also performed by the same process as described above. Further, the process in the conversion from the multi-viewpoint background image to the background projection image group by the fourth image conversion unit is also performed by the same process as described above.

＜学習パラメータ算出部の動作＞
以下、学習パラメータ算出部１２の動作について、図面を参照しながら説明する。
図７は、本発明の実施形態における画像合成装置１の学習パラメータ算出部１２の動作を示すフローチャートである。
本フローチャートは、学習パラメータ算出部１２に、教師データ生成部１１から出力された教師デプス画像群と、第１画像変換部１４から出力された教師投影画像群と、第２画像変換部１５から出力された教師背景投影画像群と、ネットワーク構造情報とが入力された際に開始する。 <Operation of learning parameter calculation unit>
Hereinafter, the operation of the learning parameter calculation unit 12 will be described with reference to the drawings.
FIG. 7 is a flowchart showing the operation of the learning parameter calculation unit 12 of the image composition device 1 according to the embodiment of the present invention.
In the flowchart, the learning parameter calculation unit 12 outputs the teacher depth image group output from the teacher data generation unit 11, the teacher projection image group output from the first image conversion unit 14, and the second image conversion unit 15. The processing starts when the teacher background projection image group and the network structure information are input.

（ステップＳ３０１）学習パラメータ算出部１２は、教師データ生成部１１から出力された教師デプス画像群、第１画像変換部１４から出力された教師投影画像群、第２画像変換部１５から出力された教師背景投影画像群、及びネットワーク構造情報を取得する。学習パラメータ算出部１２は、取得したネットワーク構造情報に基づいて、ディープニューラルネットワーク（ＤＮＮ）を構築する。その後、ステップＳ３０２へ進む。 (Step S301) The learning parameter calculation unit 12 outputs the teacher depth image group output from the teacher data generation unit 11, the teacher projection image group output from the first image conversion unit 14, and the second image conversion unit 15. A teacher background projection image group and network structure information are acquired. The learning parameter calculation unit 12 constructs a deep neural network (DNN) based on the acquired network structure information. Thereafter, the process proceeds to step S302.

（ステップＳ３０２）学習パラメータ算出部１２は、取得した教師投影画像群の中の教師投影画像をＤＮＮに入力する。その後、ステップＳ３０３へ進む。 (Step S302) The learning parameter calculation unit 12 inputs a teacher projection image in the acquired teacher projection image group to the DNN. Thereafter, the process proceeds to step S303.

（ステップＳ３０３）学習パラメータ算出部１２は、ＤＮＮから出力される教師デプス画像を取得する。その後、ステップＳ３０４へ進む。 (Step S303) The learning parameter calculation unit 12 acquires a teacher depth image output from the DNN. Thereafter, the process proceeds to step S304.

（ステップＳ３０４）学習パラメータ算出部１２は、各層のパラメータを更新する。その後、ステップＳ３０５へ進む。 (Step S304) The learning parameter calculation unit 12 updates the parameters of each layer. Thereafter, the process proceeds to step S305.

（ステップＳ３０５）学習パラメータ算出部１２は、取得した教師投影画像群に含まれる全ての教師投影画像に対して処理が完了したか否かを判定する。学習パラメータ算出部１２が、全ての教師投影画像に対して処理が完了したと判定した場合、ステップＳ３０６へ進む。そうでない場合、ステップＳ３０２へ戻り、処理が完了いない教師投影画像に対して処理を実行する。 (Step S305) The learning parameter calculation unit 12 determines whether or not processing has been completed for all teacher projection images included in the acquired teacher projection image group. If the learning parameter calculation unit 12 determines that the processing has been completed for all the teacher projection images, the process proceeds to step S306. Otherwise, the process returns to step S302, and the process is executed on the teacher projection image for which the process has not been completed.

（ステップＳ３０６）学習パラメータ算出部１２は、学習パラメータを保存する。
以上で、本フローチャートの処理が終了する。 (Step S306) The learning parameter calculation unit 12 stores the learning parameter.
Above, the process of this flowchart is complete | finished.

上記の、学習パラメータ算出部１２の動作について、図８を参照して更に具体的に説明する。
まず、学習パラメータ算出部１２は、ネットワーク構造情報に基づいて、ディープニューラルネットワーク（ＤＮＮ）を構築する。なお、ＤＮＮの構造は、図８に示す構造に限られるものではなく、例えば、その層数などについては適宜変更してもかまわない。また、図８において「畳込層」と記載しているものについては、ＤＮＮにおける一般的な畳み込みと同義であり、畳み込み演算と活性化関数とを併用しているものとする。また、画像サイズを調整するためのZero Padding層については、表記を省略している。 The operation of the learning parameter calculation unit 12 will be described more specifically with reference to FIG.
First, the learning parameter calculation unit 12 constructs a deep neural network (DNN) based on the network structure information. Note that the DNN structure is not limited to the structure shown in FIG. 8. For example, the number of layers may be appropriately changed. Further, what is described as “convolution layer” in FIG. 8 is synonymous with general convolution in DNN, and it is assumed that a convolution operation and an activation function are used in combination. Further, the description of the Zero Padding layer for adjusting the image size is omitted.

図８において、入力層は、各教師投影画像、及び教師背景投影画像と同一解像度のＲＧＢ３チャネルを、視点数と奥行き平面数の積の数だけ積み重ねたものであり、出力層は、同一解像度の１チャネルとしている。図８において、Ｉ_ｄ，ｃは、上述した第１画像変換部１４の動作の説明における表記と同様の表記であり、投影画像群の中の一枚の投影画像を表している。同様に、Ｂ_ｄ，ｃは、教師背景投影画像群の中の一枚を表している。 In FIG. 8, the input layer is obtained by stacking RGB 3 channels having the same resolution as each teacher projection image and teacher background projection image by the number of products of the number of viewpoints and the number of depth planes, and the output layer has the same resolution. One channel is used. In FIG. 8, I _{d, c} is the same notation in the description of the operation of the first image conversion unit 14 described above, and represents one projection image in the projection image group. Similarly, B _{d, c} represents one piece in the teacher background projection image group.

このようにして、視点数と奥行数の積の数だけ、Ｉ_ｄ，ｃ、及びＢ_ｄ，ｃを入力層に設定し、中間層で連結していくことによって、多視点画像中の幾何制約をかけた状態のデータを入力することができる。これにより、ＤＮＮが、中間層において幾何制約を学習する必要がなくなるため、より効率的に学習を行うことができるうえ、背景画像との差分情報を参照することができるようになる。これによって、本実施形態における画像合成装置１は、被写体の境界部分でのデプス推定精度を向上させることができる。 In this way, I _{d, c} and B _{d, c} are set in the input layer by the number of products of the number of viewpoints and the number of depths, and are connected in the intermediate layer, so that geometric constraints in the multi-viewpoint image are obtained. It is possible to input the data in a state where is applied. This eliminates the need for the DNN to learn geometric constraints in the intermediate layer, so that learning can be performed more efficiently and difference information from the background image can be referred to. Thereby, the image composition device 1 in the present embodiment can improve the depth estimation accuracy at the boundary portion of the subject.

次に、学習パラメータ算出部１２は、構築したＤＮＮに対し、入力層を教師投影画像群、及び教師背景投影画像群として設定し、出力層を教師デプス画像として設定して、ＤＮＮの学習を実行する。学習の方法は、一般的なＤＮＮの学習方法を用いればよいが、例えば、ＤＮＮの各層の学習パラメータの更新を、誤差逆伝搬法を利用して行う方法がある。そして、学習パラメータ算出部１２は、全ての教師画像とデプス画像の対に対して、学習パラメータの更新を行い、学習パラメータを保存する。そして、学習パラメータ算出部１２は、ネットワーク構造とともに、学習パラメータを、学習済みネットワーク情報として画像合成部１３へ出力する。 Next, the learning parameter calculation unit 12 sets the input layer as a teacher projection image group and a teacher background projection image group for the constructed DNN, sets the output layer as a teacher depth image, and performs DNN learning. To do. As a learning method, a general DNN learning method may be used. For example, there is a method of updating a learning parameter of each layer of the DNN using an error back propagation method. Then, the learning parameter calculation unit 12 updates the learning parameters for all pairs of teacher images and depth images, and stores the learning parameters. Then, the learning parameter calculation unit 12 outputs the learning parameters together with the network structure to the image synthesis unit 13 as learned network information.

＜画像合成部の動作＞
以下、画像合成部１３の動作について、図面を参照しながら説明する。
図９は、本発明の実施形態における画像合成装置１の画像合成部１３の動作を示すフローチャートである。
本フローチャートは、学習パラメータ算出部１２に、画像合成部１３に、学習パラメータ算出部１２から出力された学習済みネットワーク情報と、第３画像変換部１６から出力された投影画像群と、第４画像変換部１７から出力された背景投影画像群とが入力された際に開始する。 <Operation of image composition unit>
Hereinafter, the operation of the image composition unit 13 will be described with reference to the drawings.
FIG. 9 is a flowchart showing the operation of the image composition unit 13 of the image composition apparatus 1 according to the embodiment of the present invention.
In this flowchart, the learning parameter calculation unit 12, the image synthesis unit 13, the learned network information output from the learning parameter calculation unit 12, the projection image group output from the third image conversion unit 16, and the fourth image are displayed. The process starts when the background projection image group output from the conversion unit 17 is input.

（ステップＳ４０１）画像合成部１３は、学習パラメータ算出部１２から出力された学習済みネットワーク情報と、第３画像変換部１６から出力された投影画像群と、第４画像変換部１７から出力された背景投影画像群と、を取得する。その後、ステップＳ４０２へ進む。 (Step S <b> 401) The image composition unit 13 has learned network information output from the learning parameter calculation unit 12, a projection image group output from the third image conversion unit 16, and output from the fourth image conversion unit 17. And a background projection image group. Thereafter, the process proceeds to step S402.

（ステップＳ４０２）画像合成部１３は、ステップＳ４０１において取得した投影画像群、及び背景投影画像群を入力層として、学習済みネットワークを用いて出力層の計算を行う。そして、画像合成部１３は、出力層の計算に基づいて合成されたデプス画像を出力する。
以上で、本フローチャートの処理が終了する。 (Step S402) The image composition unit 13 calculates the output layer using the learned network using the projection image group and the background projection image group acquired in step S401 as the input layer. Then, the image synthesis unit 13 outputs a depth image synthesized based on the calculation of the output layer.
Above, the process of this flowchart is complete | finished.

以上説明したように、本実施形態における画像合成装置１は、同一の被写体を注視して近接して設置された複数のカメラによって撮影された画像を用いて、当該複数のカメラの中の任意の１台のカメラによって撮影された画像のデプス画像を出力する。 As described above, the image synthesizing apparatus 1 according to the present embodiment uses an image photographed by a plurality of cameras installed close to each other while gazing at the same subject, and arbitrarily selects one of the plurality of cameras. Output a depth image of an image taken by one camera.

本実施形態における画像合成装置１は、事前にＣＧにより意図的に条件を変えてレンダリングした多視点画像と事前に撮影した多視点背景画像とを重畳した画像を投影画像群に変換して、当該投影画像群をＤＮＮの教師画像データとする。そして、画像合成装置１は、教師画像データをディープニューラルネットワーク（ＤＮＮ）に入力し学習させる。
そして、画像合成装置１は、学習がなされたＤＮＮに対して、デプス推定したい画像を含む多視点撮影画像とカメラパラメータとを入力することにより、デプス画像を合成し出力する。 The image synthesizing apparatus 1 in the present embodiment converts an image in which a multi-viewpoint image that has been intentionally changed in advance by CG and rendered in advance and a multi-viewpoint background image that has been captured in advance into a projected image group, The projection image group is set as DNN teacher image data. Then, the image synthesizing apparatus 1 inputs the teacher image data to the deep neural network (DNN) and learns it.
Then, the image composition apparatus 1 synthesizes and outputs the depth image by inputting the multi-viewpoint captured image including the image whose depth is to be estimated and the camera parameter to the learned DNN.

本実施形態における画像合成装置１は、背景画像を入力画像と併せて入力してＤＮＮの学習を行うため、デプス画像において被写体領域の輪郭部分の精度をより高精度にすることができる。
また、本実施形態における画像合成装置１は、ＣＧを用いて教師画像データを生成するため、実写では用意することが困難な大量の教師データであっても、より容易に生成することができる。 Since the image composition device 1 according to the present embodiment inputs the background image together with the input image and performs DNN learning, the accuracy of the contour portion of the subject area in the depth image can be made higher.
In addition, since the image composition device 1 according to the present embodiment generates teacher image data using CG, even a large amount of teacher data that is difficult to prepare in actual shooting can be generated more easily.

また、本実施形態における画像合成装置１は、照明の状態を意図的に変化させて生成したＣＧのレンダリング画像を教師画像データとするため、とくに推定精度が低い領域である被写体の影領域においても誤検出を防ぐことができ、影領域の推定精度をより向上させることができる。
また、本実施形態における画像合成装置１は、撮影会場の撮影位置と同位置における実写の背景画像に対してＣＧを重畳させる形でレンダリングを行って教師画像データを生成するため、撮影会場に特化した学習を行うことができ、より頑健なデプス画像の合成を行うことができる。 In addition, since the image composition device 1 according to the present embodiment uses the CG rendering image generated by intentionally changing the illumination state as the teacher image data, the image composition device 1 also in the shadow region of the subject, which is a region with low estimation accuracy. False detection can be prevented, and the shadow region estimation accuracy can be further improved.
In addition, the image composition apparatus 1 according to the present embodiment generates teacher image data by performing rendering in such a manner that a CG is superimposed on a live-action background image at the same position as the shooting position of the shooting venue. Learning can be performed, and more robust depth images can be synthesized.

以上説明したように、本実施形態における画像合成装置１は、教師データ生成のコストを大幅に削減しつつ、影領域でも精確に、かつ、任意背景下において背景変動に対しても頑健に、さらに、被写体の輪郭部分においても精度の高いデプス画像を合成することができる。
本実施形態における画像合成装置１は、静止画像または動画像から被写体の奥行情報を表すデプス画像を取得することが不可欠な用途に対して適用できる。 As described above, the image synthesizing apparatus 1 according to the present embodiment significantly reduces the cost of teacher data generation, is accurate even in shadow areas, and is robust against background fluctuations under an arbitrary background. Therefore, it is possible to synthesize a depth image with high accuracy even in the contour portion of the subject.
The image synthesizing apparatus 1 according to the present embodiment can be applied to a use in which it is indispensable to acquire a depth image representing depth information of a subject from a still image or a moving image.

上述した実施形態における画像合成装置１の少なくとも１部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 You may make it implement | achieve at least one part of the image synthesizing | combining apparatus 1 in embodiment mentioned above with a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program held for a certain period of time. Further, the program may be a program for realizing a part of the above-described functions, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system. You may implement | achieve using programmable logic devices, such as FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１・・・画像合成装置、１１・・・教師データ生成部、１２・・・学習パラメータ算出部、１３・・・画像合成部、１４・・・第１画像変換部、１５・・・第２画像変換部、１６・・・第３画像変換部、１７・・・第４画像変換部 DESCRIPTION OF SYMBOLS 1 ... Image composition apparatus, 11 ... Teacher data generation part, 12 ... Learning parameter calculation part, 13 ... Image composition part, 14 ... 1st image conversion part, 15 ... 2nd Image conversion unit, 16 ... third image conversion unit, 17 ... fourth image conversion unit

Claims

教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成部と、
前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出部と、
多視点撮影画像と、前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成部と、
を備える画像合成装置。 A teacher data generation unit that generates multi-view teacher image data by superimposing CG data including an image showing a subject based on a teacher multi-view background image and a 3D shape model;
A learning parameter calculation unit that performs machine learning using the teacher image data and calculates a learning parameter;
An image composition unit that generates a depth image for the multi-viewpoint captured image by the machine learning using the multi-viewpoint captured image and the learning parameter;
An image synthesizing apparatus.

前記機械学習は、ディープニューラルネットワークを用いた機械学習である
請求項１に記載の画像合成装置。 The image synthesis device according to claim 1, wherein the machine learning is machine learning using a deep neural network.

前記ＣＧデータは、被写体の姿勢、前記被写体の位置、及び照明の状態うち、少なくとも１つを指定する情報を含むデータである
請求項１に記載の画像合成装置。 The image synthesis apparatus according to claim 1, wherein the CG data is data including information specifying at least one of a posture of the subject, a position of the subject, and a lighting state.

前記ディープニューラルネットワークは、前記多視点撮影画像に基づく投影画像群と、前記多視点撮影画像と同一の視点で撮影された多視点背景画像に基づく背景投影画像群と、をそれぞれ入力する入力層を有する
請求項２に記載の画像合成装置。 The deep neural network includes input layers for inputting a projection image group based on the multi-viewpoint captured image and a background projection image group based on a multi-viewpoint background image captured from the same viewpoint as the multi-viewpoint captured image, respectively. The image synthesizing device according to claim 2.

コンピュータによる画像合成方法であって、
教師データ生成部が、教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成ステップと、
学習パラメータ算出部が、前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出ステップと、
画像合成部が、多視点撮影画像と、前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成ステップと、
を有する画像合成方法。 An image composition method using a computer,
A teacher data generation step in which a teacher data generation unit generates multi-view teacher image data by superimposing a teacher multi-view background image and CG data including an image showing a subject based on a 3D shape model;
A learning parameter calculating unit performs machine learning using the teacher image data and calculates a learning parameter; and
An image synthesis step for generating a depth image for the multi-view shot image by the machine learning using the multi-view shot image and the learning parameter;
An image composition method comprising:

コンピュータに、
教師多視点背景画像と３Ｄ形状モデルに基づく被写体を示す画像を含むＣＧデータとを重畳させて多視点の教師画像データを生成する教師データ生成ステップと、
前記教師画像データを用いて機械学習を行い、学習パラメータを算出する学習パラメータ算出ステップと、
多視点撮影画像と、前記学習パラメータとを用いて、前記機械学習により前記多視点撮影画像に対するデプス画像を生成する画像合成ステップと、
を実行させるための画像合成プログラム。 On the computer,
A teacher data generation step of generating multi-view teacher image data by superimposing a teacher multi-view background image and CG data including an image showing a subject based on a 3D shape model;
A learning parameter calculation step of performing machine learning using the teacher image data and calculating a learning parameter;
An image synthesis step for generating a depth image for the multi-viewpoint captured image by the machine learning using the multi-viewpoint captured image and the learning parameter;
An image composition program for executing