CN108830890B

CN108830890B - Method for estimating scene geometric information from single image by using generative countermeasure network

Info

Publication number: CN108830890B
Application number: CN201810376281.9A
Authority: CN
Inventors: 李俊; 黄韬; 张露娟; 马震远
Original assignee: Qichen Guangzhou Electronic Technology Co ltd
Current assignee: Qichen Guangzhou Electronic Technology Co ltd
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2021-10-01
Anticipated expiration: 2038-04-24
Also published as: CN108830890A

Abstract

The invention provides a method for estimating scene geometric information from a single image by using a generative confrontation network, which comprises the following steps: inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image. The invention uses the depth of the image in the scene and a small number of corresponding pixels in the image as input, predicts or estimates the depth image of the scene through a dual consistency constrained generation countermeasure network, and is simple, effective and low in cost.

Description

Method for estimating scene geometric information from single image by using generative countermeasure network

Technical Field

The invention belongs to the field of computer image processing, relates to a method for estimating scene geometric information from a single image, and particularly relates to a method for estimating scene geometric information from a single image by using a generative confrontation network.

Background

Depth information prediction and estimation are very important in engineering application fields, such as robotics, auto-driving, Augmented Reality (AR), and 3D modeling. At present, two methods for acquiring depth images are mainly used, namely direct distance measurement and indirect distance measurement. Direct ranging refers to directly acquiring depth information using various hardware devices. For example, TOF cameras acquire the distance between an object in a target scene to an emitter by emitting successive near-infrared pulses; the laser radar scans an object in a detected scene by emitting laser so as to obtain the distance from the surface of the object to the laser radar; the Kinect utilizes an optical coding technology to obtain three-dimensional depth information through an infrared transmitter projection scene. However, they all have their own limitations: TOF cameras are typically expensive and susceptible to noise interference; the three-dimensional information captured by the laser radar is uneven and sparse in a color image coordinate system, and the cost is high; the Kinect has short measuring distance and is easily influenced by light to generate a large amount of noise.

Indirect ranging refers to indirect depth estimation using a single or multiple visible light images of the same scene. According to the difference of the number of the scene viewpoints, the method can be divided into the following steps: the depth estimation method based on the multi-view, the depth estimation algorithm based on the binocular image and the depth estimation method based on the monocular image. Multi-view based depth estimation typically employs an array of cameras for image acquisition of the same scene and utilizes redundant information between multiple viewpoint images for depth image computation. The depth estimation method based on multiple views can obtain a relatively accurate depth image corresponding to the scene, but the camera array is high in cost, troublesome in configuration and high in shooting requirement, so that the camera array is less used in the practical process. Depth estimation based on binocular images calculates depth information by a stereo matching technique using disparity between two cameras similar to both eyes of a human. Monocular image based depth estimation utilizes only video sequences and images of one viewpoint for depth estimation.

Due to these limitations, depth estimation using a single camera has been a strong concern, and such cameras are small in size, low in cost, energy-saving, and widely available in consumer electronics.

In recent years, with the development of deep learning, researchers have been trying to study the depth estimation problem of monocular images by using Convolutional Neural Network (CNN) in a large amount. Saxena et al model the depths of individual points and the relationship between the depths of different points using a Markov Random Field (MRF) containing multi-scale local and global image features using a supervised learning approach.

CN107578436A discloses a monocular image depth estimation method based on a full convolution neural network FCN, which includes the steps of: acquiring training image data; inputting training image data into a full convolution neural network FCN, and sequentially outputting by a pooling layer to obtain a characteristic image; amplifying the output characteristic images of the last pooling layer to obtain characteristic images with the same size as the output characteristic images of the previous pooling layer, and fusing the characteristic images of the last pooling layer and the output characteristic images of the previous pooling layer; sequentially fusing the output characteristic images of each pooling layer from back to front to obtain a final predicted depth image; in the training, parameters in the full convolution neural network FCN are trained by using a random gradient descent (SGD) method; and (3) acquiring an RGB image needing depth prediction, inputting the RGB image into the trained full convolution neural network FCN, and acquiring a corresponding depth prediction image. The method adopts a structure of a full convolution network, removes a full connection layer, effectively reduces the parameter quantity of the network, and can solve the problem of low resolution of an output image in the convolution process, but the method needs extremely large training samples and long training time.

Disclosure of Invention

To solve the above problem, the present invention provides a method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:

inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image;

the training step of the generative neural network comprises the following steps:

step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is a sub-image and a corresponding depth image;

and B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)^XAnd D^Y)；

And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured;

step D: the discriminant neural network D^YDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network D^YC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;

step E: adjustment D^XAnd D^YTo reduce the discrimination loss in step D;

step F: calculating a difference loss between the depth image in the sample and the G-generated pseudo depth image in step C, and calculating a difference loss between the image in the sample and the F-generated pseudo image;

step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;

step H: and C, returning to the step C for iteration until a preset iteration condition is met, and storing the generated neural network G at the moment as a final generated neural network.

In an embodiment of the present invention, the step C specifically includes:

inputting the depth of a plurality of pixels of the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;

inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depth of a plurality of pixels in the depth image in the sample to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.

In one embodiment of the invention, said step F further comprises: calculating a disparity loss between the depth image in the sample and the pseudo-restored depth image, and calculating a disparity loss between the image in the sample and the pseudo-restored image.

In one embodiment of the invention, the generative confrontation network utilizes data to make inferences based on a Bayesian probabilistic model as follows:

wherein X is an image in the sample, Y is a depth image in the sample,

is the depth of several pixels of the depth image in the sample, Y^sFor the depth of the pseudo-pixel of said several pixels, G is a generative neural network for generating a depth image from the image, F is a generative neural network for generating an image from the depth image, D^XFor discriminative neural networks for discriminating the authenticity of images, D^YThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,

for the pseudo-depth image generated by the generative neural network G,

is a pseudo-restored image generated by the generative neural network F.

In one embodiment of the present invention, the generative neural networks G and F loss functions are:

L_G＝L_GAN+λ₁L_REC+λ₂L_SSC：

L_GAN＝E_Y[log D(Y)]+E_X[log(1-D^y(G(X)))]，

L_REC(X，G，F)＝E_X[||X-F(G(X))||₁]+E_Y[||Y-G(F(Y))||₁]，

where E is the expectation, X is the image in the sample, Y is the depth image in the sample,

is the depth of several pixels of the depth image in the sample,

is the depth of the dummy pixel of the pixels, L_GTo generate a loss function of the neural networks G and F, L_GANTo combat the loss function of the network, L_RECAs a loss function of reduction, L_SSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G₁Is L_RECWeight coefficient of (a), λ₂Is L_SSCThe weight coefficient of (2). Preferably, said λ₁、λ₂Is 0 to 10.

Further, the discriminant neural network D^XAnd D^YHas a loss function of

is the depth of several pixels of the depth image in the sample,

for discriminant neural network D^XThe discriminant loss function of (a),

for discriminant neural network D^YThe discriminant loss function of (1).

Further, the generative neural networks G and F are full convolutional neural networks, and the full convolutional neural networks include convolutional layers, residual network layers, and deconvolution layers. Preferably, the number of layers of the residual network layer is 9-21.

The following explains the terms appearing in the present invention:

scene: the scene is a specific picture made of an objective object or object that occurs in a certain time and space, and generally includes an indoor scene or an outdoor scene.

The observer: a sensor/processor device set for scene measurements. A point (which may be located inside/outside the device) uniquely defined by the position and attitude of the device group in the 3-dimensional physical world serves as a reference measurement point for depth as referred to herein.

Depth: the depth of a point in the scene is defined as its geometric distance from the observer position.

And (3) actual measurement point set: several points in the scene, the geometrical distances between these points and the surveyor, are measured with an instrument, are considered to be known in the present method.

Matrix: a two-dimensional data table that refers to a vertical and horizontal arrangement (J.J. Sylvester, "additives to The articles in The separation number of this j.s." On a new class of The articles, "and On Pascal's The articles," [ J ] The London, Edinburgh and double phosphor Magazine and Journal of Science, 1850, 37: 363) is described. By extension, image matrices (images for short) and matrices are defined.

A neural network: the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing.

Generating a network: generative Net, herein referred to as a neural network that outputs a form of a data matrix.

True/false data: the data may be an image/matrix, the true data is sensor data obtained by measuring a scene, and the false data is data with a similar format output by a configured generation network.

FCN: full Convolutional network full Convolutional conditional Networks, all layers are Convolutional layer neural Networks.

The invention has the beneficial effects that:

1. the invention provides a method for estimating scene geometric information from a single image by using a generative countermeasure network, which takes the image of the scene and a small amount of corresponding depth information in the image as input, requires less measurement data, is simple and effective and reduces the measurement cost of the scene depth.

2. And the generated network is optimized by utilizing the residual error network, so that the layer number of the generated network is increased, and the performance and the accuracy of the generated network are improved.

Drawings

FIG. 1a is an example of an image represented by X in the present invention;

FIG. 1b shows the present invention

An example of the depth of the represented pixel;

FIG. 1c shows the present invention

The depth image example represented;

FIG. 2 is a flow chart of an image depth estimation method according to the present invention;

FIG. 3 is a basic schematic diagram of image depth estimation of the present invention;

FIG. 4 is a diagram of a basic configuration of a countermeasure generation network of the present invention;

FIG. 5 is a diagram of a primary structure of a generating network according to an embodiment of the present invention;

fig. 6 is a diagram illustrating a basic structure of a discrimination network according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solution proposed by the present invention, the following further explains the present invention with reference to the drawings and specific embodiments.

Generally, in order to acquire geometric information, particularly depth information, of a certain scene, people use a Kinect camera to shoot a depth image of the scene, and the measuring distance of Kinect is short, and the acquired depth information is sparse. Multiple measurements are required to obtain the full depth information of the scene. Therefore, we want to estimate the full depth information of the scene by using RGB images taken by a normal camera and sparse depth images taken by a Kinect camera. The common camera needs to shoot the RGB image and the Kinect shooting depth image at the same position and angle.

The method comprises the steps of inputting an image of a scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image.

The specific implementation steps in the deployment phase are as follows:

step 1: an RGB image X of a scene (as in FIG. 1a) is acquired and the depth of the corresponding pixel of the image is measured in real time at 300-400 pixels (typically not more than 1% of the total pixels of the image)

(see FIG. 1 b);

step 2: combining the RGB image X with the depth of the pixel

Inputting the training result into a resultant confrontation network (G);

and step 3: starting a generating type neural network G to perform feedforward calculation;

and 4, step 4: using the feedforward calculation result of G as a pseudo-depth image

(see FIG. 1c), then

Is a prediction of the depth image Y corresponding to X.

Therefore, we need to train the generative countermeasure network before the deployment phase.

As shown in fig. 2, in an embodiment of the present invention, the training step of the generative neural network includes:

step A: collecting a training data set: the training data set comprises a number of samples, each sample being a triplet (RGB image X, depth image Y, depth of pixel)

)；

And B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)^XAnd D^Y) Wherein G is the network for generating the depth image in the step of implementing the deployment phase;

and C: combining X in the sample with

Input to G to obtain corresponding pseudo depth image

(this step is the same as the implementation step of the deployment phase); inputting the depth image in the sample into F to obtain a corresponding pseudo image

Step D: the discriminant neural network D^XFor image X in the sample and the pseudo-image generated by F in step C

Making a discrimination, the discriminating neural network D^YFor the depth image Y in the sample and the pseudo depth image generated by G in step C

Judging;

step E: adjustment D^XAnd D^YTo reduce the discrimination loss in step D;

step F: calculating an image X in the sample and a pseudo-image generated by F in step C

The difference loss between the depth image Y in the sample and the pseudo depth image generated by G in step C

Loss of variance between;

step H: and returning to the step C for iteration until a preset iteration condition is met, and saving the generated neural network G at the moment as a final generated neural network for implementation of a deployment stage.

As shown in fig. 3, Z represents data abstraction corresponding to an actual scene (i.e. color, depth, contrast, transparency information, etc. contained in the scene), X represents an image of the scene, Y represents a depth image, and Y represents^sRepresenting the depth of several pixels of the depth image in the sample or in actual measurement.

wherein the random variable X represents an RGB image, the random variable Y represents a depth image,random variable Y^sRepresenting the depth of a number of pixels of the depth image in the sample,

representing the depth of said pseudo pixels, G being a generative neural network for generating a depth image from the image, F being a generative neural network for generating an image from the depth image, D^XOutputting the probability that the image is true for a discriminant neural network used for distinguishing the authenticity of the image; d^YOutputting the probability that the depth image is true for a discriminant neural network used for distinguishing the authenticity of the depth image;

for the pseudo-depth image generated by the generative neural network G,

is a pseudo-restored image generated by the generative neural network F.

As shown in fig. 4, in an embodiment of the present invention, the step C specifically includes:

C₁Representing a loss between the pseudo depth image generated by the generative neural network and the measured depth image;

C₂representing the loss of depth of the dummy pixels and the depth of the real pixels generated by the generative neural network G,

the generated neural network G is restrained as supervision information;

C₃representing pseudo-restored images and real scenes generated by a generative neural network FError of the image of (C)₁、C₂、C₃The generating network G is constrained collectively by a loss function.

In one embodiment of the present invention, the loss function of the generative neural networks G and F is: l is_G＝L_GAN+λ₁L_REC+λ₂L_SSC：

Wherein E is desired, i.e. E [ f (x)]Indicating the expectation of a function or random variable within brackets, X being the image in the sample, Y being the depth image in the sample,

is the depth of several pixels of the depth image in the sample,

is the depth of the dummy pixel of the pixels, L_GTo generate a loss function of the neural networks G and F, L_GANTo combat the loss function of the network, L_RECAs a loss function of reduction, L_SSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G₁Is L_RECWeight coefficient of (a), λ₂Is L_SSCThe weight coefficient of (2). Said lambda₁、λ₂Is 0 to 10. Preferably, λ₁＝10，λ₂＝10。

Accordingly, a discriminative neural network D^XAnd D^YIs in the loss boxNumber is

is the depth of several pixels of the depth image in the sample,

for discriminant neural network D^XThe discriminant loss function of (a),

for discriminant neural network D^YThe discriminant loss function of (1).

In the structure of the generation network in the above embodiment, the generation neural networks G and F are full convolution networks including convolution layers, down-sampling layers, residual error networks, and deconvolution layers.

As shown in fig. 5, in the above embodiment, the generative neural networks G and F are full convolutional neural networks including convolutional layers, residual network layers, and deconvolution layers. The method comprises the following specific steps:

the first layer is convolutional layer Conv1, 64 4 × 4 convolutional kernels (filters) in Conv1 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a Rectified Linear Unit (Rectified Linear Unit or ReLU) function of a nonlinear active layer is used as an active function;

the second layer is convolutional layer Conv2, 128 convolution kernels (filters) of 3 × 3 in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;

the third layer is convolutional layer Conv3, 256 convolution kernels (filters) of 3 × 3 in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;

after down-sampling (convolution), the processed data enters a Residual Network (Residual Network) for convolution operation, the Residual Network comprises a plurality of continuous Residual modules, and each Residual module comprises two convolution layers:

each convolution layer is provided with 256 convolution kernels (filters) of 3 multiplied by 3 for convolution operation with the step length of 1 pixel, meanwhile, the convolution operation (Padding) operation with the edge filled with 1 pixel exists before the convolution operation, the normalization processing is carried out after the convolution is finished, and then a nonlinear active layer rectifies a linear unit function to be used as an active function; dropout with probability of 0.5 is performed between the two convolutional layers; before the second layer of convolution layer is activated by a rectification Linear Unit, adding operation is carried out on data before entering the residual error module and data after two times of convolution processing, and then entering the next residual error module. Optionally, the number of residual modules of the residual network is 9-21. Preferably, the number of "residual blocks" of the residual network is 9.

After the residual error network, generating a network and then performing up-sampling on the processed data, wherein the network comprises a plurality of layers of deconvolution layers:

the first layer is a deconvolution layer Deconv1, 128 convolution kernels (filters) of 3 × 3 in the Deconv1 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;

the second layer is a deconvolution layer Deconv2, 64 convolution kernels (filters) of 3 × 3 in the Deconv2 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;

the third layer is a deconvolution layer Deconv3, 3 convolution kernels (filters) of 7 × 7 in Deconv3 perform deconvolution operation with step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a tangent function is used as an activation function.

As shown in FIG. 6, in an embodiment of the present invention, a discriminative neural network D^XAnd D^YThe structure of (1) is specifically as follows:

the first layer is convolutional layer Conv1, 64 convolution kernels (filters) of 7 × 7 in Conv1 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;

the second layer is convolutional layer Conv2, 128 4 × 4 convolutional kernels (filters) in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;

the third layer is convolutional layer Conv3, 256 4 × 4 convolutional kernels (filters) in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;

the fourth layer is convolutional layer Conv4, 512 4 × 4 convolutional kernels (filters) in Conv4 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;

the fifth layer is convolutional layer Conv5, where 1 convolution kernel (filter) of 4 × 4 in Conv5 performs a convolution operation with a step size of 1 pixel, while there is an edge Padding (Padding) operation before performing the convolution.

The depth estimation method is used as an executable instruction set of a processor circuit or a chip, can be applied to intelligent sensing equipment such as a next-generation laser radar and an advanced camera, predicts or restores the depth image of a scene corresponding to an actually measured point set by using a small number of actually measured point sets, and reduces the cost of acquiring the depth image of the scene by using the traditional method.

The above are merely specific embodiments of the present invention, and the scope of the present invention is not limited thereby; any alterations and modifications without departing from the spirit of the invention are within the scope of the invention.

Claims

1. A method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:

inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in an image;

wherein the training step of the generative neural network comprises:

step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is an image and a corresponding depth image;

and B: constructing a generative confrontation network architecture, which comprises two generative neural networks: f and G, two discriminant neural networks: d^XAnd D^Y；

And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured; wherein, the image in the sample and the plurality of pixels in the depth image thereof refer to 300-400 pixels and not more than 1% of all pixels of the image;

step D: the discriminant neural network D^XDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network D^YC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;

step E: adjustment D^XAnd D^YTo reduce the discrimination loss in step D;

2. The method for estimating scene geometry information from a single image using a generative countermeasure network as claimed in claim 1, wherein said step C is specifically:

inputting the depth of the plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;

inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depths of the pixels to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.

3. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative confrontation network utilizes data to reason based on a bayesian probabilistic model as follows:

where X is the image in the sample, Y is the depth image in the sample,

is the depth of several pixels of the depth image in the sample, Y^sThe depth of the pseudo pixel of the pixels, G is a generating neural network for generating the depth image of the image, F is a generating neural network for generating the depth image of the image, D^YThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,

for the pseudo-depth image generated by the generative neural network G,

is a pseudo-restored image generated by the generative neural network F.

4. The method of estimating scene geometry information from a single image using a generative confrontation network as claimed in claim 1 or claim 3, wherein the generative neural network G and F loss functions are:

L_G＝L_GAN+λ₁L_REC+λ₂L_SSC，

is the depth of several pixels of the depth image in the sample,

is the depth of the dummy pixel of the pixels, L_GTo generate a loss function of the neural networks G and F, L_GANTo generate a loss function against the network, L_RECLoss function between image or depth image and sample generated for generating neural networks G and F, L_SSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G₁Is L_RECWeight coefficient of (a), λ₂Is L_SSCThe weight coefficient of (2).

5. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the discriminative neural network D^XAnd D^YHas a loss function of

is the depth of several pixels of the depth image in the sample,

for discriminant neural network D^XThe discriminant loss function of (a),

for discriminant neural network D^YThe discriminant loss function of (1).

6. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative neural network G or F is a full convolutional neural network comprising convolutional layers, residual network layers, and anti-convolutional layers.