CN108830890B - Method for estimating scene geometric information from single image by using generative countermeasure network - Google Patents

Method for estimating scene geometric information from single image by using generative countermeasure network Download PDF

Info

Publication number
CN108830890B
CN108830890B CN201810376281.9A CN201810376281A CN108830890B CN 108830890 B CN108830890 B CN 108830890B CN 201810376281 A CN201810376281 A CN 201810376281A CN 108830890 B CN108830890 B CN 108830890B
Authority
CN
China
Prior art keywords
image
depth
pseudo
sample
depth image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810376281.9A
Other languages
Chinese (zh)
Other versions
CN108830890A (en
Inventor
李俊
黄韬
张露娟
马震远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichen Guangzhou Electronic Technology Co ltd
Original Assignee
Qichen Guangzhou Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichen Guangzhou Electronic Technology Co ltd filed Critical Qichen Guangzhou Electronic Technology Co ltd
Priority to CN201810376281.9A priority Critical patent/CN108830890B/en
Publication of CN108830890A publication Critical patent/CN108830890A/en
Application granted granted Critical
Publication of CN108830890B publication Critical patent/CN108830890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a method for estimating scene geometric information from a single image by using a generative confrontation network, which comprises the following steps: inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image. The invention uses the depth of the image in the scene and a small number of corresponding pixels in the image as input, predicts or estimates the depth image of the scene through a dual consistency constrained generation countermeasure network, and is simple, effective and low in cost.

Description

Method for estimating scene geometric information from single image by using generative countermeasure network
Technical Field
The invention belongs to the field of computer image processing, relates to a method for estimating scene geometric information from a single image, and particularly relates to a method for estimating scene geometric information from a single image by using a generative confrontation network.
Background
Depth information prediction and estimation are very important in engineering application fields, such as robotics, auto-driving, Augmented Reality (AR), and 3D modeling. At present, two methods for acquiring depth images are mainly used, namely direct distance measurement and indirect distance measurement. Direct ranging refers to directly acquiring depth information using various hardware devices. For example, TOF cameras acquire the distance between an object in a target scene to an emitter by emitting successive near-infrared pulses; the laser radar scans an object in a detected scene by emitting laser so as to obtain the distance from the surface of the object to the laser radar; the Kinect utilizes an optical coding technology to obtain three-dimensional depth information through an infrared transmitter projection scene. However, they all have their own limitations: TOF cameras are typically expensive and susceptible to noise interference; the three-dimensional information captured by the laser radar is uneven and sparse in a color image coordinate system, and the cost is high; the Kinect has short measuring distance and is easily influenced by light to generate a large amount of noise.
Indirect ranging refers to indirect depth estimation using a single or multiple visible light images of the same scene. According to the difference of the number of the scene viewpoints, the method can be divided into the following steps: the depth estimation method based on the multi-view, the depth estimation algorithm based on the binocular image and the depth estimation method based on the monocular image. Multi-view based depth estimation typically employs an array of cameras for image acquisition of the same scene and utilizes redundant information between multiple viewpoint images for depth image computation. The depth estimation method based on multiple views can obtain a relatively accurate depth image corresponding to the scene, but the camera array is high in cost, troublesome in configuration and high in shooting requirement, so that the camera array is less used in the practical process. Depth estimation based on binocular images calculates depth information by a stereo matching technique using disparity between two cameras similar to both eyes of a human. Monocular image based depth estimation utilizes only video sequences and images of one viewpoint for depth estimation.
Due to these limitations, depth estimation using a single camera has been a strong concern, and such cameras are small in size, low in cost, energy-saving, and widely available in consumer electronics.
In recent years, with the development of deep learning, researchers have been trying to study the depth estimation problem of monocular images by using Convolutional Neural Network (CNN) in a large amount. Saxena et al model the depths of individual points and the relationship between the depths of different points using a Markov Random Field (MRF) containing multi-scale local and global image features using a supervised learning approach.
CN107578436A discloses a monocular image depth estimation method based on a full convolution neural network FCN, which includes the steps of: acquiring training image data; inputting training image data into a full convolution neural network FCN, and sequentially outputting by a pooling layer to obtain a characteristic image; amplifying the output characteristic images of the last pooling layer to obtain characteristic images with the same size as the output characteristic images of the previous pooling layer, and fusing the characteristic images of the last pooling layer and the output characteristic images of the previous pooling layer; sequentially fusing the output characteristic images of each pooling layer from back to front to obtain a final predicted depth image; in the training, parameters in the full convolution neural network FCN are trained by using a random gradient descent (SGD) method; and (3) acquiring an RGB image needing depth prediction, inputting the RGB image into the trained full convolution neural network FCN, and acquiring a corresponding depth prediction image. The method adopts a structure of a full convolution network, removes a full connection layer, effectively reduces the parameter quantity of the network, and can solve the problem of low resolution of an output image in the convolution process, but the method needs extremely large training samples and long training time.
Disclosure of Invention
To solve the above problem, the present invention provides a method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:
inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image;
the training step of the generative neural network comprises the following steps:
step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is a sub-image and a corresponding depth image;
and B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)XAnd DY);
And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured;
step D: the discriminant neural network DYDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network DYC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating a difference loss between the depth image in the sample and the G-generated pseudo depth image in step C, and calculating a difference loss between the image in the sample and the F-generated pseudo image;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and C, returning to the step C for iteration until a preset iteration condition is met, and storing the generated neural network G at the moment as a final generated neural network.
In an embodiment of the present invention, the step C specifically includes:
inputting the depth of a plurality of pixels of the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depth of a plurality of pixels in the depth image in the sample to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
In one embodiment of the invention, said step F further comprises: calculating a disparity loss between the depth image in the sample and the pseudo-restored depth image, and calculating a disparity loss between the image in the sample and the pseudo-restored image.
In one embodiment of the invention, the generative confrontation network utilizes data to make inferences based on a Bayesian probabilistic model as follows:
Figure BDA0001638717510000031
wherein X is an image in the sample, Y is a depth image in the sample,
Figure BDA0001638717510000032
is the depth of several pixels of the depth image in the sample, YsFor the depth of the pseudo-pixel of said several pixels, G is a generative neural network for generating a depth image from the image, F is a generative neural network for generating an image from the depth image, DXFor discriminative neural networks for discriminating the authenticity of images, DYThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,
Figure BDA0001638717510000033
for the pseudo-depth image generated by the generative neural network G,
Figure BDA0001638717510000034
is a pseudo-restored image generated by the generative neural network F.
In one embodiment of the present invention, the generative neural networks G and F loss functions are:
LG=LGAN1LREC2LSSC
LGAN=EY[log D(Y)]+EX[log(1-Dy(G(X)))],
LREC(X,G,F)=EX[||X-F(G(X))||1]+EY[||Y-G(F(Y))||1],
Figure BDA0001638717510000041
where E is the expectation, X is the image in the sample, Y is the depth image in the sample,
Figure BDA0001638717510000042
is the depth of several pixels of the depth image in the sample,
Figure BDA0001638717510000043
is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo combat the loss function of the network, LRECAs a loss function of reduction, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2). Preferably, said λ1、λ2Is 0 to 10.
Further, the discriminant neural network DXAnd DYHas a loss function of
Figure BDA0001638717510000044
Figure BDA0001638717510000045
Figure BDA0001638717510000046
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,
Figure BDA0001638717510000047
is the depth of several pixels of the depth image in the sample,
Figure BDA0001638717510000048
for discriminant neural network DXThe discriminant loss function of (a),
Figure BDA0001638717510000049
for discriminant neural network DYThe discriminant loss function of (1).
Further, the generative neural networks G and F are full convolutional neural networks, and the full convolutional neural networks include convolutional layers, residual network layers, and deconvolution layers. Preferably, the number of layers of the residual network layer is 9-21.
The following explains the terms appearing in the present invention:
scene: the scene is a specific picture made of an objective object or object that occurs in a certain time and space, and generally includes an indoor scene or an outdoor scene.
The observer: a sensor/processor device set for scene measurements. A point (which may be located inside/outside the device) uniquely defined by the position and attitude of the device group in the 3-dimensional physical world serves as a reference measurement point for depth as referred to herein.
Depth: the depth of a point in the scene is defined as its geometric distance from the observer position.
And (3) actual measurement point set: several points in the scene, the geometrical distances between these points and the surveyor, are measured with an instrument, are considered to be known in the present method.
Matrix: a two-dimensional data table that refers to a vertical and horizontal arrangement (J.J. Sylvester, "additives to The articles in The separation number of this j.s." On a new class of The articles, "and On Pascal's The articles," [ J ] The London, Edinburgh and double phosphor Magazine and Journal of Science, 1850, 37: 363) is described. By extension, image matrices (images for short) and matrices are defined.
A neural network: the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing.
Generating a network: generative Net, herein referred to as a neural network that outputs a form of a data matrix.
True/false data: the data may be an image/matrix, the true data is sensor data obtained by measuring a scene, and the false data is data with a similar format output by a configured generation network.
FCN: full Convolutional network full Convolutional conditional Networks, all layers are Convolutional layer neural Networks.
The invention has the beneficial effects that:
1. the invention provides a method for estimating scene geometric information from a single image by using a generative countermeasure network, which takes the image of the scene and a small amount of corresponding depth information in the image as input, requires less measurement data, is simple and effective and reduces the measurement cost of the scene depth.
2. And the generated network is optimized by utilizing the residual error network, so that the layer number of the generated network is increased, and the performance and the accuracy of the generated network are improved.
Drawings
FIG. 1a is an example of an image represented by X in the present invention;
FIG. 1b shows the present invention
Figure BDA0001638717510000051
An example of the depth of the represented pixel;
FIG. 1c shows the present invention
Figure BDA0001638717510000052
The depth image example represented;
FIG. 2 is a flow chart of an image depth estimation method according to the present invention;
FIG. 3 is a basic schematic diagram of image depth estimation of the present invention;
FIG. 4 is a diagram of a basic configuration of a countermeasure generation network of the present invention;
FIG. 5 is a diagram of a primary structure of a generating network according to an embodiment of the present invention;
fig. 6 is a diagram illustrating a basic structure of a discrimination network according to an embodiment of the present invention.
Detailed Description
In order to better understand the technical solution proposed by the present invention, the following further explains the present invention with reference to the drawings and specific embodiments.
Generally, in order to acquire geometric information, particularly depth information, of a certain scene, people use a Kinect camera to shoot a depth image of the scene, and the measuring distance of Kinect is short, and the acquired depth information is sparse. Multiple measurements are required to obtain the full depth information of the scene. Therefore, we want to estimate the full depth information of the scene by using RGB images taken by a normal camera and sparse depth images taken by a Kinect camera. The common camera needs to shoot the RGB image and the Kinect shooting depth image at the same position and angle.
The method comprises the steps of inputting an image of a scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image.
The specific implementation steps in the deployment phase are as follows:
step 1: an RGB image X of a scene (as in FIG. 1a) is acquired and the depth of the corresponding pixel of the image is measured in real time at 300-400 pixels (typically not more than 1% of the total pixels of the image)
Figure BDA0001638717510000061
(see FIG. 1 b);
step 2: combining the RGB image X with the depth of the pixel
Figure BDA0001638717510000062
Inputting the training result into a resultant confrontation network (G);
and step 3: starting a generating type neural network G to perform feedforward calculation;
and 4, step 4: using the feedforward calculation result of G as a pseudo-depth image
Figure BDA0001638717510000063
(see FIG. 1c), then
Figure BDA0001638717510000064
Is a prediction of the depth image Y corresponding to X.
Therefore, we need to train the generative countermeasure network before the deployment phase.
As shown in fig. 2, in an embodiment of the present invention, the training step of the generative neural network includes:
step A: collecting a training data set: the training data set comprises a number of samples, each sample being a triplet (RGB image X, depth image Y, depth of pixel)
Figure BDA0001638717510000065
);
And B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)XAnd DY) Wherein G is the network for generating the depth image in the step of implementing the deployment phase;
and C: combining X in the sample with
Figure BDA0001638717510000066
Input to G to obtain corresponding pseudo depth image
Figure BDA0001638717510000067
(this step is the same as the implementation step of the deployment phase); inputting the depth image in the sample into F to obtain a corresponding pseudo image
Figure BDA0001638717510000068
Step D: the discriminant neural network DXFor image X in the sample and the pseudo-image generated by F in step C
Figure BDA0001638717510000069
Making a discrimination, the discriminating neural network DYFor the depth image Y in the sample and the pseudo depth image generated by G in step C
Figure BDA0001638717510000071
Judging;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating an image X in the sample and a pseudo-image generated by F in step C
Figure BDA0001638717510000072
The difference loss between the depth image Y in the sample and the pseudo depth image generated by G in step C
Figure BDA0001638717510000073
Loss of variance between;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and returning to the step C for iteration until a preset iteration condition is met, and saving the generated neural network G at the moment as a final generated neural network for implementation of a deployment stage.
As shown in fig. 3, Z represents data abstraction corresponding to an actual scene (i.e. color, depth, contrast, transparency information, etc. contained in the scene), X represents an image of the scene, Y represents a depth image, and Y representssRepresenting the depth of several pixels of the depth image in the sample or in actual measurement.
In one embodiment of the invention, the generative confrontation network utilizes data to make inferences based on a bayesian probabilistic model as follows:
Figure BDA0001638717510000074
wherein the random variable X represents an RGB image, the random variable Y represents a depth image,random variable YsRepresenting the depth of a number of pixels of the depth image in the sample,
Figure BDA0001638717510000075
representing the depth of said pseudo pixels, G being a generative neural network for generating a depth image from the image, F being a generative neural network for generating an image from the depth image, DXOutputting the probability that the image is true for a discriminant neural network used for distinguishing the authenticity of the image; dYOutputting the probability that the depth image is true for a discriminant neural network used for distinguishing the authenticity of the depth image;
Figure BDA0001638717510000076
for the pseudo-depth image generated by the generative neural network G,
Figure BDA0001638717510000077
is a pseudo-restored image generated by the generative neural network F.
As shown in fig. 4, in an embodiment of the present invention, the step C specifically includes:
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depth of a plurality of pixels in the depth image in the sample to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
C1Representing a loss between the pseudo depth image generated by the generative neural network and the measured depth image;
C2representing the loss of depth of the dummy pixels and the depth of the real pixels generated by the generative neural network G,
Figure BDA0001638717510000078
the generated neural network G is restrained as supervision information;
C3representing pseudo-restored images and real scenes generated by a generative neural network FError of the image of (C)1、C2、C3The generating network G is constrained collectively by a loss function.
In one embodiment of the present invention, the loss function of the generative neural networks G and F is: l isG=LGAN1LREC2LSSC
Figure BDA0001638717510000081
Figure BDA0001638717510000082
Figure BDA0001638717510000083
Wherein E is desired, i.e. E [ f (x)]Indicating the expectation of a function or random variable within brackets, X being the image in the sample, Y being the depth image in the sample,
Figure BDA0001638717510000084
is the depth of several pixels of the depth image in the sample,
Figure BDA0001638717510000085
is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo combat the loss function of the network, LRECAs a loss function of reduction, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2). Said lambda1、λ2Is 0 to 10. Preferably, λ1=10,λ2=10。
Accordingly, a discriminative neural network DXAnd DYIs in the loss boxNumber is
Figure BDA0001638717510000086
Figure BDA0001638717510000087
Figure BDA0001638717510000088
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,
Figure BDA0001638717510000089
is the depth of several pixels of the depth image in the sample,
Figure BDA00016387175100000810
for discriminant neural network DXThe discriminant loss function of (a),
Figure BDA00016387175100000811
for discriminant neural network DYThe discriminant loss function of (1).
In the structure of the generation network in the above embodiment, the generation neural networks G and F are full convolution networks including convolution layers, down-sampling layers, residual error networks, and deconvolution layers.
As shown in fig. 5, in the above embodiment, the generative neural networks G and F are full convolutional neural networks including convolutional layers, residual network layers, and deconvolution layers. The method comprises the following specific steps:
the first layer is convolutional layer Conv1, 64 4 × 4 convolutional kernels (filters) in Conv1 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a Rectified Linear Unit (Rectified Linear Unit or ReLU) function of a nonlinear active layer is used as an active function;
the second layer is convolutional layer Conv2, 128 convolution kernels (filters) of 3 × 3 in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the third layer is convolutional layer Conv3, 256 convolution kernels (filters) of 3 × 3 in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
after down-sampling (convolution), the processed data enters a Residual Network (Residual Network) for convolution operation, the Residual Network comprises a plurality of continuous Residual modules, and each Residual module comprises two convolution layers:
each convolution layer is provided with 256 convolution kernels (filters) of 3 multiplied by 3 for convolution operation with the step length of 1 pixel, meanwhile, the convolution operation (Padding) operation with the edge filled with 1 pixel exists before the convolution operation, the normalization processing is carried out after the convolution is finished, and then a nonlinear active layer rectifies a linear unit function to be used as an active function; dropout with probability of 0.5 is performed between the two convolutional layers; before the second layer of convolution layer is activated by a rectification Linear Unit, adding operation is carried out on data before entering the residual error module and data after two times of convolution processing, and then entering the next residual error module. Optionally, the number of residual modules of the residual network is 9-21. Preferably, the number of "residual blocks" of the residual network is 9.
After the residual error network, generating a network and then performing up-sampling on the processed data, wherein the network comprises a plurality of layers of deconvolution layers:
the first layer is a deconvolution layer Deconv1, 128 convolution kernels (filters) of 3 × 3 in the Deconv1 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the second layer is a deconvolution layer Deconv2, 64 convolution kernels (filters) of 3 × 3 in the Deconv2 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the third layer is a deconvolution layer Deconv3, 3 convolution kernels (filters) of 7 × 7 in Deconv3 perform deconvolution operation with step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a tangent function is used as an activation function.
As shown in FIG. 6, in an embodiment of the present invention, a discriminative neural network DXAnd DYThe structure of (1) is specifically as follows:
the first layer is convolutional layer Conv1, 64 convolution kernels (filters) of 7 × 7 in Conv1 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the second layer is convolutional layer Conv2, 128 4 × 4 convolutional kernels (filters) in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the third layer is convolutional layer Conv3, 256 4 × 4 convolutional kernels (filters) in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the fourth layer is convolutional layer Conv4, 512 4 × 4 convolutional kernels (filters) in Conv4 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the fifth layer is convolutional layer Conv5, where 1 convolution kernel (filter) of 4 × 4 in Conv5 performs a convolution operation with a step size of 1 pixel, while there is an edge Padding (Padding) operation before performing the convolution.
The depth estimation method is used as an executable instruction set of a processor circuit or a chip, can be applied to intelligent sensing equipment such as a next-generation laser radar and an advanced camera, predicts or restores the depth image of a scene corresponding to an actually measured point set by using a small number of actually measured point sets, and reduces the cost of acquiring the depth image of the scene by using the traditional method.
The above are merely specific embodiments of the present invention, and the scope of the present invention is not limited thereby; any alterations and modifications without departing from the spirit of the invention are within the scope of the invention.

Claims (6)

1. A method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:
inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in an image;
wherein the training step of the generative neural network comprises:
step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is an image and a corresponding depth image;
and B: constructing a generative confrontation network architecture, which comprises two generative neural networks: f and G, two discriminant neural networks: dXAnd DY
And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured; wherein, the image in the sample and the plurality of pixels in the depth image thereof refer to 300-400 pixels and not more than 1% of all pixels of the image;
step D: the discriminant neural network DXDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network DYC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating a difference loss between the depth image in the sample and the G-generated pseudo depth image in step C, and calculating a difference loss between the image in the sample and the F-generated pseudo image;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and C, returning to the step C for iteration until a preset iteration condition is met, and storing the generated neural network G at the moment as a final generated neural network.
2. The method for estimating scene geometry information from a single image using a generative countermeasure network as claimed in claim 1, wherein said step C is specifically:
inputting the depth of the plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depths of the pixels to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
3. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative confrontation network utilizes data to reason based on a bayesian probabilistic model as follows:
Figure FDA0003208047920000021
where X is the image in the sample, Y is the depth image in the sample,
Figure FDA0003208047920000022
is the depth of several pixels of the depth image in the sample, YsThe depth of the pseudo pixel of the pixels, G is a generating neural network for generating the depth image of the image, F is a generating neural network for generating the depth image of the image, DYThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,
Figure FDA0003208047920000023
for the pseudo-depth image generated by the generative neural network G,
Figure FDA0003208047920000024
is a pseudo-restored image generated by the generative neural network F.
4. The method of estimating scene geometry information from a single image using a generative confrontation network as claimed in claim 1 or claim 3, wherein the generative neural network G and F loss functions are:
LG=LGAN1LREC2LSSC
Figure FDA0003208047920000025
Figure FDA0003208047920000026
Figure FDA0003208047920000027
where E is the expectation, X is the image in the sample, Y is the depth image in the sample,
Figure FDA0003208047920000028
is the depth of several pixels of the depth image in the sample,
Figure FDA0003208047920000029
is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo generate a loss function against the network, LRECLoss function between image or depth image and sample generated for generating neural networks G and F, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2).
5. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the discriminative neural network DXAnd DYHas a loss function of
Figure FDA0003208047920000031
Figure FDA0003208047920000032
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,
Figure FDA0003208047920000033
is the depth of several pixels of the depth image in the sample,
Figure FDA0003208047920000034
for discriminant neural network DXThe discriminant loss function of (a),
Figure FDA0003208047920000035
for discriminant neural network DYThe discriminant loss function of (1).
6. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative neural network G or F is a full convolutional neural network comprising convolutional layers, residual network layers, and anti-convolutional layers.
CN201810376281.9A 2018-04-24 2018-04-24 Method for estimating scene geometric information from single image by using generative countermeasure network Active CN108830890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810376281.9A CN108830890B (en) 2018-04-24 2018-04-24 Method for estimating scene geometric information from single image by using generative countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810376281.9A CN108830890B (en) 2018-04-24 2018-04-24 Method for estimating scene geometric information from single image by using generative countermeasure network

Publications (2)

Publication Number Publication Date
CN108830890A CN108830890A (en) 2018-11-16
CN108830890B true CN108830890B (en) 2021-10-01

Family

ID=64154785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810376281.9A Active CN108830890B (en) 2018-04-24 2018-04-24 Method for estimating scene geometric information from single image by using generative countermeasure network

Country Status (1)

Country Link
CN (1) CN108830890B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782397B (en) * 2018-12-13 2020-08-28 北京嘀嘀无限科技发展有限公司 Image processing method, generation type countermeasure network, electronic equipment and storage medium
CN109788270B (en) * 2018-12-28 2021-04-09 南京美乐威电子科技有限公司 3D-360-degree panoramic image generation method and device
CN111274946B (en) * 2020-01-19 2023-05-05 杭州涂鸦信息技术有限公司 Face recognition method, system and equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307A (en) * 2016-06-27 2016-11-23 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307A (en) * 2016-06-27 2016-11-23 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEPTH PREDICTION FROM A SINGLE IMAGE WITH CONDITIONAL ADVERSARIAL NETWORKS;Hyungjoo Jung 等;《2017 IEEE International Conference on Image Processing (ICIP)》;IEEE;20180222;正文第1717-1720页第1-3节,图2-3 *
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks;Zhu Junyan 等;《 2017 IEEE International Conference on Computer Vision (ICCV)》;IEEE;20171225;全文 *
深度学习在图像识别中的应用;李超波 等;《南通大学学报(自然科学版)》;20180320;第17卷(第1期);全文 *

Also Published As

Publication number Publication date
CN108830890A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
Moreau et al. Lens: Localization enhanced by nerf synthesis
CN109598754B (en) Binocular depth estimation method based on depth convolution network
CN110853075B (en) Visual tracking positioning method based on dense point cloud and synthetic view
US11210804B2 (en) Methods, devices and computer program products for global bundle adjustment of 3D images
WO2019005999A1 (en) Method and system for performing simultaneous localization and mapping using convolutional image transformation
CN110689562A (en) Trajectory loop detection optimization method based on generation of countermeasure network
CN108510535A (en) A kind of high quality depth estimation method based on depth prediction and enhancing sub-network
CN108648161A (en) The binocular vision obstacle detection system and method for asymmetric nuclear convolutional neural networks
CN110598590A (en) Close interaction human body posture estimation method and device based on multi-view camera
CN108830890B (en) Method for estimating scene geometric information from single image by using generative countermeasure network
CN110910437B (en) Depth prediction method for complex indoor scene
CN110243390B (en) Pose determination method and device and odometer
CN113256698B (en) Monocular 3D reconstruction method with depth prediction
CN114049434A (en) 3D modeling method and system based on full convolution neural network
CN104424640A (en) Method and device for carrying out blurring processing on images
CN116258817B (en) Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction
CN112330795A (en) Human body three-dimensional reconstruction method and system based on single RGBD image
CN111612898B (en) Image processing method, image processing device, storage medium and electronic equipment
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN110942484A (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN116468769A (en) Depth information estimation method based on image
JP2022027464A (en) Method and device related to depth estimation of video
CN110428461B (en) Monocular SLAM method and device combined with deep learning
CN113313740B (en) Disparity map and surface normal vector joint learning method based on plane continuity
CN114935316B (en) Standard depth image generation method based on optical tracking and monocular vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant