CN110782490A - Video depth map estimation method and device with space-time consistency - Google Patents

Video depth map estimation method and device with space-time consistency Download PDF

Info

Publication number
CN110782490A
CN110782490A CN201910907522.2A CN201910907522A CN110782490A CN 110782490 A CN110782490 A CN 110782490A CN 201910907522 A CN201910907522 A CN 201910907522A CN 110782490 A CN110782490 A CN 110782490A
Authority
CN
China
Prior art keywords
depth map
frame
estimation
training
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910907522.2A
Other languages
Chinese (zh)
Other versions
CN110782490B (en
Inventor
肖春霞
胡煜
罗飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910907522.2A priority Critical patent/CN110782490B/en
Publication of CN110782490A publication Critical patent/CN110782490A/en
Application granted granted Critical
Publication of CN110782490B publication Critical patent/CN110782490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video depth map estimation method and device with space-time consistency, which comprises the steps of generating a training set, wherein the training set comprises a plurality of sequences generated by taking a central frame as a target view and taking front and back frames as source views; aiming at static objects in a scene, constructing a frame for jointly training monocular depth and camera pose estimation from an unlabeled video sequence, wherein the frame comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part; aiming at a moving object in a scene, cascading an optical flow network behind the created framework to simulate the motion in the scene, wherein the optical flow estimation network structure is built, and a loss function of the part is built; aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided; continuously optimizing the model, performing combined training on monocular depth and camera attitude estimation, and then training an optical flow network; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.

Description

Video depth map estimation method and device with space-time consistency
Technical Field
The invention belongs to the field of understanding of geometric information of video scenes, and relates to a technology for estimating a depth map of a video frame, in particular to a technical scheme for estimating the depth map of continuous video frames with space-time consistency.
Background
Understanding 3D scene geometry in video is a fundamental problem for visual perception, which includes many basic computer vision tasks such as depth estimation, camera pose estimation, optical flow estimation, and so on. A depth map refers to an image containing information of the distance from the surface of an object in a scene to a viewpoint. Estimating depth is an important component in understanding the geometric relationships in a scene, and a general method for extracting a depth map based on an image is very necessary. The distance relationship helps to provide richer object and environment representations, and can further realize the functions such as 3D modeling, object recognition, robotics and the like. In a computer vision system, distance information provides support for various computer vision practical applications such as image segmentation, target detection, object tracking, three-dimensional reconstruction and the like.
The existing depth map estimation method mainly comprises a manual scanning acquisition method by utilizing physical equipment, a traditional mathematical method, a supervised deep learning method and an unsupervised deep learning method. These several methods have some drawbacks: the equipment scanning method mainly utilizes physical equipment to carry out manual scanning acquisition, but the existing three-dimensional scanner (such as Kinect) is not only expensive in manufacturing cost, but also not suitable for general application scenes; the depth estimation precision of the traditional mathematical method is too low, and for some complex scenes, the method can not perform an effective treatment generally; the supervised deep learning method mainly depends on deep learning, a network architecture and a mathematical model to obtain results, the method generally has strong dependence on a data set, the acquisition of the data set generally needs to consume a large amount of manpower and material resources, and the method is generally poor in generalization; the unsupervised depth learning method and the existing video depth estimation method usually ignore the problem of spatial and temporal discontinuity of a depth map, and a large error is often generated in the actual processing process of some occlusion areas or non-Lambert surface areas.
Disclosure of Invention
The invention provides a technical scheme for depth estimation of continuous video frames with space-time consistency in order to overcome the defects of the existing method, so that the estimated depth map result can obtain clearer details in some areas, and meanwhile, the time continuity between different video frames is enhanced, so that the final result is more accurate.
The technical scheme of the invention provides a video depth map estimation method with space-time consistency, which comprises the following steps,
step 1, generating a training set, wherein the length of an image sequence is fixed to be 3 frames, a central frame is used as a target view, two frames in front of and behind the central frame are used as source views, and a plurality of sequences are generated;
step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
In step 2, a depth map estimation network and an optical flow estimation network, which are composed of an encoder and a decoder, are used, and multi-scale depth prediction is performed by using cross-layer connection.
And in step 2, the unmarked video is used for carrying out unsupervised training, including training by combining the geometric characteristics of the moving three-dimensional scenes, combining the training into image synthesis loss, and carrying out unsupervised learning training on the static scenes and the dynamic scenes in the images respectively by using the image similarity as the supervision.
In step 4, a spatial consistency loss is proposed, and the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image is restrained; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.
The invention also provides a corresponding device for realizing the video depth map estimation method with space-time consistency.
The invention has the following advantages: 1. the invention can obtain a video depth estimation technical scheme with more generalization. 2. The invention provides space-time consistency check, provides a new loss function, increases the relevance of depth maps of different video frames, and solves the problem of overlarge error before and after the depth map result of continuous video frames. 3. The depth map estimation results of some low-texture, three-dimensional blur, occlusion and other areas in the scene are improved, so that the accuracy of the overall depth map estimation result is improved.
Drawings
Fig. 1 is an overall flowchart framework diagram of a video depth map estimation method with spatio-temporal consistency according to an embodiment of the present invention.
FIG. 2 is an overall framework diagram for jointly training monocular depth and camera pose estimates from unlabeled video sequences in accordance with an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.
The invention provides a method for estimating a video depth map, which combines depth estimation, optical flow estimation and camera pose estimation together through the geometric characteristics of a moving three-dimensional scene for training, combines the geometric characteristics into image synthesis loss, respectively carries out unsupervised learning training on static and dynamic scenes in an image by using image similarity as supervision, and simultaneously provides a new loss function improvement effect aiming at the problem of discontinuous depth space time frequently occurring in the video depth map estimation. Referring to fig. 1, a method for estimating a video depth map with spatiotemporal consistency according to an embodiment of the present invention includes the following steps:
step 1, a training set is made according to a public data set commonly used in the field of video depth estimation.
Step 1 of the examples the procedure is carried out as follows:
by using a kitti data set commonly used in the field of video depth estimation, the kitti data set is currently applied to a computer vision image data set in an automatic driving scene, including urban, rural, road and other scenes, and the images contain at most more than ten vehicles and thirty pedestrians, and also contain various environments such as occlusion, motion and the like, so that the computer vision image data set has rich image information. The specific processing is to fix the length of the image sequence to 3 frames, take the central frame as the target view, and take ± 1 frame (i.e. two frames before and after) as the source view. Using images taken in the Kitti dataset, a total of 12000 sequences were obtained, of which 10800 were used for training and 1200 for validation.
And 2, constructing a framework for jointly training monocular depth and camera attitude estimation from the unlabeled video sequence.
A framework for jointly training monocular depth and camera pose estimation from unlabeled video sequences is constructed for static objects in a scene. The key supervisory signals of the depth and camera pose prediction convolutional neural network in the step come from the task of view synthesis: given an input view of a scene, new images of the scene are synthesized as seen from different camera poses.
The invention preferably adopts a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder, and adopts a cross-layer connection idea to carry out multi-scale depth prediction, thereby improving the operation efficiency and the accuracy of the result.
The invention proposes to use unmarked video for unsupervised training: the geometric characteristics of the moving three-dimensional scenes are combined together for training, the geometric characteristics are combined into image synthesis loss, and the image similarity is used as supervision to respectively perform unsupervised learning training on the static scenes and the dynamic scenes in the images. A large amount of manpower and material resources are saved, so that the invention has greater universality.
Referring to fig. 2, the implementation of step 2 of the example is illustrated as follows:
(1) and constructing a depth map estimation network structure.
Since the depth map estimation network needs to train and calculate the geometric relationship at the pixel level, the depth map network mainly consists of two parts, namely an encoder and a decoder, and the specific network structure of the encoder and the decoder is shown in table 1 and table 2. The encoder portion uses convolutional layers as a more efficient learning means. The decoder consists of an deconvolution layer, which maps the spatial features up to the full scale of the input. In order to simultaneously reserve global high-level features and local detail information, the idea of cross-layer connection is used for reference between an encoder and a decoder, and multi-scale depth prediction is carried out.
TABLE 1 encoder network architecture
Figure BDA0002213717400000031
TABLE 2 decoder network architecture
Figure BDA0002213717400000042
Layer, Conv1, Covn1b, Conv2, Covn2b …, Conv7 and Covn7b are convolutional layers, Disp1 and Disp2 … Disp4 are connected across layers, Icovn1 … Icovn7 and upcon 1 … upcon v7 are deconvolution layers, k is the kernel size, s is the step size, chns is the number of input and output channels per layer, the input and output are the reduction factor of each layer relative to the input image (i.e. in is the inverse ratio of the input to the size of the input, and the original is the size ratio of the output), input corresponds to the input of each layer, where + is the series, and is 2 times the upsampling of the layer.
The network structure is divided into 6 scales, the maximum scale is the scale of the original image, then the size of each scale is one half of the previous scale, the resolution of the feature map of the layer with the minimum scale is only sixty-fourth of the original resolution, but the number of channels is as high as 512. The down-sampling operation is performed using a maximum pooling method in the encoder portion and the up-sampling operation is performed using a deconvolution layer in the decoder portion. The output of the encoder part is transmitted to a decoder of a corresponding scale by cross-layer transmission at each scale, and after the output characteristic diagram is connected with the decoder of the corresponding scale, a new characteristic diagram is synthesized to be used as input to be transmitted into a corresponding deconvolution layer.
(2) And (5) building a camera pose estimation network structure.
The camera pose estimation network structure regresses the camera pose (Euler angle and translation vector of camera rotation), the main structure of the camera pose network is similar to that of the encoder of the network in the step (1), a global average pooling layer, POOL, and finally a prediction layer Softmax are connected behind 8 convolutional layers, and the specific network structure is shown in a table 3. Except for the last predicted layer, the Batch Normalization and activation Relus functions are used for all layers.
TABLE 3
Figure BDA0002213717400000051
Of these, 8 convolutional layers were designated as Conv1, Conv1b, Conv2, Conv2b, Conv3, Conv3b, Conv4, and Conv4b, respectively, Fc1, and Fc2 were all tie layers.
(3) A loss function for the portion is constructed.
The deep network only connects the target view I tAs input, and outputs a per-pixel depth map D t. Camera pose network views (I) of objects t) And adjacent source views (e.g. I) t+1) As input, and output relative camera pose
Figure BDA0002213717400000061
The outputs of the two networks are then used to reverse warp the source view to reconstruct the target view, and the photometric error is used to train the convolutional neural network. By using view synthesis as a supervision, this framework can be trained from video in an unmarked supervised manner.
For the invention<I 1.....I n>N is the number of picture frames as a representation of the training image sequence, and n pictures I are in total 1.....I n. n is the number of pictures of the entire data set, but each calculation is calculated as three consecutive frames. The specific implementation can also be carried out by three frames at a timeThe above calculation is performed, but the amount of calculation increases each time.
Selecting one frame I tAs a target view, the rest is a source view I s(s is more than or equal to 1 and less than or equal to n, and s is not equal to t). The supervisory signal may be expressed as
Figure BDA0002213717400000062
Wherein p is the index pixel coordinate,
Figure BDA0002213717400000063
representing a slave view frame I sThe composite view of the predicted target frame is obtained from the rigid-body flow, and the rig flag represents that part only takes into account static rigid objects. Therefore, the supervisory signal at this stage is from the minimized view synthesis
Figure BDA0002213717400000064
And the original frame I tThe difference between them. I is t(p) is a representative point p in Picture I tThe position in the frame of the image data,
Figure BDA0002213717400000065
for the position of the p-point in the image, L, calculated by rigid flow rsFor their difference, the present invention requires L to be applied during the training process rsAs small as possible.
A key component of this framework is a differentiable depth image-based renderer that reconstructs the target view by sampling pixels from the source view. Prediction-based depth map
Figure BDA0002213717400000066
And relative posture
Figure BDA0002213717400000067
Let P tRepresenting the homogeneous coordinates of the pixels in the target view, K representing the camera intrinsic matrix, P can be obtained by the formula tTo the source view P sThe above.
Figure BDA0002213717400000068
It is to be noted that the projection coordinate P sAre continuous values. To obtain filling
Figure BDA0002213717400000069
I of value of s(p s),
Figure BDA00022137174000000610
Depth map of p points at t frame, I s(p s) Is the position of p in the s-frame,
Figure BDA00022137174000000611
based on the predicted position of a point p in a t-frame picture, and then linearly interpolating a value p of 4 pixels using a differentiable bilinear sampling mechanism sOf (2)
Figure BDA00022137174000000612
(upper left, upper right, lower left and lower right) is approximately I s(p s) I.e. by
Figure BDA0002213717400000071
Wherein, w ijAnd P sAnd
Figure BDA0002213717400000072
the spatial proximity therebetween is linearly proportional, and
Figure BDA0002213717400000073
the bilinear interpolation method is to linearly interpolate a value p of 4 pixels sIs approximated as I (upper left, upper right, lower left and lower right) s(p s). t, b, l and r represent upper, lower, left and right, respectively, w ijIs the proportionality coefficient occupied by each point. The coordinates of the pixel deformation obtained here can be decomposed in depth and camera pose by projection geometry.
The differentiable bilinear sampling mechanism is the prior art, namely a bilinear differential interpolation method, and the details of the invention are omitted.
And 3, after the frame created in the step 2, cascading an optical flow network to simulate the motion in the scene.
In the step 2, moving objects in the scene are ignored, and certain compensation correction can be effectively carried out on the depth map of the moving objects after the optical flow estimation network is added, so that the accuracy of the result is improved.
The specific implementation process is described as follows:
(1) and constructing an optical flow estimation network structure. The remaining non-rigid flows are learned with the optical flow network, the displacements of which are caused only by the relative motion of the objects and the world scene. The framework of the optical flow estimation network structure is similar to the depth map estimation network structure in step 2 (the same network structure as tables 1 and 2 can be used because they all obtain the same resolution as the input picture at the end), and also consists of two parts, an encoder and a decoder. The optical flow network is connected in a cascaded manner after the first stage of networks. For a given pair of image frames, the optical flow network uses the output of the network in step 2
Figure BDA0002213717400000074
Predicting corresponding residual streams The final overall prediction stream is
Figure BDA0002213717400000076
Is composed of
Figure BDA0002213717400000077
The input of the optical flow network Is composed of several images connected in the channel dimension, including the source frame and target frame image pairs Is and Id, and the output of the network in step 2 Composite views
Figure BDA0002213717400000079
And with the original image I sThe error of (2).
(2) A loss function for the portion is constructed. The supervision in step 2 is extended to the present stage by slight modifications (introducing the influence of the optical flow component on the scene flow). In step 2, the static scene is mainly processed, and the processing for the moving object is ignored. To improve the robustness of the learning process to these factors, a solution to incorporate an optical flow network to train the remaining flows (optical flow portion) except the rigid flow has been proposed for this problem. In particular, over the entire prediction stream
Figure BDA00022137174000000711
Then, image warping (image warping) is performed between any pair of the target frame and the source frame
Figure BDA00022137174000000712
Instead of the former
Figure BDA00022137174000000713
Thus obtaining a warp loss L of the whole flow fs. The concrete formula is expressed as
Figure BDA00022137174000000714
Wherein the content of the first and second substances,
Figure BDA00022137174000000715
the position of the p-point in the image is calculated for the overall flow.
And 4, providing a loss function of the deep neural network.
Aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided, the overlarge error before and after the result of continuous video frames is prevented, and meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in a scene are improved to a certain extent.
The specific implementation process is described as follows:
the spatial consistency loss provided by the invention is realized by restricting the difference of flow values from a t frame image to a t +1 frame image and from a t +1 frame image to a t frame image, and the temporal consistency loss is realized by adding the difference restriction of the flow values from the t frame to the t +1 frame image and the flow values from the t-1 frame to the t +1 frame directly, wherein the specific formula is shown as the formula:
Figure BDA0002213717400000081
Figure BDA0002213717400000082
wherein the content of the first and second substances,
Figure BDA0002213717400000083
for the position of p-point at t-frame calculated by the overall stream of p-points from s-frame to t-frame, I s(p) is the position of the p point in the s frame image,
Figure BDA0002213717400000084
for the entire stream of t-1 frames to t frames,
Figure BDA0002213717400000085
for the entire stream of t frames to t +1 frames,
Figure BDA0002213717400000086
the overall stream is from t-1 frames to t +1 frames. L is ftFor differences between the stream values from t-frame to s-frame and the stream values from s-frame to t-frame, L fpIs the difference between the stream value from the t-1 frame to the t frame plus the stream value from the t frame to the t +1 frame and the stream value from the t-1 frame directly to the t +1 frame. Ideally these two values should be as small as possible so they are used as a loss function to train the network.
Pixels that flow severely contradictory (i.e., too much computational error) are considered to be possible outliers. Since these regions violate the assumption of image consistency and geometric consistency, this document only goes throughThey are handled with excessive loss of smoothness. Thus, the full flow warp loss L herein fsAnd loss of spatio-temporal consistency L ft、L fpAre weighted by pixel.
And 5, setting training parameters of the network, and continuously optimizing the model according to the error of each generation. In the training process, the set loss function is required to be continuously reduced in an iterative manner, so that the more accurate the model is. And the depth map estimation of continuous video frames can be realized by utilizing the optimized model.
In specific implementation, the monocular depth and the camera pose estimation can be trained jointly, and then the residual optical flow network is trained on the basis. And finally, obtaining the trained network models of depth map estimation, camera pose estimation and optical flow estimation.
The specific implementation process is described as follows:
the invention mainly comprises three sub-networks, namely a depth map estimation network and a camera pose estimation network, which form the reconstruction of a static object together, and the optical flow estimation network structure is combined with the output of the previous stage to realize the positioning of a moving object. Although the network can be trained together in an end-to-end fashion, there is no guarantee that local gradient optimization will bring the network to an optimal point. Therefore, a segmented training strategy is employed while reducing memory and computation consumption. Firstly, training a depth map estimation network and a camera pose estimation network, determining weights, and then training an optical flow estimation network. The resolution of the trained input images is resize to 128 x 416, while random upscaling, cropping, recoloring, etc. methods are also used to prevent overfitting. The network optimization function adopts a common neural network optimization method Adam. The initial learning rate was set to 0.0002 and the mini-batch size (minimum batch size) was set to 4. The first and second stages of the training process converge with 30 and 200 epochs (iterations), respectively. Testing on the KITTI data set it should be understood that parts not elaborated on in this specification are prior art.
In the above process, the main characteristics are: and (3) providing time consistency check of the depth map, improving a loss function of the depth neural network, constructing the time consistency check specially aiming at the video depth map in a deep learning model, improving the overall loss function, and preventing overlarge errors before and after the result of continuous video frames. Meanwhile, the estimation results of some areas such as low texture, three-dimensional blur, occlusion and the like in the scene are improved to a certain extent.
In specific implementation, the automatic operation of the process can be realized by adopting a software mode. The apparatus for operating the process should also be within the scope of the present invention.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A video depth map estimation method with space-time consistency is characterized in that: comprises the following steps of (a) carrying out,
step 1, generating a training set, wherein the length of an image sequence is fixed to be 3 frames, a central frame is used as a target view, two frames in front of and behind the central frame are used as source views, and a plurality of sequences are generated;
step 2, constructing an unmarked video sequence joint training monocular depth and camera pose estimation framework aiming at static objects in a scene, wherein the framework comprises a depth map estimation network structure, a camera pose estimation network structure and a loss function of the part;
step 3, aiming at a moving object in the scene, cascading an optical flow network to simulate the motion in the scene after the frame created in the step 2, wherein the optical flow estimation network structure is built, and a loss function of the part is built;
step 4, aiming at the space-time consistency test of the depth map, a loss function of the depth neural network is provided;
step 5, optimizing the model, including performing joint training on monocular depth and camera attitude estimation, and then training the rest optical flow network on the basis; and (4) utilizing the optimized model to realize the depth map estimation of the continuous video frames.
2. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: in step 2, a depth map estimation network and an optical flow estimation network which are composed of an encoder and a decoder are adopted, and cross-layer connection is adopted to carry out multi-scale depth prediction.
3. The method of claim 1, wherein the video depth map estimation method with spatiotemporal consistency is characterized in that: and 2, performing unsupervised training by using the unmarked video, wherein the unsupervised training comprises training by combining the geometric characteristics of the moving three-dimensional scene, combining the training into image synthesis loss, and performing unsupervised learning training on the static scene and the dynamic scene in the image by using the image similarity as a monitor.
4. A method for estimating a video depth map with spatio-temporal consistency according to claim 1, 2 or 3, characterized in that: step 4, proposing a space consistency loss, and constraining the difference of the flow values from the t frame image to the t +1 frame image and from the t +1 frame image to the t frame image; a temporal consistency loss is proposed, adding a difference constraint on the t-frame to t +1 frame image stream values and the stream values directly from the t-1 frame to the t +1 frame to the t-1 frame stream values.
5. An apparatus for estimating a video depth map with spatio-temporal consistency, characterized in that: for implementing a video depth map estimation method with spatio-temporal consistency as claimed in claims 1 to 4.
CN201910907522.2A 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency Active CN110782490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910907522.2A CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910907522.2A CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Publications (2)

Publication Number Publication Date
CN110782490A true CN110782490A (en) 2020-02-11
CN110782490B CN110782490B (en) 2022-07-05

Family

ID=69383733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910907522.2A Active CN110782490B (en) 2019-09-24 2019-09-24 Video depth map estimation method and device with space-time consistency

Country Status (1)

Country Link
CN (1) CN110782490B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111311664A (en) * 2020-03-03 2020-06-19 上海交通大学 Joint unsupervised estimation method and system for depth, pose and scene stream
CN111402310A (en) * 2020-02-29 2020-07-10 同济大学 Monocular image depth estimation method and system based on depth estimation network
CN111583305A (en) * 2020-05-11 2020-08-25 北京市商汤科技开发有限公司 Neural network training and motion trajectory determination method, device, equipment and medium
CN111709982A (en) * 2020-05-22 2020-09-25 浙江四点灵机器人股份有限公司 Three-dimensional reconstruction method for dynamic environment
CN112085717A (en) * 2020-09-04 2020-12-15 厦门大学 Video prediction method and system for laparoscopic surgery
CN112270691A (en) * 2020-10-15 2021-01-26 电子科技大学 Monocular video structure and motion prediction method based on dynamic filter network
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
CN112801074A (en) * 2021-04-15 2021-05-14 速度时空信息科技股份有限公司 Depth map estimation method based on traffic camera
CN113222895A (en) * 2021-04-10 2021-08-06 河南巨捷电子科技有限公司 Electrode defect detection method and system based on artificial intelligence
CN113284173A (en) * 2021-04-20 2021-08-20 中国矿业大学 End-to-end scene flow and pose joint learning method based on pseudo laser radar
CN114359363A (en) * 2022-01-11 2022-04-15 浙江大学 Video consistency depth estimation method and device based on deep learning
CN114663347A (en) * 2022-02-07 2022-06-24 中国科学院自动化研究所 Unsupervised object instance detection method and unsupervised object instance detection device
CN114937125A (en) * 2022-07-25 2022-08-23 深圳大学 Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium
CN115131404A (en) * 2022-07-01 2022-09-30 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
WO2022206020A1 (en) * 2021-03-31 2022-10-06 中国科学院深圳先进技术研究院 Method and apparatus for estimating depth of field of image, and terminal device and storage medium
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120127267A1 (en) * 2010-11-23 2012-05-24 Qualcomm Incorporated Depth estimation based on global motion
CN103002309A (en) * 2012-09-25 2013-03-27 浙江大学 Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
CN105100771A (en) * 2015-07-14 2015-11-25 山东大学 Single-viewpoint video depth obtaining method based on scene classification and geometric dimension
CN106599805A (en) * 2016-12-01 2017-04-26 华中科技大学 Supervised data driving-based monocular video depth estimating method
CN106612427A (en) * 2016-12-29 2017-05-03 浙江工商大学 Method for generating spatial-temporal consistency depth map sequence based on convolution neural network
CN107274445A (en) * 2017-05-19 2017-10-20 华中科技大学 A kind of image depth estimation method and system
CN107481279A (en) * 2017-05-18 2017-12-15 华中科技大学 A kind of monocular video depth map computational methods

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120127267A1 (en) * 2010-11-23 2012-05-24 Qualcomm Incorporated Depth estimation based on global motion
CN103002309A (en) * 2012-09-25 2013-03-27 浙江大学 Depth recovery method for time-space consistency of dynamic scene videos shot by multi-view synchronous camera
CN105100771A (en) * 2015-07-14 2015-11-25 山东大学 Single-viewpoint video depth obtaining method based on scene classification and geometric dimension
CN106599805A (en) * 2016-12-01 2017-04-26 华中科技大学 Supervised data driving-based monocular video depth estimating method
CN106612427A (en) * 2016-12-29 2017-05-03 浙江工商大学 Method for generating spatial-temporal consistency depth map sequence based on convolution neural network
CN107481279A (en) * 2017-05-18 2017-12-15 华中科技大学 A kind of monocular video depth map computational methods
CN107274445A (en) * 2017-05-19 2017-10-20 华中科技大学 A kind of image depth estimation method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANTONIO W. VIEIRA等: "STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences", 《PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS》, 30 September 2012 (2012-09-30), pages 1 - 8 *
TAK-WAI HUI等: "Dense depth map generation using sparse depth data from normal flow", 《2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》, 29 January 2015 (2015-01-29), pages 3837 - 3841 *
姜翰青等: "基于多个手持摄像机的动态场景时空一致性深度恢复", 《计算机辅助设计与图像学学报》, vol. 25, no. 2, 28 February 2013 (2013-02-28), pages 137 - 145 *
葛利跃等: "深度图像优化分层分割的3D场景流估计", 《南昌航空大学学报自然科学版》, vol. 32, no. 2, 30 June 2018 (2018-06-30), pages 17 - 25 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402310A (en) * 2020-02-29 2020-07-10 同济大学 Monocular image depth estimation method and system based on depth estimation network
CN111402310B (en) * 2020-02-29 2023-03-28 同济大学 Monocular image depth estimation method and system based on depth estimation network
CN111311664A (en) * 2020-03-03 2020-06-19 上海交通大学 Joint unsupervised estimation method and system for depth, pose and scene stream
CN111311664B (en) * 2020-03-03 2023-04-21 上海交通大学 Combined unsupervised estimation method and system for depth, pose and scene flow
CN111583305A (en) * 2020-05-11 2020-08-25 北京市商汤科技开发有限公司 Neural network training and motion trajectory determination method, device, equipment and medium
CN111709982B (en) * 2020-05-22 2022-08-26 浙江四点灵机器人股份有限公司 Three-dimensional reconstruction method for dynamic environment
CN111709982A (en) * 2020-05-22 2020-09-25 浙江四点灵机器人股份有限公司 Three-dimensional reconstruction method for dynamic environment
CN112085717A (en) * 2020-09-04 2020-12-15 厦门大学 Video prediction method and system for laparoscopic surgery
CN112085717B (en) * 2020-09-04 2024-03-19 厦门大学 Video prediction method and system for laparoscopic surgery
CN112270691A (en) * 2020-10-15 2021-01-26 电子科技大学 Monocular video structure and motion prediction method based on dynamic filter network
CN112270691B (en) * 2020-10-15 2023-04-21 电子科技大学 Monocular video structure and motion prediction method based on dynamic filter network
CN112344922A (en) * 2020-10-26 2021-02-09 中国科学院自动化研究所 Monocular vision odometer positioning method and system
WO2022206020A1 (en) * 2021-03-31 2022-10-06 中国科学院深圳先进技术研究院 Method and apparatus for estimating depth of field of image, and terminal device and storage medium
CN113222895A (en) * 2021-04-10 2021-08-06 河南巨捷电子科技有限公司 Electrode defect detection method and system based on artificial intelligence
CN112801074B (en) * 2021-04-15 2021-07-16 速度时空信息科技股份有限公司 Depth map estimation method based on traffic camera
CN112801074A (en) * 2021-04-15 2021-05-14 速度时空信息科技股份有限公司 Depth map estimation method based on traffic camera
CN113284173B (en) * 2021-04-20 2023-12-19 中国矿业大学 End-to-end scene flow and pose joint learning method based on false laser radar
CN113284173A (en) * 2021-04-20 2021-08-20 中国矿业大学 End-to-end scene flow and pose joint learning method based on pseudo laser radar
CN114359363A (en) * 2022-01-11 2022-04-15 浙江大学 Video consistency depth estimation method and device based on deep learning
CN114663347A (en) * 2022-02-07 2022-06-24 中国科学院自动化研究所 Unsupervised object instance detection method and unsupervised object instance detection device
CN115131404A (en) * 2022-07-01 2022-09-30 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN115131404B (en) * 2022-07-01 2024-06-14 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN114937125A (en) * 2022-07-25 2022-08-23 深圳大学 Reconstructable metric information prediction method, reconstructable metric information prediction device, computer equipment and storage medium
CN115187638A (en) * 2022-09-07 2022-10-14 南京逸智网络空间技术创新研究院有限公司 Unsupervised monocular depth estimation method based on optical flow mask
WO2024051184A1 (en) * 2022-09-07 2024-03-14 南京逸智网络空间技术创新研究院有限公司 Optical flow mask-based unsupervised monocular depth estimation method
CN117115786A (en) * 2023-10-23 2023-11-24 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method
CN117115786B (en) * 2023-10-23 2024-01-26 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method

Also Published As

Publication number Publication date
CN110782490B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN111739078B (en) Monocular unsupervised depth estimation method based on context attention mechanism
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
TWI739151B (en) Method, device and electronic equipment for image generation network training and image processing
CN111354030B (en) Method for generating unsupervised monocular image depth map embedded into SENet unit
CN113077505B (en) Monocular depth estimation network optimization method based on contrast learning
CN110910437B (en) Depth prediction method for complex indoor scene
CN109903315B (en) Method, apparatus, device and readable storage medium for optical flow prediction
CN111325784A (en) Unsupervised pose and depth calculation method and system
CN113850900B (en) Method and system for recovering depth map based on image and geometric clues in three-dimensional reconstruction
CN110942484B (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN112270692B (en) Monocular video structure and motion prediction self-supervision method based on super-resolution
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN115187638B (en) Unsupervised monocular depth estimation method based on optical flow mask
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN115294282A (en) Monocular depth estimation system and method for enhancing feature fusion in three-dimensional scene reconstruction
CN111899295A (en) Monocular scene depth prediction method based on deep learning
CN116205962B (en) Monocular depth estimation method and system based on complete context information
CN115546273A (en) Scene structure depth estimation method for indoor fisheye image
CN117593702A (en) Remote monitoring method, device, equipment and storage medium
CN113610912A (en) System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction
CN115578260A (en) Attention method and system for direction decoupling for image super-resolution
CN115131418A (en) Monocular depth estimation algorithm based on Transformer
CN115272450A (en) Target positioning method based on panoramic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant